pod5 subset cannot shard a large aggregate pod5 file #72

billytcl · 2023-09-17T18:47:51Z

Using python 3.7 and pod5 0.2.4. I'm trying to use pod5 subset to break up a large pod5 file into smaller ones, but it crashes with a strange error.

I used pod view to generate a reads table and split on the "channel" column:

nohup pod5 inspect reads converted.pod5 > inspect_reads.txt &

First 10k lines:

read_id filename        read_number     channel mux     end_reason      start_time      start_sample    duration        num_samples     minknow_events  sample_rate     median_before   predicted_scaling_scale predicted_scaling_shift   tracked_scaling_scale   tracked_scaling_shift   num_reads_since_mux_change      time_since_mux_change   run_id  sample_id       experiment_id   flow_cell_id    pore_type
001e35eb-1c55-4c7b-8886-dcdcb6dca15f    converted.pod5  74306   148     4       signal_positive 22092.86475000  88371459        0.64550000      2582    144     4000    205.11199951    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00356992-b31c-49a9-bb54-b5f04e8feccb    converted.pod5  67712   850     4       signal_positive 22086.55175000  88346207        0.71600000      2864    142     4000    208.46253967    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00399263-1258-42df-be8e-623435a5e3b0    converted.pod5  68784   1138    3       signal_positive 22092.13000000  88368520        0.93625000      3745    195     4000    205.83134460    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0046e652-310e-42e8-a9eb-a97afe61cd6c    converted.pod5  71090   2653    3       signal_positive 22093.76100000  88375044        0.94250000      3770    201     4000    204.88726807    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
004f3d41-1063-43d5-a2fe-ab70094d784c    converted.pod5  74144   1178    3       signal_positive 22089.63250000  88358530        0.90700000      3628    194     4000    208.21510315    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00625954-411f-4c18-ad6d-f7d0b5234f84    converted.pod5  65143   1218    1       signal_positive 22094.15175000  88376607        0.78825000      3153    163     4000    213.10104370    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
007ff853-7128-4f74-bb12-88a31ac23205    converted.pod5  46458   2338    1       signal_positive 22088.29975000  88353199        0.67450000      2698    152     4000    210.64157104    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0080d4a1-8abd-4a06-9a9a-e6b8d856122a    converted.pod5  74703   1140    4       signal_positive 22090.70975000  88362839        1.09300000      4372    258     4000    206.32232666    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0089e982-2261-4a08-9945-07b25a7f7214    converted.pod5  67244   56      1       signal_positive 22091.08575000  88364343        1.03300000      4132    196     4000    211.88562012    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set

Then I run subset:

nohup pod5 subset pod5/*.pod5 -o pod5_subset/ -f -r -s pod5/inspect_reads.head.txt -c channel &

It crashes with:

Parsed 9999 targets
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

POD5 has encountered an error: ''

For detailed information set POD5_DEBUG=1'

The text was updated successfully, but these errors were encountered:

0x55555555 · 2023-09-18T07:27:10Z

Thats a new one - thanks @billytcl,

We will take a look and try to work out whats going on internally...

George

billytcl · 2023-09-18T07:29:24Z

Thanks! I have an immediate need for this so fingers crossed that it's not a terrible bug.

…

On Mon, Sep 18, 2023 at 12:27 AM jorj1988 ***@***.***> wrote: Thats a new one - thanks @billytcl <https://github.com/billytcl>, We will take a look and try to work out whats going on internally... - George — Reply to this email directly, view it on GitHub <#72 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPHYT2MGPRH4BEI7FFH23DX27ZVVANCNFSM6AAAAAA43ZF5TQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0x55555555 · 2023-09-18T07:31:51Z

A short term workaround might be to work in smaller batches, if that helps?

If you are able to rerun with POD5_DEBUG=1 set during the execution it may provide us more information to debug.

How large was the input dataset (and approximate read lengths) - so I can ensure we have an equivalent internal dataset to test on?

Thanks,

George

billytcl · 2023-09-18T07:51:07Z

Unfortunately I can't work in smaller batches as the pod5 was generated as an aggregate of an entire run's fast5s (using pod5 convert fast5, without the one-to-one option). Ironically I was trying to use subset to break it into smaller batches grouped by channel! I am repodding all my fast5s using the one-to-one option, but that takes a decent amount of time to go through all of our runs. The input dataset is ~30-90k fast5 files, so that could be anywhere from 500G to 2TB. Read lengths should be short -- ~170bp on average.

…

On Mon, Sep 18, 2023 at 12:32 AM jorj1988 ***@***.***> wrote: A short term workaround might be to work in smaller batches, if that helps? If you are able to rerun with POD5_DEBUG=1 set during the execution it may provide us more information to debug. How large was the input dataset (and approximate read lengths) - so I can ensure we have an equivalent internal dataset to test on? Thanks, - George — Reply to this email directly, view it on GitHub <#72 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPHYT7ZVN4T2MMYAXX3IJTX272HFANCNFSM6AAAAAA43ZF5TQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0x55555555 · 2023-09-18T08:09:17Z

Ok - no worries.

Could you try using view rather than inspect - this will cut down the size of the input csv to subset:

pod5 view --include "read_id, channel" converted.pod5

Should produce a significantly smaller file for subset to process.

billytcl · 2023-09-18T08:13:27Z

Going to try the pod5 view with the reduced columns now. In the meantime here are the logs with debug mode:

2023-09-18--00-55-28-p-3088300-pod5.log
2023-09-18--00-55-26-main-pod5.log

Alongside is the error to nohup:

Parsed 9999 targets
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/bin/pod5", line 8, in <module>
    sys.exit(main())
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/main.py", line 60, in main
    return run_tool(parser)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 41, in run_tool
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 38, in run_tool
    return tool_func(**kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 564, in run
    return subset_pod5(**kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 647, in subset_pod5
    force_overwrite=force_overwrite,
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 585, in subset_pod5s_with_mapping
    sources_df = parse_sources(inputs, duplicate_ok, threads)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 321, in parse_sources
    items.append(parsed_sources.get(timeout=60))
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 105, in get
    raise Empty
_queue.Empty

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pod5 subset cannot shard a large aggregate pod5 file #72

pod5 subset cannot shard a large aggregate pod5 file #72

billytcl commented Sep 17, 2023

0x55555555 commented Sep 18, 2023

billytcl commented Sep 18, 2023 via email

0x55555555 commented Sep 18, 2023

billytcl commented Sep 18, 2023 via email

0x55555555 commented Sep 18, 2023

billytcl commented Sep 18, 2023

pod5 subset cannot shard a large aggregate pod5 file #72

pod5 subset cannot shard a large aggregate pod5 file #72

Comments

billytcl commented Sep 17, 2023

0x55555555 commented Sep 18, 2023

billytcl commented Sep 18, 2023 via email

0x55555555 commented Sep 18, 2023

billytcl commented Sep 18, 2023 via email

0x55555555 commented Sep 18, 2023

billytcl commented Sep 18, 2023