Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod5 subset cannot shard a large aggregate pod5 file #72

Open
billytcl opened this issue Sep 17, 2023 · 6 comments
Open

pod5 subset cannot shard a large aggregate pod5 file #72

billytcl opened this issue Sep 17, 2023 · 6 comments

Comments

@billytcl
Copy link

Using python 3.7 and pod5 0.2.4. I'm trying to use pod5 subset to break up a large pod5 file into smaller ones, but it crashes with a strange error.

I used pod view to generate a reads table and split on the "channel" column:

nohup pod5 inspect reads converted.pod5 > inspect_reads.txt &

First 10k lines:

read_id filename        read_number     channel mux     end_reason      start_time      start_sample    duration        num_samples     minknow_events  sample_rate     median_before   predicted_scaling_scale predicted_scaling_shift   tracked_scaling_scale   tracked_scaling_shift   num_reads_since_mux_change      time_since_mux_change   run_id  sample_id       experiment_id   flow_cell_id    pore_type
001e35eb-1c55-4c7b-8886-dcdcb6dca15f    converted.pod5  74306   148     4       signal_positive 22092.86475000  88371459        0.64550000      2582    144     4000    205.11199951    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00356992-b31c-49a9-bb54-b5f04e8feccb    converted.pod5  67712   850     4       signal_positive 22086.55175000  88346207        0.71600000      2864    142     4000    208.46253967    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00399263-1258-42df-be8e-623435a5e3b0    converted.pod5  68784   1138    3       signal_positive 22092.13000000  88368520        0.93625000      3745    195     4000    205.83134460    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0046e652-310e-42e8-a9eb-a97afe61cd6c    converted.pod5  71090   2653    3       signal_positive 22093.76100000  88375044        0.94250000      3770    201     4000    204.88726807    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
004f3d41-1063-43d5-a2fe-ab70094d784c    converted.pod5  74144   1178    3       signal_positive 22089.63250000  88358530        0.90700000      3628    194     4000    208.21510315    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
00625954-411f-4c18-ad6d-f7d0b5234f84    converted.pod5  65143   1218    1       signal_positive 22094.15175000  88376607        0.78825000      3153    163     4000    213.10104370    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
007ff853-7128-4f74-bb12-88a31ac23205    converted.pod5  46458   2338    1       signal_positive 22088.29975000  88353199        0.67450000      2698    152     4000    210.64157104    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0080d4a1-8abd-4a06-9a9a-e6b8d856122a    converted.pod5  74703   1140    4       signal_positive 22090.70975000  88362839        1.09300000      4372    258     4000    206.32232666    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set
0089e982-2261-4a08-9945-07b25a7f7214    converted.pod5  67244   56      1       signal_positive 22091.08575000  88364343        1.03300000      4132    196     4000    211.88562012    NaN     NaN     NaN     NaN     0       0.00000000        389c96a8-a692-4592-85de-b4c93643740e    Seq_Output      not_set PAM91261        not_set

Then I run subset:

nohup pod5 subset pod5/*.pod5 -o pod5_subset/ -f -r -s pod5/inspect_reads.head.txt -c channel &

It crashes with:

Parsed 9999 targets
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

POD5 has encountered an error: ''

For detailed information set POD5_DEBUG=1'
@0x55555555
Copy link
Collaborator

Thats a new one - thanks @billytcl,

We will take a look and try to work out whats going on internally...

  • George

@billytcl
Copy link
Author

billytcl commented Sep 18, 2023 via email

@0x55555555
Copy link
Collaborator

A short term workaround might be to work in smaller batches, if that helps?

If you are able to rerun with POD5_DEBUG=1 set during the execution it may provide us more information to debug.

How large was the input dataset (and approximate read lengths) - so I can ensure we have an equivalent internal dataset to test on?

Thanks,

  • George

@billytcl
Copy link
Author

billytcl commented Sep 18, 2023 via email

@0x55555555
Copy link
Collaborator

Ok - no worries.

Could you try using view rather than inspect - this will cut down the size of the input csv to subset:

pod5 view --include "read_id, channel" converted.pod5

Should produce a significantly smaller file for subset to process.

@billytcl
Copy link
Author

Going to try the pod5 view with the reduced columns now. In the meantime here are the logs with debug mode:

2023-09-18--00-55-28-p-3088300-pod5.log
2023-09-18--00-55-26-main-pod5.log

Alongside is the error to nohup:

Parsed 9999 targets
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
Traceback (most recent call last):
  File "/home/billylau/.conda/envs/pod5/bin/pod5", line 8, in <module>
    sys.exit(main())
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/main.py", line 60, in main
    return run_tool(parser)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 41, in run_tool
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 38, in run_tool
    return tool_func(**kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/parsers.py", line 564, in run
    return subset_pod5(**kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 647, in subset_pod5
    force_overwrite=force_overwrite,
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 585, in subset_pod5s_with_mapping
    sources_df = parse_sources(inputs, duplicate_ok, threads)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 184, in wrapper
    raise exc
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/utils.py", line 181, in wrapper
    ret = func(*args, **kwargs)
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/site-packages/pod5/tools/pod5_subset.py", line 321, in parse_sources
    items.append(parsed_sources.get(timeout=60))
  File "/home/billylau/.conda/envs/pod5/lib/python3.7/multiprocessing/queues.py", line 105, in get
    raise Empty
_queue.Empty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants