-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod5 subset cannot shard a large aggregate pod5 file #72
Comments
Thats a new one - thanks @billytcl, We will take a look and try to work out whats going on internally...
|
Thanks! I have an immediate need for this so fingers crossed that it's not
a terrible bug.
…On Mon, Sep 18, 2023 at 12:27 AM jorj1988 ***@***.***> wrote:
Thats a new one - thanks @billytcl <https://github.com/billytcl>,
We will take a look and try to work out whats going on internally...
- George
—
Reply to this email directly, view it on GitHub
<#72 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYT2MGPRH4BEI7FFH23DX27ZVVANCNFSM6AAAAAA43ZF5TQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
A short term workaround might be to work in smaller batches, if that helps? If you are able to rerun with How large was the input dataset (and approximate read lengths) - so I can ensure we have an equivalent internal dataset to test on? Thanks,
|
Unfortunately I can't work in smaller batches as the pod5 was generated as
an aggregate of an entire run's fast5s (using pod5 convert fast5, without
the one-to-one option). Ironically I was trying to use subset to break it
into smaller batches grouped by channel! I am repodding all my fast5s using
the one-to-one option, but that takes a decent amount of time to go through
all of our runs.
The input dataset is ~30-90k fast5 files, so that could be anywhere from
500G to 2TB. Read lengths should be short -- ~170bp on average.
…On Mon, Sep 18, 2023 at 12:32 AM jorj1988 ***@***.***> wrote:
A short term workaround might be to work in smaller batches, if that helps?
If you are able to rerun with POD5_DEBUG=1 set during the execution it
may provide us more information to debug.
How large was the input dataset (and approximate read lengths) - so I can
ensure we have an equivalent internal dataset to test on?
Thanks,
- George
—
Reply to this email directly, view it on GitHub
<#72 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYT7ZVN4T2MMYAXX3IJTX272HFANCNFSM6AAAAAA43ZF5TQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ok - no worries. Could you try using
Should produce a significantly smaller file for subset to process. |
Going to try the pod5 view with the reduced columns now. In the meantime here are the logs with debug mode: 2023-09-18--00-55-28-p-3088300-pod5.log Alongside is the error to nohup:
|
Using python 3.7 and pod5 0.2.4. I'm trying to use pod5 subset to break up a large pod5 file into smaller ones, but it crashes with a strange error.
I used pod view to generate a reads table and split on the "channel" column:
First 10k lines:
Then I run subset:
It crashes with:
The text was updated successfully, but these errors were encountered: