-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod5 don't seem to save much space on short reads #66
Comments
We usually write pod5 in minknow. Currently still in 4K batches (which is the default), just like fast5. We always convert the raw data files (either fast5/pod5) into a single large pod5 file for immediate basecalling. This process is quite fast. After basecalling this single pod5 is removed. We always keep the original files (written by minknow) for long-term storage. AFAICS the single POD5 is just the size of the run data. At least for the data I have worked with here so far. |
Is there a big performance difference with single pod5 vs a bunch of
individual pods? I may just do the one-to-one conversion instead.
…On Wed, Aug 30, 2023 at 11:24 AM sklages ***@***.***> wrote:
We usually write pod5 in minknow. Currently still in 4K batches (which is
the default), just like fast5.
We always convert the raw data files (either fast5/pod5) into a single
large pod5 file for immediate basecalling. This process is quite fast.
After basecalling this single pod5 is removed. We always keep the original
files (written by minknow) for long-term storage.
AFAICS the single POD5 is just the size of the run data. At least for the
data I have worked with here so far.
—
Reply to this email directly, view it on GitHub
<#66 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYT4X6T6RKUUBNNEOMLDXX6ANHANCNFSM6AAAAAA4E4LIWU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I haven't benchmarked, generally I'd say reading from one big file is more efficient than reading a few hundred or even thousand files. You may run into I/O issues with many small files so that your GPU may be loaded "suboptimal", because it does not get data fast enough. I have different sources of sequencing data, so converting/merging - whatever I get - into a single pod5 file, makes dorado always start with the same input (type). Converting/merging is quite fast, even for large datasets. So if basecalling takes 30h instead of 29h .. well, I don't care ;-) Is there any special reason why you want to convert one-by-one? |
I’m just scared of a single 2TB file of being corrupted, and we are
thinking of ways to archive a bunch of our old runs that were pre-pod5!
With the amount of reads it actually takes 8-17h to convert the fast5s to a
single pod.
…On Wed, Aug 30, 2023 at 12:40 PM sklages ***@***.***> wrote:
I haven't benchmarked, generally I'd say reading from one big file is more
efficient than reading a few hundred or even thousand files. You may run
into I/O issues with many small files so that your GPU may be loaded
"suboptimal", because it does not get data fast enough.
I have different sources of sequencing data, so converting/merging -
whatever I get - into a single pod5 file, makes dorado always start with
the same input (type). Converting/merging is quite fast, even for large
datasets. So if basecalling takes 30h instead of 29h .. well, I don't care
;-)
Is there any special reason why you want to convert one-by-one?
—
Reply to this email directly, view it on GitHub
<#66 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYT2FQW765ARNCNNOR3LXX6JLBANCNFSM6AAAAAA4E4LIWU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I just converted a small fast5 dataset (from 2020), ~1000 files, 650GB in 40min. Data was read from a NFS mount .. may be even faster from local storage. We are also thinking about archiving old run data .. we are probably going for per-flowcell-pod5 files |
There could possibly be a lot of overhead from our dataset, considering
they are short ~160bp reads! We are also converting from an NFS mount.
…On Wed, Aug 30, 2023 at 1:24 PM sklages ***@***.***> wrote:
I just converted a small fast5 dataset (from 2020), ~1000 files, 650GB in
40min. Data was read from a NFS mount .. may be even faster from local
storage.
We are also thinking about archiving old run data .. we are probably going
for *per-flowcell-pod5* files
—
Reply to this email directly, view it on GitHub
<#66 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYT2CCIMEC6QTWYMGATDXX6ORDANCNFSM6AAAAAA4E4LIWU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hi @billytcl, I don't believe youre doing anything wrong - it looks like the dataset reduces in size by about 0.88... I agree I have seen more significant reductions in the past. It will depend on the original compression of the fast5 dataset, and the length of the reads.
I think this is fair, this is a significant amount of data. And converting this amount from fast5 could take some time. I'm not sure I can recommend whether one massive file is better than several smaller ones for archiving, it likely depends on your storage system and backup processes. If you are open to sharing some of the source/dest data I can investigate why more space wasn't saved on the conversion? It might also be worth doing some estimations on the number of bytes used per read for the whole dataset - does that number seem reasonable? It might hint that either the fast5 dataset is smaller than expected, or the pod5 is larger. Thanks,
|
I have some fast5s of short reads (~170bp) that total 1.7T:
When I convert it to a pod5 file, it doesn't seem to change much in the file size, although I thought that was one of the main advantages compared to fast5:
Am I doing anything wrong here? The faster performance is nice but having a large file like that is a little scary especially if it gets corrupted.
The text was updated successfully, but these errors were encountered: