Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod5 don't seem to save much space on short reads #66

Open
billytcl opened this issue Aug 30, 2023 · 7 comments
Open

pod5 don't seem to save much space on short reads #66

billytcl opened this issue Aug 30, 2023 · 7 comments

Comments

@billytcl
Copy link

I have some fast5s of short reads (~170bp) that total 1.7T:

billylau@suzuki:/mnt/ix1/Seq_Runs/20230721_PRM_1377/Seq_Output$ find . -iname '*.fast5' -print0 | du -ch --files0-from=- | tail
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_61.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_35.fast5
19M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_14.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_86.fast5
19M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_2.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_18.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_77.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_56.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_39.fast5
1.7T    total

When I convert it to a pod5 file, it doesn't seem to change much in the file size, although I thought that was one of the main advantages compared to fast5:

billylau@suzuki:/mnt/ix1/Seq_Runs/20230721_PRM_1377/pod5$ ll -h
total 1.5T
drwxrwxr-x 2 billylau jiseqruns    4 Aug 18 22:01 ./
drwxrwxr-x 7 billylau jiseqruns   13 Aug 25 14:50 ../
-rw------- 1 billylau jiseqruns 6.4M Aug 18 22:01 nohup.out
-rw-rw-r-- 1 billylau jiseqruns 1.5T Aug 18 22:01 output.pod5

Am I doing anything wrong here? The faster performance is nice but having a large file like that is a little scary especially if it gets corrupted.

@sklages
Copy link

sklages commented Aug 30, 2023

We usually write pod5 in minknow. Currently still in 4K batches (which is the default), just like fast5.

We always convert the raw data files (either fast5/pod5) into a single large pod5 file for immediate basecalling. This process is quite fast.

After basecalling this single pod5 is removed. We always keep the original files (written by minknow) for long-term storage.

AFAICS the single POD5 is just the size of the run data. At least for the data I have worked with here so far.

@billytcl
Copy link
Author

billytcl commented Aug 30, 2023 via email

@sklages
Copy link

sklages commented Aug 30, 2023

I haven't benchmarked, generally I'd say reading from one big file is more efficient than reading a few hundred or even thousand files. You may run into I/O issues with many small files so that your GPU may be loaded "suboptimal", because it does not get data fast enough.

I have different sources of sequencing data, so converting/merging - whatever I get - into a single pod5 file, makes dorado always start with the same input (type). Converting/merging is quite fast, even for large datasets. So if basecalling takes 30h instead of 29h .. well, I don't care ;-)

Is there any special reason why you want to convert one-by-one?

@billytcl
Copy link
Author

billytcl commented Aug 30, 2023 via email

@sklages
Copy link

sklages commented Aug 30, 2023

I just converted a small fast5 dataset (from 2020), ~1000 files, 650GB in 40min. Data was read from a NFS mount .. may be even faster from local storage.

We are also thinking about archiving old run data .. we are probably going for per-flowcell-pod5 files

@billytcl
Copy link
Author

billytcl commented Aug 30, 2023 via email

@0x55555555
Copy link
Collaborator

Hi @billytcl,

I don't believe youre doing anything wrong - it looks like the dataset reduces in size by about 0.88... I agree I have seen more significant reductions in the past. It will depend on the original compression of the fast5 dataset, and the length of the reads.

I’m just scared of a single 2TB file of being corrupted.

I think this is fair, this is a significant amount of data. And converting this amount from fast5 could take some time. I'm not sure I can recommend whether one massive file is better than several smaller ones for archiving, it likely depends on your storage system and backup processes.

If you are open to sharing some of the source/dest data I can investigate why more space wasn't saved on the conversion? It might also be worth doing some estimations on the number of bytes used per read for the whole dataset - does that number seem reasonable? It might hint that either the fast5 dataset is smaller than expected, or the pod5 is larger.

Thanks,

  • George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants