Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lambda3 mkindexn on a large fasta file #229

Open
ArmandBester opened this issue Jul 19, 2024 · 1 comment
Open

lambda3 mkindexn on a large fasta file #229

ArmandBester opened this issue Jul 19, 2024 · 1 comment

Comments

@ArmandBester
Copy link

Dear lambda creators

I think I may be missing something. I am trying to create a nucleotide index on a 677G fasta (nt) file and I get the expected error:

WARNING: Your sequence file is already larger than your physical memory!
         This means you will likely encounter a crash with "bad_alloc".
         Split you sequence file into many smaller ones or use a computer
         with more memory!
free -h
              total        used        free      shared  buff/cache   available
Mem:          503Gi        31Gi       432Gi       4.1Gi        39Gi       466Gi
Swap:          31Gi       1.8Gi        30Gi

My questions are, if I split the fasta file say into 3 and create separate indexes :

    1. How would I run the search against the 3 lba files?
      and
    1. would I not still have too little memory?

Kind regards
Armand

@h-2
Copy link
Member

h-2 commented Jul 21, 2024

Dear Armand,

even assuming that you manage to create the database, what is your use-case for using it? Unless you search >10GB of query sequences, your program runtime will be dominated by just loading the database (which will take super long as it is going to be around 2TB big in total).

If you search very large query files, this could still be worth it, but you will need to split the database, run the searches individually and then manually merge the output file. In such a case, I would recommend using m8 output, reducing the desired number of hits per query and then using a combination of the shell commands sort (increase allowed memory usage and threads) and awk (for filtering) to merge the files.

If you want to proceed with splitting the index, I would suggest the following:

  • Try with a small chunk (~30GB) first. Use /usr/bin/time -v to measure runtime and memory usage ("MaxRSS" value).
  • This will give you an indication of whether the time constraints are viable for you and how large you can make the chunks in a productive setting.
  • I would definitely recommend using .lba.gz to reduce the on-disk size of the index files. This may even make it faster when loading.

If you have any further questions, feel free to ask :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants