Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add option to exclude symlink creation for temporary files (for easy use with AWS S3) #93

Open
jolespin opened this issue Sep 30, 2024 · 4 comments

Comments

@jolespin
Copy link

jolespin commented Sep 30, 2024

I'd like to run MetaEuk on hundreds of eukaryotic genome assemblies using AWS but the cost for writing to disk on EFS is extremely expensive. The alternative is to use an s3 bucket for temporary storage. However, this currently isn't possible with MetaEuk because it creates a symlink called "latest" in temporary which isn't support on s3 since it's not a traditional file system even when mounted (e.g., you can't create symlinks, remove files, or edit files once they are created w/ the latter being possible using the aws cli).

Would it be possible to make the temporary files more "s3 friendly" by creating an option to not create any symlinks in temporary directory (also not edit or remove files once they are created)?

Also, unrelated questions while I have your attention. If you run the following command:
metaeuk easy-predict ${FASTA} ${DB} ${OUTPUT_DIRECTORY} ${TMP} where FASTA is a genome assembly fasta and DB is MMseqs2 protein database, does the input fasta file get converted to a MMseqs2 database in the backend?

@milot-mirdita
Copy link
Member

The tmp directory should be on a local NVMe or SSD for performance reasons. I don't think that it's a good idea for performance to put them on S3. I believe that AWS provides either bare metal or network attached versions (EBS??) of either that are fast. The contents of the tmp directory can be thrown away after the run (actually metaeuk should be mostly cleaning up after itself unless you specify --remove-tmp-files 0).

I don't think EFS makes sense for this, as the tmp contents should not become overly large.

Your unrelated question is actually another reason why I don't really want to implement this. We convert FASTA files into internal formats and if possible do this through symlinks to not actually use additional storage space. Also some of the intermediate databases that get created are also symlinks to other intermediate databases.

So avoiding symlinks is not quite trivial in the current implementation.

@jolespin
Copy link
Author

jolespin commented Oct 1, 2024

The tmp directory should be on a local NVMe or SSD for performance reasons. I don't think that it's a good idea for performance to put them on S3.

There's an AWS service called S3 Express OneZone that is high-performance and low latency which is good for scratch space (at least better than s3 which is very slow when mounted).

I believe that AWS provides either bare metal or network attached versions (EBS??) of either that are fast.

Yea EBS is great but it's a bit tricky to do at scale because (from my understanding) you allocate it to a specific EC2 instance so you can't really run multiple jobs with it.

The contents of the tmp directory can be thrown away after the run (actually metaeuk should be mostly cleaning up after itself unless you specify --remove-tmp-files 0).

I saw this parameter and it's great especially if one were to try and use this w/ AWS s3 express one zone scratch space.

I don't think EFS makes sense for this, as the tmp contents should not become overly large.

At least from my experience, it depends on the scale. I tried gene predictions on ~200 genomes or so and they temporary directory exceeded 1.5TB. For instance, I'm running it right now on MicroEuk50 which has 30M proteins on GCA_900893395.1 and its currently at 57G in temporary space when --split-memory-limit 24G.

$ du -sh metaeuk_testing/MicroEuk50/
57G     metaeuk_testing/MicroEuk50/

If I use --split-memory-limit 12G it's even more at 206GB (and growing):

$ du -sh metaeuk_tmp
206G    metaeuk_tmp

EFS is wildly expensive to run analysis on. Right now, my only affordable option is to use EBS but that's easier for one-shot analysis and not as easily reproducible for workflows or use at scale (I need to estimate the storage and memory footprints for all the jobs, then create an EC2 with those specs, then run GNU parallel to run the jobs at the same time hoping it doesn't crash the VM). It would be great if I could deploy jobs and set the temporary directory to AWS s3 express one zone. Here is some of the pricing for reference.

Your unrelated question is actually another reason why I don't really want to implement this. We convert FASTA files into internal formats and if possible do this through symlinks to not actually use additional storage space. Also some of the intermediate databases that get created are also symlinks to other intermediate databases.

Hmm... yea I can see how this will be tricky to adapt. I've had a great experience with MetaEuk so far and would love continue using it at scale for larger projects. I guess this GitHub issue can serve more as an example of potential limitations for using the tool at scale to consider during further development than a bona fide feature request.

Alternatively, have you had any success with miniprot by any chance? I'm seeing it used more and more but the number of genes I get out are magnitudes more (>100k genes) compared to MetaEuk (with the clustered Microeukaryotic Protein Database I made in VEBA 2.0 publication Table 2 for use w/ MetaEuk).

If I knew C I would definitely offer some pull requests but unfortunately I only know Python at a production level. Although, It's on my ever growing to do list! Regardless, I appreciate the insight and your time for the responses (also the amazing tool you developed which has allowed me to do much more robust climate change and public health research).

@elileka
Copy link
Member

elileka commented Oct 3, 2024

Hi Josh,

I'll let @milot-mirdita continue the discussion about S3, but here are a couple of relevant points:

  • The size of the TMP folder is very much affected by the size of your reference database because MateEuk's gene calls are saved there so, potentially, a lot of redundancy. I don't know what reference DB you are using but it may be worth clustering / making profiles. Even conservatively - it may get rid of quite a lot of redundancy.
  • In addition, you can always split your input genomes to smaller batches or even run contigs/scaffolds of each genome separately. This would allow you to use normal ephemeral instance storage or EBS and avoid the S3 workaround.
  • Concerning Miniprot, we don't have first hand experience with it. I know that the BUSCO team have made it the default gene predictor but they also mention some of its caveats compared to MetaEuk: "may underperform for highly divergent assemblies" (I saw this comment also on the Miniprot Github page), "Metaeuk uses less memory than Miniprot.".

@jolespin
Copy link
Author

jolespin commented Oct 3, 2024

I really appreciate the insight on this. Right now the database I'm using is clustered (similar to UniRef where it's clustered at 100%, 90%, and 50%) with the database I'm testing being around 30M proteins. I'm working on some methods to do more targeted iterative gene predictions with MetaEuk (casting a wide net of markers) then building smaller more targeted set from source organisms but it's early in development and I need to benchmark against ground truths.

If I have any developments, I will add them here in case it's helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants