Storing big files (issue-collecting issue) #53

charlesreid1 · 2018-02-14T23:17:10Z

There are currently several open issues about large files and a few ideas, so I thought I would create a single issue to collect references to these issues, summarize possible solutions, and provide links to some resources.

Open issues

Issue Kaiju database is corrupted or not in .tgz format #36 was an issue downloading a file that evolved into concerns over hosting of large files and download times
Issue large files present in git repository #29 was an issue with large files in the repository, and why we should avoid adding or changing any large files in the git repo itself
Issues where do we put big databases? #2 and Determine where to host datasets #8 are open questions about where to host files.

(Any objections to closing these issues?)

Solutions

A few proposed solutions:

The Open Science Framework (OSF) provides hosting for large files for scientific projects. They have a web interface, but the URLs for files can also be obtained and they can be downloaded via wget or curl. We currently have several files hosted by OSF.
Synapse.org also provides hosting of data sets for open science and can provide DOI numbers for resources.
Amazon S3 has several classes with different pricing. There's a monthly price calculator tool.

Topics for Discussion

Can we pin numbers on our requirements to get a sense of how much this might cost? (If the budget is zero, that's useful to know too!)

What is the size/number of data sets we might realistically end up hosting?
What is the expiration date for this data?
How many users do we expect to download the public data sets?
What constraints do the users have (i.e., is a slower connection in return for lower cost acceptable, or does the data need to be available fast and reliably)?
Is there are preference (in terms of ease of disbursing funds or space-wise) between a physical server with storage and cloud storage?

There are some cloud workflows for avoiding large downloads as well, depending on the constraints and where we want to dedicate time. These would definitely be useful in the context of testing.

Can we develop workflows for how to use cloud storage drives to store/share databases? (Seems useful to users of both open and private databases.)
Example: develop a workflow to create a cloud drive image/snapshot creating a dataset from databases (public or private); then, develop further workflows that expect that data to be local, which is accomplished by mounting that cloud drive image/snapshot via network file storage or some other mechanism
How to leverage the cloud provider's network? For example, mounting an S3 bucket via network storage still has slow transfer rates, but the transfer all happens inside Amazon's network (probably faster, and definitely freer)

This was referenced Mar 12, 2018

Kaiju database is corrupted or not in .tgz format #36

Closed

large files present in git repository #29

Closed

Determine where to host datasets #8

Closed

charlesreid1 mentioned this issue Mar 23, 2018

Reducing cloning time #65

Closed

brooksph added the Strategy/Planning label Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing big files (issue-collecting issue) #53

Storing big files (issue-collecting issue) #53

charlesreid1 commented Feb 14, 2018 •

edited

Loading

Storing big files (issue-collecting issue) #53

Storing big files (issue-collecting issue) #53

Comments

charlesreid1 commented Feb 14, 2018 • edited Loading

Open issues

Solutions

Topics for Discussion

charlesreid1 commented Feb 14, 2018 •

edited

Loading