Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing big files (issue-collecting issue) #53

Open
charlesreid1 opened this issue Feb 14, 2018 · 0 comments
Open

Storing big files (issue-collecting issue) #53

charlesreid1 opened this issue Feb 14, 2018 · 0 comments

Comments

@charlesreid1
Copy link
Member

charlesreid1 commented Feb 14, 2018

There are currently several open issues about large files and a few ideas, so I thought I would create a single issue to collect references to these issues, summarize possible solutions, and provide links to some resources.

Open issues

(Any objections to closing these issues?)

Solutions

A few proposed solutions:

  • The Open Science Framework (OSF) provides hosting for large files for scientific projects. They have a web interface, but the URLs for files can also be obtained and they can be downloaded via wget or curl. We currently have several files hosted by OSF.
  • Synapse.org also provides hosting of data sets for open science and can provide DOI numbers for resources.
  • Amazon S3 has several classes with different pricing. There's a monthly price calculator tool.

Topics for Discussion

Can we pin numbers on our requirements to get a sense of how much this might cost? (If the budget is zero, that's useful to know too!)

  • What is the size/number of data sets we might realistically end up hosting?
  • What is the expiration date for this data?
  • How many users do we expect to download the public data sets?
  • What constraints do the users have (i.e., is a slower connection in return for lower cost acceptable, or does the data need to be available fast and reliably)?
  • Is there are preference (in terms of ease of disbursing funds or space-wise) between a physical server with storage and cloud storage?

There are some cloud workflows for avoiding large downloads as well, depending on the constraints and where we want to dedicate time. These would definitely be useful in the context of testing.

  • Can we develop workflows for how to use cloud storage drives to store/share databases? (Seems useful to users of both open and private databases.)
  • Example: develop a workflow to create a cloud drive image/snapshot creating a dataset from databases (public or private); then, develop further workflows that expect that data to be local, which is accomplished by mounting that cloud drive image/snapshot via network file storage or some other mechanism
  • How to leverage the cloud provider's network? For example, mounting an S3 bucket via network storage still has slow transfer rates, but the transfer all happens inside Amazon's network (probably faster, and definitely freer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants