Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDEA: upload remote files #666

Open
agy-why opened this issue Aug 29, 2022 · 5 comments
Open

IDEA: upload remote files #666

agy-why opened this issue Aug 29, 2022 · 5 comments

Comments

@agy-why
Copy link

agy-why commented Aug 29, 2022

Dear developers,

I have a question regarding the reana-client upload feature.

Is it already possible or is it planned to execute something like:

reana-client upload s3://.../my-s3-data/

And if yes which services are you currently supporting? scp, ftp, s3, google cloud, webdav,...

I thank you in advance,

Yori

@agy-why agy-why changed the title upload remote files IDEA: upload remote files Aug 30, 2022
@agy-why
Copy link
Author

agy-why commented Aug 31, 2022

I realized this may not be the proper place to ask. Shall I open this issue for reana-client ?

@tiborsimko
Copy link
Member

Hi @agy-why, this repository is a perfect location for this issue, there is no need to move it.

Currently, we don't support remote storage services in the above suggested way. What is possible is that the researchers can express remote file access needs by special stage-in and stage-out steps in their computational workflow graphs. That is, the first step of the workflow would be the download of inputs from S3, and the last step of the workflow would be uploads of results back to S3. For a live example, please see EOS stage-out example in the documentation: https://docs.reana.io/advanced-usage/storage-backends/eos/

We support virtually any external storage system where we can use Kerberos authentication or VOMS proxy authentication mechanisms. Examples include EOS or WLCG sites. Note also that we are in the middle of adding support for Rucio, see reanahub/reana-auth-rucio#1

That said, we have been planning to support remote file syntax sugar in a rather similar way as you suggested. We thought of allowing a syntax like:

inputs:
  files:
    - s3(`mybucket`, `myfile.csv`)

REANA would then do an automatic stage-in and stage-out for this file. One advantage is that researchers wouldn't have to write explicit data staging steps in their DAG workflows.

This is a bit similar to Snakemake support for remote storage, see https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html and the examples therein for AWS or S3 in Snakemake rules.

We hope to start working on similar remote file storage support syntax sugar sometime this winter.

@tiborsimko
Copy link
Member

P.S. Another related idea I should note that we have been thinking about is to add support for popular protocols so that REANA workspace could be manipulated via tools such as rclone. This might simplify initial stage-in upload and final stage-out download, especially when using many files or when using very large files.

@agy-why
Copy link
Author

agy-why commented Sep 5, 2022

Dear Tibor,

thank you for your clear and detailed response.

My personal use-case would be to have a single workflow that could work with various data origins: my dev-data are on a server that I can access via scp, my prod-data are on a private s3 infrastructure but they may move to another one (not necessarily s3) after publication of the results.

Therefore I would found useful to be able to specify not only the source but also the protocol to access the data outside of the workflow (git repo).

Currently, I need to implement two variants in my first step (get_data) to get the data in the work space, which I can chose via input parameters. It is fully acceptable that way, but I'd greatly appreciate the rclone feature you suggested.

This would allow me to plug-in / plug-out input data to the same workflow by populating my workspace accordingly.

@agy-why
Copy link
Author

agy-why commented Sep 5, 2022

An alternative would be to be able to mix workflows together, I don't know how far this is possible.

I have:

  • a git-repo with workflow get_data_from_scp
  • a git-repo with workflow get_data_from_s3
  • a git-repo with workflow analyse_data

It would an acceptable solution for me to be able to propagate the WorkSpace of one of the get_data workflow to the analyse_data WorkSpace. Or to create a new workflow from, say: get_data_from_scp + analyse_data.

Is this already possible?

I thank you in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants