Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update download_utils.py #26

Closed
wants to merge 4 commits into from
Closed

Update download_utils.py #26

wants to merge 4 commits into from

Conversation

bsantan
Copy link
Collaborator

@bsantan bsantan commented Feb 5, 2024

No description provided.

@caufieldjh
Copy link
Member

Thanks @bsantan!
Is this still backwards compatible with older download.yaml configs? I ask because the beginning of this block:

# Download file
if "local_name" in item and item["local_name"] != "uniprot_genome_features":
makes it looks like a local_name is mandatory. That's usually present, but what happens if there's just a url provided?

@caufieldjh
Copy link
Member

Note as per the KG construction call today: may be helpful to split the source-specific uniprot material into its own submodule to be called by download_utils.py

@hrshdhgd hrshdhgd marked this pull request as draft February 5, 2024 20:09
@bsantan
Copy link
Collaborator Author

bsantan commented Feb 5, 2024

@caufieldjh this is meant to check both if it exists and if it is specific to the uniprot case that I added. Note that after discussing with Harshad and Marcin, we think a better design may be to add this as a separate step in kg-microbe (perhaps as a GitHub workflow) that runs regularly and downloads all organism.json files to an s3 bucket, from which the kg download step would pull from. So this would remove some of these changes that also affect the download_from_yaml function here.

@hrshdhgd
Copy link
Member

hrshdhgd commented Feb 5, 2024

Putting a pin on this PR from now. @caufieldjh , Marcin will reach out to you in the next few days to discuss a strategy involving S3 buckets.

In short, we were thinking of running the Uniprot API to get JSON files (only the ones we need) and populating a S3 bucket. Add this S3 bucket url in the download.yaml file as a source and download the JSON files that way.

  • Would this be possible?
    • Using GH Workflow or will we need Jenkins?

@caufieldjh
Copy link
Member

Sure, that would be some great functionality to have, and possible with GH actions as long as it has access to the right credentials.
There are plenty of examples of pushing a build artifact to S3, of course, but in this case we'd prefer to go directly from uniprot -> S3, and I don't think that's quite doable, but it is absolutely possible to stream the download from the actions instance to S3.
Jenkins would help, so then you could download the files to the build server and cache to S3. Might be faster too.

@hrshdhgd
Copy link
Member

hrshdhgd commented Feb 5, 2024

Excellent! We'll need your guidance on this @caufieldjh ! Are there any examples already in the kg-hub universe?

@caufieldjh
Copy link
Member

In terms of pushing artifacts to S3 during the Jenkins build, technically all/most of the builds already do that, they just do it through s3cmd.

@hrshdhgd
Copy link
Member

hrshdhgd commented Feb 6, 2024

Closed in favor of #27

@hrshdhgd hrshdhgd closed this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants