Skip to content

Commit

Permalink
Merge pull request #33 from monarch-initiative/develop
Browse files Browse the repository at this point in the history
update deps and readme
  • Loading branch information
glass-ships committed Apr 24, 2024
2 parents 675b879 + 712e360 commit 390c894
Show file tree
Hide file tree
Showing 3 changed files with 536 additions and 255 deletions.
61 changes: 34 additions & 27 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# KG-Hub Downloader

| [Documentation](https://monarch-initiative.github.io/kghub-downloader) |
| [Documentation](https://monarch-initiative.github.io/kghub-downloader) | [Repository](https://github.com/monarch-initiative/kghub-downloader) | [PyPI](https://pypi.org/project/kghub-downloader) |

### Overview

Expand All @@ -9,17 +9,19 @@ This is a configuration based file caching downloader with initial support for h
### Installation

KGHub Downloader is available to install via pip:

```
pip install kghub-downloader
```

### Configure
### Configure

The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
For an example, see [example/download.yaml](example/download.yaml)

Available options are:
- \***url**: The URL to download from. Currently supported:

- \***url**: The URL to download from. Currently supported:
- `http(s)`
- `ftp`
- with `glob:` option to download files with specific extensions (only with ftp as of now and looks recursively).
Expand All @@ -29,29 +31,33 @@ Available options are:
- **local_name**: The name to save the file as locally
- **tag**: A tag to use to filter downloads
- **api**: The API to use to download the file. Currently supported: `elasticsearch`
- elastic search options
- elastic search options
- **query_file**: The file containing the query to run against the index
- **index**: The elastic search index for query

> \* Note:
> Google Cloud Storage URLs require that you have set up your credentials as described [here](https://cloud.google.com/artifact-registry/docs/python/authentication#keyring-user). You must:
> - [create a service account](https://cloud.google.com/iam/docs/service-accounts-create)
> - [add the service account to the relevant bucket](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-iam) and
> Google Cloud Storage URLs require that you have set up your credentials as described [here](https://cloud.google.com/artifact-registry/docs/python/authentication#keyring-user). You must:
>
> - [create a service account](https://cloud.google.com/iam/docs/service-accounts-create)
> - [add the service account to the relevant bucket](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-iam) and
> - [download a JSON key](https://cloud.google.com/iam/docs/keys-create-delete) for that service account.
> Then, set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to that file.
> Then, set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to that file.
>
> Mirorring local files to Amazon AWS S3 bucket requires the following:
> - [Create an AWS account](https://portal.aws.amazon.com/)
> - [Create an IAM user in AWS](https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started.html): This enables getting the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` needed for authentication. These two should be stored as environment variables in the user's system.
> - [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html): This will be the destination for pushing local files.
>
> - [Create an AWS account](https://portal.aws.amazon.com/)
> - [Create an IAM user in AWS](https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started.html): This enables getting the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` needed for authentication. These two should be stored as environment variables in the user's system.
> - [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html): This will be the destination for pushing local files.
You can also include any secrets like API keys you have set as environment variables using `{VARIABLE_NAME}`, for example:

You can also include any secrets like API keys you have set as environment variables using `{VARIABLE_NAME}`, for example:
```yaml
---
- url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
localname: myfancyfile.json
```
Note: `YOUR_SECRET` *MUST* as an environment variable, and be sure to include the {curly braces} in the url string.
Note: `YOUR_SECRET` _MUST_ as an environment variable, and be sure to include the {curly braces} in the url string.

### Usage

Expand All @@ -67,30 +73,31 @@ download_from_yaml(yaml_file="download.yaml", output_dir="data")

#### Command Line

To download files listed in a download.yaml file:

```bash
$ downloader [OPTIONS] ARGS
```
╰ Download files listed in a download.yaml file

| OPTIONS | |
| --- | --- |
| yaml_file | A string pointing to the download.yaml file, to be parsed for things to download.<br>Defaults to `./download.yaml` |
| ignore_cache | Ignore cache and download files even if they exist [false] |
| snippet_only | Downloads only the first 5 kB of each uncompressed source, for testing and file checks |
| tags | Limit to only downloads with this tag |
| mirror | Remote storage URL to mirror download to. Supported buckets: Google Cloud Storage |
| OPTIONS | |
| ------------ | ----------------------------------------------------------------------------------------------------- |
| yaml_file | Path to the download.yaml file, to be parsed for things to download.<br>Defaults to `./download.yaml` |
| ignore_cache | Ignore cache and download files even if they exist (Default `False`) |
| snippet_only | Downloads only the first 5 kB of each uncompressed source, for testing and file checks |
| tags | Limit to only downloads with this tag |
| mirror | Remote storage URL to upload downloaded files to.<br/>Supported buckets: Google Cloud Storage |


| ARGUMENTS | |
| --- | --- |
| output_dir | A string pointing to where to write out downloaded files. |
| ARGUMENTS | |
| ---------- | ------------------------------- |
| output_dir | Where to save downloaded files. |

Examples:

```bash
$ downloader --output_dir example_output --tags zfin_gene_to_phenotype example.yaml
$ downloader --output_dir example_output --mirror gs://your-bucket/desired/directory
# Note that if your YAML file is named `download.yaml`,
# Note that if your YAML file is named `download.yaml`,
# the argument can be omitted from the CLI call.
$ downloader --output_dir example_output
```
Expand All @@ -111,4 +118,4 @@ poetry install
poetry run pytest
```

NOTE: The tests require gcloud credentials to be set up as described above, using the monarch github actions service account.
NOTE: The tests require gcloud credentials to be set up as described above, using the Monarch github actions service account.
Loading

0 comments on commit 390c894

Please sign in to comment.