Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update documentation, update tests #25

Merged
merged 1 commit into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 41 additions & 17 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,29 +15,35 @@ pip install kghub-downloader

### Configure

#### Download Configuration

The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
The format for the file is:
```yaml
---
-
url: "http://example.com/myawesomefile.tsv"
local_name: myawesomefile.tsv
-
url: "http://example.com/myokfile.json"
local_name: myokfile.json

```
The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
For an example, see [example/download.yaml](example/download.yaml)

Available options are:
- \***url**: The URL to download from. Currently supported:
- `http(s)`
- Google Cloud Storage (`gs://`)
- Google Drive (`gdrive://` or https://drive.google.com/...). The file must be publicly accessible.
- **local_name**: The name to save the file as locally
- **tag**: A tag to use to filter downloads
- **api**: The API to use to download the file. Currently supported: `elasticsearch`
- elastic search options
- **query_file**: The file containing the query to run against the index
- **index**: The elastic search index for query

> \* Note:
> Google Cloud Storage URLs require that you have set up your credentials as described [here](https://cloud.google.com/artifact-registry/docs/python/authentication#keyring-user). You must:
> - [create a service account](https://cloud.google.com/iam/docs/service-accounts-create)
> - [add the service account to the relevant bucket](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-iam) and
> - [download a JSON key](https://cloud.google.com/iam/docs/keys-create-delete) for that service account.
> Then, set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to that file.

You can also include any secrets like API keys you have set as environment variables using `{VARIABLE_NAME}`, for example:
```yaml
---
-
url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
- url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
localname: myfancyfile.json
```
Note: You _MUST_ have this secret set as an environment variable, and be sure to include the {curly braces}
Note: `YOUR_SECRET` *MUST* as an environment variable, and be sure to include the {curly braces} in the url string.

### Usage

Expand Down Expand Up @@ -80,3 +86,21 @@ $ downloader --output_dir example_output --mirror gs://your-bucket/desired/direc
# the argument can be omitted from the CLI call.
$ downloader --output_dir example_output
```

### Development

#### Install

```bash
git clone https://github.com/monarch-initiative/kghub-downloader.git
cd kghub-downloader
poetry install
```

#### Run tests

```bash
poetry run pytest
```

NOTE: The tests require gcloud credentials to be set up as described above, using the monarch github actions service account.
26 changes: 12 additions & 14 deletions example/download.yaml
Original file line number Diff line number Diff line change
@@ -1,23 +1,21 @@
---
-
url: https://zfin.org/downloads/phenoGeneCleanData_fish.txt
- url: https://zfin.org/downloads/phenoGeneCleanData_fish.txt
local_name: zfin/fish_phenotype.txt
-
url: gs://monarch-test/kghub_downloader_test_file.yaml

- url: gs://monarch-test/kghub_downloader_test_file.yaml
local_name: test_file.yaml

- url: gdrive:10ojJffrPSl12OMcu4gyx0fak2CNu6qOs
local_name: gdrive_test_1.txt
tag: testing
# -

- url: https://drive.google.com/uc?id=10ojJffrPSl12OMcu4gyx0fak2CNu6qOs
local_name: gdrive_test_2.txt

# - url: https://www.ebi.ac.uk/chembl/elk/es/
# api: elasticsearch
# url: https://www.ebi.ac.uk/chembl/elk/es/
# query_file: example/query.json
# local_name: molecule.json
# index: chembl_28_molecule
# tag: ebi

-
url: gdrive:10ojJffrPSl12OMcu4gyx0fak2CNu6qOs
local_name: gdrive_test_1.txt

-
url: https://drive.google.com/uc?id=10ojJffrPSl12OMcu4gyx0fak2CNu6qOs
local_name: gdrive_test_2.txt

7 changes: 5 additions & 2 deletions test/integration/test_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,11 @@ def test_download():


def test_tag():
files = ["test/output/zfin/fish_phenotype.txt", "test/output/test_file.yaml"]
tagged_files = ["test/output/test_file.yaml"]
files = [
"test/output/zfin/fish_phenotype.txt",
"test/output/test_file.yaml"
]
tagged_files = ["test/output/gdrive_test_1.txt"]

for file in files:
if exists(file):
Expand Down