Add globus_token to get_assaytype_data (#1343)

* adding token to assayclassifier call, switching string concatenated URLs to urllib * fixed tests * updated tests * linters * fixing query params * forgot changed file * General: Update Visium w/ probes documentation * General: Modify CHANGELOG since we're on 0.0.23 --------- Co-authored-by: Juan Puerto <=>
hubmapconsortium · Jul 2, 2024 · 8a14a9a · 8a14a9a
1 parent 31853d8
commit 8a14a9a
Show file tree

Hide file tree

Showing 11 changed files with 56 additions and 45 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,8 @@
 # Changelog
 
+## v0.0.23
+- Add token to validation_utils.get_assaytype_data, replace URL string concatenation with urllib
+
 ## v0.0.22
 - Fix logging length issue in Upload.multi_parent
 - Minor change to Visium with probes directory schema

diff --git a/docs/visium-with-probes/current/index.md b/docs/visium-with-probes/current/index.md
@@ -16,7 +16,7 @@ Related files:
 - [📝 TSV template](https://raw.githubusercontent.com/hubmapconsortium/dataset-metadata-spreadsheet/main/visium-with-probes/latest/visium-with-probes.tsv): Alternative for metadata entry.
 
 
-REQUIRED - For this assay, you must also prepare and submit two additional metadata.tsv files following the metadata schemas linked here for [RNAseq](https://hubmapconsortium.github.io/ingest-validation-tools/rnaseq/current/) and [Histology](https://hubmapconsortium.github.io/ingest-validation-tools/histology/current/). [This link](https://docs.google.com/spreadsheets/d/1YnmdTAA0Z9MKN3OjR3Sca8pz-LNQll91wdQoRPSP6Q4/edit#gid=0) lists the set of fields that are required in the OME TIFF file XML header.
+REQUIRED - For this assay, you must also prepare and submit two additional metadata.tsv files following the metadata schemas linked here for [RNAseq (with probes)](https://hubmapconsortium.github.io/ingest-validation-tools/rnaseq-with-probes/current/) and [Histology](https://hubmapconsortium.github.io/ingest-validation-tools/histology/current/). [This link](https://docs.google.com/spreadsheets/d/1YnmdTAA0Z9MKN3OjR3Sca8pz-LNQll91wdQoRPSP6Q4/edit#gid=0) lists the set of fields that are required in the OME TIFF file XML header.
 
 ## Metadata schema
 

diff --git a/examples/dataset-examples/bad-cedar-multi-assay-visium-bad-dir-structure/README.md b/examples/dataset-examples/bad-cedar-multi-assay-visium-bad-dir-structure/README.md
@@ -1,7 +1,7 @@
 ```
 Directory Errors:
   ? examples/dataset-examples/bad-cedar-multi-assay-visium-bad-dir-structure/upload/Visium_9OLC_A4_S1
-    (as visium-no-probes-v2.0)
+    (as visium-no-probes-v2.1)
   : - Required but missing:
       - lab_processed\/.*.
       - lab_processed\/images\/.*.

diff --git a/examples/dataset-examples/bad-scrnaseq-v0/fixtures.json b/examples/dataset-examples/bad-scrnaseq-v0/fixtures.json
@@ -1 +1 @@
-{"assaytype": {"scRNAseq-10xGenomics": {"assaytype": "scRNAseq-10xGenomics-v3", "contains-pii": true, "dataset-type": "RNAseq", "description": "RNAseq (10x Genomics v3)", "dir-schema": "scrnaseq-v0", "primary": true, "tbl-schema": "scrnaseq-v0", "vitessce-hints": []}}, "validation": {"scRNAseq-10xGenomics-v3": {}, "contributors": {}}}
+{"assaytype": {"scRNAseq-10xGenomics": {"assaytype": "scRNAseq-10xGenomics-v3", "contains-pii": true, "dataset-type": "RNAseq", "description": "scRNAseq (10x Genomics v3)", "dir-schema": "scrnaseq-v0", "primary": true, "tbl-schema": "scrnaseq-v0", "vitessce-hints": []}}, "validation": {"scRNAseq-10xGenomics-v3": {}, "contributors": {}}}
diff --git a/examples/dataset-examples/good-cedar-multi-assay-visium/README.md b/examples/dataset-examples/good-cedar-multi-assay-visium/README.md
@@ -7,14 +7,14 @@ TSVs:
   good-visium-assay-metadata.tsv:
     Schema: visium-no-probes-v2
     Metadata schema version: '2'
-    Directory schema version: visium-no-probes-v2.0
+    Directory schema version: visium-no-probes-v2.1
   good-visium-histology-metadata.tsv:
     Schema: h-and-e-v2
     Metadata schema version: '2'
-    Directory schema version: visium-no-probes-v2.0
+    Directory schema version: visium-no-probes-v2.1
   good-visium-rnaseq-metadata.tsv:
     Schema: rnaseq-visium-no-probes-v2
     Metadata schema version: '2'
-    Directory schema version: visium-no-probes-v2.0
+    Directory schema version: visium-no-probes-v2.1
 
 ```
diff --git a/src/ingest_validation_tools/table-schemas/assays/visium-with-probes-v2.yaml b/src/ingest_validation_tools/table-schemas/assays/visium-with-probes-v2.yaml
@@ -1,4 +1,4 @@
-description_md: 'REQUIRED - For this assay, you must also prepare and submit two additional metadata.tsv files following the metadata schemas linked here for [RNAseq](https://hubmapconsortium.github.io/ingest-validation-tools/rnaseq/current/) and [Histology](https://hubmapconsortium.github.io/ingest-validation-tools/histology/current/). [This link](https://docs.google.com/spreadsheets/d/1YnmdTAA0Z9MKN3OjR3Sca8pz-LNQll91wdQoRPSP6Q4/edit#gid=0) lists the set of fields that are required in the OME TIFF file XML header.'
+description_md: 'REQUIRED - For this assay, you must also prepare and submit two additional metadata.tsv files following the metadata schemas linked here for [RNAseq (with probes)](https://hubmapconsortium.github.io/ingest-validation-tools/rnaseq-with-probes/current/) and [Histology](https://hubmapconsortium.github.io/ingest-validation-tools/histology/current/). [This link](https://docs.google.com/spreadsheets/d/1YnmdTAA0Z9MKN3OjR3Sca8pz-LNQll91wdQoRPSP6Q4/edit#gid=0) lists the set of fields that are required in the OME TIFF file XML header.'
 fields:
 - name: is_cedar
   description: 'Identifies whether the version is hosted by CEDAR'

diff --git a/src/ingest_validation_tools/upload.py b/src/ingest_validation_tools/upload.py
@@ -10,6 +10,7 @@
 from functools import cached_property
 from pathlib import Path
 from typing import DefaultDict, Dict, List, Optional, Union
+from urllib.parse import urljoin, urlsplit
 
 import requests
 
@@ -178,8 +179,14 @@ def get_errors(self, **kwargs) -> ErrorDict:
     def get_app_context(self, submitted_app_context: Dict):
         """
         Ensure that all default values are present, but privilege any
-        submitted values.
+        submitted values (after making a basic validity check).
         """
+        for url_type in ["entities_url", "ingest_url", "constraints_url"]:
+            if submitted_app_context.get(url_type):
+                split_url = urlsplit(submitted_app_context[url_type])
+                assert (
+                    split_url.scheme and split_url.netloc
+                ), f"{url_type} URL is incomplete: {submitted_app_context[url_type]}"
         self.app_context = {
             "entities_url": "https://entity.api.hubmapconsortium.org/entities/",
             "ingest_url": "https://ingest.api.hubmapconsortium.org/",
@@ -371,8 +378,10 @@ def _constraint_checks(self, schema: SchemaVersion):
             "Authorization": f"Bearer {self.globus_token}",
             "Content-Type": "application/json",
         }
-        url = f"{self.app_context['constraints_url']}?match=True&order={CONSTRAINTS_CHECK_METHOD}"
-        response = requests.post(url, headers=headers, data=data)
+        params = {"match": True, "order": CONSTRAINTS_CHECK_METHOD}
+        response = requests.post(
+            self.app_context["constraints_url"], headers=headers, data=data, params=params
+        )
         if self.verbose:
             print("Ancestor-Descendant pairs sent:")
             self._print_constraint_pairs(payload)
@@ -563,27 +572,24 @@ def _get_constrained_fields(self, schema: SchemaVersion):
 
         constrained_fields = {}
         schema_name = schema.schema_name
+        entities_url = self.app_context.get("entities_url")
 
         if schema_name in [
             OtherTypes.SOURCE,
             *Sample.with_parent_type(),
         ]:
-            constrained_fields["source_id"] = self.app_context.get("entities_url")
+            constrained_fields["source_id"] = entities_url
             if schema_name in Sample.with_parent_type():
-                constrained_fields["sample_id"] = self.app_context.get("entities_url")
+                constrained_fields["sample_id"] = entities_url
         elif schema_name in OtherTypes.ORGAN:  # Deprecated, included for backward-compatibility
-            constrained_fields["organ_id"] = self.app_context.get("entities_url")
+            constrained_fields["organ_id"] = entities_url
         elif schema_name == OtherTypes.CONTRIBUTORS:
             if schema.is_cedar:
-                constrained_fields["orcid"] = (
-                    "https://pub.orcid.org/v3.0/expanded-search/?q=orcid:"
-                )
+                constrained_fields["orcid"] = "https://pub.orcid.org/v3.0/expanded-search/"
             else:
-                constrained_fields["orcid_id"] = (
-                    "https://pub.orcid.org/v3.0/expanded-search/?q=orcid:"
-                )
+                constrained_fields["orcid_id"] = "https://pub.orcid.org/v3.0/expanded-search/"
         else:
-            constrained_fields["parent_sample_id"] = self.app_context.get("entities_url")
+            constrained_fields["parent_sample_id"] = entities_url
         return constrained_fields
 
     def _get_url_fields(
@@ -611,7 +617,7 @@ def _check_url(
         """
         Returns entity_type if checking a field in check_fields.
         """
-        url = constrained_fields[field] + value
+        url = urljoin(constrained_fields[field], value)
         if field in self.check_fields:
             headers = self.app_context.get("request_header", {})
             response = get_entity_api_data(url, self.globus_token, headers)
@@ -628,7 +634,9 @@ def _check_url(
                 )
         elif field in ["orcid_id", "orcid"]:
             headers = {"Accept": "application/json"}
-            response = requests.get(url, headers=headers)
+            response = requests.get(
+                constrained_fields[field], headers=headers, params={"q": f"orcid:{value}"}
+            )
             num_found = response.json().get("num-found")
             if num_found == 1:
                 return

diff --git a/src/ingest_validation_tools/validation_utils.py b/src/ingest_validation_tools/validation_utils.py
@@ -4,6 +4,7 @@
 from csv import DictReader
 from pathlib import Path, PurePath
 from typing import Dict, List, Optional, Union
+from urllib.parse import urljoin
 
 import requests
 
@@ -92,10 +93,7 @@ def get_schema_version(
             message.append('Has "orcid_id": Contributors TSV found where metadata TSV expected.')
     if message:
         raise PreflightError(" ".join([msg for msg in message]))
-    assay_type_data = get_assaytype_data(
-        rows[0],
-        ingest_url,
-    )
+    assay_type_data = get_assaytype_data(rows[0], ingest_url, globus_token)
     if not assay_type_data:
         message.append(f"No match found in assayclassifier for TSV {path}.")
         if "assay_type" in rows[0]:
@@ -151,20 +149,17 @@ def get_other_schema_data(
         raise PreflightError(f"Metadata TSV {path} contains invalid field: {unique_field}")
     if entity_type == OtherTypes.SAMPLE:
         # Sample types require additional data
-        return get_entity_info_from_entity_api(url + row[unique_field], globus_token)
+        return get_entity_info_from_entity_api(urljoin(url, row[unique_field]), globus_token)
     else:
         return EntityTypeInfo(entity_type)
 
 
-def get_assaytype_data(
-    row: Dict,
-    ingest_url: str,
-) -> Dict:
+def get_assaytype_data(row: Dict, ingest_url: str, globus_token: str) -> Dict:
     if not ingest_url:
         ingest_url = "https://ingest.api.hubmapconsortium.org/"
     response = requests.post(
-        f"""{ingest_url.strip("/")}/assaytype""",
-        headers={"Content-Type": "application/json"},
+        urljoin(ingest_url, "assaytype"),
+        headers={"Content-Type": "application/json", "Authorization": f"Bearer {globus_token}"},
         data=json.dumps(row),
     )
     response.raise_for_status()

diff --git a/tests-manual/test-plugins.py b/tests-manual/test-plugins.py
@@ -17,7 +17,7 @@ def test_info_reporting(self):
         ):
             with patch("ingest_validation_tools.upload.Upload.online_checks"):
                 fake_now = datetime.now()
-                for test_dir in glob.glob(f"examples/plugin-tests/**"):
+                for test_dir in glob.glob("examples/plugin-tests/**"):
                     upload = Upload(
                         Path(f"{test_dir}/upload"), **PLUGIN_EXAMPLES_OPTS, verbose=False
                     )

diff --git a/tests/test_dataset_examples.py b/tests/test_dataset_examples.py
@@ -279,8 +279,8 @@ def test_validate_dataset_examples(self, verbose: bool = False, full_diff: bool
                     opts = {}
                 with patch(
                     "ingest_validation_tools.validation_utils.get_assaytype_data",
-                    side_effect=lambda row, ingest_url: assaytype_side_effect(
-                        test_dir, row, ingest_url
+                    side_effect=lambda row, ingest_url, globus_token: assaytype_side_effect(
+                        test_dir, row, ingest_url, globus_token
                     ),
                 ) as mock_assaytype_data:
                     with patch("ingest_validation_tools.upload.Upload.online_checks"):
@@ -332,7 +332,7 @@ def multi_dataset_assert(self, tsv_paths: List[str], mock_assaytype_data: Mock):
             with open(tsv_path, encoding="ascii") as f:
                 rows = list(DictReader(f, dialect="excel-tab"))
             f.close()
-            calls.append(call(rows[0], "https://ingest.api.hubmapconsortium.org/"))
+            calls.append(call(rows[0], "https://ingest.api.hubmapconsortium.org/", ""))
         try:
             mock_assaytype_data.assert_has_calls(calls, any_order=True)
         except AssertionError as e:
@@ -343,7 +343,9 @@ def multi_dataset_assert(self, tsv_paths: List[str], mock_assaytype_data: Mock):
     def prep_offline_upload(test_dir: str, opts: Dict) -> Upload:
         with patch(
             "ingest_validation_tools.validation_utils.get_assaytype_data",
-            side_effect=lambda row, ingest_url: assaytype_side_effect(test_dir, row, ingest_url),
+            side_effect=lambda row, ingest_url, globus_token: assaytype_side_effect(
+                test_dir, row, ingest_url, globus_token
+            ),
         ):
             with patch("ingest_validation_tools.upload.Upload.online_checks"):
                 upload = Upload(Path(f"{test_dir}/upload"), **opts)
@@ -354,7 +356,9 @@ def prep_offline_upload(test_dir: str, opts: Dict) -> Upload:
     def prep_dir_schema_upload(self, test_dir: str, opts: Dict, patch_data: Dict) -> Upload:
         with patch(
             "ingest_validation_tools.validation_utils.get_assaytype_data",
-            side_effect=lambda row, ingest_url: assaytype_side_effect(test_dir, row, ingest_url),
+            side_effect=lambda row, ingest_url, globus_token: assaytype_side_effect(
+                test_dir, row, ingest_url, globus_token
+            ),
         ):
             with patch(
                 "ingest_validation_tools.validation_utils.get_possible_directory_schemas",
@@ -484,14 +488,14 @@ def test_bad_multi_assay_parents(self):
                     ),
                 ):
                     bad_upload = Upload(
-                        Path(f"test_path"),
+                        Path("test_path"),
                         tsv_paths=["repeated_parent_fake_path_1", "repeated_parent_fake_path_2"],
                         **DATASET_EXAMPLES_OPTS,
                     )
                     with self.assertRaises(PreflightError):
                         bad_upload.multi_parent
                     good_upload = Upload(
-                        Path(f"test_path"),
+                        Path("test_path"),
                         tsv_paths=["unique_parent_fake_path_1", "unique_parent_fake_path_2"],
                         **DATASET_EXAMPLES_OPTS,
                     )

diff --git a/tests/test_single_tsv.py b/tests/test_single_tsv.py
@@ -17,7 +17,6 @@
     BAD_DATASET_SCHEMA_WITH_ANCESTORS,
     GOOD_DATASET_EXPECTED_PAYLOAD,
     GOOD_DATASET_SCHEMA_WITH_ANCESTORS,
-    SAMPLE_BLOCK_CONSTRAINTS_RESPONSE_BAD,
     SAMPLE_BLOCK_CONSTRAINTS_RESPONSE_GOOD,
     SAMPLE_BLOCK_PARTIAL_CEDAR_RESPONSE_GOOD,
     SAMPLE_BLOCK_PARTIAL_ENTITY_API_RESPONSE,
@@ -26,7 +25,7 @@
     TEST_GET_TSV_ERRORS_PARAMS,
 )
 
-CONSTRAINTS_URL_PARAMS = "?match=True&order=ancestors"
+CONSTRAINTS_URL_PARAMS = {"match": True, "order": "ancestors"}
 CONSTRAINTS_URL = "http://constraints_test/"
 ENTITIES_URL = "http://entities_test/"
 
@@ -78,9 +77,10 @@ def test_constraints_bad(self, mock_request):
             ],
         )
         mock_request.assert_any_call(
-            CONSTRAINTS_URL + CONSTRAINTS_URL_PARAMS,
+            CONSTRAINTS_URL,
             headers={"Authorization": "Bearer test", "Content-Type": "application/json"},
             data=json.dumps(BAD_DATASET_EXPECTED_PAYLOAD),
+            params=CONSTRAINTS_URL_PARAMS,
         )
 
     @patch("ingest_validation_tools.upload.requests.post")
@@ -89,9 +89,10 @@ def test_constraints_good(self, mock_request):
         # Shouldn't return anything
         self.upload._constraint_checks(GOOD_DATASET_SCHEMA_WITH_ANCESTORS)
         mock_request.assert_any_call(
-            CONSTRAINTS_URL + CONSTRAINTS_URL_PARAMS,
+            CONSTRAINTS_URL,
             headers={"Authorization": "Bearer test", "Content-Type": "application/json"},
             data=json.dumps(GOOD_DATASET_EXPECTED_PAYLOAD),
+            params=CONSTRAINTS_URL_PARAMS,
         )
 
     @patch("ingest_validation_tools.upload.requests.post")