Skip to content

Commit

Permalink
Development (#1337)
Browse files Browse the repository at this point in the history
* Entity type constraints (#1330)

* work in progress on entity constraints querying

* basic logic sketched in

* fixing linting error

* very unfinished, need to review entity constraint endpoint documentation further

* architected, testing, definitely broken

* at least one existing test passing, need to run all + design new tests

* added testing

* still testing, lots of edge cases

* messy code, will tidy up

* removed unnecessary code

* fixing formatting errors

* putting enums where they belong

* forgot a file

* need to fix my linter

* made sample constraints work I hope

* working on tests

* fixed broken test

* removed testing from dataset file, hopefully fixed requirements-dev install

* testing fixing action

* second testing fixing actions

* changes to fixtures from assaytype endpoint; fixing globus_token mistake

* fixing malformed constraints endpoint query

* removing breakpoint

* fixing sample checking logic error

* removing breakpoint again

* fixing adding dataset sub_type to SchemaVersion.entity_type_info

* fixing the same query URL mistake in test file

* updating test output with line number changes

* General: Update changelog to reflect releases/versioning updates. (#1334)

Co-authored-by: Juan Puerto <=>

* Update to entity constraints error reporting (#1335)

* scaffolding for update to error reporting to use get_errors

* updated constraint checking to use _get_message

* fixing bugs

* missed some files

* row numbering changes from online testing

* fixing some enum referencing

* linting update

* updated validate_tsv.py for testing get_tsv_errors; fixed some issues with type enums

* sources do not need constraint checks

* Docs: Update CHANGELOG

* Revert "Update to entity constraints error reporting (#1335)"

This reverts commit f92146d.

* Phillips/entity constraints errors (#1338)

* scaffolding for update to error reporting to use get_errors

* updated constraint checking to use _get_message

* fixing bugs

* missed some files

* row numbering changes from online testing

* fixing some enum referencing

* linting update

* updated validate_tsv.py for testing get_tsv_errors; fixed some issues with type enums

* sources do not need constraint checks

---------

Co-authored-by: Gesina Phillips <[email protected]>

* Plugin run reporting (#1336)

* Mods to handle signaling of whether work was done by plugins

* changed the way info is reported via get_info to allow plugin names to be returned

* fixing weird TODO

* changelog

* added test

* moved plugin test to manual testing

---------

Co-authored-by: Joel Welling <[email protected]>

---------

Co-authored-by: gesinaphillips <[email protected]>
Co-authored-by: Juan Puerto <=>
Co-authored-by: Gesina Phillips <[email protected]>
Co-authored-by: Joel Welling <[email protected]>
  • Loading branch information
4 people committed May 28, 2024
1 parent de58a30 commit 3e4bb1e
Show file tree
Hide file tree
Showing 21 changed files with 513 additions and 365 deletions.
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# Changelog

## v0.0.19 (in progress)
## v0.0.21
- Fix the changelog to reflect the current version.
- Fix row number mismatch between validation and spreadsheet validator response

## v0.0.20
- Fix row number mismatch between validation and spreadsheet validator response

## v0.0.19
- Directory validation changes for "shared" uploads
- Update Phenocycler directory schema
- Remove bad paths from LC-MS directory schema
Expand All @@ -15,6 +22,7 @@
- Add semantic version to plugin test base class
- Fix row number mismatch between validation and spreadsheet validator response
- Adding entity constraints check
- Adding ability to report names of successfully run plugins

## v0.0.18

Expand Down
6 changes: 3 additions & 3 deletions examples/dataset-examples/bad-cedar-assay-histology/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
```
Spreadsheet Validator Errors:
examples/dataset-examples/bad-cedar-assay-histology/upload/bad-histology-metadata.tsv:
- On row 0, column "parent_sample_id", value "wrong" fails because of error "invalidValueFormat".
- On row 1, column "contributors_path", value "" fails because of error "missingRequired".
- On row 2, column "parent_sample_id", value "wrong" fails because of error "invalidValueFormat".
- On row 3, column "contributors_path", value "" fails because of error "missingRequired".
examples/dataset-examples/bad-cedar-assay-histology/upload/contributors.tsv:
- On row 0, column "orcid", value "0000-0002-8928-abcd" fails because of error "invalidValueFormat".
- On row 2, column "orcid", value "0000-0002-8928-abcd" fails because of error "invalidValueFormat".
URL Check Errors:
examples/dataset-examples/bad-cedar-assay-histology/upload/bad-histology-metadata.tsv:
- 'On row 2, column "parent_sample_id", value "wrong" fails because of error "HTTPError":
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"assaytype": {"Histology": {"assaytype": "h-and-e", "contains-pii": false, "dataset-type": "Histology", "description": "H&E Stained Microscopy", "dir-schema": "histology-v2", "primary": true, "vitessce-hints": []}}, "validation": {"h-and-e": {"URL Check Errors": ["On row 2, column \"parent_sample_id\", value \"wrong\" fails because of error \"HTTPError\": 400 Client Error: Bad Request for url: https://entity.api.hubmapconsortium.org/entities/wrong"], "Spreadsheet Validator Errors": ["On row 0, column \"parent_sample_id\", value \"wrong\" fails because of error \"invalidValueFormat\"", "On row 1, column \"contributors_path\", value \"\" fails because of error \"missingRequired\""]}, "contributors": {"URL Check Errors": ["On row 2, column \"orcid\", value \"0000-0002-8928-abcd\" fails because of error \"Exception\": ORCID 0000-0002-8928-abcd does not exist."], "Spreadsheet Validator Errors": ["On row 0, column \"orcid\", value \"0000-0002-8928-abcd\" fails because of error \"invalidValueFormat\""]}}}
{"assaytype": {"Histology": {"assaytype": "h-and-e", "contains-pii": false, "dataset-type": "Histology", "description": "H&E Stained Microscopy", "dir-schema": "histology-v2", "primary": true, "vitessce-hints": []}}, "validation": {"h-and-e": {"URL Check Errors": ["On row 2, column \"parent_sample_id\", value \"wrong\" fails because of error \"HTTPError\": 400 Client Error: Bad Request for url: https://entity.api.hubmapconsortium.org/entities/wrong"], "Spreadsheet Validator Errors": ["On row 2, column \"parent_sample_id\", value \"wrong\" fails because of error \"invalidValueFormat\"", "On row 3, column \"contributors_path\", value \"\" fails because of error \"missingRequired\""]}, "contributors": {"URL Check Errors": ["On row 2, column \"orcid\", value \"0000-0002-8928-abcd\" fails because of error \"Exception\": ORCID 0000-0002-8928-abcd does not exist."], "Spreadsheet Validator Errors": ["On row 2, column \"orcid\", value \"0000-0002-8928-abcd\" fails because of error \"invalidValueFormat\""]}}}
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
parent_sample_id lab_id preparation_protocol_doi dataset_type analyte_class is_targeted acquisition_instrument_vendor acquisition_instrument_model source_storage_duration_value source_storage_duration_unit time_since_acquisition_instrument_calibration_value time_since_acquisition_instrument_calibration_unit contributors_path data_path is_image_preprocessing_required stain_name stain_technique is_batch_staining_done is_staining_automated preparation_instrument_vendor preparation_instrument_model slide_id tile_configuration scan_direction tiled_image_columns tiled_image_count intended_tile_overlap_percentage metadata_schema_id
wrong Visium_9OLC_A4_S1 https://dx.doi.org/10.17504/protocols.io.eq2lyno9qvx9/v1 Histology DNA No Zeiss Microscopy Axio Observer 7 24 day ./contributors.tsv ./dataset-1 Yes H&E Progressive staining Yes No HTX Technologies SunCollect Sprayer V11A19-078 Snake-by-rows Right-and-down 10 120 30 e7475329-9a60-4088-8e34-19a3828e0b3b
HBM854.FXDQ.783 Visium_9OLC_A4_S2 https://dx.doi.org/10.17504/protocols.io.eq2lyno9qvx9/v1 Histology DNA No Zeiss Microscopy Axio Observer 7 24 day ./dataset-2 Yes H&E Progressive staining Yes No HTX Technologies SunCollect Sprayer V11A19-078 Snake-by-rows Right-and-down 10 120 30 e7475329-9a60-4088-8e34-19a3828e0b3b
HBM733.HSZF.798 Visium_9OLC_A4_S2 https://dx.doi.org/10.17504/protocols.io.eq2lyno9qvx9/v1 Histology DNA No Zeiss Microscopy Axio Observer 7 24 day ./dataset-2 Yes H&E Progressive staining Yes No HTX Technologies SunCollect Sprayer V11A19-078 Snake-by-rows Right-and-down 10 120 30 e7475329-9a60-4088-8e34-19a3828e0b3b
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
```
Spreadsheet Validator Errors:
examples/dataset-examples/bad-cedar-multi-assay-visium-bad-child-metadata/upload/bad-visium-rnaseq-metadata.tsv:
- On row 1, column "parent_sample_id", value "" fails because of error "missingRequired".
- On row 2, column "preparation_protocol_doi", value "wrong" fails because of error
- On row 3, column "parent_sample_id", value "" fails because of error "missingRequired".
- On row 4, column "preparation_protocol_doi", value "wrong" fails because of error
"invalidUrl".
URL Check Errors:
examples/dataset-examples/bad-cedar-multi-assay-visium-bad-child-metadata/upload/bad-visium-rnaseq-metadata.tsv:
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"assaytype": {"RNAseq": {"assaytype": "rnaseq-visium-no-probes", "contains-pii": true, "dataset-type": "RNAseq", "description": "Capture bead RNAseq (10x Genomics v3)", "dir-schema": "rnaseq-v2", "primary": true, "vitessce-hints": []}, "Visium (no probes)": {"assaytype": "visium-no-probes", "contains-pii": true, "dataset-type": "Visium (no probes)", "description": "Visium (no probes)", "dir-schema": "visium-no-probes-v2", "is-multi-assay": true, "must-contain": ["Histology", "RNAseq"], "primary": true, "vitessce-hints": []}, "Histology": {"assaytype": "h-and-e", "contains-pii": false, "dataset-type": "Histology", "description": "H&E Stained Microscopy", "dir-schema": "histology-v2", "primary": true, "vitessce-hints": []}}, "validation": {"rnaseq-visium-no-probes": {"URL Check Errors": ["On row 3, column \"parent_sample_id\", value \"\" fails because of error \"HTTPError\": 404 Client Error: Not Found for url: https://entity.api.hubmapconsortium.org/entities/"], "Spreadsheet Validator Errors": ["On row 1, column \"parent_sample_id\", value \"\" fails because of error \"missingRequired\"", "On row 2, column \"preparation_protocol_doi\", value \"wrong\" fails because of error \"invalidUrl\""]}, "contributors": {}, "visium-no-probes": {}, "h-and-e": {}}}
{"assaytype": {"RNAseq": {"assaytype": "rnaseq-visium-no-probes", "contains-pii": true, "dataset-type": "RNAseq", "description": "Capture bead RNAseq (10x Genomics v3)", "dir-schema": "rnaseq-v2", "primary": true, "vitessce-hints": []}, "Visium (no probes)": {"assaytype": "visium-no-probes", "contains-pii": true, "dataset-type": "Visium (no probes)", "description": "Visium (no probes)", "dir-schema": "visium-no-probes-v2", "is-multi-assay": true, "must-contain": ["Histology", "RNAseq"], "primary": true, "vitessce-hints": []}, "Histology": {"assaytype": "h-and-e", "contains-pii": false, "dataset-type": "Histology", "description": "H&E Stained Microscopy", "dir-schema": "histology-v2", "primary": true, "vitessce-hints": []}}, "validation": {"rnaseq-visium-no-probes": {"URL Check Errors": ["On row 3, column \"parent_sample_id\", value \"\" fails because of error \"HTTPError\": 404 Client Error: Not Found for url: https://entity.api.hubmapconsortium.org/entities/"], "Spreadsheet Validator Errors": ["On row 3, column \"parent_sample_id\", value \"\" fails because of error \"missingRequired\"", "On row 4, column \"preparation_protocol_doi\", value \"wrong\" fails because of error \"invalidUrl\""]}, "contributors": {}, "visium-no-probes": {}, "h-and-e": {}}}
4 changes: 3 additions & 1 deletion script-docs/README-validate_tsv.py.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
usage: validate_tsv.py [-h] --path PATH --schema
{sample,sample-block,sample-suspension,sample-section,antibodies,contributors,metadata,source}
[--globus_token GLOBUS_TOKEN]
[--output {as_text,as_md}]
[--output {as_text,as_md}] [--app_context APP_CONTEXT]
Validate a HuBMAP TSV. REMINDER: Use of validate_tsv.py is deprecated; use the HuBMAP Metadata Spreadsheet Validator to validate single TSVs instead (https://metadatavalidator.metadatacenter.org).
Expand All @@ -13,6 +13,8 @@ optional arguments:
--globus_token GLOBUS_TOKEN
Token for URL checking using Entity API.
--output {as_text,as_md}
--app_context APP_CONTEXT
App context values.
Exit status codes:
0: Validation passed
Expand Down
48 changes: 26 additions & 22 deletions src/ingest_validation_tools/enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,55 +152,59 @@


@unique
class Sample(str, Enum):
BLOCK = "sample-block"
SUSPENSION = "sample-suspension"
SECTION = "sample-section"
ORGAN = "organ"
class EntityTypes(str, Enum):

# TODO: I believe this can be streamlined with the StrEnum class added in 3.11
@classmethod
def full_names_list(cls) -> List[str]:
return [sample_type.value for sample_type in cls]
def value_list(cls) -> List[str]:
return [entity_type.value for entity_type in cls]

@classmethod
def just_subtypes_list(cls) -> List[str]:
return [sample_type.name.lower() for sample_type in cls]
def key_list(cls) -> List[str]:
return [entity_type.name.lower() for entity_type in cls]

@classmethod
def get_key_from_val(cls, val) -> str:
match = [sample_type.name for sample_type in cls if sample_type.value == val]
def get_enum_from_val(cls, val) -> str:
match = [entity_type for entity_type in cls if entity_type.value == val]
if not match:
return ""
return match[0]


class DatasetType(str, Enum):
class DatasetType(EntityTypes):
DATASET = "dataset"


@unique
class OtherTypes(str, Enum):
class OtherTypes(EntityTypes):
ANTIBODIES = "antibodies"
CONTRIBUTORS = "contributors"
SOURCE = "source"
SAMPLE = "sample"
ORGAN = "organ"
DONOR = "donor"

@classmethod
def value_list(cls):
return [other_type.value for other_type in cls]

@classmethod
def get_sample_types(cls):
return Sample.just_subtypes_list()
return Sample.key_list()

@classmethod
def get_sample_types_full_names(cls):
return Sample.full_names_list()
return Sample.value_list()

@classmethod
def with_sample_subtypes(cls):
all_types = [*cls.value_list(), *cls.get_sample_types_full_names()]
def with_sample_subtypes(cls, with_sample=True):
all_types = [entity_type for entity_type in [*cls, *Sample]]
if not with_sample:
all_types.remove(OtherTypes.SAMPLE)
return all_types


class Sample(EntityTypes):
BLOCK = "sample-block"
SUSPENSION = "sample-suspension"
SECTION = "sample-section"
ORGAN = "organ"

@classmethod
def with_parent_type(cls):
return [*[entity_type for entity_type in cls], OtherTypes.SAMPLE]
14 changes: 9 additions & 5 deletions src/ingest_validation_tools/error_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,23 @@ def __init__(self, error: str):

@dataclass
class InfoDict:
time: datetime
git: str
dir: str
tsvs: Dict[str, Dict[str, str]]
time: Optional[datetime] = None
git: Optional[str] = None
dir: Optional[str] = None
tsvs: Dict[str, Dict[str, str]] = field(default_factory=dict)
successful_plugins: list[str] = field(default_factory=list)

def as_dict(self):
return {
as_dict = {
"Time": self.time,
"Git version": self.git,
"Directory": self.dir,
# "Directory schema version": self.dir_schema,
"TSVs": self.tsvs,
}
if self.successful_plugins:
as_dict["Successful Plugins"] = self.successful_plugins
return as_dict


@dataclass
Expand Down
14 changes: 8 additions & 6 deletions src/ingest_validation_tools/plugin_validator.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
import inspect
import sys
from collections.abc import Iterator
from importlib import util
from pathlib import Path
from typing import Iterator, List, Tuple, Type, Union
from typing import List, Optional, Tuple, Type, Union

from ingest_validation_tools.schema_loader import SchemaVersion

PathOrStr = Union[str, Path]

KeyValuePair = Tuple[str, str]


class add_path:
"""
Expand Down Expand Up @@ -79,7 +78,7 @@ def _log(self, message):
print(message)
return message

def collect_errors(self) -> List[str]:
def collect_errors(self, **kwargs) -> List[str]:
"""
Returns a list of strings, each of which is a
human-readable error message.
Expand All @@ -91,6 +90,9 @@ def collect_errors(self) -> List[str]:
raise NotImplementedError()


KeyValuePair = Tuple[Type[Validator], Optional[str]]


def run_plugin_validators_iter(
metadata_path: PathOrStr,
sv: SchemaVersion,
Expand Down Expand Up @@ -194,5 +196,5 @@ def validation_error_iter(
"""
for cls in validation_class_iter(plugin_dir):
validator = cls(paths, assay_type, contains, verbose)
for err in validator.collect_errors(**kwargs): # type: ignore
yield cls.description, err
for err in validator.collect_errors(**kwargs):
yield cls, err
65 changes: 60 additions & 5 deletions src/ingest_validation_tools/schema_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,13 @@
from pathlib import Path
from typing import Dict, List, Optional, Sequence, Set, Union

from ingest_validation_tools.enums import OtherTypes, shared_enums
from ingest_validation_tools.enums import (
DatasetType,
EntityTypes,
OtherTypes,
Sample,
shared_enums,
)
from ingest_validation_tools.yaml_include_loader import load_yaml

_table_schemas_path = Path(__file__).parent / "table-schemas"
Expand Down Expand Up @@ -50,10 +56,8 @@ class SchemaVersion:
dir_schema: str = ""
metadata_type: str = "assays"
contains: List = field(default_factory=list)
ancestor_entities: Dict = field(default_factory=dict)
entity_type_info: Dict = field(
default_factory=dict
) # entity_type, entity_sub_type, entity_sub_type_val; for constraint checking
entity_type_info: Optional[EntityTypeInfo] = None
ancestor_entities: List[AncestorTypeInfo] = field(default_factory=list)

def __post_init__(self):
if type(self.path) is str:
Expand Down Expand Up @@ -106,6 +110,57 @@ def get_assayclassifier_data(self):
self.contains = [schema.lower() for schema in contains]


@dataclass
class EntityTypeInfo:
entity_type: EntityTypes
entity_sub_type: str = ""
entity_sub_type_val: str = ""

def __post_init__(self):
if (
self.entity_type in [OtherTypes.SAMPLE, DatasetType.DATASET]
and not self.entity_sub_type
):
raise Exception(f"Entity of type {self.entity_type} must have a sub_type.")
# If a member of the Sample enum is passed in as the entity_type,
# this extracts the entity_type and entity_sub_type from that value
# and mutates the instance accordingly
# e.g. self.entity_type == <Sample.BLOCK: "sample-block">
if isinstance(self.entity_type, Sample):
self.entity_sub_type = self.entity_type.name.lower()
self.entity_type = OtherTypes.SAMPLE
if self.entity_sub_type == Sample.ORGAN and not self.entity_sub_type_val:
raise Exception(
f"Entity of type {self.entity_type}/{self.entity_sub_type} must have a sub_type_val."
)

def format_constraint_check_data(self) -> Dict:
"""
Formats data about an entity so that it can be sent as
part of the payload to the constraints endpoint.
"""
return {
"entity_type": self.entity_type.value,
"sub_type": [self.entity_sub_type if self.entity_sub_type else ""],
"sub_type_val": [self.entity_sub_type_val] if self.entity_sub_type_val else None,
}

def format_constraint_check_error(self):
data_entity_sub_type = f"/{self.entity_sub_type.lower()}" if self.entity_sub_type else ""
data_entity_sub_type_val = (
f"/{self.entity_sub_type_val.lower()}" if self.entity_sub_type_val else ""
)
return self.entity_type + data_entity_sub_type + data_entity_sub_type_val


@dataclass
class AncestorTypeInfo(EntityTypeInfo):
entity_id: Optional[str] = None
source_schema: Optional[SchemaVersion] = None
row: Optional[int] = None
column: Optional[str] = None


@dataclass
class DirSchemaVersion:
dir_schema_name: str
Expand Down
Loading

0 comments on commit 3e4bb1e

Please sign in to comment.