Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate input annotations to primary.cwlprov files #1678

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

RenskeW
Copy link
Contributor

@RenskeW RenskeW commented Jun 7, 2022

Propagate structured annotations for files and directories (e.g. Schema.org, but also CWL-specific metadata fields) to the provenance graph. In addition, annotations under the cwlprov:prov field should be incorporated in the RDF graph as well.

Step 1: If 'label' is specified in input object, it is added as an annotation to the file entity it corresponds to.

Example:

file: 
  class: File
  path: ./file.tsv
  label: file_label

This would be represented in the cwlprov provenance like this:

id:92fc295f-f6cb-444b-8ffe-0f604b127858 a wf4ever:File,
        wfprov:Artifact,
        prov:Entity ;
    rdfs:label "file_label"^^xsd:string ;
    prov:specializationOf data:86ec86fdbbe70863a78453f69349568f9d1f14a1 ;
    cwlprov:basename "file.tsv"^^xsd:string ;
    cwlprov:nameext ".tsv"^^xsd:string ;
    cwlprov:nameroot "file"^^xsd:string .

See here for the full example RO.

Step 2: Values of doc, format, label, intent are propagated to RDF provenance graph.

EXAMPLE RO: https://github.com/RenskeW/cwlprov-provenance/tree/930cc268da77c5aa3739a2f7f87b94e076b144e6/cwlprov_rdf_examples

Screenshot 2023-11-06 at 12 24 31

Mapping to Schema.org terms:

TODO:

  • Propagate annotations for File
  • Propagate annotations for Directory
  • Add support for annotations of type CommentedSeq
  • Propagate annotations under cwlprov:prov
  • Propagate values of doc, intent, label, and intent fields to RDF
  • inputs should be copied into data/ , not be broken symlinks
  • improve code coverage
  • Fix double entries in provenance (optionally)

@RenskeW RenskeW requested a review from mr-c June 7, 2022 16:06
@codecov
Copy link

codecov bot commented Jun 7, 2022

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (bcdbeaf) 83.80% compared to head (dea0b8e) 83.92%.

Files Patch % Lines
cwltool/cwlprov/provenance_profile.py 71.69% 11 Missing and 4 partials ⚠️
cwltool/singularity.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1678      +/-   ##
==========================================
+ Coverage   83.80%   83.92%   +0.11%     
==========================================
  Files          46       46              
  Lines        8221     8266      +45     
  Branches     2182     2201      +19     
==========================================
+ Hits         6890     6937      +47     
+ Misses        854      850       -4     
- Partials      477      479       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mr-c
Copy link
Member

mr-c commented Jun 8, 2022

Thank you for this @RenskeW ! can you make format and push the changes?

@mr-c
Copy link
Member

mr-c commented Jun 9, 2022

@RenskeW Thanks!

Checking https://www.commonwl.org/v1.2/CommandLineTool.html#File we see that label is not one of the fields present here in the CWL standards. So we'll need to use a namespaced field name, for which I recommend that you choose one of the schema.org properties like https://schema.org/name as it is linked to rdfs:label and is modeled as equivalent to dcterms:title.

After you make that change, can you add a s:name to one of the input objects used in the cwlprov tests? You'll need to add a $namespaces dictionary to document the s prefix if the corresponding CWL description lacks that. Then you can add a check to one of the cwlprov tests to look for the rdfs:label in the resulting CWLProv RO.

@mr-c
Copy link
Member

mr-c commented Jun 9, 2022

(maybe in CWL v1.3 we will add a label field to File)

@lgtm-com
Copy link

lgtm-com bot commented Jun 20, 2022

This pull request introduces 1 alert when merging 99fc196 into c23a9ed - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

@lgtm-com
Copy link

lgtm-com bot commented Jun 22, 2022

This pull request introduces 1 alert when merging c6e7385 into b21b0c1 - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times


def _add_nested_annotations(dataset, e: ProvEntity) -> ProvEntity:
for annotation in dataset:
if isinstance(dataset[annotation], (str, bool, int, float)): # check if these are all allowed types
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(dataset[annotation], list), isinstance(dataset[annotation], MutableMapping)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MutableSequence = list

cwltool/provenance_profile.py Outdated Show resolved Hide resolved
cwltool/provenance_profile.py Fixed Show fixed Hide fixed
cwltool/provenance_profile.py Fixed Show fixed Hide fixed
cwltool/provenance_profile.py Outdated Show resolved Hide resolved
@lgtm-com
Copy link

lgtm-com bot commented Jul 4, 2022

This pull request introduces 2 alerts when merging bf34ca6 into bb41504 - view on LGTM.com

new alerts:

  • 2 for Incomplete URL substring sanitization

# # how do we identify the correct file to write to? self.workflow_run_uri?
# #
# pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First step in propagating metadata under cwlprov:prov to provenance as well.

@lgtm-com
Copy link

lgtm-com bot commented Jul 5, 2022

This pull request introduces 2 alerts when merging 70d184f into bb41504 - view on LGTM.com

new alerts:

  • 2 for Incomplete URL substring sanitization

cwltool/provenance_profile.py Fixed Show fixed Hide fixed
cwltool/provenance_profile.py Fixed Show fixed Hide fixed
cwltool/provenance_profile.py Fixed Show fixed Hide fixed
cwltool/provenance_profile.py Fixed Show fixed Hide fixed
@cwl-bot
Copy link

cwl-bot commented Mar 25, 2023

This pull request has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/equivalent-to-file-metadata-for-local-environment/782/2

@mr-c mr-c force-pushed the input_annotations branch 4 times, most recently from 25b3ddb to fad5e8e Compare December 18, 2023 13:34
cwltool/cwlprov/provenance_profile.py Dismissed Show dismissed Hide dismissed
cwltool/cwlprov/provenance_profile.py Dismissed Show dismissed Hide dismissed
TODO: fix intent list

add/amend tests
@RenskeW
Copy link
Contributor Author

RenskeW commented Jan 22, 2024

@mr-c here is a hand-annotated example of the new RDF provenance graph, including the CWL metadata fields:

https://github.com/RenskeW/cwlprov-provenance/blob/d0b69c24e57b6db9e7d061f9512ba35e842c9cc5/cwlprov_rdf_examples/scenario1/ro/metadata/provenance/primary.cwlprov.ttl

There is a separate example, containing propagation of annotation of input parameter values (e.g. an input File), to CWLprov RDF, here:

https://github.com/RenskeW/cwlprov-provenance/tree/930cc268da77c5aa3739a2f7f87b94e076b144e6/prov_data_annotations/example1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants