Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🍀 Document the FOUR Data Pipelines #4433

Closed
5 tasks done
nickumia-reisys opened this issue Aug 25, 2023 · 8 comments
Closed
5 tasks done

🍀 Document the FOUR Data Pipelines #4433

nickumia-reisys opened this issue Aug 25, 2023 · 8 comments
Assignees
Labels

Comments

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Aug 25, 2023

User Story

In order to inform existing and new harvesting processes and procedures, the Data.go Architect Team wants to document the FOUR pipelines that all harvesting travels through. These pipelines will either be (1) optimized in the current system or (2) fed into building a better new system from the start.

Acceptance Criteria

  • GIVEN research and analysis has been done surrounding the pipelines
    WHEN I look at this ticket
    THEN there is documentation about how the pipelines are structured with supporting details or insight into the complex intricacies

Background

Security Considerations (required)

...

Sketch

  • Work through the data lifecycle and pipeline structure for each of the following pipelines
    • file json DCAT
      • Classes of concern: DataJsonHarvester, DatasetHarvesterBase, HarvesterBase
    • file xml FDGC/ISO
      • Classes of concern: GeoDataGovDocHarvester, DocHarvester, GeoDataGovHarvester, SpatialHarvester, HarvesterBase
    • file xml waf FDGC
      • Classes of concern: GeoDataGovWAFHarvester, WAFHarvester, GeoDataGovHarvester, SpatialHarvester, HarvesterBase
    • api json ARCGIS
      • Classes of concern: ArcGISHarvester, SpatialHarvester, HarvesterBase
    • Note: Each pipeline consists of a source file interface (file, api, waf), a file type association (json, xml) and a schema
      specification (dcat, fdgc, arcgis). The idea behind specifying these is to highlight areas of abstraction to make the code more relevant and reusable. A proposed NEW pipeline consists of api json DCAT for large DCAT data.json files that are unwieldy when processed as a single entity.
    • api json DCAT
      • PROPOSED
  • Data lifecycle consists of: (1) Source creation, (2) Source execution (job run) schedule, (3) Gather Stage, (4) Fetch Stage, (5) Import Stage
    • Note: the current harvesting methodology does the create/update/delete/etc at non-sequential times through the gather/fetch/import stages. Be sure to note the real sequence as it stands currently. If we decide to optimize or change something, that should be done as follow on work.
@nickumia-reisys
Copy link
Contributor Author

Note: TWO harvesting pipelines have been deprecated (I believe both of these are FGDC/ISO, but not sure):

  • api csw ??? ???
  • api cms ???

@nickumia-reisys nickumia-reisys self-assigned this Aug 25, 2023
@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Aug 26, 2023

Comment is in history. I deleted it to make the ticket cleaner. See diagrams below for the most up to date information.

@btylerburton
Copy link
Contributor

Link to MD Translator spike: #4200

@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Sep 19, 2023

DCAT Pipeline initial pass complete.

dcat

... Moving on to file xml FDGC/ISO

Just as a random note, our DCAT code is much more unorganized compared to the spatial upstream code...

@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Sep 19, 2023

Single XML Pipeline initial pass complete.

single_xml

... Moving on to file xml waf FDGC/ISO tomorrow

@nickumia-reisys
Copy link
Contributor Author

WAF XML Pipeline initial pass complete.

waf_xml

... Moving on to api json ARCGIS next

@nickumia-reisys
Copy link
Contributor Author

ArcGIS Pipeline initial pass complete.

arcgis

... I'm done? 🎉

@nickumia-reisys
Copy link
Contributor Author

The diagrams in the comments above represent the core of the harvesting optimization problem. What happens when.. What errors are not being captured... What assumptions are made that fail to be true.. The next step is reviewing the code, abstracting it into meaningful chunks, testing the functionality, preserving the best parts and fixing the broken parts. One of the corner stones of implementing a new version of this code deals with the following requirement:

1.2.3 Data.gov should be able to adapt to new data formats not originally accounted for in its design.

The controller diagram highlights some high-level abstractions for input/output definitions. However, for example, within the extract component, we want to be able to support multiple file types (and possibly new ones, i.e. rdf), so creating an abstraction for between "data download" and "data parsing/input" would allow us to hook in a new file type. Further discussion will ensue and we'll answer a lot of open questions. From there, it'll be easier to start implementing and executing on this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🗄 Closed
Development

No branches or pull requests

3 participants