🍀 Document the FOUR Data Pipelines #4433

nickumia-reisys · 2023-08-25T05:51:16Z

User Story

In order to inform existing and new harvesting processes and procedures, the Data.go Architect Team wants to document the FOUR pipelines that all harvesting travels through. These pipelines will either be (1) optimized in the current system or (2) fed into building a better new system from the start.

Acceptance Criteria

GIVEN research and analysis has been done surrounding the pipelines
WHEN I look at this ticket
THEN there is documentation about how the pipelines are structured with supporting details or insight into the complex intricacies

Background

Security Considerations (required)

...

Sketch

Work through the data lifecycle and pipeline structure for each of the following pipelines
- file json DCAT
  - Classes of concern: DataJsonHarvester, DatasetHarvesterBase, HarvesterBase
- file xml FDGC/ISO
  - Classes of concern: GeoDataGovDocHarvester, DocHarvester, GeoDataGovHarvester, SpatialHarvester, HarvesterBase
- file xml waf FDGC
  - Classes of concern: GeoDataGovWAFHarvester, WAFHarvester, GeoDataGovHarvester, SpatialHarvester, HarvesterBase
- api json ARCGIS
  - Classes of concern: ArcGISHarvester, SpatialHarvester, HarvesterBase
- Note: Each pipeline consists of a source file interface (file, api, waf), a file type association (json, xml) and a schema
  specification (dcat, fdgc, arcgis). The idea behind specifying these is to highlight areas of abstraction to make the code more relevant and reusable. A proposed NEW pipeline consists of api json DCAT for large DCAT data.json files that are unwieldy when processed as a single entity.
- api json DCAT
  - PROPOSED
Data lifecycle consists of: (1) Source creation, (2) Source execution (job run) schedule, (3) Gather Stage, (4) Fetch Stage, (5) Import Stage
- Note: the current harvesting methodology does the create/update/delete/etc at non-sequential times through the gather/fetch/import stages. Be sure to note the real sequence as it stands currently. If we decide to optimize or change something, that should be done as follow on work.

The text was updated successfully, but these errors were encountered:

nickumia-reisys · 2023-08-25T06:00:08Z

Note: TWO harvesting pipelines have been deprecated (I believe both of these are FGDC/ISO, but not sure):

api csw ??? ???
api cms ???

nickumia-reisys · 2023-08-26T20:35:34Z

Comment is in history. I deleted it to make the ticket cleaner. See diagrams below for the most up to date information.

btylerburton · 2023-09-06T17:58:06Z

Link to MD Translator spike: #4200

nickumia-reisys · 2023-09-19T15:18:56Z

DCAT Pipeline initial pass complete.

... Moving on to file xml FDGC/ISO

Just as a random note, our DCAT code is much more unorganized compared to the spatial upstream code...

nickumia-reisys · 2023-09-19T23:52:03Z

Single XML Pipeline initial pass complete.

... Moving on to file xml waf FDGC/ISO tomorrow

nickumia-reisys · 2023-09-25T14:45:51Z

WAF XML Pipeline initial pass complete.

... Moving on to api json ARCGIS next

nickumia-reisys · 2023-09-25T20:22:38Z

ArcGIS Pipeline initial pass complete.

... I'm done? 🎉

nickumia-reisys · 2023-09-29T12:51:54Z

The diagrams in the comments above represent the core of the harvesting optimization problem. What happens when.. What errors are not being captured... What assumptions are made that fail to be true.. The next step is reviewing the code, abstracting it into meaningful chunks, testing the functionality, preserving the best parts and fixing the broken parts. One of the corner stones of implementing a new version of this code deals with the following requirement:

1.2.3 Data.gov should be able to adapt to new data formats not originally accounted for in its design.

The controller diagram highlights some high-level abstractions for input/output definitions. However, for example, within the extract component, we want to be able to support multiple file types (and possibly new ones, i.e. rdf), so creating an abstraction for between "data download" and "data parsing/input" would allow us to hook in a new file type. Further discussion will ensue and we'll answer a lot of open questions. From there, it'll be easier to start implementing and executing on this work.

nickumia-reisys mentioned this issue Aug 25, 2023

⛽ Harvesting 2.0 Planning + CKAN Harvesting Documentation #4411

Closed

5 tasks

nickumia-reisys self-assigned this Aug 25, 2023

nickumia-reisys closed this as completed Sep 29, 2023

hkdctol added the H2.0/Harvest-General General Harvesting 2.0 Issues label Sep 29, 2023

nickumia-reisys mentioned this issue Oct 9, 2023

catalog-fetch cannot delete dataset? #4419

Closed

nickumia-reisys added Mission & Vision Explore CKAN CI/CD labels Oct 9, 2023

btylerburton removed CI/CD Explore labels Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🍀 Document the FOUR Data Pipelines #4433

🍀 Document the FOUR Data Pipelines #4433

nickumia-reisys commented Aug 25, 2023 •

edited

Loading

nickumia-reisys commented Aug 25, 2023

nickumia-reisys commented Aug 26, 2023 •

edited

Loading

btylerburton commented Sep 6, 2023

nickumia-reisys commented Sep 19, 2023 •

edited

Loading

nickumia-reisys commented Sep 19, 2023 •

edited

Loading

nickumia-reisys commented Sep 25, 2023

nickumia-reisys commented Sep 25, 2023

nickumia-reisys commented Sep 29, 2023

🍀 Document the FOUR Data Pipelines #4433

🍀 Document the FOUR Data Pipelines #4433

Comments

nickumia-reisys commented Aug 25, 2023 • edited Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

nickumia-reisys commented Aug 25, 2023

nickumia-reisys commented Aug 26, 2023 • edited Loading

btylerburton commented Sep 6, 2023

nickumia-reisys commented Sep 19, 2023 • edited Loading

nickumia-reisys commented Sep 19, 2023 • edited Loading

nickumia-reisys commented Sep 25, 2023

nickumia-reisys commented Sep 25, 2023

nickumia-reisys commented Sep 29, 2023

nickumia-reisys commented Aug 25, 2023 •

edited

Loading

nickumia-reisys commented Aug 26, 2023 •

edited

Loading

nickumia-reisys commented Sep 19, 2023 •

edited

Loading

nickumia-reisys commented Sep 19, 2023 •

edited

Loading