Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema v1 to Aardvark migrator #143

Merged
merged 1 commit into from
Jan 29, 2024
Merged

Schema v1 to Aardvark migrator #143

merged 1 commit into from
Jan 29, 2024

Conversation

thatbudakguy
Copy link
Member

  • Handle elements without crosswalk (via lookup tables)
  • Support migrating collections in dct_isPartOf_sm
  • Convert single to multivalued fields where appropriate
  • Retain custom fields and remove deprecated fields

Closes #121

@thatbudakguy thatbudakguy force-pushed the v1-aardvark branch 2 times, most recently from e0b8659 to c051c55 Compare March 1, 2023 23:10
@thatbudakguy thatbudakguy marked this pull request as ready for review March 1, 2023 23:11
thatbudakguy added a commit that referenced this pull request Mar 27, 2023
This fixes the error about a pending spec without a reason.
#143 will
un-pend the test when it is merged.
@thatbudakguy thatbudakguy marked this pull request as draft March 30, 2023 22:14
@thatbudakguy
Copy link
Member Author

Making this a draft again pending discussion of behavior for some fields; see OpenGeoMetadata/metadata-issues#50

@the-codetrane
Copy link

The path in lib/geo_combine/geoblacklight.rb:16changed to "https://raw.githubusercontent.com/OpenGeoMetadata/opengeometadata.github.io/main/docs/schema/geoblacklight-schema-#{GEOBLACKLIGHT_VERSION}.json"

@the-codetrane
Copy link

@thatbudakguy any chance we can get solr_geom to dcat_bbox added to this? This is otherwise working
image

@thatbudakguy
Copy link
Member Author

thatbudakguy commented Sep 13, 2023

@the-codetrane thx for pointing that out; I added a step to handle dcat_bbox. This PR is now blocked by #162.

@the-codetrane
Copy link

the-codetrane commented Oct 12, 2023

@thatbudakguy found another key that could be migrated - layer_geom_type_s to gbl_resourceType_sm. The crosswalk documentation has them as deprecated/new fields, but it would appear they are in fact related.

@thatbudakguy
Copy link
Member Author

there's code in this PR to do that – we use a lookup table to map geometry types to resources types. it's only straightforward for a few cases, imo. does it not work for you?

@the-codetrane
Copy link

This is what comes out when I run the migrator on a GBL 1.0 schema record:

{
  "dct_description_sm": [
    "This polygon shapefile represents the 1964 County Boundaries for China. The layer includes population census data and was primarily based on the \"Historical Administrative Maps of the People's Republic of China,\" published by China Map Press, and some other yearly administrative maps. See the documentation for more information and a list of the layer variables."
  ],
  "dct_format_s": "Shapefile",
  "dct_identifier_sm": [
    "http://hdl.handle.net/2451/34626"
  ],
  "dct_language_sm": [
    "English"
  ],
  "dct_publisher_sm": [
    "Beijing Hua tong ren shi chang xin xi you xian ze ren gong si"
  ],
  "dc_relation_sm": [
    "http://sws.geonames.org/1814991/about/rdf"
  ],
  "dct_accessRights_s": "Restricted",
  "dct_subject_sm": [
    "Boundaries",
    "Demographic surveys",
    "Population"
  ],
  "dct_title_s": "1964 County Boundaries of China with Population Census Data",
  "dc_type_s": "Dataset",
  "dct_isPartOf_sm": [
    "Historical China County Population Census Data"
  ],
  "dct_issued_s": "2005",
  "schema_provider_s": "NYU",
  "dct_references_s": "{\"http://schema.org/url\":\"http://hdl.handle.net/2451/34626\",\"http://www.opengis.net/def/serviceType/ogc/wfs\":\"https://maps-restricted.geo.nyu.edu/geoserver/sdr/wfs\",\"http://www.opengis.net/def/serviceType/ogc/wms\":\"https://maps-restricted.geo.nyu.edu/geoserver/sdr/wms\",\"http://schema.org/downloadUrl\":\"https://archive.nyu.edu/retrieve/74851/nyu_2451_34626.zip\",\"http://lccn.loc.gov/sh85035852\":\"https://archive.nyu.edu/retrieve/74896/nyu_2451_34626_doc.zip\"}",
  "dct_spatial_sm": [
    "People's Republic of China, China"
  ],
  "dct_temporal_sm": [
    "1964"
  ],
  "gbl_mdVersion_s": "Aardvark",
  "layer_geom_type_s": "Polygon", // I'M GUESSING THIS IS SUPPOSED TO BE SOMETHING ELSE?
  "gbl_wxsIdentifier_s": "sdr:nyu_2451_34626",
  "gbl_mdModified_dt": "2016-11-10T15:51:38Z",
  "id": "nyu-2451-34626",
  "nyu_addl_dspace_s": "35559",
  "locn_geometry": "ENVELOPE(73.557693, 134.773911, 53.56086, 10.175472)",
  "gbl_indexYear_im": [
    1964
  ],
  "nyu_addl_format_sm": [
    "Shapefile"
  ],
  "_version_": 1779481613907787776,
  "timestamp": "2023-10-11T17:38:31.500Z"
}

@srappel
Copy link

srappel commented Oct 12, 2023

"layer_geom_type_s": "Polygon", // I'M GUESSING THIS IS SUPPOSED TO BE SOMETHING ELSE?

I would expect "gbl_resourceType_sm": "Polygon Data" according to the controlled vocab

@thatbudakguy
Copy link
Member Author

@the-codetrane can you share the record that you transformed to get that output?

@the-codetrane
Copy link

@thatbudakguy My contract at NYU ended, so I'm outside the walled garden. @mnyrop should be able to help you with this.

@thatbudakguy
Copy link
Member Author

thatbudakguy commented Jan 19, 2024

OK, I found the record. I ran it through the migrator myself and got:

{
  "dct_creator_sm": [],
  "dct_description_sm": [
    "This polygon shapefile represents the 1964 County Boundaries for China. The layer includes population census data and was primarily based on the \"Historical Administrative Maps of the People's Republic of China,\" published by China Map Press, and some other yearly administrative maps. See the documentation for more information and a list of the layer variables."
  ],
  "dct_format_s": "Shapefile",
  "dct_identifier_sm": ["http://hdl.handle.net/2451/34626"],
  "dct_language_sm": ["English"],
  "dct_publisher_sm": [
    "Beijing Hua tong ren shi chang xin xi you xian ze ren gong si"
  ],
  "dc_relation_sm": ["http://sws.geonames.org/1814991/about/rdf"],
  "dct_accessRights_s": "Restricted",
  "dct_subject_sm": ["Boundaries", "Demographic surveys", "Population"],
  "dct_title_s": "1964 County Boundaries of China with Population Census Data",
  "dct_issued_s": "2005",
  "schema_provider_s": "NYU",
  "dct_references_s": "{\"http://schema.org/url\":\"http://hdl.handle.net/2451/34626\",\"http://www.opengis.net/def/serviceType/ogc/wfs\":\"https://maps-restricted.geo.nyu.edu/geoserver/sdr/wfs\",\"http://www.opengis.net/def/serviceType/ogc/wms\":\"https://maps-restricted.geo.nyu.edu/geoserver/sdr/wms\",\"http://schema.org/downloadUrl\":\"https://archive.nyu.edu/retrieve/74851/nyu_2451_34626.zip\",\"http://lccn.loc.gov/sh85035852\":\"https://archive.nyu.edu/retrieve/74896/nyu_2451_34626_doc.zip\"}",
  "dct_spatial_sm": ["People's Republic of China, China"],
  "dct_temporal_sm": ["1964"],
  "gbl_mdVersion_s": "Aardvark",
  "gbl_wxsIdentifier_s": "sdr:nyu_2451_34626",
  "gbl_mdModified_dt": "2016-11-10T15:51:38Z",
  "id": "nyu-2451-34626",
  "nyu_addl_dspace_s": "35559",
  "dcat_bbox": "ENVELOPE(73.557693, 134.773911, 53.56086, 10.175472)",
  "gbl_indexYear_im": [1964],
  "gbl_resourceClass_s": ["Datasets"],
  "gbl_resourceType_s": ["Polygon data"]
}

It turned out there was just a typo; the new field is gbl_resourceType_sm (not gbl_resourceType_s), as it's multi-valued. Otherwise, the conversion works as expected (it outputs Polygon data and the original field is stripped).

I've corrected the mistake.

@karenmajewicz
Copy link
Contributor

Resource Class is also multivalued: gbl_resourceClass_sm

Copy link

@srappel srappel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it out with this record and got the following output:

{"gbl_mdVersion_s":"Aardvark",
 "dct_identifier_sm":["930D4EA3-442E-4A28-AEC7-830F1A6CB5F8"],
 "dct_title_s":"Land Use Milwaukee County, WI 1963",
 "dct_description_sm":["This data layer represents land use for Milwaukee County, Wisconsin in 1963."],
 "dct_accessRights_s":"Public",
 "schema_provider_s":"UW-Madison Robinson Map Library",
 "gbl_wxsIdentifier_s":"",
 "id":"930D4EA3-442E-4A28-AEC7-830F1A6CB5F8",
 "gbl_mdModified_dt":"2022-01-22T20:12:43Z",
 "dct_format_s":"Shapefile",
 "dct_language_sm":["English"],
 "dct_creator_sm":["Southeastern Wisconsin Regional Planning Commission"],
 "dc_publisher_sm":[""],
 "dct_subject_sm":["Planning and Cadastral"],
 "dct_spatial_sm":[],
 "dct_issued_s":"",
 "dct_temporal_sm":["1963"],
 "gbl_indexYear_im":[1963],
 "dct_references_s":
  "{\"http://schema.org/downloadUrl\":\"https://gisdata.wisc.edu/public/Milwaukee_LandUse_1963.zip\",\"http://www.isotc211.org/schemas/2005/gmd/\":\"https://gisdata.wisc.edu/public/metadata/Milwaukee_LandUse_1963.xml\"}",
 "dcat_bbox":"ENVELOPE(-88.074273, -87.812986, 43.195098, 42.83888)",
 "uw_supplemental_s":"For more information: http://www.sewrpc.org/SEWRPC/LandUse.htm",
 "uw_notice_s":"",
 "gbl_resourceClass_s":["Datasets"],
 "gbl_resourceType_sm":["Polygon data"]}

I looked through it pretty carefully and don't see anything unusual. Note the local fields uw_notice_s and uw_supplemental_s both seem to have just come through as-is, which I assume is the default behavior.

@thatbudakguy thatbudakguy force-pushed the v1-aardvark branch 2 times, most recently from 9338a89 to 52db14d Compare January 29, 2024 22:34
- Handle elements without crosswalk (via lookup tables)
- Support migrating collections in dct_isPartOf_sm
- Convert single to multivalued fields where appropriate
- Retain custom fields and remove deprecated fields

Closes #121
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create class to convert from schema version 1 to Aardvark
4 participants