-
Notifications
You must be signed in to change notification settings - Fork 6
Ingestion Type Handling
The ingestion process reads strings from a data source such as CSV or ODBC, then maps, combines and transforms them into values for a SPARQL INSERT statement. Two types of checks are done at this stage:
- type checking
- values needed for URILookup must not be empty
This page explains how the final values are converted to the proper type as specified by the model, and how type-checking is performed.
Data conversion is focused on URIs and W3C xsd primitive data types. The following describe how each is parsed and ingested.
To be ingested as a URI, a string must pass these validations:
-
the Java URL(uri) constructor must succeed without throwing an exception
-
the first character of the local fragment must match [a-zA-Z0-9]
First note that an ingestion template may use a text field ending in "#" to add a prefix to ingested data.
After the entire URI value is built, if it contains no prefix (i.e. if the URI contains neither of: "#", "://")
-
Enumerated - If a URI is being assigned to an instance of a Class that is specified by "owl:oneOf" (SADL "must be one of"), then input strings may be either the entire URL or case-sensitive local prefix. Local prefixes will be changed to the matching full URI by the ingestion service.
-
BaseURI - otherwise if the "Base URI" field is specified, it is prepended along with a "#"
-
Default - otherwise "http://semtk.research.ge.com/generated#" will be prepended.
During ingestion, all xsd:date values are translated into a SPARQL INSERT query using the [ISO_LOCAL_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_TIME) format: "10:15:30".
The following formatters are tried in order, using Java LocalDate.parse(date, formatter):
- "MM/dd/yyyy"
- "MM-dd-yyyy"
- "yyyy-MM-dd"
- "dd-MMM-yyyy", case insensitive (e.g. 12-Jun-2008 or 12-JUN-2008)
During ingestion, all xsd:dateTime values are translated into a SPARQL INSERT query using the [ISO_OFFSET_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_OFFSET_DATE_TIME) format: "2011-12-03T10:15:30+01:00".
[ISO_OFFSET_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_OFFSET_DATE_TIME) is the only format recognized.
- "yyyy-MM-ddTHH:mm:ss+01:00"
The following formatters are then tried in order, using Java LocalDateTime.parse(date, formatter):
- "MM/dd/yyyy HH:mm:ss"
- "MM-dd-yyyy HH:mm:ss"
- "yyyy/MM/dd HH:mm:ss"
- "yyyy-MM-dd HH:mm:ss"
- "dd-MMM-yyyy HH:mm:ss", case insensitive (e.g. 12-Jun-2008 05:00:00 or 12-JUN-2008 05:00:00)
All xsd:date formats are tried next. If parsed successfully, the date is inserted as a dateTime with no timezone and hours, minutes, seconds set to 0.
boolean:
- Java Boolean.parseBoolean
decimal, double:
- Java Double.parseDouble
duration:
- Java Duration.parse
float:
- Java Float.parseFloat
int, integer, negativeInteger, nonNegativeInteger, positiveInteger, nonPositiveInteger:
- Java Integer.parseInt
long:
- Java Boolean.parseBoolean
unsignedByte:
- Java Byte.parseByte
unsignedInt:
time:
- Java Time.parse
gYearMonth:
- "YYYY-MM"
gYear:
- "YYYY"
gMonthDay:
- "MM-dd"
During ingestion, Semtk creates literals following RDF1.1:
- strings are quoted and untyped: "example"
- numeric values are untyped and unquoted: 42
- dates and times are quoted and typed: "2012-02-02T02:00:00"^^XMLSchema:dateTime