-
Notifications
You must be signed in to change notification settings - Fork 6
Ingestion Type Handling
The ingestion process reads strings from a data source such as CSV or ODBC, then maps, combines and transforms them into values for a SPARQL INSERT statement. Two types of checks are done at this stage:
- type checking
- values needed for URILookup must not be empty
This page explains how the final values are converted to the proper type as specified by the model, and how type-checking is performed.
Data conversion is focused on URIs and W3C xsd primitive data types. The following describe how each is parsed and ingested.
Strings are handled in conformance with CSV spec with an attempt to handle newlines in strings in a way compatible with Excel. (Excel has been observed to handle embedded newlines in a hard-to-specify way when CRLF characters are not as it expects.)
Below is a sample CSV which demonstrates many basic principles of escaping strings in CSV:
- line 1 is the column headers
- line 2 column 1: shows quoted string with escaped quotes "" and comma inside the quotes
- line 2 column 2: shows one embedded line return, plus the escape sequences \n and \\n
- line 3: shows embedded tab and escape sequence \t, and the use of unquoted strings which contain no commas nor quotes
str1,str2
"notepad what is ""this,"" here","notepad line one
backslash-n-follows\nrest2 double-backslash-n-follows\\nrest3"
tab tab,back-t\tback-t
Inside the triplestore, additional backslashes are added to differentiate between special characters and their escape sequences. Querying them back through SemTK will produce a CSV that is equivalent to the one ingested.
To be ingested as a URI, a string must pass these validations:
-
the Java URL(uri) constructor must succeed without throwing an exception
-
the first character of the local fragment must match [a-zA-Z0-9]
First note that an ingestion template may use a text field ending in "#" to add a prefix to ingested data.
After the entire URI value is built, if it contains no prefix (i.e. if the URI contains neither of: "#", "://")
-
Enumerated - If a URI is being assigned to an instance of a Class that is specified by "owl:oneOf" (SADL "must be one of"), then input strings may be either the entire URL or case-sensitive local prefix. Local prefixes will be changed to the matching full URI by the ingestion service.
-
BaseURI - otherwise if the "Base URI" field is specified, it is prepended along with a "#"
-
Default - otherwise "http://semtk.research.ge.com/generated#" will be prepended.
During ingestion, all xsd:date values are translated into a SPARQL INSERT query using the [ISO_LOCAL_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_TIME) format: "10:15:30".
The following formatters are tried in order, using Java LocalDate.parse(date, formatter):
- "MM/dd/yyyy"
- "MM-dd-yyyy"
- "yyyy-MM-dd"
- "dd-MMM-yyyy", case insensitive (e.g. 12-Jun-2008 or 12-JUN-2008)
During ingestion, all xsd:dateTime values are translated into a SPARQL INSERT query using the [ISO_OFFSET_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_OFFSET_DATE_TIME) format: "2011-12-03T10:15:30+01:00".
- "yyyy-MM-ddTHH:mm:ss+01:00" - ISO_OFFSET_DATE_TIME, e.g. "2020-03-23T23:59:59-4:00"(https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_OFFSET_DATE_TIME)
- "EEE MMM dd HH:mm:ss zzz yyyy" - produced by SADL, e.g. "Wed Mar 22 20:00:00 EST 2017"
The following formatters are then tried in order, using Java LocalDateTime.parse(date, formatter):
- "MM/dd/yyyy HH:mm:ss"
- "MM-dd-yyyy HH:mm:ss"
- "yyyy/MM/dd HH:mm:ss"
- "yyyy-MM-dd HH:mm:ss"
- "dd-MMM-yyyy HH:mm:ss", case insensitive (e.g. 12-Jun-2008 05:00:00 or 12-JUN-2008 05:00:00)
All xsd:date formats are tried next. If parsed successfully, the date is inserted as a dateTime with no timezone and hours, minutes, seconds set to 0.
boolean:
- Java Boolean.parseBoolean
decimal, double:
- Java Double.parseDouble
duration:
- Java Duration.parse
float:
- Java Float.parseFloat
int, integer, negativeInteger, nonNegativeInteger, positiveInteger, nonPositiveInteger:
- Java Integer.parseInt
long:
- Java Boolean.parseBoolean
unsignedByte:
- Java Byte.parseByte
unsignedInt:
time:
- Java Time.parse
gYearMonth:
- "YYYY-MM"
gYear:
- "YYYY"
gMonthDay:
- "MM-dd"
During ingestion, Semtk creates literals following RDF1.1:
- strings are quoted and untyped: "example"
- numeric values are untyped and unquoted: 42
- dates and times are quoted and typed: "2012-02-02T02:00:00"^^XMLSchema:dateTime