Skip to content

Ingestion Type Handling

Paul Cuddihy GE Research edited this page Feb 6, 2022 · 12 revisions

The ingestion process reads strings from a data source such as CSV or ODBC, then maps, combines and transforms them into values for a SPARQL INSERT statement. Two types of checks are done at this stage:

  1. type checking
  2. values needed for URILookup must not be empty

This page explains how the final values are converted to the proper type as specified by the model, and how type-checking is performed.

Data conversion is focused on URIs and W3C xsd primitive data types. The following describe how each is parsed and ingested.

Strings

Strings are handled in conformance with CSV spec with an attempt to handle newlines in strings in a way compatible with Excel. (Excel has been observed to handle embedded newlines in a hard-to-specify way when CRLF characters are not as it expects.)

Below is a sample CSV which demonstrates many basic principles of escaping strings in CSV:

  • line 1 is the column headers
  • line 2 column 1: shows quoted string with escaped quotes "" and comma inside the quotes
  • line 2 column 2: shows one embedded line return, plus the escape sequences \n and \\n
  • line 3: shows embedded tab and escape sequence \t, and the use of unquoted strings which contain no commas nor quotes
str1,str2
"notepad what is ""this,"" here","notepad line one
backslash-n-follows\nrest2 double-backslash-n-follows\\nrest3"
tab	tab,back-t\tback-t

Inside the triplestore, additional backslashes are added to differentiate between special characters and their escape sequences. Querying them back through SemTK will produce a CSV that is equivalent to the one ingested.

URIs

Valid URIs

To be ingested as a URI, a string must pass these validations:

  • the Java URL(uri) constructor must succeed without throwing an exception

  • the first character of the local fragment must match [a-zA-Z0-9]

Prefixing

First note that an ingestion template may use a text field ending in "#" to add a prefix to ingested data.

After the entire URI value is built, if it contains no prefix (i.e. if the URI contains neither of: "#", "://")

  1. Enumerated - If a URI is being assigned to an instance of a Class that is specified by "owl:oneOf" (SADL "must be one of"), then input strings may be either the entire URL or case-sensitive local prefix. Local prefixes will be changed to the matching full URI by the ingestion service.

  2. BaseURI - otherwise if the "Base URI" field is specified, it is prepended along with a "#"

  3. Default - otherwise "http://semtk.research.ge.com/generated#" will be prepended.

xsd:date

During ingestion, all xsd:date values are translated into a SPARQL INSERT query using the [ISO_LOCAL_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_LOCAL_TIME) format: "10:15:30".

The following formatters are tried in order, using Java LocalDate.parse(date, formatter):

  • "MM/dd/yyyy"
  • "MM-dd-yyyy"
  • "yyyy-MM-dd"
  • "dd-MMM-yyyy", case insensitive (e.g. 12-Jun-2008 or 12-JUN-2008)

xsd:dateTime

During ingestion, all xsd:dateTime values are translated into a SPARQL INSERT query using the [ISO_OFFSET_DATE_TIME] (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_OFFSET_DATE_TIME) format: "2011-12-03T10:15:30+01:00".

To ingest dateTimes with timezone

To ingest dateTimes without timezone

The following formatters are then tried in order, using Java LocalDateTime.parse(date, formatter):

  • "MM/dd/yyyy HH:mm:ss"
  • "MM-dd-yyyy HH:mm:ss"
  • "yyyy/MM/dd HH:mm:ss"
  • "yyyy-MM-dd HH:mm:ss"
  • "dd-MMM-yyyy HH:mm:ss", case insensitive (e.g. 12-Jun-2008 05:00:00 or 12-JUN-2008 05:00:00)

To ingest plain date as a dateTime

All xsd:date formats are tried next. If parsed successfully, the date is inserted as a dateTime with no timezone and hours, minutes, seconds set to 0.

Non-date primitive types

boolean:

decimal, double:

duration:

float:

int, integer, negativeInteger, nonNegativeInteger, positiveInteger, nonPositiveInteger:

long:

unsignedByte:

unsignedInt:

Simpler date and time related types

time:

  • Java Time.parse

gYearMonth:

  • "YYYY-MM"

gYear:

  • "YYYY"

gMonthDay:

  • "MM-dd"

Creating Literals

During ingestion, Semtk creates literals following RDF1.1:

  • strings are quoted and untyped: "example"
  • numeric values are untyped and unquoted: 42
  • dates and times are quoted and typed: "2012-02-02T02:00:00"^^XMLSchema:dateTime
SPARQLgraph
Clone this wiki locally