Skip to content

Components

IS4 edited this page Jul 18, 2023 · 1 revision

Configuration

Various configurable objects, understood by the application, such as analyzers, file formats, or hash algorithms, are collectively stored in component collections and can be included or excluded from them via the -i and -x options.

When present in a collection, a component receives its compound identifier, formed from the name of the collection, and the base name of the component, joined using :. For example, components in the analyzer collection are identified using the analyzer: prefix, and can be collectively removed using -x analyzer:*.

It is possible a single component may be included in multiple collections at once, for example data hash algorithms may be used by the data analyzer for hashing arbitrary bytes, using the data-hash: prefix, or by the image analyzer for analyzing raw pixel data, using the pixel-hash: prefix. Using the proper pattern, a particular hash algorithm may be configured to be present in only one collection, both or none of them.

Some components may have configurable properties that can be browsed by running the application in the list mode. The identifier of a component's property is formed by joining the compound identifier of the component and the name of the property by :, and can be used as an option by prefixing it with --. For example, the estimate of the base size of a single triple in bytes in a triple store can be configured using --analyzer:stream-factory:triple-size-estimate as an option.

Collections

The component collections are the primary way to group individual components in the application, serving varied purposes.

analyzer

The analyzers are the components whose purpose is to describe individual objects they encounter using RDF. Each analyzer has a primary type of entities it supports, which also forms its name.

data-format

The data formats are components that are a part of the data analyzer, used to determine the format of arbitrary pieces of data, and parse them.

xml-format

Similar to data-format, but only for formats based on XML, used by the XML analyzer.

container-format

Used for formats which are determined based on a certain system of nodes, such as package formats that arise from a particular file system hierarchy.

data-hash

The hash algorithms (cryptographic and non-cryptographic) that work purely with data, as a sequence of bytes.

file-hash

The hash algorithms that also include metadata about a file when producing its hash.

pixel-hash

A mirror of data-hash, used by the image analyzer to hash the raw pixel data of an image.

image-hash

Used by the image analyzer, these algorithms produce hashes from bitmap graphics in a way that treats the image as a whole, such as dHash.

rdf-handler

These components are used when storing RDF data and --ugly and --buffered are both unset. In such a case, a custom handler is used for the output format (if existing). If no handler exists, rdf-formatter is used instead.

rdf-formatter

These components are used when RDF data is to be stored as plain triples streamed sequentially (--buffered is unset and either --ugly is specified, or no rdf-handler matching the output format is found).

rdf-writer

These components are used when RDF data is to be stored as a graph, saved at once (with --buffered).

sparql-writer

These components are used in the search mode, when the SPARQL Results output is to be saved.

Individual components

The purpose of individual components as displayed by list is generally self-descriptive. Since the analyzers form the basis of the application, they are listed here explicitly:

analyzer:object
This analyzer is included as a fallback if an arbitrary object is encountered. If does not do anything by default, but if `accept-everything` is enabled, it at least obtains the label of encountered entities.
analyzer:file-node-info
This is the file analyzer, accepting arbitrary files and directories. This component holds the collection of file hash algorithms, prefixed file-hash:.
analyzer:stream-factory
This is the data analyzer, accepting any source of data or sequence of bytes. This component holds the collection of binary formats, prefixed data-format:, and data hash algorithms, prefixed data-hash:.
analyzer:data-object
This is the data object analyzer. Excluding this component will disable describing data, but individual media objects will still be analyzed.
analyzer:format-object
This is the format object analyzer. Excluding this component will prevent any format analysis.
analyzer:content-type
This is the analyzer of individual encountered media types (MIME types). Excluding this will prevent media types from being identified in the output as URIs.
analyzer:path-object
This is the analyzer of individual paths. Excluding this will stop paths from being described.
analyzer:extension-object
This is the analyzer of individual filename extensions. Excluding this will stop extensions from being identified.
analyzer:xml-reader
The analyzer of XML documents. This component holds the collection of XML formats, prefixed xml-format:.
analyzer:x509-certificate
The analyzer of X.509 certificates.
analyzer:assembly
The analyzer of .NET assemblies.
analyzer:read-only-list.metadata-extractor.directory
The analyzer of metadata directories, produced by MetadataExtractor from image files.
analyzer:metadata-extractor.exif-directory-base
The analyzer of EXIF metadata in MetadataExtractor directories.
analyzer:metadata-extractor.xmp-directory
The analyzer of XMP metadata in MetadataExtractor directories.
analyzer:tag-lib-sharp.file
The analyzer of TagLibSharp files, as containers of tags.
analyzer:tag-lib-sharp.xmp-tag
The analyzer of XMP TagLibSharp tags.
analyzer:svg.svg-document
The analyzer of SVG documents from SVG.NET.
analyzer:swf-dot-net.io.swf
The analyer of Shockwave Flash animations from SwfDotNet.IO.
analyzer:npoi.poi-document
The analyzer of OLE-based documents from NPOI.
analyzer:npoi.ooxml.poixml-document
The analyzer of OOXML-based documents from NPOI.
analyzer:pdf-sharp-core.pdf-document
The analyzer of PDF documents from PdfSharpCore.
analyzer:html-agility-pack.html-document
The analyzer of HTML documents from the Html Agility Pack.
analyzer:archive-file
The analyzer of whole archives from SharpCompress.
analyzer:archive-reader
The analyzer of archives from SharpCompress, read sequentially.
analyzer:cabinet-archive
The analyzer of Cabinet archives. The analyzed archive is wrapped and handled to analyzer:archive-reader.
analyzer:disc-utils.core.file-system
The analyzer of file systems from DiscUtils.
analyzer:module
The analyzer of Windows executable or resource modules.
analyzer:win-version-info
The analyzer of Windows version resources, in the VS_VERSIONINFO format}.
analyzer:dos-module
The analyzer of DOS executables, using Aeon to execute them.
analyzer:delphi-object
The analyzer of Delphi DFM objects, stored as resources in executables.
analyzer:open-mcdf.compound-file
The analyzer of OLE compound files from OpenMcdf.
analyzer:package-description
The analyzer of packages using the FILE_ID.DIZ file for description.
analyzer:rdf-xml-analyzer.document
The analyzer of RDF/XML documents.
analyzer:async-enumerable.toimik.warc-protocol.record
The analyzer of WARC files from WarcProtocol.

Examples

-x *-format:* -i *-format:html
Excludes all file formats from the list of components, but keeps the HTML format.
-x * -i analyzer:stream-factory -i analyzer:data-object
Only allows for the analysis of actual data, not files.
--analyzer:stream-factory:max-depth-for-formats ""
Sets this property value to null, disabling depth checks.