Skip to content


IS4 edited this page Jul 18, 2023 · 1 revision


Various configurable objects, understood by the application, such as analyzers, file formats, or hash algorithms, are collectively stored in component collections and can be included or excluded from them via the -i and -x options.

When present in a collection, a component receives its compound identifier, formed from the name of the collection, and the base name of the component, joined using :. For example, components in the analyzer collection are identified using the analyzer: prefix, and can be collectively removed using -x analyzer:*.

It is possible a single component may be included in multiple collections at once, for example data hash algorithms may be used by the data analyzer for hashing arbitrary bytes, using the data-hash: prefix, or by the image analyzer for analyzing raw pixel data, using the pixel-hash: prefix. Using the proper pattern, a particular hash algorithm may be configured to be present in only one collection, both or none of them.

Some components may have configurable properties that can be browsed by running the application in the list mode. The identifier of a component's property is formed by joining the compound identifier of the component and the name of the property by :, and can be used as an option by prefixing it with --. For example, the estimate of the base size of a single triple in bytes in a triple store can be configured using --analyzer:stream-factory:triple-size-estimate as an option.


The component collections are the primary way to group individual components in the application, serving varied purposes.


The analyzers are the components whose purpose is to describe individual objects they encounter using RDF. Each analyzer has a primary type of entities it supports, which also forms its name.


The data formats are components that are a part of the data analyzer, used to determine the format of arbitrary pieces of data, and parse them.


Similar to data-format, but only for formats based on XML, used by the XML analyzer.


Used for formats which are determined based on a certain system of nodes, such as package formats that arise from a particular file system hierarchy.


The hash algorithms (cryptographic and non-cryptographic) that work purely with data, as a sequence of bytes.


The hash algorithms that also include metadata about a file when producing its hash.


A mirror of data-hash, used by the image analyzer to hash the raw pixel data of an image.


Used by the image analyzer, these algorithms produce hashes from bitmap graphics in a way that treats the image as a whole, such as dHash.


These components are used when storing RDF data and --ugly and --buffered are both unset. In such a case, a custom handler is used for the output format (if existing). If no handler exists, rdf-formatter is used instead.


These components are used when RDF data is to be stored as plain triples streamed sequentially (--buffered is unset and either --ugly is specified, or no rdf-handler matching the output format is found).


These components are used when RDF data is to be stored as a graph, saved at once (with --buffered).


These components are used in the search mode, when the SPARQL Results output is to be saved.

Individual components

The purpose of individual components as displayed by list is generally self-descriptive. Since the analyzers form the basis of the application, they are listed here explicitly:

This analyzer is included as a fallback if an arbitrary object is encountered. If does not do anything by default, but if `accept-everything` is enabled, it at least obtains the label of encountered entities.
This is the file analyzer, accepting arbitrary files and directories. This component holds the collection of file hash algorithms, prefixed file-hash:.
This is the data analyzer, accepting any source of data or sequence of bytes. This component holds the collection of binary formats, prefixed data-format:, and data hash algorithms, prefixed data-hash:.
This is the data object analyzer. Excluding this component will disable describing data, but individual media objects will still be analyzed.
This is the format object analyzer. Excluding this component will prevent any format analysis.
This is the analyzer of individual encountered media types (MIME types). Excluding this will prevent media types from being identified in the output as URIs.
This is the analyzer of individual paths. Excluding this will stop paths from being described.
This is the analyzer of individual filename extensions. Excluding this will stop extensions from being identified.
The analyzer of XML documents. This component holds the collection of XML formats, prefixed xml-format:.
The analyzer of X.509 certificates.
The analyzer of .NET assemblies.
The analyzer of metadata directories, produced by MetadataExtractor from image files.
The analyzer of EXIF metadata in MetadataExtractor directories.
The analyzer of XMP metadata in MetadataExtractor directories.
The analyzer of TagLibSharp files, as containers of tags.
The analyzer of XMP TagLibSharp tags.
The analyzer of SVG documents from SVG.NET.
The analyer of Shockwave Flash animations from SwfDotNet.IO.
The analyzer of OLE-based documents from NPOI.
The analyzer of OOXML-based documents from NPOI.
The analyzer of PDF documents from PdfSharpCore.
The analyzer of HTML documents from the Html Agility Pack.
The analyzer of whole archives from SharpCompress.
The analyzer of archives from SharpCompress, read sequentially.
The analyzer of Cabinet archives. The analyzed archive is wrapped and handled to analyzer:archive-reader.
The analyzer of file systems from DiscUtils.
The analyzer of Windows executable or resource modules.
The analyzer of Windows version resources, in the VS_VERSIONINFO format}.
The analyzer of DOS executables, using Aeon to execute them.
The analyzer of Delphi DFM objects, stored as resources in executables.
The analyzer of OLE compound files from OpenMcdf.
The analyzer of packages using the FILE_ID.DIZ file for description.
The analyzer of RDF/XML documents.
The analyzer of WARC files from WarcProtocol.


-x *-format:* -i *-format:html
Excludes all file formats from the list of components, but keeps the HTML format.
-x * -i analyzer:stream-factory -i analyzer:data-object
Only allows for the analysis of actual data, not files.
--analyzer:stream-factory:max-depth-for-formats ""
Sets this property value to null, disabling depth checks.