Skip to content

Latest commit

 

History

History
102 lines (79 loc) · 7.25 KB

folia-corpora.md

File metadata and controls

102 lines (79 loc) · 7.25 KB

Data format for various linguistically-annotated corpora

Metadata

  • Status: Complete
  • Type: Generic
  • Work Package: WP3
  • Research Coordinators: (various)
  • Coordinators for CLARIAH: Maarten van Gompel
  • Participating Institutes: (various)
  • End-users: Corpus builders and their users
  • Developers: Maarten van Gompel, Ko van der Sloot
  • Interest Groups: Annotation, Text
  • Task IDs: T108 (FoLiA)

Description

This use case groups several projects that aimed to deliver a corpus with various kinds of linguistic enrichment, either achieved automatically by NLP or manually.

What is the research about?

This use case is an abstraction over several research projects that all needed a way to encode their corpora:

  • SoNaR500 - A 500 million word reference corpus for contemporary written Dutch, which is delivered in FoLiA format. Other corpora in which FoLiA is used
  • VU-DNC - a 2 million word diachronic corpus for Dutch offering both sentiment annotations and a gold standard for OCR post-correction.
  • DutchSemCor - A lexical semantic sense annotated corpus (superset of SoNaR500)
  • Basilex - A corpus consisting of Dutch texts young children would typically be exposed to; VU-DNC, a 2 million word diachronic corpus for Dutch offering both sentiment annotations and a gold standard for OCR post-correction.
  • Basiscript - A corpus of contemporary Dutch texts written by primary school children
  • Nederlab - Established a search environment for a large number of dutch text collections, including historical ones. The project however does not dissemminate the corpus that it compiled due to licensing restrictions.
  • Political Mashup - Parliamentary corpus

What problem was hindering the research?

These projects needed to encode texts one or more types of linguistic annotation. Between all these projects there was quite a diversity in linguistic annotation types that had to be encoded, such as for example Part-of-Speech tags, lemmas, named entities, dependency relations, semantic roles. To prevent having to encode each using an ad-hoc scheme, a more general solution was proposed and adopted by these projects.

In addition to encoding linguistic annotation, it was also important for some projects to have a format that can also encode document structure (paragraphs, sentences, lists, etc) and even text markup.

What is needed to do the research?

Data

  • A clear data format specification for linguistic annotation. FoLiA was adopted as a solution by these projects. (Note that Political Mashup is a notable exception as they technically did not adopt FoliA but merely embedded parts of it in their own format). FoLiA provides an integrated XML-based solution. It has its own document-based generic paradigm and strictly defines various linguistic and structural annotation types, but leaves definitions of actual linguistic (or other) vocabulary up to the user. FoLiA is indended as both a corpus storage format and language-resource interchange format between tools and services. It shares certain similarities with initiatives such as TEI, TCF, TiGeR XML, NAF, and various others.
  • Formal schemas (RelaxNG)
  • Independent vocabularies offered through FoLiA Set Definitions (nowadays SKOS-based).

Tools

  • Validation tools
  • Programming libraries to work with the format
  • NLP tools that can handle the format
  • Tools for visualisation

What software and services are involved?

  • FoLiA - Format for Linguistic Annotation; data format
  • foliapy - Python library for working with FoLiA (previously part of pynlpl)
  • libfolia - C++ library for working with FoLiA
  • foliatools - Command-line tools for working with FoLiA (contains validators and converters etc)
  • foliautils - Another set of command-line tools for working with FoLiA (contains validators and converters etc)
  • Ucto - Ucto is a tokeniser with built-in FoliA support that has been used in several of these projects.
  • Frog - Frog is an NLP-tool for Dutch that has built-in FoLiA support and was used in several of these projects and has probably been a factor in their choice for FoLiA.

References

Related use cases that use FoLiA:

Publications: