GithubHelp home page GithubHelp logo

proycon / folia Goto Github PK

View Code? Open in Web Editor NEW
59.0 13.0 10.0 56.32 MB

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions

Home Page: http://proycon.github.io/folia/

License: GNU General Public License v3.0

Python 82.89% Shell 17.11%
nlp computational-linguistics xml file-format linguistics corpus language library python folia

folia's Introduction

FoLiA: Format for Linguistic Annotation

tests documentation lamabadge DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Documentation | Examples | Python Library | Python Library Documentation | C++ Library | Rust Library | FoLiA-Tools | FoLiA Utilities | FLAT: Web-based Annotation environment

by Maarten van Gompel, CLST/Radboud University Nijmegen & KNAW Humanities Cluster

https://proycon.github.io/folia

FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA's intended use is as a format for storing and/or exchanging language resources, including corpora. Our aim is to introduce a single rich format that can accommodate a wide variety of linguistic annotation types through a single generalised paradigm. We do not commit to any label set, language or linguistic theory. This is always left to the developer of the language resource, and provides maximum flexibility.

XML is an inherently hierarchic format. FoLiA does justice to this by maximally utilising a hierarchic, inline, setup. We inherit from the D-Coi format, which posits to be loosely based on a minimal subset of TEI. Because of the introduction of a new and much broader paradigm, FoLiA is not backwards-compatible with D-Coi, i.e. validators for D-Coi will not accept FoLiA XML. It is however easy to convert FoLiA to less complex or verbose formats such as the D-Coi format, or plain-text. Converters are provided.

The main characteristics of FoLiA are:

  • Generalised paradigm - We use a generalised paradigm, with as few ad-hoc provisions for annotation types as possible.
  • Expressivity - The format is highly expressive, annotations can be expressed in great detail and with flexibility to the user's needs, without forcing unwanted details. Moreover, FoLiA has generalised support for representing annotation alternatives, and annotation metadata such as information on annotator, time of annotation, and annotation confidence.
  • Extensible - Due to the generalised paradigm and the fact that the format does not commit to any label set, FoLiA is fairly easily extensible.
  • Formalised - The format is formalised, and can be validated on both a shallow and a deep level (the latter including tagset validation), and easily machine parsable, for which tools are provided.
  • Practical - FoLiA has been developed in a bottom-up fashion right alongside applications, libraries, and other toolkits and converters. Whilst the format is rich, we try to maintain it as simple and straightforward as possible, minimising the learning curve and making it easy to adopt FoLiA in practical applications.

The FoLiA format makes mixed-use of inline and stand-off annotation. Inline annotation is used for annotations pertaining to single tokens, whilst stand-off annotation in a separate annotation layers is adopted for annotation types that span over multiple tokens. This provides FoLiA with the necessary flexibility and extensibility to deal with various kinds of annotations.

Notable features are:

  • XML-based, UTF-8 encoded
  • Language and tagset independent
  • Can encode both tokenised as well as untokenised text + partial reconstructability of untokenised form even after tokenisation.
  • Generalised paradigm, extensible and flexible
  • Provenance support for all linguistic annotations: annotator, type (automatic or manual), time.
  • Used by various software projects and corpora, especially in the Dutch-Flemish NLP community

Paradigm Schema

Resources

A more extensive list of FoLiA-capable software is maintained on the FoLiA website

Publications

See the FoLiA website for more publications and full text links.

  • Maarten van Gompel (2019). FoLiA: Format for Linguistic Annotation - Documentation. Language and Speech Technology Technical Report Series. Radboud University Nijmegen.
  • Maarten van Gompel, Ko van der Sloot, Martin Reynaert, Antal van den Bosch (2017). FoLiA in Practice: The Infrastructure of a Linguistic Annotation Format. In: CLARIN in the Low Countries. Eds: Jan Odijk and Arjan van Hessen. Pp. 71-81. PDF
  • Maarten van Gompel & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive and comparative study; Computational Linguistics in the Netherlands Journal; 3:63-81; 2013. PDF
  • Maarten van Gompel (2014). FoLiA: Format for Linguistic Annotation. Documentation. Language and Speech Technology Technical Report Series LST-14-01. Radboud University Nijmegen.

folia's People

Contributors

kosloot avatar larsmans avatar martinreynaert avatar proycon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

folia's Issues

Relation between annotations and text classes is not explicit

FoLiA allows for multiple text content elements on the same structural elements, these other text content elements must carry a different class. This indicates an alternative text for the same element and is used for instance for pre-OCR vs. post-OCR or pre-normalisation vs. post-normalisation distinctions (e.g. by Ticcl), or for transliterations (e.g a text in a chinese characters as well as pinyin). The standard text class, is always current (the only case in which FoLiA predefines a class).

In Nederlab, historical text is modernised, the modernised text is stored in the contemporary text class and the original historical text is in the default current class. Now the issue is that they want to annotate both spelling variants. Software such as Frog allows to specify what text class to use as input, and it is viable to run Frog multiple times, with some post-processing, and add alternative annotations that are based on a different text class input. This is what I currently implemented and which works okay.

The problem with this approach , however, is that: The relation between annotations and text classes is not explicit. It is now merely a convention in my Nederlab pipeline that the alternatives are based on the historical text, whilst the authoritative annotations are based on the contemporary variant.

This is a limitation in FoLiA that should be thought about and remedied. In FoLiA annotations are tied to structural elements (e.g. words/tokens) rather than on any particular text surface form (all textual forms are equally valid and describe the same thing). How do we establish a link with a text class?

For morphology/phonology and corrections this issue does not occur as those explicitly use text content elements; but for normal token annotation and span annotation (wref) it is not and an elegant solution needs to be devised. A symptom of this problem is apparent also in the serialisation of the wref/@t attribute, which always now always contains the current layer even if the span annotation was derived from another text layer.

This issue also encroaches upon another (deliberate) limitation in FoLiA; the general inability to have multiple tokenisations (though there are already soms ways around this).

Ensure example.xml examples are "sensible"

Try to clean up some of example.xml so the examples are semantically sensible. It's not just used by our test suites. The whitespace occurrence for example led to some confusion.

[new annotation proposal] Sentiment Analysis

FoLiA currently has the token annotation subjectivity for limited sentiment analysis or other subjectivity annotation, it is used by the VU-DNC corpus for instance. This, however, is not sufficient for more complex expressions of sentiment. A strong span annotation element is needed. The following proposal is inspired on NAF's opinion layer:

<s>
 <w xml:id="w1"><t>He</t></w>
 <w xml:id="w2"><t>is</t></w>
 <w xml:id="w3"><t>happy</t></w>
 <w xml:id="w4"><t>to</t></w>
 <w xml:id="w5"><t>see</t></w>
 <w xml:id="w6"><t>him</t></w>
 <w xml:id="w7"><t>.</t></w>
 <sentiments>
  <sentiment class="emotion.joy" polarity="positive" strength="moderate">
    <source>
      <wref id="w1" t="he" />
    </source>
    <target>
      <wref id="w6" t="him" />
    </target>
    <hd>
      <wref id="w3" t="happy" />
    </hd>
  </sentiment>
 </sentiments>
</s>

This predefines the following feature subsets, whether they are actually used and the class values they take are defined by the set.

  • polarity
  • strength

The following span role elements are introduced and used (will be reused in another upcoming proposal as well):

  • source - The source/holder of the sentiment (optional)
  • target - The target/recipient of the sentiment (optional)
  • hd - The head contains the sentiment itself (required)

Set definition frog-mbma-nl doesn't comply to Frog's actual output

In normal mode, Frog does not seem to output any classes at all for morphemes (which is fine, but not according to the set definition), in deep-morph mode, I see classes like complex or stem, which is fine too but they don't correspond with what's in the set. We'll need to adapt the set at some point.

Revise FoLiA documentation, turn into more formal specification

The FoLiA documentation is currently a LaTeX document containing 157 pages that has grown over the years. Though it has been revised to keep up with the latest FoLiA standard, at certain places discrepancies may have arisen with the yaml specification (folia.yml) that acts as the source for the libraries (pynlpl.formats.folia and libfolia). A more integrative revision of the documentation might be desirable. By this I mean that parts of the documentation are generated from the specification, giving the documentation a more formal character and ensuring everything is in sync.

This also allows for documentation to be publishable in various forms, rather than just the PDF which it is now.

allow 'alien' atributes

in libfolia, 'alien' attributes (in a non-FoLiA name-space) are discarded.
(how does folia.py handle those?)

it might be better to store them, serialize them and somehow give the FoLiA user a possibility to retrieve then. (NOT adding/modify)

each AnnotationLayer should have children in one set only

At the moment it is possible to add children from different sets to an annotation-layer.
This is undesirable.
A layer has 'per definition' the set of its children, and multiple sets would violate this.
Simplest solution is to let the append method for layers check every child on its set.
The fist child would determine the set of the layer. (if not set on creation)

[new annotation proposal] observations

The observation element is a span annotation element that makes an observation pertaining to one or more word tokens. It is embedded in an observations layer. Observations offer a an external qualification on part of a text. The qualification is expressed by the class, in turn defined by a set. The precise semantics of the observation depends on the user-defined set.

The element may for example act as a more generic replacement for the errordetection element, or to encapsulate observations from teachers/proofreaders on a text, in which case it is often used with the desc element. The following example shows observations from two fictitious sets:

<s>
  <w xml:id="w1"><t>The</t></w>
  <w xml:id="w2"><t>Dalai</t></w>
  <w xml:id="w3"><t>Lama</t></w>
  <w xml:id="w4"><t>greets</t></w>
  <w xml:id="w5"><t>himm</t></w>
  <w xml:id="w6"><t>.</t></w>
 <observations>
  <observation class="typo" set="http://somewhere/errordetection.set.xml"> 
   <wref id="w5"/>
  </observation>
 </observations>
 <observations>
  <observation class="encouragement" set="http://somewhere/teacherobservations.set.xml" annotator="teacher234" annotatortype="manual">
   <wref id="w1" />
   <wref id="w2" />
   <wref id="w3" />
   <wref id="w4" />
   <wref id="w5" />
   <wref id="w6" />
   <desc>Almost a good sentence, only one mistake. Keep up the good work!</desc>
  </observation>
 </observations>
</s>

As always, further attributes can be associated with any observation using FoLiA's feature mechanism.

(proposal inspired on Revisely's solution)

add aliases (short names) for set definitions.

At the moment, having more then one annotation set in scope, leads to a lot of bloat, example:

<w xml:id="WR-P-E-J-0000000001.p.1.s.2.w.16">
  <t>genealogie</t>
  <pos class="N(soort,ev,basis,zijd,stan)" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn"/>
  <lemma class="genealogie"/>
  <morphology>
    <morpheme class="complex">
	<t>genealogie</t>
	<feat class="[[genealogisch]adjective[ie]]noun/singular" subset="structure"/>
	<pos class="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
	<morpheme class="complex">
           <feat class="N_A*" subset="applied_rule"/>
           <feat class="[[genealogisch]adjective[ie]]noun" subset="structure"/>
           <pos class="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
           <morpheme class="stem">
             <t>genealogisch</t>
             <pos class="A" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
           </morpheme>
           <morpheme class="affix">
             <t>ie</t>
             <feat class="[ie]" subset="structure"/>
          </morpheme>
	</morpheme>
	<morpheme class="inflection">
        <feat class="singular" subset="inflection"/>
      </morpheme>
    </morpheme>
  </morphology>
</w>

set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn"/>
and especially
set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
are repeated a lot

Maybe it is a plan to introduce short-hand labels, like cgg-set and celex-set to avoid all the bloat.

Something like this:

<pos-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn" annotator="frog" annotatortype="auto" label="cgn"/>
<pos-annotation annotator="frog-mbma-1.0" annotatortype="auto" datetime="2
017-04-20T16:48:45" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex" label="celex"/>

Everywhere a set is used, you may use the label instead. When serializing the label, if provided, is preferred.
Labels must be unique of course

foliavalidator: processdir raises UnboundLocalError instead of returning false

On validation of an entire dir, folialint returns a failure while foliavalidator raises an Error (see output below).
Behaviour should be consolidated (i.e. foliavalidator should return false)?

folialint
(lamachine)tvoets@applejack:/vol/tensusers/proycon/clin_spellingstaak/annotated_docs$ folialint --nooutput nnota/*.xml
nnota/page1341.tested.tagged.folia.xml failed:
XML error: Unresolvable id page1341.text.div.1.p.1.s.7.w.24in WordReference

foliavalidator
(lamachine)tvoets@applejack:/vol/tensusers/proycon/clin_spellingstaak/annotated_docs$ foliavalidator nnota/
Searching in nnota/
Traceback (most recent call last):
File "/vol/customopt/lamachine/bin/foliavalidator", line 11, in
sys.exit(main())
File "/vol/customopt/lamachine/lib/python3.4/site-packages/foliatools/foliavalidator.py", line 145, in main
r = processdir(x,schema,quick,settings.deep, settings.stricttextvalidation,settings.debug)
File "/vol/customopt/lamachine/lib/python3.4/site-packages/foliatools/foliavalidator.py", line 87, in processdir
if not r: success = False
UnboundLocalError: local variable 'r' referenced before assignment

[proposal] token annotations on multi-word spans (group annotations) and discussion of other multi-word issues.

Since it's inception, FoLiA makes a distinction between annotations on single tokens (or other single structural elements), and annotations made on spans of tokens. These are called token annotations and span annotation respectively, the former is implemented inline, using the natural hierarchy in XML, whereas the latter is a stand-off layer. Each particular annotation type (e.g lemma/pos/entities/syntax etc) is implemented as one of these forms. Whether a particular annotation type is implemented as a token or span annotation depends on the nature of the annotation type.

FoLiA is, by design, limited to a single tokenisation, or no tokenisation at all, in which case actual linguistic annotation abilities are limited. Tokens are represented as <w> (word) elements. How tokenisation should be performed is not prescribed by FoLiA but left to the tokeniser. Whitespace in a token is not prohibited (as long as the token contains more than just whitespace) so the notion of a word or token is a flexible one and the two concepts are not strongly distinguished.

However, it appears that more expressive flexibility is needed as challenges appear in the situations where:

  1. a token annotation (e.g. pos, lemma) can not be be assigned a single token but only to multiple tokens, an extra complication being if the tokens are discontinuous. Consider seperable verbs in dutch for instance; in the sentence "ik hou mijn adem in", we may want to tag hou in with lemma inhouden and part-of-speech verb. This is currently not possible in FoLiA.
  2. a token annotation (e.g. pos, lemma) can not be assigned to a single token but only a part of it, for instance in the
    case of a constraction (e.g "it's"). This is already largely solved by the morphology layer in FoLiA.

Both are symptoms of the same underlying theme; the lack of atomicity of the token/word. The most straightforward solution would seem to be to retokenize the document, but this is too rigid and not always feasible or desireable. Sometimes maintaining an explicit distinction between tokens/words/groups of words is needed.

Multi-token words

Consider the following FoLiA mock-example of three tokens which together form a compound noun:

<w xml:id="w.1">
    <t>dry</t>
</w>
<w xml:id="w.2" space="no">
    <t>-</t>
</w>
<w xml:id="w.3" space="no">
    <t>cleaning</t>
</w>

FoLiA already has facilities to express that a group of tokens forms some type of entity (named or otherwise), or to correct the tokens to a single new one (<correction>). But in cases where this all is undesireable, where you want to keep the tokens as-is because they were expressed in the original in thay way, but still express that it concerns a single word with a single part of speech tag and lemma; new facilities are needed to use token annotations with spans.

When looking at other formats; NAF makes an explicit distinction between tokens (wordforms) and what it calls terms, and then proceeds to annotate largely (not always consistently so) on the terms rather than the wordforms. An extension of FoLiA is therefore also needed for the NAF-FoLiA convertor (see issue cltl/NAFFoLiAPy#4, and as such maybe relevant also for @antske).

I propose the following: adding a facility to FoLiA that can group words (like any normal span annotation element, no news here) but that allows for token annotations within its scope.

I think the simplest and least intrusive way to do this is to expand the existing entity annotation, example:

<entities>
    <entity xml:id="wg.1" class="compoundnoun">
        <wref id="w.1" />
        <wref id="w.2" />
        <wref id="w.3" />
        <pos class="N" >
        <lemma class="dry-cleaning" />
    </entity>
</entities>

This would cover non-continguous spans just as well.
Such an annotation would be declared in the header as follows:

<entity-annotation set="...." type="complex" />

The type attribute is new here and would default to simple, the current behaviour. The value complex is used for the proposed extension, to explicitly denote that we are allowing token annotations on entities. I want this attribute so we can explicitly distinguish the two, documents with the new complex entities pose extra challenges for FoLiA tools so we want to know whether this will happen from the declaration already.

Alternatives to this solution would be:

  • Introducing a new annotation type for this entirely (e.g wgroup). The disadvantage is that it may be a bit too similar to entity and the two could be used interchangeably when no further annotations are added.
  • Adding span-annotation variants of all token annotation types.

The motivation for the proposed solution is to keep changes as minimal and simple as possible and not introduce too many new things. Despite the simplicity of the change, it does have quite some implications for the tools and libraries.

I do not propose that other span elements can in turn refer (wref) to entities rather than tokens/words (there are already facilities for doing that anyway), and it would add unnecessary ambiguity.

Non-atomic tokens

In cases where we have a token annotation (e.g. pos, lemma) that can not be assigned to a single token but only a part of it, we can use the already existing morphology layer:

Consider the example of the English contraction it's:

<w xml:id="w.1">
    <t>it's</t>
    <morphology>
        <morpheme>
            <t>it</t>
            <pos class="pron" />
            <lemma class="it" />
        </morpheme>
        <morpheme>
            <t>'s</t>
            <pos class="v" />
            <lemma class="is" />
        </morpheme>
    </morphology>
</w>

Here I want to stress that this is not the only possible representation for this contraction, as we can just as well express it with two tokens as shown in the next example. It's not FoLiA's job to favour one over the other, but that is a decision of the creator/researcher/tokeniser, FoLiA just has to provide the facilities that make both models possible:

<w xml:id="w.1" space="no">
    <t>it</t>
    <pos class="pron" />
    <lemma class="it" />
</w>
<w xml:id="w.2">
    <t>'s</t>
    <pos class="v" />
    <lemma class="it" />
</w>

The morphology notation in FoLiA is very powerful and nestable. Consider the arabic token فيبيتك. This consists of three words meaning "in your house", translitterated in the below example for ease of reading:

<w xml:id="w.1">
    <t>fiybaytika</t>
    <morphology>
        <morpheme class="prefix" function="lexical">
            <t>fiy</t>
            <pos class="PREP" />
            <lemma class="fiy" />
        </morpheme>
        <morpheme class="stem" function="lexical">
            <t>bayti</t>
            <pos class="N">
                <feat subset="case" class="prep" />
            </pos>
            <lemma class="bayt" />
            <morpheme class="stem">
                <t>bayt</t>
                <pos class="N" />
            </morpheme>
            <morpheme class="suffix" function="inflectional">
                <t>i</t>
                <desc>prepositional marker</desc>
            </morpheme>
        </morpheme>
        <morpheme class="suffix" class="lexical">
            <t>ka</t>
            <pos class="PRON" />
            <lemma class="anta" />
        </morpheme>
    </morphology>
</w>

Morphemes (and phonemes) can explicitly be referred to (like of words/tokens) from any span annotation (wref).

My question (mainly for @kdepuydt, @JessedeDoes) if is this solution is sufficient (it can capture contractions, clitics, etc.. ) and linguistically accurate enough (e.g. grouping it all under morphology)? If there are counter-examples, I'd be very interested.

Compound classes

One point that arises from current annotations in the CRM and Gysseling corpora (historical dutch), is the use of what I call compound PoS-tags and lemmas. Take the arabic example above, the token itself does not have a PoS tag, but one may want to force a tag anyway and assign something like prep+n+pron. Recall that FoLiA itself does not define the tagset, so this would be valid. However, the semantics of it being some kind of compound class would not be formalised in any way. The question arises whether we need facilities for explicitly representing compound classes? Perhaps we
should allow FoLiA set definitions to define operators such as +, allowing for more expressivity in classes. This as opposed to really defining operators in FoLiA itself, because that begs the question which operators are needed and that is more a property of the vocabulary in question. In categorial grammars for instance, one would want to define / and \. In other vocabularies, perhaps more set-theoretic operators such as and make sense. If operators are introduced, then of course bracketing and operator precendence becomes a factor to take into account a well. The class
would cease to be a simple reference and allow for a mini-language in it's own right, although for many tools this is of no consequence.

I'm not yet including a specific proposal for this, but would very much like to hear your thoughts on this direction.

Add a 'comment' element (higher-order annotation)

XML comments are not sufficient, we need proper comment facilities in FoLiA that are allowed everywhere (analogous to descriptions). We should add a comment element to this end. The text of the element hold the comment and is free-fill.

[documentation] newlines and whitespace in FoLiA text content (<t>)

This issue documents a fundamental issue with FoLiA's text content (<t>) that may leads to misunderstanding and requires more extensive documentation. It is especially relevant now FoLiA v1.5 introduces mandatory text validation and may identify problems caused by this.

A FoLiA text content block (<t>) is an XML mixed content node, such a node may consist of both text and elements, the latter being FoLiA text markup elements in this case (t-style, t-gap, br etc...). In practise it's often just text. When associated with a structural element that is not a word or morpheme, the text content expresses untokenised text. This means that spaces and newlines are significant.

Consider the following snippets:

A:

<s><t>This is a sentence</t></s>

B:

<s><t>This is
a sentence</t></s>

C:

<s><t>This is<br/>a sentence</t></s>

The text of sentence A is not equivalent to B or C, the text of B and C are equivalent.

Special caution is in order when spreading text content over multiple lines, this usually does not do mean what you might assume:

D:

<s>
    <t>This is
         a sentence</t>
</s>

Sentence D is not equivalent to B or C, it's text is This is\n\s\s\s\s\s\s\s\s\sa sentence.

This is in line with XML behaviour (quoting http://usingxml.com/Basics/XmlSpace):

.., if the element is declared as having mixed content, both text and element child nodes, then the XML parser must pass on all the white space found within the element.

It does differ from what people are accustomed to in HTML (hence some of the confusion perhaps), which considers whitespace insignificant far more frequently.

FoLiA v1.5 introduced mandatory text validation (#24), which checks if any text redundancy is consistent. This may bring to light issues such as described here. This text validation, however, still proceeds in a more flexible manner as it is insensitive to multiple spaces/newlines and operates on a normalised form. Explicit text offsets (if used), on the other hand, do not operate on a normalised form and are thus very strict, they are also validated as part of text validation.

Note for completeness that this discussion is limited to text content (<t>) and text markup elements therein, whitespaces/newlines in most other context, such as within structural elements, is not significant.

Problem with text offset and Linebreak

example:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia2html.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="WR-P-E-J-0000000001" generator="libfolia-v1.14" version="1.5">
  <text xml:id="WR-P-E-J-0000000001.text">
    <div>
      <head xml:id="sandbox.3">
        <t>De <br/><br/><br/><br/>FoLiA developers zijn:</t>
        <str xml:id="sandbox.3.str">
          <t offset="7">FoLiA</t>
        </str>
      </head>
    </div>
  </text>
</FoLiA>

C++'s libfolia accepts this, as it sees every <br/> as 1 character, so the offset of FoLiA is 7

Python's folia.py rejects this as it ignores all <br/> symbols and requires an offset of 3

I think libfolia is right here. but this is very tricky indeed.

Coreference set is missing

It seems sets were originally hosted at ILK (ilk.uvt.nl), but that server now redirects to this repository:
http://ilk.uvt.nl/folia/sets/ --> https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/.
However, the coref set (http://ilk.uvt.nl/folia/sets/coref) that is referred to in the official documentation (PDF) is not in the repo here. Is there any chance it can still be retrieved from somewhere? (I tried the Internet Archive but of course they didn't have it, and Google doesn't come up with much either...)

do we need textclass attribute on structure nodes?

consider the following FoLiA fragment:

<s id="s1">
  <w id="w1" class="WORD-WITHSUFFIX">
   <t>Zo</t>
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t>probleem</t>
   <t textclass="other">probleem</t>
  </w>
</s>

Clearly the sentence is tokenized by ucto on the 'other' textclass, but there is now way to express this.
A simple solution would be to allow for textclass on the word level:

<s id="s1">
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t>Zo</t>
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t>probleem</t>
   <t textclass="other">probleem</t>
  </w>
</s>

But this raises some questions on the 'orphaned' current text. Wouldn't it be better to have these connected to another word? like this:

<s id="s1">
  <w id="w1.1">
   <t>Zo</t>
  </w>
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2.1">
   <t>probleem</t>
  </w>
  <w id="w2" class="WORD">
   <t textclass="other">probleem</t>
  </w>
</s>

This could also be raised to the sentence level then:

<s id="s1.1">
  <t>Zo probleem</t>
  <w id="w1.1">
   <t>Zo</t>
  </w>
  <w id="w2.1">
   <t>probleem</t>
  </w>
</s>
<s id="s1" textclass="other">
  <t textclass="other">Zo'n probleem</t>
  <w id="w1" class="WORD-WITHSUFFIX" textclass="other">
   <t textclass="other">Zo'n</t>
  </w>
  <w id="w2" class="WORD">
   <t textclass="other">probleem</t>
  <w>
</s>

This might be a solution for the problem of multiple/different tokenizations in one FoLiA document.
But again it raises questions:

  • Should we then disallow/dis-encourage multiple <t> nodes per structure?
  • making textclass redundant on <t> nodes? or implicit...

Migrate FoLiA Set Definition scheme to RDF

The role of FoLiA Set Definitions is:

  • to define which classes are valid in a set
  • to define which subsets and classes are valid in "features" in a set
  • to constrain which subsets+classes may co-occur in an annotation of the set
  • to allow enumeration over classes and subsets
  • to assign human-readable labels to symbolic classes
  • to relate classes to external resources defining them (data category registries)
  • to define a hierarchy/taxonomy of classes

Using set definitions a FoLiA document can be validated on a deep level, i.e.
the validity of the used classes can be tested. Set definitions provide
semantics to the FoLiA documents that use them and are an integral part of FoLiA.

Set definitions are not in widespread use yet, most people simply don't bother
or care for such a level of abstraction and formality. One tool, FLAT,
does rely heavily on set definitions to populate options in selection fields.

Set definitions are currently described in a simple XML format, distinct from
FoLiA itself. The format is limited and not strongly established.

Considering the highly semantic nature of set definitions, the binding role
they play between the FoLiA document on one hand and data category registries
on the other hand, and the advent of linked open data, I propose describing
the set definitions themselves in RDF in future versions of FoLiA. I'm working
on a scheme for this.

The current set definitions will remain supported for backwards compatibility
of course, and may also act as an intermediate step in producing the RDF data.

Add predicates for semantic roles.

Right now, semantic roles are not grouped into predicates in FoLiA, whilst semantic roles are typically grouped as such. We need a new span annotation element to remedy this.

Proposal:

<semroles>
  <predicate>
    <semrole class="agent">
      <wref />...
    </semrole>
    <semrole class="theme">
      <wref />...
    </semrole>
  </predicate>
</semroles>

semrole would then require predicate (i.e. semrole would become a span role element rather than a first order span annotation element). This breaks backwards compatibility but is fairly easy to automatically remedy by automatically grouping all semroles in a layer (assuming layers are per sentence as is conventional). Also, I am not aware of anyone already using this annotation type.

Generate RDF/OWL ontology of FoLiA

Now FoLiA uses a generic external specification since v1.0 (folia.yml), we can augment this and generate a formal OWL ontology from it. Requested by the OpenMinTeD project.

Add proper support for provenance logging in FoLiA

It is often desireable to know exactly what tools (and what versions thereof and even with what parameters) were invoked in which order to produce a FoLiA document. The annotation declaration section covers this to a certain extent already, but does not go far enough yet.

space attribute not implemented according to documentation

The FoLiA documentation states:
Allowed values for space are:

  • “yes” or “ ” (a space) – This is the default and says that the token is followed by a single space.
  • “no” or “” (empty) – This states that the token is not followed by a space.
  • any other character or string – This states that the token is followed by another character or string that acts as a token separator.

But both the Python version and the C++ version only implement the boolean value.

I suggest removing the 'any other character or string' case from the documentation, as that has never occurred in this universe

FoLiA set definitions currently can't express constraints

Continuation of INL/nederlab-linguistic-enrichment#17, @JessedeDoes wrote:

I see no obvious way in SkoS of declaring that feature f can be combined with PoS p, etc.
(You could express that "Masculine" is a narrower class than "having a gender feature", etc, how would one express, eg, that "having a number feature" is a subset of the union of TW,WW,N,VNW. It would be possible in OWL)

The old more ad-hoc XML format had facilities for this but we need modern RDF ones now and don't have any yet. Parsing these constraints would be needed for e.g. Frog (see LanguageMachines/frog#51), preferably without needing a complete OWL logic parser in Frog I'd say..

revise class hierarchy considering paragraphs and sentences

At the moment, the accepted data considering sentence and/or paragraph is a bit rigid and unclear

For example:
A Caption may contain a Sentence, but not a Paragraph.
But that means that it also may contain several Sentences.
But isn't that just a Paragraph?

An Entry may NOT contain Sentence or Paragraph, but it may contain a Term, which may contain both.

An Utterance may contain a Sentence, but not a Paragraph, which somehow makes sense, but it may contain a Quote too, which accepts both.

The ratio is unclear to me.

Add facilities for metadata on sub-parts of a document

Allow metadata on structural elements (e.g. a part of the document rather than the whole document).

This is not explicitly possible yet now (although could be handled with a foreign-data block), a more explicit solution is desired.

Two proposed approaches:

  1. In the main metadata section with references to elements (favoured by Katrien), this prevents duplication and keeps all metadata in one place but requires a referencing mechanism.
  2. Allow <metadata> blocks in the structural elements themselves (e.g. in paragraphs, sentences, whatever structural unit is desired)

FoLiA's native metadata scheme is deliberately very simple, approach 1 would add some complexity to it. The idea behind keeping things simple is that we focus on annotations rather than metadata, as there are other schemes already in existence which handle metadata (e.g. CMDI, Dublin Core) and we didn't want to reinvent that wheel. However, referring to sub-parts is a valid FoLiA matter and existing schemes of metadata won't have facilities for it (as they're independent of FoLiA). Such a solution would have to be able to ne used in combination with whatever metadata scheme is used.

Approach 2 would perhaps be the most straightforward; tie in easily with other metadata schemes (as the metadata block may simply contain foreign-data and use CMDI, DC, or whatever). The more inline solution fits the FoLiA paradigm better on first glance, but raises the important issue of unnecessary duplication in case a block of metadata applies to various sub-parts.

The two approaches need not even be mutually exclusive, both could be implemented and choice deferred to the user, but this would introduce extra complexity for tools who don't know what to expect. and FoLiA aims to be rather specific to prevent that.

The boundary between metadata and annotation is not always a clear-cut one; whatever mechanism we introduce should not be used for linguistic annotation.

allow a foreign-annotation node

Sometimes you want to ad some 'raw XML' to a FoLiA document.
This then can be serialized. Or you can request a pointer to it.
FoLiA itself will NOT parse or use this XML.

suggestion: add a node to store raw XML.
the raw XML should NOT be in the FoLiA name-space.

'alien' nodes that are NOT under such an annotation node are completely ignored and discarded
(this is the current behaviour)

QUESTIONS:

  • should we actively check that the name-space != FoLiA ? probably yes.

  • Is the foreign name-space to be defined on the foreign-annotation node or on its immediate child?

    <foreign-annotation xmlns:pm="http://...">
           <pm:members pm:status="present">
    
versus:
   <foreign-annotation>
         <pm:members pm:status="present" xmlns:pm="http://...">

Pagebreaks: add linenr/pagenr/newpage attributes on br

bestaande linebreak element uitbouwen met een newpage="yes" attribuut (pagebreak impliceert immers ook een linebreak). Het br element is al speciaal in de zin dat het zowel als structuurelement als ook als text markup element gebruikt kan worden (binnen t), die functie willen we voor pagina breaks ook behouden, want ze kunnen overal voorkomen, middenin een zin en met een beetje pech zelfs middenin een woord (hyphenisation).

Support more than one ForeignData element

atm a FoLiA document can have one ForeignData element, which contains random non-FoLiA XML.
This is a hard-coded property.
It is desirable that several ForeignData blocks can occur.
In libfolia this is already implemented.

missing set in annotation declaration

Both the C++ and the Python implementation seem to accept annotation declarations without a set,
from example.xml:
<token-annotation annotator="ilktok" annotatortype="auto" />

The documentation states:
The set attribute is mandatory

with a footnote:
Technically, it can be omitted, but then the set defaults to “undefined”. This is allowed for flexibility and less explicit usage of FoLiA in limited settings, but not recommended!

I think this to lax, and set names should be mandatory unconditionally.
For instance: We run into trouble when a module would like to add another token-annotation.
per definition there is no default set anymore then, but it is rather complicated or impossible to assign a set to the already existing tokens, to distinguish those from the newly added ones.

afik, these nameless declaration are quite rare, probably only in testfiles???
We could investigate this, but NOT allowing this is important.

Question: why can't corrections have features?

Corrections can't have features (folia.Feature) at the moment. I'm not entirely sure why this isn't allowed, would you care to elaborate?

Reason why I'm asking is that we're compiling a learner corpus which is annotated with corrections that are classified with a specific unit, problem and part-of-speech; e.g. [roed/roet]SPE*I*N signals there's a spelling error (SPE) in the word 'roed' (correctly: roet), it is incorrectly (I) spelled noun (N). I'd like to split these features s.t. I'll later be able to query on them. So e.g.

    <w xml:id="example.text.1.p.1.s.5.w.6">
      <correction class="SPE">
        <new>
          <t>roet</t>
        </new>
        <original auth="no">
          <t>roed</t>
        </original>
        <feat class="I" subset="problem"/>
        <feat class="N" subset="pos"/>
      </correction>
    </w>

But this is not allowed by the foliavalidator currently :-) Other ideas for a proper solution are also welcome. Thanks very much in advance!

[new annotation proposal] Statements aka attribution

A span annotation type is needed to encode who says what about what/whom. This proposal is inspired on NAF's attribution layer:

<s>
 <w xml:id="w1"><t>They</t></w>
 <w xml:id="w2"><t>said</t></w>
 <w xml:id="w3"><t>the</t></w>
 <w xml:id="w4"><t>hotel</t></w>
 <w xml:id="w5"><t>was</t></w>
 <w xml:id="w6"><t>a</t></w>
 <w xml:id="w7"><t>nightmare</t></w>
 <w xml:id="w8"><t>.</t></w>
 <statements>
  <statement class="said">
   <source>
    <wref id="w1" />
   </source>
   <hd>
    <wref id="w3" />
    <wref id="w4" />
    <wref id="w5" />
    <wref id="w6" />
    <wref id="w7" />
   </hd>
   <relation>
     <wref id="w2" />
   </relation>
  </statement>
 </statements>
</s>

This introduces/uses the following span roles:

  • hd - The statement itself (required)
  • source - The source/holder of the statement (optional), used also in issue #16
  • relation - The relationship between source and the head (optional)

Whether the statement's class expresses the relationship like in the example, or has more direct bearing on the statement itself, is of course up to the set used.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.