Poliglot is a Scala library for parsing common language resources such as corpora and tagsets. It was created in order to facilitate working with bilingual corpora.
Currently, it has language support for:
- German: RFTagger tagset
- Polish: NKJP tagset, TCP bindings for concraft-pl
The language-specific tagsets are translated to a generic class-based hierarchy.
Poliglot also ships the following tools for creating bilingual corpora and analysing annotations:
Analyse the semantics of adpositions.
Prints statistics on the bilingual corpus.
Annotates selected sentences morphosyntactically. To achieve that tokenisation is consistent, the German sentences are tokenised by concraft-pl. The annotated alignments are written to alignments-import.xml
. This dump can then be imported using the corpus editor into the existing database (alignments.xml
).
Train a model for alignment entities.
Reads the alignment corpus and extracts potential lemmas that are missing in the lemma corpus.
Manually select alignments from the provided .tmx
file. It can be obtained from here. The dump is expected to provide German-Polish alignments for now. The purpose is to select viable sentences for each of the German adpositions defined in German.Adpositions
. Sentences are tagged with a flag indicating their fitness. It can be executed several times; all prior tagged sentences will be skipped over in this case.
This document explains the underlying concepts and provides an analysis of the German-Polish corpus with regards to the annotation of adpositions.
Poliglot is licensed under the terms of the Apache v2.0 license.
- Tim Nieradzik