GithubHelp home page GithubHelp logo

languagemachines / ucto Goto Github PK

View Code? Open in Web Editor NEW
61.0 13.0 13.0 4.39 MB

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

Home Page: https://languagemachines.github.io/ucto

License: GNU General Public License v3.0

Shell 1.70% C++ 58.39% Coq 31.34% Verilog 1.49% NewLisp 0.59% Python 1.65% Makefile 0.45% M4 3.88% Dockerfile 0.51%
natural-language-processing language nlp computational-linguistics tokeniser punctuation folia

ucto's Introduction

GitHub build Language Machines Badge DOI GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Ucto - A rule-based tokeniser

KNAW Humanities Cluster
Centre for Language and Speech technology, Radboud University Nijmegen
Induction of Linguistic Knowledge Research Group, Tilburg University

Website: https://languagemachines.github.io/ucto/

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation.

Ucto comes with tokenisation rules for several languages (packaged separately) and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog (https://languagemachines.github.io/frog), our Dutch morpho-syntactic processor.

The software is intended to be used from the command-line by researchers in Natural Language Processing or related areas, as well as software developers. An Ucto python binding is also available separately.

Features:

  • Comes with tokenization rules for English, Dutch, French, Italian, Turkish, Spanish, Portuguese and Swedish; easily extendible to other languages. Rules consists of regular expressions and lists. They are packaged separately as uctodata.
  • Recognizes units, currencies, abbreviations, and simple dates and times like dd-mm-yyyy
  • Recognizes paired quote spans, sentences, and paragraphs.
  • Produces UTF8 encoding and NFC output normalization, optionally accepting other input encodings as well.
  • Ligature normalization (can undo for isntance fi,fl as single codepoints).
  • Optional conversion to all lowercase or uppercase.
  • Supports FoLiA XML

Ucto was written by Maarten van Gompel and Ko van der Sloot. Work on Ucto was funded by NWO, the Netherlands Organisation for Scientific Research, under the Implicit Linguistics project, the CLARIN-NL program, and the CLARIAH project.

This software is available under the GNU Public License v3 (see the file COPYING).

Demo

Ucto demo

Installation

To install Ucto, first consult whether your distribution's package manager has an up-to-date package:

  • Alpine Linux users can do apk install ucto.
  • Debian/Ubuntu users can do apt install ucto but this version will likely be significantly out of date!
  • Arch Linux users can install Frog via the AUR.
  • macOS users with homebrew can do: brew tap fbkarsdorp/homebrew-lamachine && brew install ucto
  • An OCI container image is also available and can be used with Docker: docker pull proycon/ucto. Alternatively, you can build an OCI container image yourself using the provided Dockerfile in this repository.

To compile and install manually from source:

$ bash bootstrap.sh
$ ./configure
$ make
$ sudo make install

If you want to automatically download, compile and install the latest stable versions of the required dependencies, then run ./build-deps.sh prior to the above. You can pass a target directory prefix as first argument and you may need to prepend sudo to ensure you can install there. The dependencies are:

  • ticcutils - A shared utility library
  • libfolia - A library for the FoLiA format.
  • uctodata - Data files for ucto, packaged separately

If you already have these dependencies, e.g. through a package manager or manually installed, then you should skip this step.

You will still need to take care to install the following 3rd party dependencies through your distribution's package manager, as they are not provided by our script:

  • icu - A C++ library for Unicode and Globalization support. On Debian/Ubuntu systems, install the package libicu-dev.
  • libxml2 - An XML library. On Debian/Ubuntu systems install the package libxml2-dev.
  • libexttextcat - A language detection package.
  • A sane build environment with a C++ compiler (e.g. gcc 4.9 or above or clang), make, autotools, libtool, pkg-config

Usage

Tokenize an english text file to standard output, tokens will be space-seperated, sentences delimiter by <utt>:

$ ucto -L eng yourfile.txt

The -L flag specifies the language (as a three letter iso-639-3 code), provided a configuration file exists for that language. The configurations are provided separately, for various languages, in the uctodata package. Note that older versions of ucto used different two-letter codes, so you may need to update the way you invoke ucto.

To output to file instead of standard output, just add another positional argument with the desired output filename.

If you want each sentence on a separate line (i.e. newline delimited rather than delimited by <utt>), then pass the -n flag. If each sentence is already on one line in the input and you want to leave it at that, pass the -m flag.

Tokenize plaintext to FoLiA XML using the -X flag, you can specify an ID for the FoLiA document using the --id= flag.

$ ucto -L eng -X --id=hamlet hamlet.txt hamlet.folia.xml

Note that in the FoLiA XML output, ucto encodes the class of the token (date, url, smiley, etc...) based on the rule that matched.

For further documentation consult the ucto documentation.

Container Usage

A pre-made container image can be obtained from Docker Hub as follows:

docker pull proycon/ucto

You can build a docker container as follows, make sure you are in the root of this repository:

docker build -t proycon/ucto .

This builds the latest stable release, if you want to use the latest development version from the git repository instead, do:

docker build -t proycon/ucto --build-arg VERSION=development .

Run the container interactively as follows, you can pass any additional arguments that ucto takes.

docker run -t -i proycon/ucto

Add the -v /path/to/your/data:/data parameter (before -t) if you want to mount your data volume into the container at /data.

Webservice

If you are looking to run Ucto as a webservice yourself, please see https://github.com/proycon/ucto_webservice . It is not included in this repository.

ucto's People

Contributors

irishx avatar kosloot avatar proycon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ucto's Issues

cannot configure package

I am trying to install ucto from source but I am stuck with configure which keeps throwing:

checking that generated files are newer than configure... done
configure: error: conditional "OLD_LM" was never defined.
Usually this means the macro was only invoked conditionally.

I remember being able to compile it in the past, should I perhaps checkout a particular stable version?

unsupported language 'eng'

I've installed ucto following the manual instructions. I also installed uctodata. I used the prefix /usr/local for all dependencies, including ucto and uctodata itself.

When I run ucto I get

ucto: unsupported language 'eng'
ucto: Available Languages: 

Any clues?

best

combinations of words/numbers with abbreviations are incorrectly handled

When using lamachine (frog) I noticed that sometimes characters dissapear (in tokenization?). A few examples are:

met een kreatininespiegel tussen de 120 en de 140mmol/L.
Becomes:
met een kreatininespiegel tussen de 120 en de 140mmol L.

Berkhof, arts-assistente longziekten Berkhof arts-assistente longziekten/dr.
Becomes:
Berkhof , arts-assistente longziekten Berkhof arts-assistente longziekten dr.

Lengte 1,85m.
Becomes:
Lengte 1,8 m.

ECG: sinusritme 70 slagen per minuut, smal QRS, oud voorwandinfarct, QTc 426ms.
Becomes:
ECG : sinusritme 70 slagen per minuut , smal QRS , oud voorwandinfarct , QTc 42 ms.

cyste, astma of mengbeeld met COPD, FEV1 0.98L.
Becomes:
cyste , astma of mengbeeld met COPD , FEV1 0.9 L.

ucto slow on very long lines?

There is some weak evidence that ucto becomes non-linear slower for very long input lines. >2000 words or so.
This should be investigated.
Goal: ucto should be O(n)

Handling of abbreviations followed by punctuation goes awry

Ucto does great overall, but it seems to have some trouble with abbreviations before other punctuation:

Wil je mij er even langs laten a.j.b., ik heb haast.
In het interview zei hij o.a.: "Ucto is cool."

With the standard tokconfig-nl this leads to more sentence splits than necessary.

Multi label rules

For Rules with more then one capture group, it would be nice toe have multiple labels. For each group one..

e,g:
In English you want to split 'won't' into two tokens; 'wo' and 'n't'
The first labled as a WORD ?? the second as a SUFFIX.
The rule now is:
SUFFIX = ((?:\p{L})+)( %SUFFIXES% )(?:\Z|\P{L})
where n't is one of the possible SUFFIXES

How to accomplish what we want? Not sure.

For rules with more then one capture group you could think in the line of:
WORD+SUFFIX = ....some regexp with 2 capture groups
group 1 is assigned WORD
group 2 is assigned SUFFIX

Tokenization of bracketed abbreviations is problematic

ucto correctly states that A.F.K. is an ABBREVIATION:

ucto> een A.F.K.
een	WORD	BEGINOFSENTENCE NEWPARAGRAPH 
A.F.K.	ABBREVIATION	ENDOFSENTENCE 

But putting this inside brackets () {} [] <> fails:

ucto> een <A.F.K.>
een	WORD	BEGINOFSENTENCE 
<	SYMBOL	NOSPACE 
A	WORD	NOSPACE 
.	PUNCTUATION	NOSPACE 
F	WORD	NOSPACE 
.	PUNCTUATION	NOSPACE 
K	WORD	NOSPACE 
.	PUNCTUATION	NOSPACE 
>	SYMBOL	ENDOFSENTENCE 

Autoconf template has errors (with autoconf 2.69)

I'm at dd2f374 on origin/master.

sander@Yoga:/opt/installers/ucto/ucto/  autoconf
configure.ac:8: error: possibly undefined macro: AM_INIT_AUTOMAKE
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
configure.ac:31: error: possibly undefined macro: AC_PROG_LIBTOOL
configure.ac:74: error: possibly undefined macro: AC_MSG_FAILURE
sander@Yoga:/opt/installers/ucto/ucto/  autoconf -V
autoconf (GNU Autoconf) 2.69
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+/Autoconf: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>, <http://gnu.org/licenses/exceptions.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by David J. MacKenzie and Akim Demaille.

separate dat from code

It would be convenient to have the config data in a separate package, just like frogdata for frog.
This would allow for updates in the rules/languages without disturbing other software depending on Ucto

Problem with labeled lists

I have some files which have labels preceding list items.

I have some difficult to encode this in a way that ucto and foliavalidator will accept, cf attached files and error messages.

Neither
<item>
<label>a</label>
<t> .... text </t>
</item>

nor
<item>
<label>a</label>
<part><t> .... text </t></part>
</item>

work for me (The last option is accepted by foliavalidator)

error.log
metPart.mini.xml.txt
zonderPart.mini.xml.txt

turning off sentence detection fails

Goal: run the tokenizer without changing sentence boundaries: number of input lines and output lines need to be the same.

Seems to have several parameters for this?
both -n -m do this, and -S too?

running
ucto -S -n -m -L deu < bla > DE.trusted.clean.test.pt2.de.tok2

fails to skip the 2 empty lines in the file.

(data is at: /vol/bigdata2/datasets2/TraMOOC/Data/Wikifier2017/TraMOOCtest)

Date tagging can be improved

Context

Versions

Ucto - Unicode Tokenizer - version 0.9.5
(c) ILK 2009 - 2014, Induction of Linguistic Knowledge Research Group, Tilburg University
Licensed under the GNU General Public License v3
based on [libfolia 1.6]

Input

19 mei 2014
19.05.2014
19-05-2014
19-05-'14
19-5-'14
19 mei '14

Invocation

ucto -X -m uctotest.txt out.xml

Expected

All are recognized as dates.

Actual

Only ‘19-05-2014’ is recognized as a date.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="untitleddoc" generator="libfolia-v1.6" version="1.4.0">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-01-15T16:45:12" set="tokconfig-generic"/>
    </annotations>
    <meta id="language">default</meta>
  </metadata>
  <text xml:id="untitleddoc.text">
    <p xml:id="untitleddoc.p.1">
      <t>19 mei 2014 19.05.2014 19-05-2014 19-05-'14 19-5-'14 19 mei '14</t>
      <s xml:id="untitleddoc.p.1.s.1">
        <t>19 mei 2014</t>
        <w xml:id="untitleddoc.p.1.s.1.w.1" class="NUMBER">
          <t>19</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.1.w.2" class="WORD">
          <t>mei</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.1.w.3" class="NUMBER">
          <t>2014</t>
        </w>
      </s>
      <s xml:id="untitleddoc.p.1.s.2">
        <t>19.05.2014</t>
        <w xml:id="untitleddoc.p.1.s.2.w.1" class="NUMBER">
          <t>19.05.2014</t>
        </w>
      </s>
      <s xml:id="untitleddoc.p.1.s.3">
        <t>19-05-2014</t>
        <w xml:id="untitleddoc.p.1.s.3.w.1" class="DATE">
          <t>19-05-2014</t>
        </w>
      </s>
      <s xml:id="untitleddoc.p.1.s.4">
        <t>19-05-'14</t>
        <w xml:id="untitleddoc.p.1.s.4.w.1" class="NUMBER" space="no">
          <t>19</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.4.w.2" class="NUMBER" space="no">
          <t>-05</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.4.w.3" class="PUNCTUATION" space="no">
          <t>-</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.4.w.4" class="NUMBER-YEAR">
          <t>'14</t>
        </w>
      </s>
      <s xml:id="untitleddoc.p.1.s.5">
        <t>19-5-'14</t>
        <w xml:id="untitleddoc.p.1.s.5.w.1" class="NUMBER" space="no">
          <t>19</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.5.w.2" class="NUMBER" space="no">
          <t>-5</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.5.w.3" class="PUNCTUATION" space="no">
          <t>-</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.5.w.4" class="NUMBER-YEAR">
          <t>'14</t>
        </w>
      </s>
      <s xml:id="untitleddoc.p.1.s.6">
        <t>19 mei '14</t>
        <w xml:id="untitleddoc.p.1.s.6.w.1" class="NUMBER">
          <t>19</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.6.w.2" class="WORD">
          <t>mei</t>
        </w>
        <w xml:id="untitleddoc.p.1.s.6.w.3" class="NUMBER-YEAR">
          <t>'14</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

Difficulties with complex <t> contents

Some (not always very meaningful) internally complex t-elements give problems.

Typical exception:
ucto: no such text: s::textcontent(default)

Example:

voorbeeld_NoSuchText.xml.txt

<head xml:id="TEI.1.text.1.front.1.div1.4.head.1"><t class="default"><t-style class="italic">Op zijnen</t-style> YSTROOM.<br/></t></head>

Refactor `Setting::read`

This variable is only used once, in a much narrower scope.
This variable is written to and read from in a place while uninitialized.

The whole method is quite lengthy and complicated, and could be decomposed. Do you need help with any of this?

Unable to load shared libraries

I compile source code

$ bash bootstrap.sh
$ ./configure && make && make install

Previously I have installed libfolia from source. When we I try to call the ucto

$ ucto
/usr/local/bin/ucto: error while loading shared libraries: libfolia.so.6:

My system is Ubuntu Server 16.04 LTS

Retain --with-icu

It seems to be needed on Mac OS X (at least LaMachine invokes it),

REGEXP support not available

I successfully installed ucto following the instructions. However, whenever I try to run it, I get that warning.

I've also noticed that no paragraphs are being detected when using python-ucto (token.isnewparagraph() is always False) even though text looks like this:

I was almost startled into the water from my perch on the alder roots by
a voice saying:

"Well, what is there to look at?" My friend was a young farmer, stoutly
built, brown eyed, with a naturally fair skin burned dark and freckled
in patches. He laughed, seeing me start, and looked down at me with lazy
curiosity.

"I was thinking the place seemed old, brooding over its past."

He looked at me with a lazy indulgent smile, and lay down on his back on
the bank, saying: "It's all right for a doss here."

"Your life is nothing else but a doss. I shall laugh when somebody jerks
you awake," I replied.

He smiled comfortably and put his hands over his eyes because of the
light.

but I am not sure if that's related.

Ucto attempts to double-append the same paragraph when processing tables

Source document (minimalish example): https://download.anaproy.nl/cell.xml

Error:

$ ucto -Lnld cell.xml                                                                                                                                                                                         
ucto: inputfile = cell.xml
ucto: outputfile = 
ucto:tokconfig-nld: version=0.2
ucto:ucto: --filter=NO is automatically set. inputclass equals outputclass!
ucto: duplicate ID : cast005cons01_01.p.1

That paragraph is something ucto creates, it does not exist yet in the input.

Expected behaviour: <s> under <cell> (or <p> with <s> under cell would be acceptable too.)

Release ucto v0.9.7?

Last release was in January and there has been considerable work since (almost 100 commits). Normal LaMachine users will still not benefit from all of this yet until release. I recommend releasing as soon as the version is deemed stable. (along with uctodata if applicable)

parsing very long integers takes exponential time

when ucto needs to tokenize a sequence like 123456789 all seems well but
the longer the sequence, the more time it takes, and this is exponential!

so 123456789012345789012 takes 4 minutes,
as 12345678901234578901 only takes 2
(still WAY to long)

Abbreviations at the end of sentence not handled correctly

In line with issue #3, abbreviations at the end of a sentence seem not to be handled correctly, some examples follow. This might be harder to fix than the previous issue; as it's not obvious when an abbreviation really closes the sentence...

De beurzen stegen met 3 Pct. Dat was een prima resultaat.
Met GitHub kan je issues taggen, pull requests mergen, e.d. Dat is super!

Greek encoding

Here is a puzzling one:

antikeimenou.txt
error.log
HuygensING-epistolarium-1-1_0005f254-51d2-4dca-aeea-6abbf37711e4.xml.txt

The two versions of the text (before and after tokenization) seem the same, but actually they differ in the word ἀντιϰειμένου!

ἀντιϰειμένου
ἀντιϰειμένου

ἀ 1f00
ν 3bd
τ 3c4
ι 3b9
ϰ 3f0
ε 3b5
ι 3b9
μ 3bc
έ 1f73 Unicode Character 'GREEK SMALL LETTER EPSILON WITH OXIA' (U+1F73)
ν 3bd
ο 3bf
υ 3c5

ἀ 1f00
ν 3bd
τ 3c4
ι 3b9
ϰ 3f0
ε 3b5
ι 3b9
μ 3bc
έ 3ad Unicode Character 'GREEK SMALL LETTER EPSILON WITH TONOS' (U+03AD)
ν 3bd
ο 3bf
υ 3c5

This probably has something to do with these two representing the same 'abstract character'?? (https://en.wikipedia.org/wiki/Unicode#Abstract_characters)

Ucto fails to tokenise certain folia input?

Unexpected failure in tokenisation due to clashing text content elements. Input document is /vol/tensusers/proycon/_gen001gent01_01.folia.xml.

Command error:                                                
  ucto: --filter=NO is automaticly set. inputclass equals outputclass!                                                      
  ucto: inputfile = _gen001gent01_01.folia.xml                
  ucto: outputfile = _gen001gent01_01.tok.folia.xml           
  ucto:tokconfig-nld: version=0.2                             
  ucto: XML error: attempt to add <t> with class=current and text 'De Zuidelijke Nederlanden in de zestiende eeuw en de     
  negentien kamers die deelnamen aan het Gentse rederijkersfeest van                                                        
  1539.' to element: _gen001gent01_01.TEI.2.text.body.div.div.p.418.s.1 with parent _gen001gent01_01.TEI.2.text.body.div.div.p.418 which already has a <t> with that class and text: 'De Zuidelijke Nederlanden in de zestiende eeuw en de              
                                         negentien kamers die deelnamen aan het Gentse rederijkersfeest van                 
                                         1539.' 

enable alternative search paths for uctodata

Not all configurations allow installing uctodata stuff in $datadir/ucto (notably MacOSX)
where $datadir is something like $prefix/share
The alternative would be to install the ucto datafiles in $datadir/uctodata (that IS possible @fbkarsdorp ?)
ucto could then use a search path: first $datadir/ucto, and then $datadir/uctodata

Is this an option to try?

Tokenize ALL FoLiA elements that carry text

At the moment ucto ignores <t> elements on FoLiA tags like <head>, <note> etc.
These should be tokenized too (assuming that they don't already have deeper structure like <s>)

A point of discussion might be if this done unconditionally or not.

Some edge cases for nld rules

I think I may have found a few edge cases where the rules for Dutch split words incorrectly:

  • *a's has the 's chopped off, which seems inconsistent at least (oma's -> oma + 's, drama's -> drama + 's, but opa's -> opa's); seems the pattern for this to occur is -a's*, with * some non-word, non-whitespace character.
  • Iraki's does have its 's cut off for some reason, just like neonazi's. Seems the closest pattern for this might be -i's*, with * some non-word, non-whitespace character.
  • 37.501ste has the period treated like a regular one to get 37 + . + 501ste, so ordinals greater than 999 are probably not dealt with correctly.
  • 16- en 17-jarige has the - separated from 16 while with words, such a hyphen is left attached
  • SP.A is a terrible one, but it does have to stick together and is split up right now (maybe keep a list of such exceptions hardcoded somewhere?)
  • ' after s is unconditionally seen as possessive, but may be simply because a word is in quotes, e.g. " 'Chaos' is het woord dat het vaakst voorkomt. "
  • several abbreviations from nld-afk are also common nouns: fa (F note), pers (the press), var (an animal) and a verb, verg (~require).
  • colons (:) seem to stick to any preceding punctuation, not sure that's supposed to happen.

Ucto fails on XML comments

I have an input FoLiA document containing XML comments (retained in conversion from TEI):

<utt xml:id="tuin005oors02_01.TEI.2.text.body.div.div.lg.l.9691"> <t>Haec benè si serves, tu longo tempore vives.</t>                                                 
    <!--[/Lat]--></utt>  

Ucto chokes on this:

$ ucto -L nld -X -F tuin005oors02_01.folia.xml tuin005oors02_01.tok.folia.xml
....
Command error:
  ucto: --filter=NO is automaticly set. inputclass equals outputclass!
  ucto: inputfile = tuin005oors02_01.folia.xml
  ucto: outputfile = tuin005oors02_01.tok.folia.xml
  ucto:tokconfig-nld: version=0.2
  ucto:ucto: --filter=NO is automaticly set. inputclass equals outputclass!
  ucto: no such text: NON printable element: _XmlComment

Input file is in /vol/tensusers/proyon/tuin005oors02_01.folia.xml

Feature request: Rule type applied prior to whitespace tokenization, to allow protecting token sequences

I'm trying to protect token sequences enclosed in tags, e.g. <PER>John Smith</PER>. This is easily covered by the following regex:

\<.*?\>[\w\s]+?\<\/.*?\> (where \w\s may be . , if it also matches whitespace)

However, this appears to be impossible due to the fact that before considering any rules, the tokenizer splits on whitespaces. Would it be possible to add a type of rules that are evaluated prior to whitespace tokenization? Otherwise, token sequences that should be kept together have to be concatenated using special characters, reconstructed after tokenization or substituted with placeholders prior to tokenization, all of which cause overhead.

misplaced uctodata warning for tokconfig-generic configuration

On ucto 0.9.3 with uctodata 0.2, and also reproduces on unreleased ucto 0.9.4:

configfile = tokconfig-generic
inputfile =
outputfile =
Initiating tokeniser...
skipping META rule: 'NUMBER-ORDINAL'
skipping META rule: 'ABBREVIATION-KNOWN'
skipping META rule: 'WORD-TOKEN'
skipping META rule: 'PREFIX'
skipping META rule: 'SUFFIX'
WARNING: your datafile seems out of date!
         for best results, you should use uctodata version >=0.2

Ucto sentence splitting can cause FoLiA text redundancy errors

This is a pretty tough one I'm afraid. Take the following FoLiA paragraph from a real example:

<p xml:id="TEI.1.text.1.body.1.div1.1.div2.1.p.1" class="p">                                                                                                                                                                                  
 <t>Zoo ooit, dan heb ik in de laatste maanden betreurd dat ik mijn ‘dagboek’ zoo geheel en al verwaarloosd heb. ’t Gaat mooi met de ‘Katholieke Sociale Actie’ (<t-style class="italic">K.S.A.</t-style>). Wat zou ’t me nu niet waard zijn in dit dagboek eens een precies verhaal te bezitten, hoe die zaak is tot stand gekomen.</t>
</p>

Ucto tokenises this wrongly as it inserts a sentence split inside the K.S.A abbreviation:

              <w xml:id="TEI.1.text.1.body.1.div1.1.div2.1.p.1.s.2.w.12" class="WORD" set="tokconfig-nld" space="no">                                                                                                                                          
                <t>K</t>                                                                                                                                                                                                                                                                                                                                                                                                                                                               
              </w>                                                                                                                                                                                                                                             
              <w xml:id="TEI.1.text.1.body.1.div1.1.div2.1.p.1.s.2.w.13" class="PUNCTUATION" set="tokconfig-nld" space="no">                                                                                                                                   
                <t>.</t>                                                                                                                                                                                                                                                                                                                                                                                                                                                               
              </w>                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
            </s>                                                                                                                                                                                                                                               
            <s xml:id="TEI.1.text.1.body.1.div1.1.div2.1.p.1.s.3">                                                                                                                                                                                             
              <w xml:id="TEI.1.text.1.body.1.div1.1.div2.1.p.1.s.3.w.1" class="WORD" set="tokconfig-nld" space="no">                                                                                                                                           
                <t>S</t>                                                                                                                                                                                                                                                                                                                                                                                                                                                           
              </w>                                                                                                                                                                                                                                             
              <w xml:id="TEI.1.text.1.body.1.div1.1.div2.1.p.1.s.3.w.2" class="PUNCTUATION" set="tokconfig-nld" space="no">                                                                                                                                    
                <t>.</t>                                                                                                                                                                                                                                       
                <pos class="LET()" confidence="1" head="LET" />                                                                                                                                                                                                
                <lemma class="." />                                                                                                                                                                                                                            
              </w>                                                                                                                                                                                                                                             
            </s>                                                                                                                                                                                                                                               

Of course wrong tokenisation can and will occur (and though not ideal that is not the issue of this actual bug report), however, the problem now is that this produces a text consistency error, so ucto/frog generates invalid FoLiA! Because of the sentence split we get "K.\sS.A" instead of "K.S.A" and FoLiA v1.5 trips over this, which is pretty serious.

assigning paragraphs to FoLiA structure elements, yes, no, maybe?

The code that assigns higher structure FoLiA tags to tokenized text from FoLiA documents is rather messy.
An attempt is made to see whether a 'root' bearing the text is a structure or not.
But this code is not exhaustive, (recently we added Cell to the list)
A more generic solution would be preferable.
I tried such an approach but that raises a question: Do we always want to generate a Paragraph, even when only one sentence is present? This might be a bit of an overkill.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA generator="teiExtractText.pl" version="1.4" xml:id="doc" xmlns="http://ilk.uvt.nl/folia">
  <metadata>
    <annotations>
    </annotations>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

The current implementation generates the following tokenization:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:14:58" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
            <s xml:id="cell.1.s.1">
              <w xml:id="cell.1.s.1.w.1" class="WORD">
                <t>Word</t>
              </w>
              <w xml:id="cell.1.s.1.w.2" class="WORD">
                <t>one</t>
              </w>
            </s>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

More generic, the cell would also get a paragraph:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:17:01" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Word one</t>
            <p xml:id="cell.1.p.1">
              <s xml:id="cell.1.s.1">
                <w xml:id="cell.1.s.1.w.1" class="WORD">
                  <t>Word</t>
                </w>
                <w xml:id="cell.1.s.1.w.2" class="WORD">
                  <t>one</t>
                </w>
              </s>
            </p>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

This redundancy seems a bit of overkill, but now consider this example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA generator="teiExtractText.pl" version="1.4" xml:id="doc" xmlns="http://ilk.uvt.nl/folia">
  <metadata>
    <annotations>
    </annotations>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

After tokenization we get:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:21:48" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
            <s xml:id="cell.1.s.1">
              <w xml:id="cell.1.s.1.w.1" class="WORD">
                <t>Een</t>
              </w>
              <w xml:id="cell.1.s.1.w.2" class="WORD">
                <t>lange</t>
              </w>
              <w xml:id="cell.1.s.1.w.3" class="WORD" space="no">
                <t>zin</t>
              </w>
              <w xml:id="cell.1.s.1.w.4" class="PUNCTUATION">
                <t>.</t>
              </w>
            </s>
            <s xml:id="cell.1.s.2">
              <w xml:id="cell.1.s.2.w.1" class="WORD">
                <t>Gevolgde</t>
              </w>
              <w xml:id="cell.1.s.2.w.2" class="WORD">
                <t>door</t>
              </w>
              <w xml:id="cell.1.s.2.w.3" class="WORD">
                <t>nog</t>
              </w>
              <w xml:id="cell.1.s.2.w.4" class="WORD">
                <t>een</t>
              </w>
              <w xml:id="cell.1.s.2.w.5" class="WORD" space="no">
                <t>Zin</t>
              </w>
              <w xml:id="cell.1.s.2.w.6" class="PUNCTUATION">
                <t>.</t>
              </w>
            </s>
            <s xml:id="cell.1.s.3">
              <w xml:id="cell.1.s.3.w.1" class="WORD">
                <t>Dit</t>
              </w>
              <w xml:id="cell.1.s.3.w.2" class="WORD">
                <t>is</t>
              </w>
              <w xml:id="cell.1.s.3.w.3" class="WORD">
                <t>dus</t>
              </w>
              <w xml:id="cell.1.s.3.w.4" class="WORD">
                <t>een</t>
              </w>
              <w xml:id="cell.1.s.3.w.5" class="WORD" space="no">
                <t>paragraaf</t>
              </w>
              <w xml:id="cell.1.s.3.w.6" class="PUNCTUATION">
                <t>?</t>
              </w>
            </s>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

And I think this is WRONG or at least questionable.
Shouldn't it not better be:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="1.4">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-12-05T15:21:48" set="tokconfig-nld"/>
    </annotations>
    <meta id="language">nld</meta>
  </metadata>
  <text xml:id="text">
    <div xml:id="div.1">
      <table xml:id="table.1">
        <row xml:id="row.1">
          <cell xml:id="cell.1">
            <t>Een lange zin. Gevolgde door nog een Zin. Dit is dus een paragraaf?</t>
            <p xml:id="cell.1.p.1">
              <s xml:id="cell.1.s.1">
                <w xml:id="cell.1.s.1.w.1" class="WORD">
                  <t>Een</t>
                </w>
                <w xml:id="cell.1.s.1.w.2" class="WORD">
                  <t>lange</t>
                </w>
                <w xml:id="cell.1.s.1.w.3" class="WORD" space="no">
                  <t>zin</t>
                </w>
                <w xml:id="cell.1.s.1.w.4" class="PUNCTUATION">
                  <t>.</t>
                </w>
              </s>
              <s xml:id="cell.1.s.2">
                <w xml:id="cell.1.s.2.w.1" class="WORD">
                  <t>Gevolgde</t>
                </w>
                <w xml:id="cell.1.s.2.w.2" class="WORD">
                  <t>door</t>
                </w>
                <w xml:id="cell.1.s.2.w.3" class="WORD">
                  <t>nog</t>
                </w>
                <w xml:id="cell.1.s.2.w.4" class="WORD">
                  <t>een</t>
                </w>
                <w xml:id="cell.1.s.2.w.5" class="WORD" space="no">
                  <t>Zin</t>
                </w>
                <w xml:id="cell.1.s.2.w.6" class="PUNCTUATION">
                  <t>.</t>
                </w>
              </s>
              <s xml:id="cell.1.s.3">
                <w xml:id="cell.1.s.3.w.1" class="WORD">
                  <t>Dit</t>
                </w>
                <w xml:id="cell.1.s.3.w.2" class="WORD">
                  <t>is</t>
                </w>
                <w xml:id="cell.1.s.3.w.3" class="WORD">
                  <t>dus</t>
                </w>
                <w xml:id="cell.1.s.3.w.4" class="WORD">
                  <t>een</t>
                </w>
                <w xml:id="cell.1.s.3.w.5" class="WORD" space="no">
                  <t>paragraaf</t>
                </w>
                <w xml:id="cell.1.s.3.w.6" class="PUNCTUATION">
                  <t>?</t>
                </w>
              </s>
            </p>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

A quick fix is 'easy': always add a paragraph level.
We could 'count' sentences and leave the paragraph out when only one sentence is present.
That would require exceptions again, i guess for 'div' and 'text' nodes at least. Maybe 'head' and others too?

detectlanguages should detect languages in FoLiA

The --detectlanguages option is confusing:
On plain text it means: Detect the language, tokenize according to that language and assign it to the FoLiA output.
On FoLiA input it means: check the language tag of elements and when it is in the provided list, tokenize it, according to the language.

I think that in FoLiA input it should be possible to really detect the language too.
This probably only will work correctly on input documents without any language info, but still that is useful.

TEXT VALIDATION ERROR (consistency)

Probleem bij tokenizeren:

Type melding van foliavalidator:

TEXT VALIDATION ERROR: Text for Paragraph, ID TEI.1.text.1.body.1.div1.1.div2.2.div3.25.p.2, class default, is inconsistent: expected (after normalization):
...
got .
....

Het probleem lijkt te worden veroorzaakt door gevallen waar ucto een splitsing binnen een spatieloze vorm aanbrengt: seggen.te

Sijn ter vergaderinge gecompareert d’E. broederen Jacobus Laurentius ende Mattheus Meursius 3), dienaars des goddelijcken woorts, verthonende dat de predikant Hoornhovius 4), tegenwoordigh staande binnen Emmenes, sigh aan dese te werden als predikant, sonder sigh alvoren aan 't classis deser stede bekent gemaeckt te hebben, daer nogtans volgens voorgaende gebruyck ende goede correspondentie, dewelcke het voorn, classis altijt heeft gehouden met de kercken van Indien ende dese Camer, alle predikanten, proponenten en sieckentroosters gewoon sijn haar te addresseren aan 't meergenoemde classis deser stede; oversulcx dat 't selve alsnogh soude mogen werden gecontinueert en gepractiseert, sonder dat daarmede bedenckinge op de persoon van de voorn. Hoornhovius werde genomen, alsoo sij verklaeren de testimonia der kercken, die voorn. Hoornhovius voor hem braghte, gesien is, daarop niet te seggen.te hébben. Waarop bij de vergaderingh gedelibereert sijnde, goetgevonden is de voorn, broederen voor antwoort toe te voegen, dat de voorn. Hoornhoven sigh sal mogen aan 't gemelte classis addresseren, indien Sijn E. 't selve goetvint ende anders niet, alsoo de vergaderinge verstaat van het reght, hetwelck sij heeft omme predikanten aan te mogen nemen, geensints te willen cederen, maar 't selve aan haar te behouden.

Sentence detection breaks structure

Hmm, I found a pretty serious text-altering bug. Consider the following input:

      <event class="poem" xml:id="west049zeed01_01.TEI.2.text.body.div.lg.2794">
        <utt xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2609"> <t>O! Philemon naar gaat,</t></utt>
        <utt xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610"> <t>In 't wout, seer droevigh, dwaalen,</t></utt>
     </event>

Ucto erroneously strips part of the first utterance text and appends a sentence at the end:

    <event xml:id="west049zeed01_01.TEI.2.text.body.div.lg.2794" class="poem">                                                                                                                                                                                 
      <utt xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2609">                                                                                                                                                                                            
        <t>O! Philemon naar gaat,</t>                                                                                                                                                                                                                          
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2609.w.1" class="WORD" space="no">                                                                                                                                                                
          <t>O</t>                                                                                                                                                                                                                                             
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2609.w.2" class="PUNCTUATION">                                                                                                                                                                    
          <t>!</t>                                                                                                                                                                                                                                             
        </w>                                                                                                                                                                                                                                                   
      </utt>                                                                                                                                                                                                                                                   
      <utt xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610">                                                                                                                                                                                            
        <t>In 't wout, seer droevigh, dwaalen,</t>                                                                                                                                                                                                             
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.1" class="WORD">                                                                                                                                                                           
          <t>In</t>                                                                                                                                                                                                                                            
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.2" class="WORD-TOKEN">                                                                                                                                                                     
          <t>'t</t>                                                                                                                                                                                                                                            
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.3" class="WORD" space="no">                                                                                                                                                                
          <t>wout</t>                                                                                                                                                                                                                                          
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.4" class="PUNCTUATION">                                                                                                                                                                    
          <t>,</t>                                                                                                                                                                                                                                             
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.5" class="WORD">                                                                                                                                                                           
          <t>seer</t>                                                                                                                                                                                                                                          
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.6" class="WORD" space="no">                                                                                                                                                                
          <t>droevigh</t>                                                                                                                                                                                                                                      
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.7" class="PUNCTUATION">                                                                                                                                                                    
          <t>,</t>                                                                                                                                                                                                                                             
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.8" class="WORD" space="no">                                                                                                                                                                
          <t>dwaalen</t>                                                                                                                                                                                                                                       
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.l.2610.w.9" class="PUNCTUATION">                                                                                                                                                                    
          <t>,</t>                                                                                                                                                                                                                                             
        </w>                                                                                                                                                                                                                                                   
      </utt>                                                                                                                                                                                                                                                   
      <s xml:id="west049zeed01_01.TEI.2.text.body.div.lg.2794.s.1">                                                                                                                                                                                            
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.2794.s.1.w.1" class="WORD">                                                                                                                                                                         
          <t>Philemon</t>                                                                                                                                                                                                                                      
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.2794.s.1.w.2" class="WORD">                                                                                                                                                                         
          <t>naar</t>                                                                                                                                                                                                                                          
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.2794.s.1.w.3" class="WORD" space="no">                                                                                                                                                              
          <t>gaat</t>                                                                                                                                                                                                                                          
        </w>                                                                                                                                                                                                                                                   
        <w xml:id="west049zeed01_01.TEI.2.text.body.div.lg.2794.s.1.w.4" class="PUNCTUATION">                                                                                                                                                                  
          <t>,</t>                                                                                                                                                                                                                                             
        </w>                                                                                                                                                                                                                                                   
      </s>                                                                                                                                                                                                                                                     
    </event>   

I caught this using new text validation. If sentence detection is disabled, the problem does not occur. I would suggest disabling sentence detection under <utt> anyway, as this often does not work. However, I don't know if this problem might manifest itself in other situations as well (I'm thinking part for instance?)

Improve the include mechanisme for uctodata files

At the moment the include mechanism is a bit messy.
it is context dependent and uses a lot of implicit knowledge.

It would be convenient to be able to include files at all positions and do 'the right thing'

This is related to a more generic solution for #47

add possibility to add extra user-defined rules on startup

recently an --add-tokens option was introduced to Ucto to add extra 'TOKENS' to the configuration.
We might consider extending this, so a user could add extra, non-default rules/items to the tokenizer.

Some caveats to consider:

  • are the extra rules additional, of do they override?
  • make it possible to disable a certain rule
  • are the additions language specific? How to express that

Disable word tokenization

Is it possible to disable word tokenization and only do sentence tokenization with the current version of ucto.
Given the current state of the art of sentence tokenization tools, I think this would be a pretty good contribution (and probably not too hard to implement).

best

Ucto crashes on overly long word strings

[2]+ Segmentation fault nohup /exp/sloot/usr/local/bin/ucto -m -n --normalize=CURRENCY,DATE,DATE-REVERSE,E-MAIL,FRACNUMBER,NUMBER,NUMBER-COMPOUND,NUMBER-ORDINAL,NUMBER-STRING,NUMBER-YEAR,PUNCTUATION,PUNCTUATION-MULTI,QUOTE-COMPOUND,REVERSE-SMILEY,SMILEY,TIME,URL,URL-DOMAIN,URL-WWW --filterpunct lol.longest.txt > lol.longest.stdout 2> lol.longest.stderr
mre@black:/opensonar/SoNaRCurated/OUT2$

Tried to process SoNaR-500 text in this fashion. Ucto crashed. Problem reduced to this very long line of '::lol::' (x 13106).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.