Comments (9)
Ok, this sounds like a feasible request.
I would have to dive into this for a solution
from ucto.
Thank you kosloot!
Also for the further explanations about how this works.
We have now run this on several thousands of files: not a single one failed.
I consider this matter closed.
from ucto.
unk
is an existing iso-639-3 code: https://iso639-3.sil.org/code/unk , we shouldn't abuse it for something else. If a language can't be identified (or with not enough confidence), it'd be better simply not to output the <lang>
element at all.
from ucto.
I am aware of the unk
code, but fortunately there is also an und
code we could use. Which is exactly what I am heading to.
from ucto.
Ha, right, I was already wondering if there was something like that. Good idea.
from ucto.
Sounds good!
from ucto.
Ok, I added code to handle 'und' languages.
When adding 'und' to the --detectlanguages option, the 'default' language will be 'undefined' and those sentences will
remain untokenized, and added 'as is' to the FoLiA output.
@martinreynaert please test and comment.
from ucto.
Thank you!
I have tested 'und' in Ucto's language detection mode. Results appear much more reliable than before! See this remarkable example: JSTOR.music.00656.p.1.s.7
I had 1477 input files for testing. However, 25 of these gave empty output. And a message in *stderr. I attach them for your convenience.
I saw at least one where there's only non-Latin script (file: JSTOR.music.01437). That text I think should nevertheless be incorporated in FoLiA. I have yet to check what happens when there is mixed Latin and non-Latin text.
Another is more unclear, the text seems just plain English to me, but Ucto complains ""ucto: ucto: conflicting language(s) assigned"" and returns an empty file (see: JSTOR.music.00072 and 17 more files).
UCTO.FailLangDetect.20220407.MRE.tar.gz
Again: thanks! Ucto has already been greatly improved for our purposes!
``
from ucto.
I added some code to avoid the ucto: ucto: conflicting language(s)
message.
Beware that this stems from an unsolvable "chicken egg" problem:
To detect languages, we need to detect detect sentence bounds, which requires tokenization, But to tokenize, we need to know the language.
At the moment we guess some sentence bounds, use the detected fragments to detect the language, and then tokenize the longest utterance within the same language. This works quite well, but not always. As libtextcat sometimes makes strange decisions.
example:
Educated at the famous monastery in St. Gallen, he went as a wandering student in search of learning as'
This is first split at the '.' in St., then the first part: Educated at the famous monastery in St.
is detected as English,
but the second part: Gallen, he went as a wandering student in search of learning as'
is somehow detected as Dutch.
As a consequence, this utterance will be tokenized as 2 Sentences.
When both parts would have been seen as English, then it would be correctly seen as 1 Sentence.
This problem is NOT resolvable by Ucto
from ucto.
Related Issues (20)
- passthru mode should not be combined with other language options
- ucto creates invalid folia HOT 2
- Update debian package for v0.21
- Byte-order mark followed by space or tab results in Folia error HOT 7
- is this correct handling of FoLiA paragraphs with embedded Part nodes? HOT 4
- -T full option produces invalid FoLiA HOT 1
- Tokenization of t-style element that has font_typeface Feature HOT 19
- Validation of ucto output fails due to space character in FoLiA output from Piereling HOT 7
- ucto sometimes misses out on the <t> for <p> HOT 3
- IDs in UCTO in concert with tei2folia HOT 3
- Ucto with 'detectlanguages' : failure HOT 3
- remove some deprecated options HOT 6
- Ucto aborts on FoLiA creation
- Question: Concatenating word parts at soft hyphens HOT 77
- Develop a tokenizer for Premodern Slavic
- Implement (soft)hyphen handling in Ucto analogues to foliautils
- Ucto fails on some UTF-8 characters in tei2folia generated FoLiA HOT 12
- add a batch option HOT 6
- Setting -m in container does not supress punctuation-based sentence splitting HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ucto.