Hi, We have thousands of article abstracts. A lot are in mixed langu

unk is an existing iso-639-3 code: <a href="https://i

Language detection default for 'unknown' language about ucto HOT 9 CLOSED

martinreynaert commented on June 19, 2024

Language detection default for 'unknown' language

from ucto.

Comments (9)

kosloot commented on June 19, 2024 1

Ok, this sounds like a feasible request.
I would have to dive into this for a solution

from ucto.

martinreynaert commented on June 19, 2024 1

Thank you kosloot!

Also for the further explanations about how this works.

We have now run this on several thousands of files: not a single one failed.

I consider this matter closed.

from ucto.

proycon commented on June 19, 2024

unk is an existing iso-639-3 code: https://iso639-3.sil.org/code/unk , we shouldn't abuse it for something else. If a language can't be identified (or with not enough confidence), it'd be better simply not to output the <lang> element at all.

from ucto.

kosloot commented on June 19, 2024

I am aware of the unk code, but fortunately there is also an und code we could use. Which is exactly what I am heading to.

from ucto.

proycon commented on June 19, 2024

Ha, right, I was already wondering if there was something like that. Good idea.

from ucto.

martinreynaert commented on June 19, 2024

Sounds good!

from ucto.

kosloot commented on June 19, 2024

Ok, I added code to handle 'und' languages.
When adding 'und' to the --detectlanguages option, the 'default' language will be 'undefined' and those sentences will
remain untokenized, and added 'as is' to the FoLiA output.
@martinreynaert please test and comment.

from ucto.

martinreynaert commented on June 19, 2024

Thank you!

I have tested 'und' in Ucto's language detection mode. Results appear much more reliable than before! See this remarkable example: JSTOR.music.00656.p.1.s.7

I had 1477 input files for testing. However, 25 of these gave empty output. And a message in *stderr. I attach them for your convenience.

I saw at least one where there's only non-Latin script (file: JSTOR.music.01437). That text I think should nevertheless be incorporated in FoLiA. I have yet to check what happens when there is mixed Latin and non-Latin text.

Another is more unclear, the text seems just plain English to me, but Ucto complains ""ucto: ucto: conflicting language(s) assigned"" and returns an empty file (see: JSTOR.music.00072 and 17 more files).

UCTO.FailLangDetect.20220407.MRE.tar.gz

Again: thanks! Ucto has already been greatly improved for our purposes!
``

from ucto.

kosloot commented on June 19, 2024

I added some code to avoid the ucto: ucto: conflicting language(s) message.
Beware that this stems from an unsolvable "chicken egg" problem:
To detect languages, we need to detect detect sentence bounds, which requires tokenization, But to tokenize, we need to know the language.

At the moment we guess some sentence bounds, use the detected fragments to detect the language, and then tokenize the longest utterance within the same language. This works quite well, but not always. As libtextcat sometimes makes strange decisions.

example:
Educated at the famous monastery in St. Gallen, he went as a wandering student in search of learning as'

This is first split at the '.' in St., then the first part: Educated at the famous monastery in St. is detected as English,
but the second part: Gallen, he went as a wandering student in search of learning as' is somehow detected as Dutch.
As a consequence, this utterance will be tokenized as 2 Sentences.

When both parts would have been seen as English, then it would be correctly seen as 1 Sentence.

This problem is NOT resolvable by Ucto

from ucto.

Language detection default for 'unknown' language about ucto HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs