aliok / trnltk Goto Github PK
View Code? Open in Web Editor NEWTurkish Natural Language Toolkit
License: Apache License 2.0
Turkish Natural Language Toolkit
License: Apache License 2.0
For example, it might make sense to mark the word "bilimsel" as "contains 'Related' suffix"
We don't want to break that word into its subparts, but need to reduce the number of parse results produced for that word.
bilimsel+Adj
bilim+A3sg+Pnon+Nom+Adj+Related
Can be applied to nouns ending with a consonant.
erkeksi
cocuksu
raporsu
but not
masa-m-si
yasli-m-si
-msi is another suffix
Z3's solution : https://code.google.com/p/zemberek3/source/detail?r=321
Seems like a very hard problem.
For example, with brute force, parse results for the word "yapıyordum" are following:
Nr 2, nr 3 and nr 4 are false positives.
For nr 3 and nr 4 we can have a look at the similarity with other verb roots.
There are a lot of verbs (kapa, ...) in the form of nr 2, so this doesn't solve the problem completely. But it is a start.
1.'yi
2'nci
10000'er
biner
bininci (supported already)
instead of adding words to dictionary, maybe it is better to add the support with the suffix graph
masmavi, simsıcak, yapayalnız, ipince, küpküçük etc.
Doesn't have strict rules, thus need to find syllables for those from a big corpus.
If word ends with a dictionary adverb/adjective, and if it starts with the beginning of the same adjective/adverb, we can suggest that it is intensified.
Try to find as much as possible and add them to a dictionary!
Brute force root extractors already exist. However, the results are too much and it is better to do it statistically
This might be useful for finding roots that doesn't exist in the dictionary (e.g. local words) and proper nouns.
For proper noun recognition
For e.g. verb recognition: for non-dictionary word 'kıvışlıyordu' find the root as 'kıvışlamak'
That would help a lot.
Another example:
hıllandım -> 'hıllandı(hıllandımak)+Verb+Neg(m[m])+Imp+A2sg',
Duplications:
Some of them can be done during tokenization.
Some of them needs to be done after parsing, such as 13
_R+Verb+Pos+Aor+A3sg _R+Verb+Neg+Aor+A3sg %1 %2+Adverb+When
Should accept:
Some suffixes which can be applied to these kind of roots:
etc.
Check the rules from TDK: http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=187:Noktalama-Isaretleri-Aciklamalar&catid=50:yazm-kurallar&Itemid=132
Some suffixes doesn't make sense to apply proper nouns:
Really problematic part is, sometimes apostrophe is used, sometimes it is not.
Transitive verbs (verbs that accept an object) could be found from an annotated corpus.
A non-transitive verb can be converted to transitive by adding Causative suffix.
** A very advanced issue. Related with POS tagging.
Finding reciprocal is easy. Rule based (verbs ends with "ş" and no need for POS tagging)
Reflexive (such as giyinmek) is similar to reciprocal
For:
Think about it, need to investigate. However, it seems it is not necessary/
-vermek
-durmak
.etc
In order to use in tokenizer (sentence to words), we need something like that.
Can be done statistically with some rules, with the support of Issue #25
hafta sonu => hafta_sonu
Turkiye Cumhuriyeti ==> Turkiye_Cumhuriyeti
ilan etmek --> ilan_etmek
Doesn't make sense to parse "ilan" and "etmek" separately.
Zemberek has already a small database about these.
Issue #32 is related
Use statistical parsing + context information(n-gram probability)
Rule based part is already available : https://github.com/aliok/trnltk/blob/master/trnltk/tokenizer/texttokenizer.py
It doesn't work good with:
Ideas:
while tokenization:
for 5 and 6, see http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=221:Ayri-Yazilan-Birlesik-Kelimeler&catid=50:yazm-kurallar&Itemid=132
This would be good for deciding what to do when a dot char is seen.
If it makes sense:
don't separate it.
Same would go with other ambiguous points.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.