aliok / trnltk Goto Github PK

For nr 3 and nr 4 we can have a look at the similarity with other verb roots.
There are a lot of verbs (kapa, ...) in the form of nr 2, so this doesn't solve the problem completely. But it is a start.

Support more numerals

1.'yi
2'nci
10000'er
biner
bininci (supported already)

instead of adding words to dictionary, maybe it is better to add the support with the suffix graph

Rename context free parser to context free morphological parser

Make use of Mahout/Hadoop

Formulate probability coefficients in matrices

Extract intensifying syllables for adjectives and adverbs

masmavi, simsıcak, yapayalnız, ipince, küpküçük etc.

Doesn't have strict rules, thus need to find syllables for those from a big corpus.

If word ends with a dictionary adverb/adjective, and if it starts with the beginning of the same adjective/adverb, we can suggest that it is intensified.

Try to find as much as possible and add them to a dictionary!

Add MongoDB indexes

Add syntactic categories of the words to parsesets

Use Z3 style compound information in master dictionary

Statistical sentence tokenizer (text to sentences)

Unsupervised statistical root extraction without a dictionary

Brute force root extractors already exist. However, the results are too much and it is better to do it statistically

This might be useful for finding roots that doesn't exist in the dictionary (e.g. local words) and proper nouns.

Save the possible roots for a big corpus (10M words) in a file
...

For proper noun recognition

check if the root has been used with a apostrophe in the corpus
or check if the word starts with upper case in the middle of a sentence in the corpus

For e.g. verb recognition: for non-dictionary word 'kıvışlıyordu' find the root as 'kıvışlamak'

Check if there are other surfaces with root candidates as "kıvışlamak", such as 'kıvışladım' 'kıvışla' 'kıvışlarsa'
Then we would eliminate the some of the candidates : 'kıvışlımak' 'kıvışlıyormak' 'kıvışlıyomak' etc.
However, it doesn't eliminate the roots such as "kıvış+Noun" 'kıvmak' 'kıvımak' etc.
For them, check if there is other surfaces such as 'kıvışımı' 'kıvdım' 'kıvıyorum' etc.

That would help a lot.

Implement context free parse in playground

Morphotactics bug : "yapm" is considered as imperative

Another example:
hıllandım -> 'hıllandı(hıllandımak)+Verb+Neg(m[m])+Imp+A2sg',

Concept of auxiliary verbs

Is it better to use Kneser-Ney discounting instead of SGT alone?

Make current disambiguator work on file, not database!

Document the differences of Zemberek morphological parser and TRNLTK one

Duplication recognition in tokenization

Duplications:

abur cubur both doesn't make sense
yemek memek second doesn't make sense
iyi kotu opposites
zırıl zırıl called "sound reflection" in Turkish
sıcak sıcak 2 adjectives turn into an adverb
gide gide
kırk elli kişi
uc bes kurus
bata cika
enine boyuna
ev bark
bas basa, daldan dala, ucu ucuna
gelir gelmez, yapar yapmaz --> Adverbs

Some of them can be done during tokenization.

Some of them needs to be done after parsing, such as 13

_R+Verb+Pos+Aor+A3sg _R+Verb+Neg+Aor+A3sg %1 %2+Adverb+When

Brute force proper noun root finder and a special suffix graph

Should accept:

Surfaces starting with uppercase letter
Including no apostrophe (since with apostrophe, the root is obvious)

Some suffixes which can be applied to these kind of roots:

-ler :
- Turkler,
- Alilere gidiyorum
-gil
- Ahmetgildeyim
-li
- Kayserili
-lik
- Turkluk

etc.

Check the rules from TDK: http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=187:Noktalama-Isaretleri-Aciklamalar&catid=50:yazm-kurallar&Itemid=132

Some suffixes doesn't make sense to apply proper nouns:

- len

Really problematic part is, sometimes apostrophe is used, sometimes it is not.

Find if a verb is transitive, reciprocal, reflexive or not

Transitive verbs (verbs that accept an object) could be found from an annotated corpus.
A non-transitive verb can be converted to transitive by adding Causative suffix.
** A very advanced issue. Related with POS tagging.

Finding reciprocal is easy. Rule based (verbs ends with "ş" and no need for POS tagging)

Reflexive (such as giyinmek) is similar to reciprocal

Corpus explorer

Morpheme container context free probability generator

For:

roots
lexemes
suffixes

Doesn't make sense to parse "ilan" and "etmek" separately.

Zemberek has already a small database about these.

Issue #32 is related

See http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=221:Ayri-Yazilan-Birlesik-Kelimeler&catid=50:yazm-kurallar&Itemid=132

Abbreviations like M.Ö. or ing.
Ordinals like 3.
Roman numerals like III and III.
Paranthesis such as "(abc"
Some phrases which are multiple words but should be considered as one : "hafta sonu" => "hafta_sonu"
Proper nouns which are multiple words but should be consireded as one : "İç Anadolu" => "İç_Anadolu"
Duplications

Ideas:
while tokenization:

Check if M.Ö. is used as an abbreviation
This is rule based I think. A sentence almost never ends with a cardinal number.
Need morphologic support for that first.
Seems rule based
After tokenization, can have a look if there is a phrase like that. If so, words could be merged
Same as 5
Issue #32 is related

for 5 and 6, see http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=221:Ayri-Yazilan-Birlesik-Kelimeler&catid=50:yazm-kurallar&Itemid=132

numerals
roman numerals
etc.

don't separate it.

Same would go with other ambiguous points.

aliok / trnltk Goto Github PK

trnltk's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs