GithubHelp home page GithubHelp logo

buda-base / tibetan-stemming-data Goto Github PK

View Code? Open in Web Editor NEW
3.0 5.0 1.0 2.24 MB

Data for testing the Tibetan Lucene analyzers

License: Apache License 2.0

Python 100.00%
tibetan nlp-resources stemming

tibetan-stemming-data's Introduction

Data for testing the Tibetan Lucene analyzers

Generated resources

output/total_lexicon.txt

A general purpose Tibetan word-list. Each line is formatted as follows: inflected<space>operation Affixed particles (འི, འོ, -ས and -ར) and dadrag (ད་དྲག) are appended to each processed word following the syllable-formation rules.

operation (to reconstruct the lemma) can have the following values:

  • /lemma: the lemma is inserted when more than the operations below are required to find it from the inflected form
  • =: the inflected form and the lemma are identical
  • >A: remove one character
  • >B: remove two characters
  • >C: remove three characters
  • >D: remove one character and add "འ"
Minimal testing-set
  • test_sentence.txt: the beginning of a sutra(བཀྲ་ཤིས་ཆེན་པོའི་མདོ།) split in words.
  • test_vocab.txt: the words from the sentence and all their inflected forms.

affixify.py

Input:
  • input/dadrag_syllables.txt (from here. All syllables until GT are included)
  • input/vocabs/TDC.txt (from here)
Action
  • To every entry of TDC.txt:
    • Appends /C to every syllable that is in dadrag_syllables.txt
    • To the final syllable:
      • nothing added if the syllable can't host any affixed particle,
      • /A added if the particle can host an affixed particle and requires a final འ to be valid,
      • /B added if the particle can host an affixed particle but doesn't require a འ.
Output
  • output/lexicon_with_markers.txt
Issues
  • the sskrt syllables marked with /B were manually processed. Implementing of the sskrt syllables formation rules would enable to automatize this process.

prepare_verbs.py

Input
  • input/monlam_verbs.json (from Esukhia's canon_notes project)
  • input/dadrag_syllables.txt
Action
  • for every inflected form:
    • find all the lemmas (citation forms)
    • create a second inflected form if the verb is in dadrag_syllables.txt
    • add (inflected, /lemma) to the output list (= instead of /lemma if the inflected form and the lemma are identical)
Output
  • output/parsed_verbs.txt
Issues
  • a few entries for which Monlam doesn't give any information about conjugation are ignored. (ex: ལྷོགས་ | ༡བྱ་ཚིག 1. ༡བརྡ་རྙིང་། རློགས། 2. ཀློགས། is parsed into "ལྷོགས": [])

compile_total_lexicon.py

Input
  • output/parsed_verbs.txt
  • input/particles.txt (an adaptation of this list)
  • output/lexicon_with_markers.txt
Action
  • expands every entry in lexicon_with_markers.txt:
    • /C : create a new entry with a dadrag on the marked syllable
    • for the entry (or entries if there is one with dadrag):
      • /A : remove the ending འ
      • apply all affixes (['འི', 'འོ', 'ས', 'ར'])
  • de-duplicate the generated entries and the content of parsed_verbs.txt and particles.txt
  • write the sorted entries.
Output
  • output/total_lexicon.txt
Issues
  • Applying the particle over the last syllable of some words might generate an ambiguous inflected form. Ex: སྡེ་པར་ where པར་ can be both the particle and the compressed form of པར་མ་.

tibetan-stemming-data's People

Contributors

drupchen avatar marcagate avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

openpecha

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.