GithubHelp home page GithubHelp logo

timarkh / tsakorpus Goto Github PK

View Code? Open in Web Editor NEW
17.0 17.0 13.0 3.36 MB

Yet another search platform for linguistic corpora.

License: MIT License

Python 52.18% Batchfile 0.07% JavaScript 11.64% CSS 5.64% HTML 30.45% Shell 0.02%
corpus corpus-linguistics corpus-tools elasticsearch flask language-documentation linguistic-corpora linguistics media-aligned-corpora parallel-corpora

tsakorpus's People

Contributors

codemurt avatar kategerasimenko avatar maryszmary avatar timarkh avatar torilov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

tsakorpus's Issues

Using "multiple_choice_fields" option for sentence-level metadata

В документации в разделе Corpus configuration указано, что

multiple_choice_fields (dictionary) – describes tag selection tables for word-level fields other that Grammar or Gloss and sentence-level metadata fields. Keys are field names, values are structured in the same way as gramm_selection above.

Создается неоднозначность: можно ли все-таки прикрутить таблицы выбора тегов в поля поиска по метаданным предложения (в случае, например, span annotations)?

Bug in a parallel corpus

So, when I do this:

  1. click on the "more fields" button
  2. then click on the "add word" button

Then drop down list with language tiers is placed right over the text box. Like this
Screenshot 2021-12-21 at 13 57 20

Make additional fields work when loading a query

As of now, additional fields (e.g. in INEL: Lex. gloss, Borrowing) are included when sharing a query, but discarded when loading it.

UPD: it actually works ok when loading a query with the load button, but doesn't work when pasting into the address bar.
E.g. when opening this link:

https://inel.corpora.uni-hamburg.de/KamasCorpus/search?n_words=1&random_seed=441808&gr1=n&BOR1=RUS%3Acult&n_ana1=any&lang1=kamas&precise=on&page_size=10&sort=random&distance_strict=on&input_method=normal&

only the Gram. tag=n is kept in the query but the Borrowing field=RUS:cult is blank.

Variables for multiword queries

A wish: introduce variables for multiword queries, to say "same value as in word N" or "different value than in word N".

E.g. to search for two consecutive forms of the same lemma, or pair words with repeating inflectional affixes.
Or to find adjectives agreeing with the noun in case or gender.

Add parameter for a regex to tokenize kw fields

We've got customized tokenization for text fields like lemma and it works fine (e.g. no tokenization by dot)
Now a similar setting is needed for keyword fields like INEL SeR, SyF etc. (e.g. to force tokenizing by space)

Add an option to un-randomize results

For smaller corpora / low number of results, i.e. when the user can have an exhaustive list of hits in the corpus, it is sometimes very useful to have them presented in an ordered sequence.
I.e. sort by text ID and, within a single text, by order of appearance.

Interaction of 'position in a sentence' and 'distance to word' parameters

Если для первой словоформы указать место в предложении, а второй определить дистанцию от первой, дистанция перестаёт учитываться. Место в предложении при этом работает.
Проверили на адыгейском и на чукотском корпусах.

Regular expressions for 'gramm_shortcuts'

Для параметра 'gramm_shortcuts' в файле corpus.json не работают регулярные выражения.
 Наверное, они там и не подразумевались, но это очень удобно устроено для параметра 'gloss_shortcuts', и если бы регулярки можно было использовать и для грамматических тегов, было бы здорово.

Например, хотелось бы, чтобы по тегу "A" искались такие А, в которых нет пометы NtoV, а по тегу, например, "А+" искались бы уже все А без ограничений (то есть NtoV в такую выдачу бы включалось). Первое правило со скриншота ниже не работает, а второе работает нормально.

0

Правила с разделителями тоже не работают, например, такое:


1

'language_codes'. xml_flex2json.py error

Hi! I've run into a problem:

% python xml_flex2json.py
Traceback (most recent call last):
File "xml_flex2json.py", line 445, in
x2j.process_corpus()
File "../src_convertors/txt2json.py", line 314, in process_corpus
curTokens, curWords, curAnalyzed = self.convert_file(fnameSrc, fnameTarget)
File "xml_flex2json.py", line 430, in convert_file
textJSON['sentences'] = [s for sNode in interlinear.xpath('./paragraphs/paragraph/phrases/phrase | '
File "xml_flex2json.py", line 432, in
for s in self.process_se_node(sNode)]
File "xml_flex2json.py", line 387, in process_se_node
re.sub('-.*', '', element.attrib['lang']) in self.corpusSettings['language_codes']:
KeyError: 'language_codes'

What can be wrong?

Listing lemmas with gram. features in corpora without alt. analyses

When searching for a lemma with specified gram. tags, eg. part-of-speech ("v"), the result table now lists frequency counts for all wordforms with a given lemma disregarding the features.

E.g. imagine real frequencies "kaja" 'v' 120, "kaja" 'n' 15.
Results now:
Searching 'v' > "kaja" 135
Searching 'n' > "kaja" 135

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.