timarkh / tsakorpus Goto Github PK

Yet another search platform for linguistic corpora.

License: MIT License

Python 52.18% Batchfile 0.07% JavaScript 11.64% CSS 5.64% HTML 30.45% Shell 0.02%

corpus corpus-linguistics corpus-tools elasticsearch flask language-documentation linguistic-corpora linguistics media-aligned-corpora parallel-corpora

tsakorpus's People

Contributors

Stargazers

Watchers

Forkers

torilov ansorox makarfedorov timtim1342 oserikov codemurt al-indigo myrix theodevi liz-feklina tbkazakova affenmilchmann marlnox

tsakorpus's Issues

Using "multiple_choice_fields" option for sentence-level metadata

В документации в разделе Corpus configuration указано, что

multiple_choice_fields (dictionary) – describes tag selection tables for word-level fields other that Grammar or Gloss and sentence-level metadata fields. Keys are field names, values are structured in the same way as gramm_selection above.

Создается неоднозначность: можно ли все-таки прикрутить таблицы выбора тегов в поля поиска по метаданным предложения (в случае, например, span annotations)?

Bug in a parallel corpus

So, when I do this:

click on the "more fields" button
then click on the "add word" button

Then drop down list with language tiers is placed right over the text box. Like this

Add "Copy" button when sharing a query

Would be nice to have a "Copy" button to copy the query code to clipboard.

Make additional fields work when loading a query

As of now, additional fields (e.g. in INEL: Lex. gloss, Borrowing) are included when sharing a query, but discarded when loading it.

UPD: it actually works ok when loading a query with the load button, but doesn't work when pasting into the address bar.
E.g. when opening this link:

https://inel.corpora.uni-hamburg.de/KamasCorpus/search?n_words=1&random_seed=441808&gr1=n&BOR1=RUS%3Acult&n_ana1=any&lang1=kamas&precise=on&page_size=10&sort=random&distance_strict=on&input_method=normal&

only the Gram. tag=n is kept in the query but the Borrowing field=RUS:cult is blank.

Add "(de)select all / invert selection" buttons to Subcorpus selection

Add "(de)select all / invert selection" buttons to Subcorpus selection / Choose from a list dialogue.

Include additional fields in query history

Include additional field (e.g. in INEL: Lex. gloss, Borrowing, etc.) in the query history window, otherwise it can be not very helpful:

Include parallel translations when copying an example

When copying a single example, include all parallel fragments (in INEL: sentence translations, comments, alt. orthographies).

Пример веб-интерфейса, работающего на Tsakorpus

Здравствуйте, Тимофей! Имеется ли какой-нибудь онлайн-пример, работающий на основе платформы Tsakorpus, чтобы можно было потестировать возможности поиска?

UPD: Нашёл — https://tsakorpus.readthedocs.io/en/latest/examples.html

Include corpus URL when sharing a query

When sharing a query, add an option to also include the corpus URL.
(in our case, prepend e.g. "https://inel.corpora.uni-hamburg.de/KamasCorpus/search?")

Variables for multiword queries

A wish: introduce variables for multiword queries, to say "same value as in word N" or "different value than in word N".

E.g. to search for two consecutive forms of the same lemma, or pair words with repeating inflectional affixes.
Or to find adjectives agreeing with the noun in case or gender.

Add parameter for a regex to tokenize kw fields

We've got customized tokenization for text fields like lemma and it works fine (e.g. no tokenization by dot)
Now a similar setting is needed for keyword fields like INEL SeR, SyF etc. (e.g. to force tokenizing by space)

Add an option to un-randomize results

For smaller corpora / low number of results, i.e. when the user can have an exhaustive list of hits in the corpus, it is sometimes very useful to have them presented in an ordered sequence.
I.e. sort by text ID and, within a single text, by order of appearance.

Interaction of 'position in a sentence' and 'distance to word' parameters

Если для первой словоформы указать место в предложении, а второй определить дистанцию от первой, дистанция перестаёт учитываться. Место в предложении при этом работает.
Проверили на адыгейском и на чукотском корпусах.

Regular expressions for 'gramm_shortcuts'

Для параметра 'gramm_shortcuts' в файле corpus.json не работают регулярные выражения.  Наверное, они там и не подразумевались, но это очень удобно устроено для параметра 'gloss_shortcuts', и если бы регулярки можно было использовать и для грамматических тегов, было бы здорово.

Например, хотелось бы, чтобы по тегу "A" искались такие А, в которых нет пометы NtoV, а по тегу, например, "А+" искались бы уже все А без ограничений (то есть NtoV в такую выдачу бы включалось). Первое правило со скриншота ниже не работает, а второе работает нормально.

Правила с разделителями тоже не работают, например, такое: 

'language_codes'. xml_flex2json.py error

Hi! I've run into a problem:

% python xml_flex2json.py
Traceback (most recent call last):
File "xml_flex2json.py", line 445, in
x2j.process_corpus()
File "../src_convertors/txt2json.py", line 314, in process_corpus
curTokens, curWords, curAnalyzed = self.convert_file(fnameSrc, fnameTarget)
File "xml_flex2json.py", line 430, in convert_file
textJSON['sentences'] = [s for sNode in interlinear.xpath('./paragraphs/paragraph/phrases/phrase | '
File "xml_flex2json.py", line 432, in
for s in self.process_se_node(sNode)]
File "xml_flex2json.py", line 387, in process_se_node
re.sub('-.*', '', element.attrib['lang']) in self.corpusSettings['language_codes']:
KeyError: 'language_codes'

What can be wrong?

Listing lemmas with gram. features in corpora without alt. analyses

When searching for a lemma with specified gram. tags, eg. part-of-speech ("v"), the result table now lists frequency counts for all wordforms with a given lemma disregarding the features.

E.g. imagine real frequencies "kaja" 'v' 120, "kaja" 'n' 15.
Results now:
Searching 'v' > "kaja" 135
Searching 'n' > "kaja" 135

timarkh / tsakorpus Goto Github PK

tsakorpus's People

Contributors

Stargazers

Watchers

Forkers

tsakorpus's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs