timarkh / tsakorpus Goto Github PK
View Code? Open in Web Editor NEWYet another search platform for linguistic corpora.
License: MIT License
Yet another search platform for linguistic corpora.
License: MIT License
В документации в разделе Corpus configuration указано, что
multiple_choice_fields
(dictionary) – describes tag selection tables for word-level fields other that Grammar or Gloss and sentence-level metadata fields. Keys are field names, values are structured in the same way asgramm_selection
above.
Создается неоднозначность: можно ли все-таки прикрутить таблицы выбора тегов в поля поиска по метаданным предложения (в случае, например, span annotations)?
Would be nice to have a "Copy" button to copy the query code to clipboard.
As of now, additional fields (e.g. in INEL: Lex. gloss, Borrowing) are included when sharing a query, but discarded when loading it.
UPD: it actually works ok when loading a query with the load button, but doesn't work when pasting into the address bar.
E.g. when opening this link:
https://inel.corpora.uni-hamburg.de/KamasCorpus/search?n_words=1&random_seed=441808&gr1=n&BOR1=RUS%3Acult&n_ana1=any&lang1=kamas&precise=on&page_size=10&sort=random&distance_strict=on&input_method=normal&
only the Gram. tag=n is kept in the query but the Borrowing field=RUS:cult is blank.
Add "(de)select all / invert selection" buttons to Subcorpus selection / Choose from a list dialogue.
When copying a single example, include all parallel fragments (in INEL: sentence translations, comments, alt. orthographies).
Здравствуйте, Тимофей! Имеется ли какой-нибудь онлайн-пример, работающий на основе платформы Tsakorpus, чтобы можно было потестировать возможности поиска?
UPD: Нашёл — https://tsakorpus.readthedocs.io/en/latest/examples.html
When sharing a query, add an option to also include the corpus URL.
(in our case, prepend e.g. "https://inel.corpora.uni-hamburg.de/KamasCorpus/search?")
A wish: introduce variables for multiword queries, to say "same value as in word N" or "different value than in word N".
E.g. to search for two consecutive forms of the same lemma, or pair words with repeating inflectional affixes.
Or to find adjectives agreeing with the noun in case or gender.
We've got customized tokenization for text fields like lemma and it works fine (e.g. no tokenization by dot)
Now a similar setting is needed for keyword fields like INEL SeR, SyF etc. (e.g. to force tokenizing by space)
For smaller corpora / low number of results, i.e. when the user can have an exhaustive list of hits in the corpus, it is sometimes very useful to have them presented in an ordered sequence.
I.e. sort by text ID and, within a single text, by order of appearance.
Если для первой словоформы указать место в предложении, а второй определить дистанцию от первой, дистанция перестаёт учитываться. Место в предложении при этом работает.
Проверили на адыгейском и на чукотском корпусах.
Для параметра 'gramm_shortcuts' в файле corpus.json не работают регулярные выражения. Наверное, они там и не подразумевались, но это очень удобно устроено для параметра 'gloss_shortcuts', и если бы регулярки можно было использовать и для грамматических тегов, было бы здорово.
Например, хотелось бы, чтобы по тегу "A" искались такие А, в которых нет пометы NtoV, а по тегу, например, "А+" искались бы уже все А без ограничений (то есть NtoV в такую выдачу бы включалось). Первое правило со скриншота ниже не работает, а второе работает нормально.
Правила с разделителями тоже не работают, например, такое:
Hi! I've run into a problem:
% python xml_flex2json.py
Traceback (most recent call last):
File "xml_flex2json.py", line 445, in
x2j.process_corpus()
File "../src_convertors/txt2json.py", line 314, in process_corpus
curTokens, curWords, curAnalyzed = self.convert_file(fnameSrc, fnameTarget)
File "xml_flex2json.py", line 430, in convert_file
textJSON['sentences'] = [s for sNode in interlinear.xpath('./paragraphs/paragraph/phrases/phrase | '
File "xml_flex2json.py", line 432, in
for s in self.process_se_node(sNode)]
File "xml_flex2json.py", line 387, in process_se_node
re.sub('-.*', '', element.attrib['lang']) in self.corpusSettings['language_codes']:
KeyError: 'language_codes'
What can be wrong?
When searching for a lemma with specified gram. tags, eg. part-of-speech ("v"), the result table now lists frequency counts for all wordforms with a given lemma disregarding the features.
E.g. imagine real frequencies "kaja" 'v' 120, "kaja" 'n' 15.
Results now:
Searching 'v' > "kaja" 135
Searching 'n' > "kaja" 135
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.