lexedata / lexedata Goto Github PK

View Code? Open in Web Editor NEW

3.0 4.0 5.0 43.7 MB

Lexical Data Editing tools

Home Page: https://lexedata.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 99.70% TeX 0.08% Makefile 0.10% Batchfile 0.12%

cldf wordlist comparative-linguistics lexicon cognates2nexus phylogenetics dataset-interface

lexedata's Introduction

lexedata

Lexedata is an open source Python package that includes a collection of command line tools to edit comparative lexical data in the CLDF format (https://doi.org/10.1038/sdata.2018.205), which adds an ontology of terms from comparative linguistics (https://cldf.clld.org/v1.0/terms.html) to the W3C “CSV for the Web” (csvw, https://w3c.github.io/csvw/) standard. The package includes tools to run batch-editing operations as well as tools to convert CLDF to and from other data formats in use in computational historical linguistics.

The documentation for Lexedata can be found on ReadTheDocs.IO.

The package is available from PyPI. If you have a recent Python installation, you can easily install it using

pip install lexedata

More detailed installation instructions can be found in the documentation.

Lexedata was developed from the practical experience of editing several specific different lexical data sets, but generalizes beyond the specific datasets. The aim is to be helpful for editors of other lexical datasets, and as such we consider lacking documentation and unexpected, unexplained behaviour a bug; feel free to raise an issue on Github if you encounter any.

lexedata's People

Contributors

Stargazers

Watchers

Forkers

sellisd ujlbu4 xrotwang tpellard szhorvat

lexedata's Issues

non-uniform behavior of enrich.segment_using_clts

This is for the some-hacks-branch so maybe it is already fixed in melvin's branch. I just don't dare switch right now, sorry!
after the importation of all new data in the Arawak dataset, I run the segmenter and I got the following warnings. From the warning it is not clear what is being done. I expected that the segmentation failed, but instead, when I checked there were a number of different outcomes.
Segments
WARNING:root:Unknown sound '' in form '-p/ po' (segment #3) - the segment / was skipped (I am asking Lev what has happened here, probably alternative forms)
WARNING:root:Unknown sound ''' in form 'kasa'' (segment #5) - this and the following one have a glottal stop at the end? (again I am asking Lev). However, here the segment is not skipped but added as a separate segment to the segments.
WARNING:root:Unknown sound ''' in form 'úku'u' (segment #4)
WARNING:root:Unknown sound 'áʰ' in form 'ipitáʰnasi' (segment #5) - this and all the following ones have a superscript h. It has been segmented with the preceding vowel in the segments column.
WARNING:root:Unknown sound 'áʰ' in form 'uráʰnasi' (segment #3)
WARNING:root:Unknown sound 'eʰ' in form 'hiaseʰni' (segment #5)
WARNING:root:Unknown sound 'áʰ' in form 'dáʰmikasi' (segment #2)
WARNING:root:Unknown sound 'uʰ' in form 'haɻuʰmétata' (segment #4)
WARNING:root:Unknown sound 'iʰ' in form 'nipiʰɻe' (segment #4)
WARNING:root:Unknown sound 'uʰ' in form 'buʰɻaka' (segment #2)
WARNING:root:Unknown sound 'uʰ' in form 'puʰmátakasi' (segment #2)
WARNING:root:Unknown sound 'aʰ' in form 'saʰmétakasi' (segment #2)
WARNING:root:Unknown sound 'eʰ' in form 'háseʰnisi' (segment #4)
WARNING:root:Unknown sound 'iʰ' in form 'uníjapiʰɻe' (segment #7)
WARNING:root:Unknown sound 'éʰ' in form 'mabéʰniɻi' (segment #4)
WARNING:root:Unknown sound 'aʰ' in form 'jaʰnébúɻu' (segment #2)
WARNING:root:Unknown sound 'áʰ' in form 'pabuáʰɻeɻi' (segment #5)
WARNING:root:Unknown sound 'iʰ' in form 'kabiʰɻimi' (segment #4)
WARNING:root:Unknown sound 'aʰ' in form 'daʰmitukáɻusi' (segment #2)
WARNING:root:Unknown sound 'aʰ' in form 'daʰmítukasi' (segment #2)
WARNING:root:Unknown sound 'iʰ' in form 'siʰmákasi' (segment #2)
WARNING:root:Unknown sound 'aʰ' in form 'kitaʰwasi' (segment #4)
WARNING:root:Unknown sound 'aʰ' in form 'nitaʰni' (segment #4)
WARNING:root:Unknown sound 'aʰ' in form 'jaʰmáɻakasi' (segment #2)
WARNING:root:Unknown sound 'aʰ' in form 'jaʰne' (segment #2)
WARNING:root:Unknown sound 'aʰ' in form 'jaʰne' (segment #2)

Give error message and refer to segmenter script, instead of traceback, in edictor exporter

python -m lexedata.exporter.to_edictor
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/gereon/Develop/lexedata/src/lexedata/exporter/to_edictor.py", line 211, in <module>
    forms_to_tsv(
  File "/home/gereon/Develop/lexedata/src/lexedata/exporter/to_edictor.py", line 60, in forms_to_tsv
    c_form_segments = dataset["FormTable", "segments"].name
  File "/home/gereon/.local/etc/lexedata/lib/python3.9/site-packages/pycldf/dataset.py", line 610, in __getitem__
    raise KeyError(column)
KeyError: 'segments'

error message question

I am not sure why this error is popping up. the form in question is present in the same language and source, but in a different concept (few). Maybe it is just that the version you imported is older?

Failed to find object {'cldf_languageReference': 'guaja', 'cldf_value': '<makamututuhũ>(para ref. contáveis){2}(FEW)', 'orthographic': 'makamututuhũ', 'cldf_comment': 'para ref. contáveis\t(FEW)', 'variants': []} in the database. Skipped. In cell: <Cell 'Numbers, Body Parts, Food, Anim'.AP7>.

concepts for cognatesets overwrite existing

the functionality to add concepts to cognatesets needs a switch to overwrite existing concept references

Test both #parameterReference with a separator and without a separator.

We have some tests for this in places already, but we should explicitly test that every one of our scripts works with single-concept forms and with multi-concept forms, and add understandable error messages where a script doesn't.

This is one of the cases of #63

simplify names when running exporter.phylogenetics

It would be good to strip diacritics, spaces, brackets and stuff like that from language names when exporting matrices. many phylogenetic programs have problems with those.

When guessing core concepts, attach concept IDs (i.e. valid foreign keys)

When guessing the core concept for a cognateset, there are several issues.

This functionality should also, in the minimum possible, work when the CLICS subgraph centrality cannot be used for whatever reason.
All cognatesets with missing concepticon are currently glued to the concept without concepticon link encountered last. Probably a side-effect of my quick fix of that script to work around the previous point, but @PaprikaSteiger can easily do something better and less hackish.
The concepts added to the cognatesets should be identified by ID, not by name.
There should be a test for that behaviour.
The script needs a command line switch whether to override existing concepts, in the same manner as #78

Open usage issues

Originally posted in ##64

excelsinglewordlist.py: (...) I am not sure here if this script should directly make the singleton cognatesets or if the exporter should as we have it now. The reason I am thinking of this is because we want cldf to be valid after a script. So, do we allow forms that are not in cognate sets (within the "policed" concepts) to go without a warning?
guess_concepticon.py: (...) I would also like to add here that based on my concept linking with MG, I think for the verification to work with minimal looking up of concepticon, it needs to pull in three things: the concept set ID, the name and the definition. A follow up we could think about would be an option to update name and definition based on the ID for updating the dataset (I don't even know if such updates are common enough, Gereon?)
segments_using_clts.py: I am not sure that we want an update here by default. I am imagining a case where you import a bunch of stuff at different times, then you segment so you can export, and then everything becomes "segmented". Assuming the segmenter is solid, I think segmentation is largely hidden from the user and should remain so. I think I would just add a thing in the manual saying if you have weird stuff in your unified forms, then check that the segmentation is correct but that's all. Or maybe this one should be additive and not replace a previous status.
Currently, no other scripts calls segment_using_clts.py. When should another script check if a Segments column exists and possibly create one?

segment slice with different separator?

I don't know if we can, but the : separator for the segment slice is messed up by excel and google docs. They think it's time or something like that.

comments accumulate signatures -lexedata.exporter

Due to the handling of comments by xlsx (it adds as a signature the name of the person who wrote the comment), comments over multiple loops accumulate signatures.
examples ALOUATTA SP.9 in Garifuna
ANACARDIUM SP.21 in Achagua

concepticon guess enhancement

It would be nice to add an extra column with the concepticon concept set in addition to the concepticon ID. This would help with manual inspection for errors in the matching.

Write lists of segments to file, not dot-separated string.

In addition to overriding existing segments (#78), segment_using_clts currently puts separator dots into the segment column.

Instead, it should hand the CLDF writer a list of segments (in their string representation, of course) and let the writer worry about putting the metadata-specified separator (probably " ") between the segments.

Test with non-standard datasets

We need to test various sub-tools with valid, but non-standard datasets as well as with invalid, but human-comprehensible datasets, probably constructing tiny additional test datasets.

lexedata.exporter.cognates expects that the CognatesetTable has a #parameterReference column. Test with a dataset without that.
Test a metadata-free dataset, concepts are concepticon IDs, languages are glottocodes. How many functions can we get to work with that?
Test with a dataset that has #cognatesetReference in the FormTable.
Test both #parameterReference with a separator and without a separator.
Test dataset with missing sources.bib and dataset with invalid sources.bib
Test with dataset where a inconsequential column ('ignored', let's say) is “required’ according to the metadata, but missing an entry.
Test with a ridiculously column-rich (including different data types – int/float/str, separators) and a ridiculously column-starved CognateTable (maybe even #id missing?), to see how they interact with Excel.

More robust IPA-Tokenizer

I have been working with some other transcription stuff in the last few days and I have made my version of the segmenter more robust, in particular with respect to pre-nasalization and pre-aspiration. It could maybe replace the current segmenter in segment_using_clts.py

Originally posted by @Anaphory in #70 (comment)

treatment of superscript segments

The original Arawak data have a number of superscript segments in the unified form that have become regular segments in the dataset. Why is this? How could we recover them?
Also, is the segmentation of somehting like tʰ supposed to be t h ?
an example is the form for bow in Tariana. (form id 29411)
the newly imported data however have no problem (e.g. Bare new data have superscript h, while the old data don't). However, Warekena Velha (Guainia) has vowels followed by superscript h that throw a warning. They are segmented as a unit though in the segments.

lexedata validate some more

In addition to cldf validate, there are some more assumptions we make, which a validate script should check:

The primary key of a table is its #id.
Every #reference (#parameterReference, #languageReference, #cognatesetReference) corresponds to a foreign key relationship
#segments, #segmentSlice and #alignment correspond to each other in the obvious relationship. #segmentSlice is a valid python slice (i.e. "t e s t" has indices 1:4) into segments.
Empty forms may exist, but only if there is no actual form for the concept and the language, and probably given some other constraints.
All files should be in NFC normalized unicode

There are probably a few others, let's collect them here.
Many of them are not just tests, but could even be fixed automatically.

Regression test: Check that a cell with ‘kʲa’ is not imported as ‘kja’.

In #30, I said:

This should probably get a regression test, checking that a 'kʲa' is not mangled to a 'kja' upon import.

That test has not been added yet.

Change segment indexing

In #81, we suggest that indices should be zero-based non-inclusive. This is not as intended by cldf:

https://github.com/cldf/cldf/pull/105/files/78a24b8841ba4827e71be0d6f92daa2c35449df8#diff-991442876801ef87377a21e16bb9027d7b46142c730e4cdd2ce27619ba485f36R29-R32

states that implicitly indices should be 1-based and inclusive, so t e s t should have indices 1:4, not 0:4.

phylogenetics exporter needs to take into account primary concept

at least for cases where cross-concept cognate sets are present. But it may be useful also to verify root-concept sets as well.

Code Coverage

Get one of the code coverage (line & branch)

bug when searching for a form in the database?

Failed to find object {'cldf_languageReference': 'tocantins_asurini', 'cldf_value': '<omónem>(put on (clothes))', 'orthographic': 'omónem', 'cldf_comment': 'put on (clothes)', 'variants': []} in the database. Skipped. In cell: <Cell 'Verbs'.Z536>.

I have found a number now of these errors that seem wrong. The form in question seems to be identical and present in both spreadsheets in the correct language and source. I am pretty sure that this one cannot have been fixed in the meantime as it is in the verb tab and the few things i fixed were in the beginning.

segment_using_clts: Override only when requested.

The lexedata.enrich.segment_using_clts script should overwrite existing segments only when explicitly requested to do so.

The command line switch to do so should be named exactly as the one for the same functionality in other scripts.

LOAN tag is converted to L;O;A;N after the loop

I think this is due to the possibility of having multiple tags separated by ;. Somehow it thinks that LOAN is multiple tags. Maybe because it is in all caps?

segmenter warning log

it would be nice for the segmenter warning log to have more info to locate the form, such as line number or id

more info when importing new data

It would be nice for the importer of the single wordlist to give a report including:
how many new rows it read and what it did:

X were new and imported
Y were already existed and skipped
Z were new concepts of existing forms and added
A had problems and would correspond to warnings above

Create get_row_header function

This bunch

    row_header = []
    for (header,) in ws_test.iter_cols(
        min_row=1,
        max_row=1,
        max_col=len(dataset["CognatesetTable"].tableSchema.columns),
    ):
        column_name = header.value
        if column_name is None:
            column_name = dataset["CognatesetTable", "id"].name
        elif column_name == "CogSet":
            column_name = dataset["CognatesetTable", "id"].name
        try:
            column_name = dataset["CognatesetTable", column_name].name
        except KeyError:
            break
        row_header.append(column_name)

from the cognates importer and the corresponding loop tests should be a function.

Existing concepticon connections should not be overidden

If a concept has a concepticon connection already, we should make sure that it's not overridden when the guesser is run again.

plus signs for morpheme boundaries are problematic in the beginning of forms in google sheets

bow (n.)12.2 in Baniwa is an example of this. it becomes an error in googlesheets (probably thinks it is a formula of sorts)

Vague idea: Write metadata file for metadata-free forms.csv

Maybe we can at some point write a script that does some educated guesses to create the ‘free’ metadata file for a metadata-free forms.csv.

Code duplication in central concept guesser for cognatesets

lexedata.enrich.guess_concept_for_cognateset duplicates code between its ConceptGuesser and its __main__ functionality. What still needs to be changed can be changed internally to that file I hope, and the functionality can also be obtained by the new exporter functionality, so this is a code beauty thing, not a functionality thing.

concepticon connection script unexpected behavior

after linking the MG list and also Vilacy's Tupian wordlist to concepticon, we have a couple of observations:

it seems that the matching is based on the concept IDs (is this MG specific because there are no english glosses?) I had to make Vilacy's glosses into an ID (without spaces or diacritics) in order for the script to work. Maybe the default option should be something different?
it seems that two word concepts are not mapped. e.g. be born, be able to, be pregnant etc
would it be possible to return candidate mappings when there is not a unique match? right now it seems too strict.

Untested dead code

There is currently no code going to this case, and no use case to reach it either.
https://github.com/Anaphory/lexedata/blob/ff533fed6d01da5c2439fd38e48de6f29ee02a58/src/lexedata/importer/fromexcel.py#L380

Match function can match objects missing a property that we require.

https://github.com/Anaphory/lexedata/blob/6c278c18e80575355f61e226f128d67bdad17e99/src/lexedata/importer/fromexcel.py#L154-L157

Restructured match function potentially matches objects missing a property that we require.
If our cell parser was slightly dodgy, <or> (with no phonemic transcription given) would match /and/. That sounds very bug-prone to me.
Did you ever encounter a KeyError ln line 157? Where would that come from, and can we complain about that error earlier? Because if there is a KeyError, something is fundamentally wrong with the dataset to be imported, and we should not sweep that error under the rug.

Originally posted by @Anaphory in #39 (comment)

Yes, there were KeyErrors, the test script doesn't run with the old structure. I get:

>   all(match(properties[p], object[p]) for p in properties)]
E   KeyError: 'ID'
..\src\lexedata\importer\fromexcel.py:154: KeyError

Which I don't really understand.
Originally posted by @PaprikaSteiger in #39 (comment)

help of enrich.segment_using_clts

the syntax of the command seems different from the others
It mentions linking to concepticon, this is wrong I think
it mentions wordlist, while it should be metadata file or something like that

concepts for cognatesets: overwrite existing

the functionality to add concepts to cognatesets needs a switch to overwrite existing concept references

comments not showing up after re-export of cognate sets

Only 5 comments show up in the latest Arawak export and all of them were initially notes in the google doc.
I checked that both google notes and comments make it to the xlsx file that is downloaded, but then only original notes make it out from the exporter.
so it should be the thing reading xlsx comments that is not working on all of them.
A good test case is ARM1 that has both a note and a comment:
In the google spreadsheet: the note is "an ana~tana set. -- LDM". The comment has essentially identical text "AR O: an ana~tana set. -- LDM", it has been resolved and then reopened (all by Lev).
In the downloaded xlsx file: both note and comment are merged as following:

In the xlsx file coming out of the exporter, only the original note "an ana~tana set. -- LDM" is present, without any trace of the comment and its history.

enable NA forms

Lokono has a bunch of them in the Arawak dataset. right now they are exported in singleton cognate sets with { } forms.

Be less forgiving about mismatching delimiters

Do not import a form when the brackets mismatch, and say that also in the warning. https://github.com/Anaphory/lexedata/blob/b814a35095991fd333adf18f79fd45f57d2d8977/src/lexedata/importer/cellparser.py#L126-L131

Add functionality to add new data to existing dataset

We need functionality to add new data to an existing dataset.
It should add new forms, report the overlap, and optionally remove old forms that match some condition if there is a new form in their slot (eg. our '...' forms)

Test whether format guesser and other readline dependent scripts now run on Mac

If necessary, wrap imports in try.

triplicate error messages

a mismatching delimiter error seems to cause three separate error messages:

one without sheet/cell information
one with sheet/cell info
one that is about the source (the mismatching delimiter in question was on the source)

Is this expected behavior, or we could simplify?

Here is an example:
In form <che sy'y>(MZ){Pereira1994:68): Encountered mismatched closing delimiter )
Lexical List.J299: In form <che sy'y>(MZ){Pereira1994:68): Element {Pereira1994:68) had mismatching delimiters
In source {Pereira1994:68): Closing bracket '}' is missing, split into source and page/context may be wrong

Unknown sound report misses brackets

We just ran the CLTS segmenter on some Arawak data, and we found that in the report, parentheses are not counted:

WARNING:root:In form 506 (line 38): Unknown sound V́ː encountered in -V́:jada
WARNING:root:In form yavitero_current_in_water (line 337): Unknown sound ) encountered in kawi (weni)
WARNING:root:In form yavitero_current_in_water (line 337): Unknown sound ( encountered in kawi (weni)
WARNING:root:In form yavitero_fish_with_hook_and_line (line 783): Unknown sound ) encountered in muta(ta)
WARNING:root:In form yavitero_fish_with_hook_and_line (line 783): Unknown sound ( encountered in muta(ta)
WARNING:root:In form mehinaku_eye (line 2665): Unknown sound ' encountered in utɨ 'tai
WARNING:root:In form yavitero_evening (line 4284): Unknown sound ) encountered in jaɺ̥i(na)
WARNING:root:In form yavitero_evening (line 4284): Unknown sound ( encountered in jaɺ̥i(na)
WARNING:root:In form 221 (line 4910): Unknown sound V́ː encountered in -V́:ja-mi
WARNING:root:In form yavitero_itch_stat_v_n (line 6004): Unknown sound ) encountered in weha(hi)
WARNING:root:In form yavitero_itch_stat_v_n (line 6004): Unknown sound ( encountered in weha(hi)
WARNING:root:In form 518 (line 6264): Unknown sound V́ː encountered in -V́:mana:
WARNING:root:In form 36617 (line 6548): Unknown sound ; encountered in h̃iph̃ɯta;ɾɯph̃ɯçewna
WARNING:root:In form yavitero_sharpen_v_t (line 7710): Unknown sound ) encountered in kamenata(hi)
WARNING:root:In form yavitero_sharpen_v_t (line 7710): Unknown sound ( encountered in kamenata(hi)
WARNING:root:In form 315 (line 7767): Unknown sound V́ː encountered in -V́:ni-ɻi
WARNING:root:In form mehinaku_ear (line 8509): Unknown sound ' encountered in tulũ'ĩ
WARNING:root:In form yavitero_thirsty_stat_v_or_adj (line 10455): Unknown sound ) encountered in makaɺe(hi)
WARNING:root:In form yavitero_thirsty_stat_v_or_adj (line 10455): Unknown sound ( encountered in makaɺe(hi)
WARNING:root:In form bahuana_girl (line 10690): Unknown sound ) encountered in -teɲawɨ (h)akiʝi
WARNING:root:In form bahuana_girl (line 10690): Unknown sound ( encountered in -teɲawɨ (h)akiʝi
WARNING:root:In form yavitero_break_v_t (line 10878): Unknown sound ) encountered in kaɺia(hi)
WARNING:root:In form yavitero_break_v_t (line 10878): Unknown sound ( encountered in kaɺia(hi)
WARNING:root:In form mehinaku_mazama_americana_red_brocket_odocoileus_virginianus_white_tailed_deer_1 (line 11976): Unknown sound ' encountered in ju'ta
WARNING:root:In form mehinaku_armpit (line 13365): Unknown sound ˈ encountered in piˈ t͡sanũˈnã:ku
WARNING:root:In form 310 (line 14813): Unknown sound V̀ː encountered in -V̀:bana
WARNING:root:In form mehinaku_chin (line 16278): Unknown sound ˈ encountered in ˈpiumaˈ ʂaku
WARNING:root:In form mehinaku_split_vi_or_vt (line 20750): Unknown sound ˈ encountered in ˈ ɨ̃katɨˈwait͡sa
WARNING:root:In form mehinaku_capsicum_sp_pepper_1 (line 20930): Unknown sound ' encountered in kata'mutɨ
WARNING:root:In form yavitero_drunk_stat_v_or_adj (line 21636): Unknown sound ) encountered in kama(hi)
WARNING:root:In form yavitero_drunk_stat_v_or_adj (line 21636): Unknown sound ( encountered in kama(hi)
WARNING:root:In form mehinaku_hat (line 22039): Unknown sound ' encountered in tɨwɨ'nãi
WARNING:root:In form 324 (line 23329): Unknown sound V́ː encountered in -V́:ja-kua
WARNING:root:In form yavitero_hug (line 24416): Unknown sound ) encountered in tuja(hi)
WARNING:root:In form yavitero_hug (line 24416): Unknown sound ( encountered in tuja(hi)
WARNING:root:In form yavitero_dawn_v_i_see_the_dawn_v_t (line 26035): Unknown sound ) encountered in kahaɺi(hi)
WARNING:root:In form yavitero_dawn_v_i_see_the_dawn_v_t (line 26035): Unknown sound ( encountered in kahaɺi(hi)
WARNING:root:In form mehinaku_young_adj_or_stat_v (line 29940): Unknown sound ' encountered in jamu'kuhĩ
WARNING:root:In form mehinaku_lagenaria_siceraria_bottle_gourd_calabash_vine_opo_squash_long_melon_calabash_gourd_syn_lagenaria_vulgaris (line 31072): Unknown sound ' encountered in mɨ̃'mã
WARNING:root:In form 431 (line 31420): Unknown sound V encountered in á:hV-ba:
WARNING:root:In form yavitero_door (line 34501): Unknown sound ) encountered in huta(hi)
WARNING:root:In form yavitero_door (line 34501): Unknown sound ( encountered in huta(hi)
WARNING:root:In form yavitero_smooth_stat_v_or_adj (line 34936): Unknown sound ) encountered in kahit͡θi(hi)
WARNING:root:In form yavitero_smooth_stat_v_or_adj (line 34936): Unknown sound ( encountered in kahit͡θi(hi)
WARNING:root:In form yavitero_wait (line 35235): Unknown sound ) encountered in nai(n)ta
WARNING:root:In form yavitero_wait (line 35235): Unknown sound ( encountered in nai(n)ta
WARNING:root:In form 32937 (line 35671): Unknown sound V encountered in -Vka
WARNING:root:In form yavitero_delicious_stat_v_or_adj (line 35833): Unknown sound ) encountered in kunehe(hi)
WARNING:root:In form yavitero_delicious_stat_v_or_adj (line 35833): Unknown sound ( encountered in kunehe(hi)
WARNING:root:In form yavitero_dark_stat_v_or_adj (line 36458): Unknown sound ) encountered in maɺ̥ete(hi)
WARNING:root:In form yavitero_dark_stat_v_or_adj (line 36458): Unknown sound ( encountered in maɺ̥ete(hi)
WARNING:root:In form 198 (line 36588): Unknown sound V̀ː encountered in -V̀:ja
WARNING:root:In form mehinaku_child (line 37136): Unknown sound ' encountered in eˈnɨʂa'tai
WARNING:root:In form 33025 (line 37619): Unknown sound V encountered in -Vma
WARNING:root:In form mehinaku_axe (line 39596): Unknown sound ' encountered in ja'wai
WARNING:root:In form yavitero_carry_on_back (line 40515): Unknown sound ) encountered in nahi(a)
WARNING:root:In form yavitero_carry_on_back (line 40515): Unknown sound ( encountered in nahi(a)
WARNING:root:In form yavitero_bored (line 40555): Unknown sound ) encountered in teɺeka(hi)
WARNING:root:In form yavitero_bored (line 40555): Unknown sound ( encountered in teɺeka(hi)
WARNING:root:In form mehinaku_jaw (line 40665): Unknown sound ˈ encountered in piˈ t͡sapaˈkat ɨ
| LanguageID       | Sound   |   Occurrences | Comment                |
|------------------+---------+---------------+------------------------|
| Achagua          | .       |             1 | '.' replaced by '.'    |
| Achagua          | V́ː     |             1 | unknown sound          |
| Achagua          | V       |             1 | unknown sound          |
| Achagua          | V̀ː     |             1 | unknown sound          |
| Yawalapiti       | .       |             3 | '.' replaced by '.'    |
| Bare             | .       |             1 | '.' replaced by '.'    |
| Baure            | .       |             1 | '.' replaced by '.'    |
| Mehinaku         | Ɂ       |            16 | 'Ɂ' replaced by 'ʔ'    |
| Mehinaku         | '       |             8 | unknown sound          |
| Mehinaku         | .       |             5 | '.' replaced by '.'    |
| Mehinaku         | ˈ       |             4 | unknown sound          |
| Resigaro         | oː́     |             1 | 'oː́' replaced by 'oó' |
| Enawene Nawe     | .       |             8 | '.' replaced by '.'    |
| Baniwa (Central) | .       |             1 | '.' replaced by '.'    |

Status Column

We edit datasets. It would be nice to have a column in many places (all the places?) that comment on the status of those objects, with inter-operable pre-defined values. Sometimes a status needs to be set by an importer, such as adding them to lexedata.exporter.cognates around L. 195 for automatically created singleton cognate sets.

adding new concepts to existing forms doesn't work

I am not sure if this is a bug or it hasn't been implemented yet, but right now the single wordlist importer doesn't add new concepts to existing forms if it encounters them, but adds a new form altogether with the additional concept.
e.g. in Yawalapiti mapi was listed both as fur and body hair in the new data to import and it was imported twice (with distinct concepts).
I know this will be caught by the homophony/polysemy script, but I just wanted to make sure it is intended behavior.

forms in cognate set headers

this error was raised by a form that was in a cognate set header (i.e. not in a cognate set row such as HAIR1).
Apart from the fact that indeed this form didn't exist in the wordlist spreadsheet, I think that this is something we want to check for. There shouldn't be any forms hanging out in the header rows.

Failed to find object {'cldf_languageReference': 'xeta', 'cldf_value': "/'wadʒo/(pêlo amarelo)", 'phonemic': "'wadʒo", 'cldf_comment': 'pêlo amarelo', 'variants': []} in the database. Skipped. In cell: <Cell 'Numbers, Body Parts, Food, Anim'.M63>.

Related to the error above, here is another one:
Numbers, Body Parts, Food, Anim.AA24: In form ... <rogatoetey'ym> (naõ é quatro): Element ... could not be parsed, ignored

In this case, the form in question was present in the database, it seems that the program didn't expect ... and anything else in a cell (which makes sense). However, there may be cases where in a header row in the same cell there are tags in all caps and then a form (which shouldn't be there).

Is there a way to make sure that the cells of the header row can be only ... or things in all caps (separated by commas or ;) or both (i.e. ... HEAD)?
If this is too much work, I guess we can find forms that are not in a cognate set by searching the cognate spreadsheet for every form in the database.

bit to add to segmenter report

there is an automatic correction from tç to cç (which is totally fine). It would be nice that the user is warned, so they can change the forms as well.

lexedata installation doesn't work on a mac

We (Gereon and Natalia) think that it is due to a dependency read.line that doesn't work on macs. I am going to try and run it again so I keep the exact error message.

segment report

I had an enhancement idea. a segment report could be per language or for the whole dataset to find extremely rare segments that are probably errors (even though they are IPA characters)

importer.cognates inserts a row without ID in the cognateset.csv

Add Matrix exporter

@nataliacp mentioned that it might be nice to have an exporter that turns a long table into matrix format.
If we build such a thing, it should be able to filter by a column in the ParameterTable that distinguishes core concepts (which are expected to be present in every language) from peripheral concepts.

lexedata / lexedata Goto Github PK

lexedata's Introduction

lexedata

lexedata's People

Contributors

Stargazers

Watchers

Forkers

lexedata's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs