morphdiv / teddi_sample Goto Github PK

Text Data Diversity Sample (TeDDi Sample)

License: Other

Python 60.74% R 1.52% TeX 18.63% Jupyter Notebook 19.11%

teddi_sample's Introduction

TeDDi

This is the repository for the Text Data Diversity Sample (TeDDi Sample), a part of the Swiss National Science Foundation funded project: Non-randomness in Morphological Diversity: A Computational Approach Based on Multilingual Corpora.

This repository contains the corpus data and code that processes and analyzes it. This is currently a work in progress.

If you use TeDDi, please cite as:

Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Pelloni, and Tanja Samardzic. 2022. TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1150–1158, Marseille, France. European Language Resources Association. Online: https://aclanthology.org/2022.lrec-1.123/

To contribute code or data to the repository, please first refer to our guidelines on contributing.

Different data formats available for direct download.

Main Contributors (alphabetical order):

Bentz, Christian
Gutierrez-Vasques, Ximena
Moran, Steven
Samardžić, Tanja
Sozinova, Olga

Language-specific contributors (alphabetical order):

Kalessa, Jule (Paiwan)
Mächler, Alina
Rood, David S. (Wichita)
Roth, Rainer (Wari')

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

teddi_sample's People

Contributors

Stargazers

Watchers

Forkers

olgapelloni veljkovic ximenina

teddi_sample's Issues

Rama text by Craig

Transcribe the text in Craig () The Rama language; a text with grammatical notes.

Wari Grammar by Everett and Kern (1997)

Transcribe texts given in the Appendix of Everett and Kern (1997).

Example sentences in this grammar might also be transcribed in a separate file.

Warao Texts by Osborn (1960)

Transcribe the Texts with Spanish translations in Osborn (1960) Textos folcloricos Guarao. The first text was already transcribed in wba_nfi_1.txt

UDHR white spaces

Relevant Texts: All UDHR (Universal Declaration of Human Rights) translations

ToDo: Add white spaces around punctuation. Probably a semi-automatic task.

Bagirmi Lexicon by Keegan and Djibrine (2016)

This lexicon contains a short grammatical description in English, with example sentences that can be extracted.

In the Bagirmi-French lexicon there are example sentences in Bagirmi with French translations that can also be extracted (the genre is then also example sentences, i.e. gre)

Asmat grammar by Voorhoeve (1980)

Transcribe example sentences of Voorhoeve (1980) The Asmat languages of Irian Jaya.

Missing LangData directory

here there is a reference to the file WALS_languages.csv:

https://github.com/uzling/100LC/blob/84704ea008554853b730bb12b98390135ecbece8/Rcode/langInfo_100_merge.R#L16

in a LangData/WALS/ directory which does not exist.

todo: update the Rcode directory to point to the relevant files in:

https://github.com/uzling/100LC/tree/master/LangInfo

Kiowa Texts by Harrington (1946)

Transcribe the three Kiowa Texts in Harrington (1946). The first text is transcribed in kio_nfi_1.txt.

Acoma (kjq) texts in Boas (1928) Keresan texts

The first ten pages of Boas (1928), corresponding to the story "The Emergence" have been transcribed using Transkribus and can be found in kjq_nfi_1. The transcription model is ready to use to transcribe the other stories in Boas (1928).

This transcription is currently organized by pages and lines of the original hand-written text by Boas. line breaks with hyphens are currently also left as in the original. Eventually it might be better to establish one sentence per line, or one paragraph by line.

Potentially add the English translation as given in the other Volume by Boas (1928), though this is complicated by the fact that the sentence segmentation by Boas is hard to disentangle, and hence establishing the correspondence between hand-written Acoma and printed English translation is non-trivial.

Oneida Teaching Grammar by Abbott (2006)

Transcribe example sentences of Abbott (2006) Oneida teaching grammar.

Duplicates in OPUS subtitles

@avonizos commented on Aug 2, 2019

TODO:
Find out a way for automatic duplicates identification, their deletion and renaming of the files.

Burushaski Lexicon by Lorimer (1938)

Digitize example sentences in Lorimer (1938) The Burushaski language.

Write tests for data input

Some of these should go into CI tests.

Canela-Kraho Chapter by Popjes and Popjes (1986)

Transcribe example sentences of the Chapter on Canela-Kraho by Popjes and Popjes (1986)

Bible in Hixkaryana

The first Book (Book of Matthew) is already ported to hix_nfi_1.txt. The other books found in the pdf should be ported to the same file. The structure of the text is by chapters and verses. So line tags need to be added corresponding to the chapter numbers and the numbers given for verses.

Sometimes, several verses are collated in the original, this needs to be kept in the transcription, e.g. <verse_12,13> or <verse_17-19>. Make sure that each verse (collation of verses) is running only over one line. Line breaks indicate a new verse.

White spaces around punctuation need to be added.

UDHR formatting

Relevant Texts: All UDHR (Universal Declaration of Human Rights) translations

ToDo: Bring all of these into the same format. The UDHR is organized by articles, and this structure should be kept by using the line tag <article_x> followed by a tab, and the text of the article. The whole of the preamble can follow a line tag . This formatting of UDHR translations should probably be done automatically, since these are several dozens of languages.

Meithei Grammar by Chelliah (1997)

Transcribe the texts in the Appendix of Chelliah (1997) A grammar of Meithei.

Grammar examples from this grammar can also be described in a separate file.

Open Subtitles Texts

Minor adjustments to be made in the Open Subtitles Corpora:

replace the underscore in "prepared_speeches" by a white space.
put the reference to Lison and Tiedemann (2016) under "source" followed by the link, rather than under "copyright long"
all the texts have "whole" as sample type, but I thought this should be the number of tokens that were used for random sampling?
*There should be the line tag <line_x> (with x being a running number giving the line number) at the beginning of each line. This line tag should be followed by a tab.

Zoque Grammar by Harrison (1984)

Transcribe example sentences in Harrison (1984) Survey of morphology and syntax for Zoque of Copainala.

Add contributing information

Add CONTRIBUTING.md

Describe F & PR model for adding data / code
Describe who OK's the code / data

Canela-Kraho Bible

Transcribe the Canela-Kraho bible (pdf found in Sources/Canela-Kraho (ram)/Text). The first chapter is transcribed in ram_nfi_1.txt. Extend on this file.

Write XML output

create the XML schema
dump the data from the database into XML files
update/finish Database/README.md

Maybrat Grammar by Dol 2007

Relevant Texts: Grammatical examples in the Maybrat grammar by Dol (2007). Only transcribe full sentences, not phrases or words. Also, there are three texts given as appendices which can be transcribed as separate stories.

Imonda Grammar

The text of the appendix of Seiler (1985) "Imonda, a Papuan language" is already transcribed in imn_nfi_1.txt. Use this as a reference to transcribe in a separate file the example sentences in that grammar (only full sentences, no phrases or words).

Rapa Nui Grammar by Kieviet (2017)

Transcribe the example sentences given in the grammar. The texts in the Appendix are already transcribed in rap_nfi_1-3.

Dani grammar by Bromley (1981)

Transcribe example sentences in Bromley (1981) A grammar of Lower Grand Valley Dani.

The text in the appendix is transcribed in dni_con_1.txt.

Rapa Nui grammar by Kieviet (2017)

Transcribe the interlinear texts (Appendix) in Kieviet (2017) A grammar of Rapa Nui. The first two texts are already transcribed in rap_nfi_1.txt and rap_nfi_2.txt.

The example sentences of this grammar might also be transcribed in a separate file.

Lezgian Grammar

Transcribe the Lezgian texts in Haspelmath's (2013) Lezgian grammar.

Example sentences in this grammar can also be transcribed in a separate file.

PBC line tags

Relevant Texts: All PBC texts (Parallel Bible Corpus)

ToDo: The line tags currently consist of the numbers described in Mayer and Cysouw (2014) followed by a tab. These represent verses, and should be kept, but embedded in our line tag format, i.e. <verse_x>, where x is the original number assigned to the verse. This needs to be done automatically, since there are around 50 bible translations with many thousands of lines.

Skip

@avonizos @christianbentz -- this pull request removes all the .DS_Store files

https://en.wikipedia.org/wiki/.DS_Store

that were checked in and adds a .gitignore file:

https://help.github.com/en/github/using-git/ignoring-files

so they won't be checked in in the future (i try avoiding git add * in general)

White spaces between orthographic words

Relevant texts: all Mandarin Chinese (cmn), Japanese (jpn), Burmese (mya), Thai (tha), and Plains Cree (crk) texts.

ToDo: Check whether white spaces between orthographic words are given. If not, these can be added by using the software packages mentioned for each language and respective writing system in "Report (Transcription and Annotation)" on Overleaf. For Plains Cree, this is only a problem if it is written in Canadian Syllabary.

Create database schema and load the corpus data

create database schema
parse the input files (reorganize the data input into ini (metadata) vs data files?)
- [ ] test the input data for consistent metadata categories
- [ ] test the input data for bad characters
- [ ] test the input data for proper alignments when splitting IGT (!)
load the data
write round trip

Kiowa grammar by Watkins (1984)

Transcribe the texts in the Appendix of Watkins (1984) A grammar of Kiowa.

Example sentences in the grammar might also be transcribed.

Zoque text by Harrison (1952)

Transcribe the text with English translations by Harrison (1952) The Mason: A Zoque text. The transcription is started in zoc_nfi_1.txt

MIT Piraha Corpus

The MIT Piraha corpus can be found online at https://github.com/languageMIT/piraha. It consists of more than 1000 sentences that can be ported into our format. The structure of the example sentences with syntactic glossing seems very complicated, but it should be possible to automatically extract the Piraha words, glosses, and translation sentence by sentence.

Lavukaleve Grammar by Terrill (1999)

Transcribe the texts in the Appendix of Terrill (1999) Lavukaleve grammar. Use the 1999 Doctoral thesis rather than the later Mouton de Gruyter publication. The first text on the Monggo people is transcribed in luv_nfi_1.txt already.

Grammar examples from this grammar can also be transcribed in a separate file.

Kutenai tales by Boas (1928)

Transcribe Kutenai tales by Boas (1928). The first tale has been digitized in kut_nfi_1.txt.

Asmat grammar by Voorhoeve (1965)

Transcribe texts in Voorhoeve (1965) The Flamingo Bay Dialect of the Asmat language. The first two texts in the appendix are already transcribed in tml_nfi_1 and tml_nfi_2

Grammatical examples might also be transcribed in a separate file.

Kayardlid Grammar Examples in Round (2013)

Relevant Text: Round (2013) grammar of Kayardild

The first six example sentences (Chapter 1 to 4) are digitalized in gyd_gre_1. This file should be completed by adding all the other example sentences from Chapter 5 onwards. Note that when copying and pasting from the pdf, upper case letters in the glosses are sometimes rendered as lower case, the mu character (μ) is sometimes rendered as m, so these need to be carefully checked. Additionally, the letters that are in index position in the original are given in normal position when copying and pasting, the original index position should be indicated by an underscore, e.g. child_NL.

Grammar of Kuniyanti (Gooniyandi) by McGregor (1984)

Transcribe the three texts given in the Kunjiyanti (Gooniyanti) Grammar by McGregor (1984). The first 20 lines of the first text are transcribed in gni_con_1. Make sure that the tone indications (arrows) and the indications of pauses are correct.

Later, the example sentences in this grammar might also be described in a separate file gni_gre_1.

McGregor suggested in an e-mail that all dd consonant clusters should be replaced by rr.

Martuthunira grammar by Dench (1994)

Transcribe the texts and songs given in the Appendix of Dench's Martuthunira grammar. The first text is already transcribed in vma_con_1.txt

Example sentences might also be transcribed in a separate file.

Ngiyambaa grammar by Donaldson (1977)

Transcribe texts and songs in appendix of Donaldson (1977) Ngiyambaa grammar. The first text has been started already in wyb_con_1.txt.

Automatically generating progress file

Write code to automatically generate a .csv file which gives a summary of the current status of the 100LC. The columns should be at least: iso code of the language, language name (both can be taken from the highest level folder), number of texts, number of genres (broad is enough I guess), number of characters, number of tokens (count by removing puncutation and then splitting strings by white spaces; this does not have to be perfectly accurate at this point).

Create bibliography

We should have a BibTeX file with each resource in the project and an additional sources index file that links each resource and its metadata to its doculect in the BibTeX file.

Mangarayi grammar by Merlan (1989)

Transcribe grammar examples from the Mangarayi grammar by Merlan (1989)

Lakhota Grammar by Ingham (2003)

Transcribe the texts in the Appendix of Ingham (2003) Lakhota grammar.

Grammar examples from this grammar can also be described in a separate file.

Plains Cree Texts by Bloomfield

Relevant Texts: All the Plains Cree (crk) stories in Bloomfield (1934). Plains Cree Texts.

ToDo: Transcribe these stories into separate files (one file per story). The first story is given in crk_nfi_1.txt as an example. This is potentially a task to use Transkribus for. Copying and pasting from the pdf is cumbersome, since some special characters are not rendered correctly.

Rama grammar by Craig/Grinevald (1990)

Transcribe grammatical example sentences in Craig/Grinevald (1990) A grammar of Rama.
128 sentences (first 100 pages of the grammar) have already been transcribed in file rma_gre_1.txt. Complete this file by adding the rest of the example sentences found in the grammar (probably another 200 or so).

morphdiv / teddi_sample Goto Github PK

teddi_sample's Introduction

TeDDi

teddi_sample's People

Contributors

Stargazers

Watchers

Forkers

teddi_sample's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs