GithubHelp home page GithubHelp logo

morphdiv / teddi_sample Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 3.0 803.47 MB

Text Data Diversity Sample (TeDDi Sample)

License: Other

Python 60.74% R 1.52% TeX 18.63% Jupyter Notebook 19.11%
corpus-linguistics typology

teddi_sample's Introduction

TeDDi

This is the repository for the Text Data Diversity Sample (TeDDi Sample), a part of the Swiss National Science Foundation funded project: Non-randomness in Morphological Diversity: A Computational Approach Based on Multilingual Corpora.

This repository contains the corpus data and code that processes and analyzes it. This is currently a work in progress.

If you use TeDDi, please cite as:

Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Pelloni, and Tanja Samardzic. 2022. TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1150–1158, Marseille, France. European Language Resources Association. Online: https://aclanthology.org/2022.lrec-1.123/

To contribute code or data to the repository, please first refer to our guidelines on contributing.

Different data formats available for direct download.

Main Contributors (alphabetical order):

  • Bentz, Christian
  • Gutierrez-Vasques, Ximena
  • Moran, Steven
  • Samardžić, Tanja
  • Sozinova, Olga

Language-specific contributors (alphabetical order):

  • Kalessa, Jule (Paiwan)
  • Mächler, Alina
  • Rood, David S. (Wichita)
  • Roth, Rainer (Wari')

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

License: CC BY-NC-SA 4.0

License: CC BY-NC-SA 4.0

teddi_sample's People

Contributors

alinamaechler avatar bambooforest avatar christianbentz avatar tsamardzic avatar veljkovic avatar ximenina avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

teddi_sample's Issues

Rama text by Craig

Transcribe the text in Craig () The Rama language; a text with grammatical notes.

Warao Texts by Osborn (1960)

Transcribe the Texts with Spanish translations in Osborn (1960) Textos folcloricos Guarao. The first text was already transcribed in wba_nfi_1.txt

UDHR white spaces

Relevant Texts: All UDHR (Universal Declaration of Human Rights) translations

ToDo: Add white spaces around punctuation. Probably a semi-automatic task.

Bagirmi Lexicon by Keegan and Djibrine (2016)

This lexicon contains a short grammatical description in English, with example sentences that can be extracted.

In the Bagirmi-French lexicon there are example sentences in Bagirmi with French translations that can also be extracted (the genre is then also example sentences, i.e. gre)

Acoma (kjq) texts in Boas (1928) Keresan texts

The first ten pages of Boas (1928), corresponding to the story "The Emergence" have been transcribed using Transkribus and can be found in kjq_nfi_1. The transcription model is ready to use to transcribe the other stories in Boas (1928).

This transcription is currently organized by pages and lines of the original hand-written text by Boas. line breaks with hyphens are currently also left as in the original. Eventually it might be better to establish one sentence per line, or one paragraph by line.

Potentially add the English translation as given in the other Volume by Boas (1928), though this is complicated by the fact that the sentence segmentation by Boas is hard to disentangle, and hence establishing the correspondence between hand-written Acoma and printed English translation is non-trivial.

Duplicates in OPUS subtitles

@avonizos commented on Aug 2, 2019

TODO:
Find out a way for automatic duplicates identification, their deletion and renaming of the files.

Write tests for data input

Some of these should go into CI tests.

Bible in Hixkaryana

The first Book (Book of Matthew) is already ported to hix_nfi_1.txt. The other books found in the pdf should be ported to the same file. The structure of the text is by chapters and verses. So line tags need to be added corresponding to the chapter numbers and the numbers given for verses.

Sometimes, several verses are collated in the original, this needs to be kept in the transcription, e.g. <verse_12,13> or <verse_17-19>. Make sure that each verse (collation of verses) is running only over one line. Line breaks indicate a new verse.

White spaces around punctuation need to be added.

UDHR formatting

Relevant Texts: All UDHR (Universal Declaration of Human Rights) translations

ToDo: Bring all of these into the same format. The UDHR is organized by articles, and this structure should be kept by using the line tag <article_x> followed by a tab, and the text of the article. The whole of the preamble can follow a line tag . This formatting of UDHR translations should probably be done automatically, since these are several dozens of languages.

Meithei Grammar by Chelliah (1997)

Transcribe the texts in the Appendix of Chelliah (1997) A grammar of Meithei.

Grammar examples from this grammar can also be described in a separate file.

Open Subtitles Texts

Minor adjustments to be made in the Open Subtitles Corpora:

  • replace the underscore in "prepared_speeches" by a white space.
  • put the reference to Lison and Tiedemann (2016) under "source" followed by the link, rather than under "copyright long"
  • all the texts have "whole" as sample type, but I thought this should be the number of tokens that were used for random sampling?
    *There should be the line tag <line_x> (with x being a running number giving the line number) at the beginning of each line. This line tag should be followed by a tab.

Canela-Kraho Bible

Transcribe the Canela-Kraho bible (pdf found in Sources/Canela-Kraho (ram)/Text). The first chapter is transcribed in ram_nfi_1.txt. Extend on this file.

Write XML output

  • create the XML schema
  • dump the data from the database into XML files
  • update/finish Database/README.md

Maybrat Grammar by Dol 2007

Relevant Texts: Grammatical examples in the Maybrat grammar by Dol (2007). Only transcribe full sentences, not phrases or words. Also, there are three texts given as appendices which can be transcribed as separate stories.

Imonda Grammar

The text of the appendix of Seiler (1985) "Imonda, a Papuan language" is already transcribed in imn_nfi_1.txt. Use this as a reference to transcribe in a separate file the example sentences in that grammar (only full sentences, no phrases or words).

Dani grammar by Bromley (1981)

Transcribe example sentences in Bromley (1981) A grammar of Lower Grand Valley Dani.

The text in the appendix is transcribed in dni_con_1.txt.

Rapa Nui grammar by Kieviet (2017)

Transcribe the interlinear texts (Appendix) in Kieviet (2017) A grammar of Rapa Nui. The first two texts are already transcribed in rap_nfi_1.txt and rap_nfi_2.txt.

The example sentences of this grammar might also be transcribed in a separate file.

Lezgian Grammar

Transcribe the Lezgian texts in Haspelmath's (2013) Lezgian grammar.

Example sentences in this grammar can also be transcribed in a separate file.

PBC line tags

Relevant Texts: All PBC texts (Parallel Bible Corpus)

ToDo: The line tags currently consist of the numbers described in Mayer and Cysouw (2014) followed by a tab. These represent verses, and should be kept, but embedded in our line tag format, i.e. <verse_x>, where x is the original number assigned to the verse. This needs to be done automatically, since there are around 50 bible translations with many thousands of lines.

White spaces between orthographic words

Relevant texts: all Mandarin Chinese (cmn), Japanese (jpn), Burmese (mya), Thai (tha), and Plains Cree (crk) texts.

ToDo: Check whether white spaces between orthographic words are given. If not, these can be added by using the software packages mentioned for each language and respective writing system in "Report (Transcription and Annotation)" on Overleaf. For Plains Cree, this is only a problem if it is written in Canadian Syllabary.

Create database schema and load the corpus data

  • create database schema
  • parse the input files (reorganize the data input into ini (metadata) vs data files?)
    • [ ] test the input data for consistent metadata categories
    • [ ] test the input data for bad characters
    • [ ] test the input data for proper alignments when splitting IGT (!)
  • load the data
  • write round trip

Kiowa grammar by Watkins (1984)

Transcribe the texts in the Appendix of Watkins (1984) A grammar of Kiowa.

Example sentences in the grammar might also be transcribed.

Zoque text by Harrison (1952)

Transcribe the text with English translations by Harrison (1952) The Mason: A Zoque text. The transcription is started in zoc_nfi_1.txt

MIT Piraha Corpus

The MIT Piraha corpus can be found online at https://github.com/languageMIT/piraha. It consists of more than 1000 sentences that can be ported into our format. The structure of the example sentences with syntactic glossing seems very complicated, but it should be possible to automatically extract the Piraha words, glosses, and translation sentence by sentence.

Lavukaleve Grammar by Terrill (1999)

Transcribe the texts in the Appendix of Terrill (1999) Lavukaleve grammar. Use the 1999 Doctoral thesis rather than the later Mouton de Gruyter publication. The first text on the Monggo people is transcribed in luv_nfi_1.txt already.

Grammar examples from this grammar can also be transcribed in a separate file.

Asmat grammar by Voorhoeve (1965)

Transcribe texts in Voorhoeve (1965) The Flamingo Bay Dialect of the Asmat language. The first two texts in the appendix are already transcribed in tml_nfi_1 and tml_nfi_2

Grammatical examples might also be transcribed in a separate file.

Kayardlid Grammar Examples in Round (2013)

Relevant Text: Round (2013) grammar of Kayardild

The first six example sentences (Chapter 1 to 4) are digitalized in gyd_gre_1. This file should be completed by adding all the other example sentences from Chapter 5 onwards. Note that when copying and pasting from the pdf, upper case letters in the glosses are sometimes rendered as lower case, the mu character (μ) is sometimes rendered as m, so these need to be carefully checked. Additionally, the letters that are in index position in the original are given in normal position when copying and pasting, the original index position should be indicated by an underscore, e.g. child_NL.

Grammar of Kuniyanti (Gooniyandi) by McGregor (1984)

Transcribe the three texts given in the Kunjiyanti (Gooniyanti) Grammar by McGregor (1984). The first 20 lines of the first text are transcribed in gni_con_1. Make sure that the tone indications (arrows) and the indications of pauses are correct.

Later, the example sentences in this grammar might also be described in a separate file gni_gre_1.

McGregor suggested in an e-mail that all dd consonant clusters should be replaced by rr.

Martuthunira grammar by Dench (1994)

Transcribe the texts and songs given in the Appendix of Dench's Martuthunira grammar. The first text is already transcribed in vma_con_1.txt

Example sentences might also be transcribed in a separate file.

Automatically generating progress file

Write code to automatically generate a .csv file which gives a summary of the current status of the 100LC. The columns should be at least: iso code of the language, language name (both can be taken from the highest level folder), number of texts, number of genres (broad is enough I guess), number of characters, number of tokens (count by removing puncutation and then splitting strings by white spaces; this does not have to be perfectly accurate at this point).

Create bibliography

We should have a BibTeX file with each resource in the project and an additional sources index file that links each resource and its metadata to its doculect in the BibTeX file.

Lakhota Grammar by Ingham (2003)

Transcribe the texts in the Appendix of Ingham (2003) Lakhota grammar.

Grammar examples from this grammar can also be described in a separate file.

Plains Cree Texts by Bloomfield

Relevant Texts: All the Plains Cree (crk) stories in Bloomfield (1934). Plains Cree Texts.

ToDo: Transcribe these stories into separate files (one file per story). The first story is given in crk_nfi_1.txt as an example. This is potentially a task to use Transkribus for. Copying and pasting from the pdf is cumbersome, since some special characters are not rendered correctly.

Rama grammar by Craig/Grinevald (1990)

Transcribe grammatical example sentences in Craig/Grinevald (1990) A grammar of Rama.
128 sentences (first 100 pages of the grammar) have already been transcribed in file rma_gre_1.txt. Complete this file by adding the rest of the example sentences found in the grammar (probably another 200 or so).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.