yanglalo's Introduction

CLDF dataset derived from Yang's "Lalo Regional Varieties" from 2011

How to cite

If you use these data please cite

the original source

Yang, Cathryn (2011): Lalo regional varieties: Phylogeny, dialectometry and sociolinguistics. Bundoora: La Trobe University.
the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY-4.0 license

Conceptlists in Concepticon:

Yang-2011-1014

Statistics

Varieties: 8
Concepts: 1,000
Lexemes: 8,505
Sources: 1
Synonymy: 1.13
Cognacy: 8,505 cognates in 1,222 cognate sets (10 singletons)
Cognate Diversity: 0.03
Invalid lexemes: 0
Tokens: 53,082
Segments: 209 (0 BIPA errors, 0 CTLS sound class errors, 209 CLTS modified)
Inventory size (avg): 91.38

Contributors

Name	GitHub user	Description	Role
Cathryn Yang		provided data in digital form	Author, DataCollector
Steve Pepper		did initial concept and glottolog mapping	Other
Tiago Tresoldi	@tresoldi	maintainer	Other
Johann-Mattis List	@LinguList	maintainer	Other

CLDF Datasets

The following CLDF datasets are available in cldf:

CLDF Wordlist at cldf/cldf-metadata.json

yanglalo's People

Contributors

Watchers

yanglalo's Issues

Concepticon change

PR 872 over on Concepticon changed a concept mapping, so the list needs to be re-run for the next release.

Switch to online supplement for raw data

See lexibank/lexibank-analysed#49 (comment).

The raw data in https://github.com/lexibank/yanglalo/blob/master/raw/raw_data.tsv are corrupted/don't represent the source data (see https://www.sil.org/system/files/reapdata/45/63/47/45634767234331504727329755903316120678/Yang_Lalo_Sept_2011.pdf) appropriately.

Switch to the online supplement and verify with the SIL publication's wordlist.

re-run code to account for new title and contributors

Status of the data

The data is based on a glossary in PDF, which is, however, not easy to digitize. The author shared a spreadsheet with me in the past, with extended data, which we should not share online.

My idea is: take only the languages explicitly mentioned in the official publication (about 8) and extract them from the spreadsheet. Otherwise it would be unfair for the author (and we'd have a hardtime to find sources for the data).