GithubHelp home page GithubHelp logo

sillsdev / cog Goto Github PK

View Code? Open in Web Editor NEW
21.0 10.0 10.0 10.47 MB

Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques.

Home Page: http://sillsdev.github.io/cog/

License: MIT License

C# 95.65% Python 0.24% Shell 0.27% PowerShell 0.04% R 0.15% Rich Text Format 3.66%

cog's Introduction

Cog

Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties.

Features

  • IPA-based segmentation: automatically splits words in to segments
  • Stem identification: identifies prefixes and suffixes so that they can be ignored during comparison
  • Word alignment: aligns segments between word pairs
  • Sound correspondence identification: automatically identifies sound correspondences and the environments in which they occur
  • Likely cognate identification: provides various methods for identifying likely cognates
  • Lexical/phonetic similarity: calculates lexical/phonetic similarity for multiple language varieties
  • Visualization: generates similarity matrices, hierarchical graphs (UPGMA, Neighbor-joining), and network graphs

Experimentation

The goal of Cog is to provide a framework for experimenting with different techniques for language variety comparison. It is intended to be used iteratively: run a comparison, analyze the results, refine the process, run the comparison again, and so on. Most steps in the process can be tailored. It currently only supports a few comparison techniques, but we hope to include many more in the future.

cog's People

Contributors

andrew-polk avatar ddaspit avatar megahirt avatar rmunn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cog's Issues

Similarity Matrix Controls Suggestion

The Similarity Matrix is intriguing. If I understand correctly, the numbers and analyses generated are a direct result of the cognate identification method. It think it would be helpful to at least put the method on the screen, so I as a user know what I'm looking at.

But I think even more helpful would be to move Likely cognate identification box out of settings where it's currently hiding and move it into the Similarity matrix screen. That way I can manipulate settings and see the results immediately upon refresh without leaving the page. For a similar design, see: http://bigcharts.marketwatch.com/advchart/frames/frames.asp?symb=GNTX&insttype=&time=8&freq=1. Here the controls are on the left and the chart is on the right.

I am aware that the setting affects the Variety Pairs. An alternative to the above design could be to provide a popup window for these settings that could be called from the Similarity Matrix, or from the Settings. I personally don't think it's as elegant a solution, but I'm guessing it's doable.

Import word list data from WordSurv 7

There is currently no easy way to import data from WordSurv 7. WordSurv 7 can export data into Excel or CSV. The ability to import the specific CSV format that WordSurv 7 exports to should be added to Cog.

Can't Widen Wordlist Column

Hi Damien. Thanks for doing this project.

In the Input tab, Word lists, there is the Word lists table. The header of each column has a single meaning. I can hover over the divider between columns, and the cursor changes into a double-headed horizontal arrow, indicating that I can widen the column. That's what I want, because some of the meanings are too wide right now, and I can't tell the difference between a couple of them. So I click and hold to widen the column. The column won't widen for me.

If you need me to add a screen shot, let me know.

Add feature to automatically compute if a segment correspondence is significant

Currently, the Blair method (as it is implemented in Cog) is hard-coded to identify any correspondence that occurs three or more times as significant. This is a good guess, but a bit arbitrary, especially given the fact that the size of word lists vary greatly. It would be better to automatically determine if a correspondence is significant using a statistical method. Cog could implement the idea outlined in the paper, "Assessing the Significance of Correspondences in Word Lists", by Ramzi Nahhas.

Add support for COMPASS method

The COMPASS method is used to identify cognates. It is an approximation of the Comparative method as is currently implemented in WordSurv.

Tone and Syllabification

Tone marks seem to be messing with Cog's syllabification. For example, I have in my data ['ʔu˦ɾu]. Note the the high tone ˦ halfway through the word. Cog syllabifies this with two syllables 'ʔ|u˦ru|. Seems to me this should beː |'ʔu˦.ru| . In other words, Cog misses that the high tone mostly likely should be included with the syllable preceding it.

The distance between two complex segments is too large

When Cog performs alignment, it calculates a phonetic distance between segment pairs. This can also be used to determine similar segments for the Blair method. The computed distance between complex segments is too large.

Global similar segment charts

We were realizing that it would be nice to be able to see global (over all the data) PSS (phonetically-similar segments) charts showing frequency of occurrence. Something like this:

pss

It would be nice to have the frequency listed next to the bar connecting the two sounds and when the user clicks on the bar, to pull up a list of the comparisons that involve that correspondence. It would be nice to have separate charts for initial (simple and cluster) consonants, final consonants, simple vowels, and poly-phthongs.

Add a New Meaning Dialog Broken

v.1.2.0.10002 (10/13/2015)
I click on Add a new Meaning. The dialog comes up. I add a gloss and a category. Whether I click on OK or tab over to the OK button and hit enter, nothing happens. I can't add a new meaning. All I can do is Cancel the dialog.

Word List Export Improvement

The current word list export is currently in tab delimited form. I suspect this is intended for import into other software. However, I may well need to include the word list in a document relatively soon. The current format is not really useful for that purpose. It'd be much better to turn the list 90 degrees. For instance:

Variety 1 Variety 2 Variety 3
meaning 1 word 1-1 word 1-2 word 1-3
meaning 2 word 2-1 word 2-2 word 3-3
meaning 3 (etc.)
meaning 4
meaning 5
meaning 6

Rearrange meanings list

I am working off a recording done a dozen years ago by someone else. The recording was done on a cassette, later digitized, and uploaded online. I don't have the original word list the collector used, so I had to recreate it based on the recording. The problem is that the collector is hard to hear at times. She gave the word list to three different language groups, and I more or less pieced the word list together, but still missed some words. It would be really handy to be able to shuffle words around.

Other possible auto syllabification issues

The automatic syllabification feature is nice. But it seems to be missing some easy breaks, and I haven't been able to determine why. (Either that, or I'm misunderstanding what | is supposed to signify.) Some seem to have to do with the glottal stop, and others with the stress marker, but I haven't found a pattern to it yet. Some examples from Input / Varieties / Wordsː

'bi|ta
'?|u.ru
|mamaː˦ ˨|
|'ka|mi
|ɾuɾu'naʲ˦|

Add support for the SCA alignment algorithm

Cog currently only supports an ALINE-based word alignment algorithm. Johann-Mattis List outlines an alignment algorithm that uses sound classes similar to those used in the Dolgopolsky cognate identification method. The algorithm is implemented in the software package LingPy. The LingPy library could be integrated into Cog.

Citations:
List, J. M. 2012b. SCA: Phonetic alignment based on sound classes. In New Directions in logic, language, and computation. Slavkovik, M. and D Lassiter (Eds.). Springer.

List, J. M., Moran S. 2013. An open source toolkit for quantitative historical linguistics. In Proceedings of the 51st Conference of the Association for Computational Linguistics, pp. 13–18. Stroudsburg, PA: Assoc. Comput. Linguist.

Catastropic failure with Edit Similar Consonants Chart

In my current database, I am getting a catastrophic failure when I go to Compare, Settings, and click on the Edit Similar Consonants Chart. This produces, at one count, (exactly) 50 error messages. I'll attempt to attach one of them to this error report. cog-crash

I moved all of error messages aside on the screen to reach a dialog that said something about sending more information. You may have that already. When I did that, the other error messages went away. If I didn't do that, I tried to kill Cog with Task Manager and couldn't. Cog only went away when I killed all the error messages with Task Manager.

I've edited the Similar Consonants Chart before without problem. I've changed some words and meanings, but I don't think that would cause the problem. The only distinctively different thing I can think of is that I just added affixes to each of the three varieties under Input, Varieties.

Edit: Seeing that the Cog file is quite small, I'm tried to attach that as well, but I couldn't. You should find it in your email box.

Edit: Note: After successfully closing down Cog without Task Manager, I brought it up again, only to have a moment of panic. I thought all my data was gone. I had to open the database again.

Compute percentage of similar segments used when comparing a variety pair

Here is the request from Wyn Owen:

"Say I specified 20 similar segments, could Cog count the number of these that were used in a particular comparison, eg 15 out of the 20 were used when comparing Variety 2 with Variety 5? The results could be put into a matrix in percentage form. This would give a quick view of differences between pairs that might point to varieties that need further investigation."

Data Input/Storage Recommendation

Reading through some of the Cog documentation, Cog was apparently meant to interface with WordSurv. WordSurv to Cog is certainly one obvious data path. If I'm reading the documentation correctly, there is also speculation that Cog data might be stored in WordSurv at some future date.

There are two or three issues with this: 1) I am a linguist working with a recording that is a dozen years old. I didn't start with WordSurv. I started with Cog. I'm not sure, but I get the feeling others might take the same route. 2) The phonetic transcription UX in both Cog and WordSurv is less than optimal, particularly if you compare them with software such as ELAN. 3) If I remember correctly, WordSurv stores data in MS Access, a proprietary data format.

I recommend taking the approach that the SayMore team took to address these issues. SayMore stores data in ELAN's .eaf format. This allows users to transcribe data using SayMore or ELAN. That is, the same data can be edited using ELAN or SayMore as a front end.

I can even right-click on the data file in SayMore, and this will pop up a menu offering me to "Open in Program Associated with this File..." This defaults to ELAN, but I get the impression that I could change it to something else if I preferred. So, for example, if a user prefers Praat for some reason, he could use that instead of ELAN.

If Cog used this approach, it would: 1) Give users a refined, open source input method for audio or video phonetic transcription, if desired; 2) Store Cog data in an established, open format that is already used in software data exchange. This should be better than storing data in a unique format that no one else yet knows about.

I recognize that this would require some reworking of Cog's word input UX and data storage mechanisms. I don't have the code, and so I don't know how extensive such a rework would take (even if I have suspicions). Even so, I expect the payoff would be significant enough to justify the expense.

Word List Export: Enhancement Request

Currently Cog exports a tab-delimited text. This is a good start, but I'm not aware of any software that can import that data in its present state. I have spent many hours doing phonetic transcription on roughly 200 words for 3 languages, totaling about 600 words. That would be a great base of data to export elsewhere. In my case, I would like to import it into FLEx.

FLEx currently imports LIFT, standard format, and LinguaLinks data. I've been out of the FLEx team for awhile, but my assumption at the moment is that LIFT might be the best format for Cog export. That assumption would need to be verified.

Given that FLEx is based on one writing system, I expect that Cog would need a dialog asking which variety to export.

Add Meaning: Invalid Duplicate Meaning Bug

I have v. 1.2.3. now.

  1. I added a meaning 'sindaun-3PM' (which is 'sindaun'-third-plural-masculine). Then I added 'sindaun-3PF'.
  2. Since I don't have the ability to rearrange meanings (yet), and I wanted to put the singular forms before the plural forms, I merely changed these to 'sindaun-3SM' and 'sindaun-3SF', respectively.
  3. I added a new meaning, 'sindaun-3PM'.

Cog flags the new 'sindaun-3PM' as a duplicate. This is not true. I did have a 'sindaun-3PM' before, but I just changed it to 'sindaun-3SM'. 'sindaun-3PM' does not exist anymore, even though Cog says it does.

Figuring that Cog had the meaning stuck in memory, I hit save and tried again. That made no difference. Cog still flags 'sindaun-3PM' as a duplicate.

I then exited Cog and started Cog again. That flushed the memory sufficiently for me to add the new meaning. But I shouldn't have to exit Cog to do that.

Add New Meaning

This is another UX issue that got to me a little.

There are two ways to go to Add Meaning: 1) Input, Word lists, Add a new meaning; or 2) Input, Meanings, Add a new meaning. Either way brings up a simple Edit Meaning dialog.

Therein lies the problem. I had a couple hundred meanings to add. I had to add them one at a time by: 1) Click on Add a new meaning; 2) Type the gloss; 3) press tab; 4) type the category; 5) press tab; 6) press OK; 7) Take the hands off the keyboard for the mouse to Click on Add a new meaning again.

It would be so nice to offer another method. Especially one that would not require me to take my hands off the keyboard. (Try entering several dozen meanings sometime with the current dialog to see what I mean.) One possibility is: Go to the table in Input, Meanings and be able to hit Enter for a new row. Type the gloss, tab, type the meaning, and hit enter for a new row.

Hybrid Alignment Mode Hangs Variety Pair Comparison

To reproduce:

  1. Click on Compare. Go to Settings.
  2. In the Alignment mode at the top of the screen, select "Hybrid (semi-global)" from the drop-down combo box.
  3. Go way down to the bottom of the screen and press "Apply".
  4. Still under the Compare tab, select Similarity Matrix.
  5. On the left side of the screen, click on "Compare all variety pairs".

Observed:

A small dialog appears showing 0% completed. The Cancel button is apparently active, but the process remains stuck at 0% completed. The red X at the top right of the little dialog is fully active, and this is the only way to stop an otherwise hung process. If this isn't used, Cog must be closed using Task Manager.

Note:

I'm seeing the same problem with Partial and Beginning Alignment modes. Only Full alignment mode is working correctly at the moment.

Automated syllabification

It would improve the IPA segmentation capability if Cog could automatically identify syllables. The syllabification capability should work well enough out-of-the-box, but should be customizable to improve accuracy.

Add better support for editing and viewing similar segments

Currently, there is no way to easily view which segment pairs are considered similar. This is especially true of correspondences that are considered similar because of the distance threshold. For segment pairs that are entered manually, there is only a list. I think the best way to deal with this is to provide a table with all segments listed on the x and y axes and checkmarks in the cells of segment pairs that are considered similar. If all segments are displayed on a graph similar to the "Global Correspondences Chart" view, it would be too confusing.

Add ability to export the similarity matrix as a distance matrix in NEXUS format

Currently, the cognate set information can be exported to NEXUS format. It would be useful to also be able to export the similarity matrix as well. It would look something like thisː

NEXUS

BEGIN taxa;
DIMENSIONS ntax=5;
TAXLABELS
[1] 'VarietyA'
[2] 'VarietyB'
[3] 'VarietyC'
[4] 'VarietyD'
[5] 'VarietyE'
;
END;
BEGIN distances;
DIMENSIONS ntax=5;
FORMAT
triangle=LOWER
diagonal
labels
missing=?
;
MATRIX

[1]'VarietyA' 0
[2]'VarietyB' 0.05 0
[3]'VarietyC' 0.05 0.02 0
[4]'VarietyD' 0.05 0.02 0.03 0
[5]'VarietyE' 0.07 0.05 0.05 0.03 0

;
END;

Add airstream mechanism phonological feature

Cog cannot currently differentiate between regular consonants, ejectives, implosives, and clicks, because it does not have an airstream mechanism feature. The list of phonological features and the segment inventory should be updated to include this feature.

Multiple sequence alignment

Cog only aligns word pairs. It might be useful to align words from all varieties that have the same gloss.

Add support for LexStat cognate identification algorithm

LexStat is a statistical algorithm for detecting cognates. It would be good to add this as another cognate identification method. It is implemented in the LingPy library.

List, J.-M. (2012). LexStat: Automatic Detection of Cognates in Multilingual Wordlists. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH (pp. 117–125).

Edit Similar Consonants Chart Window is Modal

I'm looking at the Edit Similar Consonants Window.

I noted that the chart marked <ŋ> as similar to . That is the obvious assumption, but I think there's only one <ŋ> in the whole data set, and that is probably the result of preceding a < k >. I wanted to check to be sure, but I can't. The window is modal. I would prefer that it not be, so I can leave the chart up and go look at the data set.

Of course, I have to eyeball the data to look for it. There's no other way. But I believe that is a separate issue.

Nonexistant Words

This is more a question than an issue, I suspect, but it could turn into one or more issues. I don't have your email address anymore, nor could I find a way to contact you here, so I couldn't ask this off-line.

In certain cases, one variety has a word, but the other doesn't. For example, one variety has pronouns for first-person-singular-masculine and first-person-singular-feminine. It does not have a generic first-person-singular that both men and women use. In contrast, the other variety recognizes a generic first-person-singular only. It does not distinguish whether a man or a woman is speaking. This is a major contrast, but I'm not sure how to designate it in Cog. Leave it blank?

In other cases, I don't have a word in one variety. The language most likely does have the word, but the recording wasn't clear, or maybe the collector simply forgot to ask it in one variety. I don't want Cog flagging this as a major contrast like I would in the pronoun example above. How do I differentiate between a missing word and a non-existent word?

Generate correspondence sets across all varieties

Currently, sound correspondences are only identified between variety pairs. It would be a good step in the direction of the comparative method, if Cog could identify sets of correspondences that align across multiple varieties.

Word lists next word

This isn't a bug per se, but a UX issue that's getting to me.

I'm in the input tab, Word lists. I have three varieties set up with hundreds of meanings, and I'm entering words now. I enter a word into a cell and hit tab to go to the next cell for the variety. This works fine; the cell is highlighted. But I can't type in it.

If I am using Excel or LibreOffice Calc, I can tab and type in the next cell. I can't do that with Cog. To edit the cell, I must click on it. This might not seem like much, but after entering a couple dozen words, I have found that it slows me way down, and I keep thinking it should act like a spreadsheet table and it doesn't.

Affix Look up in Words

So I ran the tool that removes affixes from words in the variety. I think the doc calls it the stemmer. This came up with about 10 affixes under Input / Varieties. Most of them were a surprise, and I don't think most of them are correct, but I wanted to check them.

How?

I have the Find Words utility. So I click on that put 'ti' to look for and tell it to look in Form, because 'ti' has been identified as a suffix in my V (verb) category. This dutifully goes out and finds all instances of 'ti' in words, including the first part of [tiβi], which is a N (noun). This is not very helpful.

What I would really like is a list of verbs that have been identified by the utility as having [-ti] as a suffix.

One suggestion how to do thisː Paratext does something similar with its key term look up. If the same principle is applied here, I would double-click on 'ti' on the affix list. This would give me a list of verbs with that suffix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.