sillsdev / cog Goto Github PK

Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques.

Home Page: http://sillsdev.github.io/cog/

License: MIT License

C# 95.65% Python 0.24% Shell 0.27% PowerShell 0.04% R 0.15% Rich Text Format 3.66%

cog's Introduction

Cog

Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties.

Features

IPA-based segmentation: automatically splits words in to segments
Stem identification: identifies prefixes and suffixes so that they can be ignored during comparison
Word alignment: aligns segments between word pairs
Sound correspondence identification: automatically identifies sound correspondences and the environments in which they occur
Likely cognate identification: provides various methods for identifying likely cognates
Lexical/phonetic similarity: calculates lexical/phonetic similarity for multiple language varieties
Visualization: generates similarity matrices, hierarchical graphs (UPGMA, Neighbor-joining), and network graphs

Experimentation

The goal of Cog is to provide a framework for experimenting with different techniques for language variety comparison. It is intended to be used iteratively: run a comparison, analyze the results, refine the process, run the comparison again, and so on. Most steps in the process can be tailored. It currently only supports a few comparison techniques, but we hope to include many more in the future.

cog's People

Contributors

Stargazers

Watchers

Forkers

fielddb sakthigs stephenehmann rmunn megahirt jretza bouri lgtm-migrator

cog's Issues

Generate correspondence sets across all varieties

Currently, sound correspondences are only identified between variety pairs. It would be a good step in the direction of the comparative method, if Cog could identify sets of correspondences that align across multiple varieties.

Add find and replace capability to word lists view

Currently, users can only search word lists. Users will often have the need to fix transcription problems in bulk. A replace feature would help to facilitate this user need.

Multiple sequence alignment

Cog only aligns word pairs. It might be useful to align words from all varieties that have the same gloss.

Migrate application settings when assembly version is updated

Currently, the application settings are lost when the assembly version changes. The settings should be migrated in this case. Also, the assembly version should not include the build number.

Affix Look up in Words

So I ran the tool that removes affixes from words in the variety. I think the doc calls it the stemmer. This came up with about 10 affixes under Input / Varieties. Most of them were a surprise, and I don't think most of them are correct, but I wanted to check them.

How?

I have the Find Words utility. So I click on that put 'ti' to look for and tell it to look in Form, because 'ti' has been identified as a suffix in my V (verb) category. This dutifully goes out and finds all instances of 'ti' in words, including the first part of [tiβi], which is a N (noun). This is not very helpful.

What I would really like is a list of verbs that have been identified by the utility as having [-ti] as a suffix.

One suggestion how to do thisː Paratext does something similar with its key term look up. If the same principle is applied here, I would double-click on 'ti' on the affix list. This would give me a list of verbs with that suffix.

The distance between two complex segments is too large

When Cog performs alignment, it calculates a phonetic distance between segment pairs. This can also be used to determine similar segments for the Blair method. The computed distance between complex segments is too large.

Add ability to set explicit exceptions to threshold-based similar segments

Normally, you can either specify similar segments explicitly as a list or by a phonetic threshold. It would be helpful to be able to mix these two approaches. If a correspondence meets the threshold, it will be considered similar. If it doesn't, it will be checked against a list.

Jump to invalid word when a user clicks on red "x" in word lists view

Currently, the user has to scroll through all words in a variety to find invalid words. This will allow users to quickly find invalid words.

Crash occurs when syllabifying a word with only a syllable break marker in it

Cog should gracefully handle words that only contain a syllable break marker (period character).

Add better support for editing and viewing similar segments

Currently, there is no way to easily view which segment pairs are considered similar. This is especially true of correspondences that are considered similar because of the distance threshold. For segment pairs that are entered manually, there is only a list. I think the best way to deal with this is to provide a table with all segments listed on the x and y axes and checkmarks in the cells of segment pairs that are considered similar. If all segments are displayed on a graph similar to the "Global Correspondences Chart" view, it would be too confusing.

Add feature to automatically compute if a segment correspondence is significant

Currently, the Blair method (as it is implemented in Cog) is hard-coded to identify any correspondence that occurs three or more times as significant. This is a good guess, but a bit arbitrary, especially given the fact that the size of word lists vary greatly. It would be better to automatically determine if a correspondence is significant using a statistical method. Cog could implement the idea outlined in the paper, "Assessing the Significance of Correspondences in Word Lists", by Ramzi Nahhas.

Nonexistant Words

This is more a question than an issue, I suspect, but it could turn into one or more issues. I don't have your email address anymore, nor could I find a way to contact you here, so I couldn't ask this off-line.

In certain cases, one variety has a word, but the other doesn't. For example, one variety has pronouns for first-person-singular-masculine and first-person-singular-feminine. It does not have a generic first-person-singular that both men and women use. In contrast, the other variety recognizes a generic first-person-singular only. It does not distinguish whether a man or a woman is speaking. This is a major contrast, but I'm not sure how to designate it in Cog. Leave it blank?

In other cases, I don't have a word in one variety. The language most likely does have the word, but the recording wasn't clear, or maybe the collector simply forgot to ask it in one variety. I don't want Cog flagging this as a major contrast like I would in the pronoun example above. How do I differentiate between a missing word and a non-existent word?

Navigate to a specific word pair from the Multiple Word Alignment view

Allow the user to select two words in the Multiple Word Alignment view and navigate to that word pair in the Variety Pair view.

Add support for specifying which variety pairs a set of similar segments apply to

Currently, any specified similar segments apply to all variety pairs in a Cog project. Some users would like to have more fine-grained control of this, so they can specify which variety pairs a particular set applies to. This would help to deal with outlier varieties in a project.

Add support for the SCA alignment algorithm

Cog currently only supports an ALINE-based word alignment algorithm. Johann-Mattis List outlines an alignment algorithm that uses sound classes similar to those used in the Dolgopolsky cognate identification method. The algorithm is implemented in the software package LingPy. The LingPy library could be integrated into Cog.

Citations:
List, J. M. 2012b. SCA: Phonetic alignment based on sound classes. In New Directions in logic, language, and computation. Slavkovik, M. and D Lassiter (Eds.). Springer.

List, J. M., Moran S. 2013. An open source toolkit for quantitative historical linguistics. In Proceedings of the 51st Conference of the Association for Computational Linguistics, pp. 13–18. Stroudsburg, PA: Assoc. Comput. Linguist.

Similarity Matrix Controls Suggestion

The Similarity Matrix is intriguing. If I understand correctly, the numbers and analyses generated are a direct result of the cognate identification method. It think it would be helpful to at least put the method on the screen, so I as a user know what I'm looking at.

But I think even more helpful would be to move Likely cognate identification box out of settings where it's currently hiding and move it into the Similarity matrix screen. That way I can manipulate settings and see the results immediately upon refresh without leaving the page. For a similar design, see: http://bigcharts.marketwatch.com/advchart/frames/frames.asp?symb=GNTX&insttype=&time=8&freq=1. Here the controls are on the left and the chart is on the right.

I am aware that the setting affects the Variety Pairs. An alternative to the above design could be to provide a popup window for these settings that could be called from the Similarity Matrix, or from the Settings. I personally don't think it's as elegant a solution, but I'm guessing it's doable.

Tone and Syllabification

Tone marks seem to be messing with Cog's syllabification. For example, I have in my data ['ʔu˦ɾu]. Note the the high tone ˦ halfway through the word. Cog syllabifies this with two syllables 'ʔ|u˦ru|. Seems to me this should beː |'ʔu˦.ru| . In other words, Cog misses that the high tone mostly likely should be included with the syllable preceding it.

Add support for LexStat cognate identification algorithm

LexStat is a statistical algorithm for detecting cognates. It would be good to add this as another cognate identification method. It is implemented in the LingPy library.

List, J.-M. (2012). LexStat: Automatic Detection of Cognates in Multilingual Wordlists. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH (pp. 117–125).

Data Input/Storage Recommendation

Reading through some of the Cog documentation, Cog was apparently meant to interface with WordSurv. WordSurv to Cog is certainly one obvious data path. If I'm reading the documentation correctly, there is also speculation that Cog data might be stored in WordSurv at some future date.

There are two or three issues with this: 1) I am a linguist working with a recording that is a dozen years old. I didn't start with WordSurv. I started with Cog. I'm not sure, but I get the feeling others might take the same route. 2) The phonetic transcription UX in both Cog and WordSurv is less than optimal, particularly if you compare them with software such as ELAN. 3) If I remember correctly, WordSurv stores data in MS Access, a proprietary data format.

I recommend taking the approach that the SayMore team took to address these issues. SayMore stores data in ELAN's .eaf format. This allows users to transcribe data using SayMore or ELAN. That is, the same data can be edited using ELAN or SayMore as a front end.

I can even right-click on the data file in SayMore, and this will pop up a menu offering me to "Open in Program Associated with this File..." This defaults to ELAN, but I get the impression that I could change it to something else if I preferred. So, for example, if a user prefers Praat for some reason, he could use that instead of ELAN.

If Cog used this approach, it would: 1) Give users a refined, open source input method for audio or video phonetic transcription, if desired; 2) Store Cog data in an established, open format that is already used in software data exchange. This should be better than storing data in a unique format that no one else yet knows about.

I recognize that this would require some reworking of Cog's word input UX and data storage mechanisms. I don't have the code, and so I don't know how extensive such a rework would take (even if I have suspicions). Even so, I expect the payoff would be significant enough to justify the expense.

Rows in segments table in variety view resizes while scrolling horizontally

While scrolling horizontally, the rows in the segments table resizes weirdly.

Add New Meaning

This is another UX issue that got to me a little.

There are two ways to go to Add Meaning: 1) Input, Word lists, Add a new meaning; or 2) Input, Meanings, Add a new meaning. Either way brings up a simple Edit Meaning dialog.

Therein lies the problem. I had a couple hundred meanings to add. I had to add them one at a time by: 1) Click on Add a new meaning; 2) Type the gloss; 3) press tab; 4) type the category; 5) press tab; 6) press OK; 7) Take the hands off the keyboard for the mouse to Click on Add a new meaning again.

It would be so nice to offer another method. Especially one that would not require me to take my hands off the keyboard. (Try entering several dozen meanings sometime with the current dialog to see what I mean.) One possibility is: Go to the table in Input, Meanings and be able to hit Enter for a new row. Type the gloss, tab, type the meaning, and hit enter for a new row.

Display a scale on dendrograms

A scale of some type would be helpful in interpreting UPGMA and neighbor-joining dendrograms.

Add Meaning: Invalid Duplicate Meaning Bug

I have v. 1.2.3. now.

I added a meaning 'sindaun-3PM' (which is 'sindaun'-third-plural-masculine). Then I added 'sindaun-3PF'.
Since I don't have the ability to rearrange meanings (yet), and I wanted to put the singular forms before the plural forms, I merely changed these to 'sindaun-3SM' and 'sindaun-3SF', respectively.
I added a new meaning, 'sindaun-3PM'.

Cog flags the new 'sindaun-3PM' as a duplicate. This is not true. I did have a 'sindaun-3PM' before, but I just changed it to 'sindaun-3SM'. 'sindaun-3PM' does not exist anymore, even though Cog says it does.

Figuring that Cog had the meaning stuck in memory, I hit save and tried again. That made no difference. Cog still flags 'sindaun-3PM' as a duplicate.

I then exited Cog and started Cog again. That flushed the memory sufficiently for me to add the new meaning. But I shouldn't have to exit Cog to do that.

Rearrange meanings list

I am working off a recording done a dozen years ago by someone else. The recording was done on a cassette, later digitized, and uploaded online. I don't have the original word list the collector used, so I had to recreate it based on the recording. The problem is that the collector is hard to hear at times. She gave the word list to three different language groups, and I more or less pieced the word list together, but still missed some words. It would be really handy to be able to shuffle words around.

Provide a way to manually determine that two words are cognates

Currently, all cognate decisions are completely automatic. Allow users to override these decisions manually.

Expand ability to format the similarity matrix

Currently, the similarity matrix is only displayed as a full-matrix with a set color scheme. Some users prefer alternative similarity matrix formats.

Global similar segment charts

We were realizing that it would be nice to be able to see global (over all the data) PSS (phonetically-similar segments) charts showing frequency of occurrence. Something like this:

It would be nice to have the frequency listed next to the bar connecting the two sounds and when the user clicks on the bar, to pull up a list of the comparisons that involve that correspondence. It would be nice to have separate charts for initial (simple and cluster) consonants, final consonants, simple vowels, and poly-phthongs.

Add support for importing KML files

There is currently no way to import geographic region data. Add the ability to import region data in the KML format.

Word lists next word

This isn't a bug per se, but a UX issue that's getting to me.

I'm in the input tab, Word lists. I have three varieties set up with hundreds of meanings, and I'm entering words now. I enter a word into a cell and hit tab to go to the next cell for the variety. This works fine; the cell is highlighted. But I can't type in it.

If I am using Excel or LibreOffice Calc, I can tab and type in the next cell. I can't do that with Cog. To edit the cell, I must click on it. This might not seem like much, but after entering a couple dozen words, I have found that it slows me way down, and I keep thinking it should act like a spreadsheet table and it doesn't.

Add capability to set phonological features for a segment based on environment

There are cases where a user might want to treat a consonant as a vowel or vice-versa in a certain phonological environment. A UI for this is probably not currently necessary, since it is a rare use case.

Edit Similar Consonants Chart Window is Modal

I'm looking at the Edit Similar Consonants Window.

I noted that the chart marked <ŋ> as similar to . That is the obvious assumption, but I think there's only one <ŋ> in the whole data set, and that is probably the result of preceding a < k >. I wanted to check to be sure, but I can't. The window is modal. I would prefer that it not be, so I can leave the chart up and go look at the data set.

Of course, I have to eyeball the data to look for it. There's no other way. But I believe that is a separate issue.

Add a New Meaning Dialog Broken

v.1.2.0.10002 (10/13/2015)
I click on Add a new Meaning. The dialog comes up. I add a gloss and a category. Whether I click on OK or tab over to the OK button and hit enter, nothing happens. I can't add a new meaning. All I can do is Cancel the dialog.

Catastropic failure with Edit Similar Consonants Chart

In my current database, I am getting a catastrophic failure when I go to Compare, Settings, and click on the Edit Similar Consonants Chart. This produces, at one count, (exactly) 50 error messages. I'll attempt to attach one of them to this error report.

I moved all of error messages aside on the screen to reach a dialog that said something about sending more information. You may have that already. When I did that, the other error messages went away. If I didn't do that, I tried to kill Cog with Task Manager and couldn't. Cog only went away when I killed all the error messages with Task Manager.

I've edited the Similar Consonants Chart before without problem. I've changed some words and meanings, but I don't think that would cause the problem. The only distinctively different thing I can think of is that I just added affixes to each of the three varieties under Input, Varieties.

Edit: Seeing that the Cog file is quite small, I'm tried to attach that as well, but I couldn't. You should find it in your email box.

Edit: Note: After successfully closing down Cog without Task Manager, I brought it up again, only to have a moment of panic. I thought all my data was gone. I had to open the database again.

Provide a way to ignore certain correspondences when using the Blair method

Add an editable list of correspondences under the Blair method settings that specifies correspondences that are ignored. The UI can be similar to the similar segments control.

Add subscript number to variety name in multiple word alignment view

Currently, there is no easy way to tell if there are multiple words for a variety in the multiple word alignment view.

Can't Widen Wordlist Column

Hi Damien. Thanks for doing this project.

In the Input tab, Word lists, there is the Word lists table. The header of each column has a single meaning. I can hover over the divider between columns, and the cursor changes into a double-headed horizontal arrow, indicating that I can widen the column. That's what I want, because some of the meanings are too wide right now, and I can't tell the difference between a couple of them. So I click and hold to widen the column. The column won't widen for me.

If you need me to add a screen shot, let me know.

Word List Export Improvement

The current word list export is currently in tab delimited form. I suspect this is intended for import into other software. However, I may well need to include the word list in a document relatively soon. The current format is not really useful for that purpose. It'd be much better to turn the list 90 degrees. For instance:

	Variety 1	Variety 2	Variety 3
meaning 1	word 1-1	word 1-2	word 1-3
meaning 2	word 2-1	word 2-2	word 3-3
meaning 3	(etc.)
meaning 4
meaning 5
meaning 6

Add support for COMPASS method

The COMPASS method is used to identify cognates. It is an approximation of the Comparative method as is currently implemented in WordSurv.

Automated syllabification

It would improve the IPA segmentation capability if Cog could automatically identify syllables. The syllabification capability should work well enough out-of-the-box, but should be customizable to improve accuracy.

Other possible auto syllabification issues

The automatic syllabification feature is nice. But it seems to be missing some easy breaks, and I haven't been able to determine why. (Either that, or I'm misunderstanding what | is supposed to signify.) Some seem to have to do with the glottal stop, and others with the stress marker, but I haven't found a pattern to it yet. Some examples from Input / Varieties / Wordsː

'bi|ta
'?|u.ru
|mamaː˦ ˨|
|'ka|mi
|ɾuɾu'naʲ˦|

Import word lists from clipboard

Provide the capability to import word lists that have been copied to the clipboard from Excel.

Add option to display all segments in global correspondences chart

Currently, Cog only displays segments that have corresponded to other segments. All segments that only correspond to itself are filtered out. Some users might want to see all of the segments that occur in the word lists.

Add airstream mechanism phonological feature

Cog cannot currently differentiate between regular consonants, ejectives, implosives, and clicks, because it does not have an airstream mechanism feature. The list of phonological features and the segment inventory should be updated to include this feature.

Import word list data from WordSurv 7

There is currently no easy way to import data from WordSurv 7. WordSurv 7 can export data into Excel or CSV. The ability to import the specific CSV format that WordSurv 7 exports to should be added to Cog.

Hybrid Alignment Mode Hangs Variety Pair Comparison

To reproduce:

Click on Compare. Go to Settings.
In the Alignment mode at the top of the screen, select "Hybrid (semi-global)" from the drop-down combo box.
Go way down to the bottom of the screen and press "Apply".
Still under the Compare tab, select Similarity Matrix.
On the left side of the screen, click on "Compare all variety pairs".

Observed:

A small dialog appears showing 0% completed. The Cancel button is apparently active, but the process remains stuck at 0% completed. The red X at the top right of the little dialog is fully active, and this is the only way to stop an otherwise hung process. If this isn't used, Cog must be closed using Task Manager.

Note:

I'm seeing the same problem with Partial and Beginning Alignment modes. Only Full alignment mode is working correctly at the moment.

Export list of cognates and non-cognates

In the variety pair view, Cog lists the word pairs that it has determined to be cognate and non-cognate. Provide a way to export these lists.

Word List Export: Enhancement Request

Currently Cog exports a tab-delimited text. This is a good start, but I'm not aware of any software that can import that data in its present state. I have spent many hours doing phonetic transcription on roughly 200 words for 3 languages, totaling about 600 words. That would be a great base of data to export elsewhere. In my case, I would like to import it into FLEx.

FLEx currently imports LIFT, standard format, and LinguaLinks data. I've been out of the FLEx team for awhile, but my assumption at the moment is that LIFT might be the best format for Cog export. That assumption would need to be verified.

Given that FLEx is based on one writing system, I expect that Cog would need a dialog asking which variety to export.

Compute percentage of similar segments used when comparing a variety pair

Here is the request from Wyn Owen:

"Say I specified 20 similar segments, could Cog count the number of these that were used in a particular comparison, eg 15 out of the 20 were used when comparing Variety 2 with Variety 5? The results could be put into a matrix in percentage form. This would give a quick view of differences between pairs that might point to varieties that need further investigation."

Add ability to export the similarity matrix as a distance matrix in NEXUS format

Currently, the cognate set information can be exported to NEXUS format. It would be useful to also be able to export the similarity matrix as well. It would look something like thisː

NEXUS

BEGIN taxa;
DIMENSIONS ntax=5;
TAXLABELS
[1] 'VarietyA'
[2] 'VarietyB'
[3] 'VarietyC'
[4] 'VarietyD'
[5] 'VarietyE'
;
END;
BEGIN distances;
DIMENSIONS ntax=5;
FORMAT
triangle=LOWER
diagonal
labels
missing=?
;
MATRIX

[1]'VarietyA' 0
[2]'VarietyB' 0.05 0
[3]'VarietyC' 0.05 0.02 0
[4]'VarietyD' 0.05 0.02 0.03 0
[5]'VarietyE' 0.07 0.05 0.05 0.03 0

;
END;

Fix incorrect cognate set clustering when all words are cognate

If all words for a meaning are cognate, the cognate set clustering algorithm will use the wrong threshold, which is the average of the highest and lowest cognicity scores. The lowest cognicity score should always be 0.