The jyutping-table-parser from chaklim

jyutping-table-parser's Issues

Any upcoming schedule of this repository?

Hi CL, I am Jeremy :)
I have extreme interest on Chinese Phonetics. I know there are online resource (the service is down unfortunately) that collects the historical pronunciation as well. I can provide that resource I backed up.

Upstream data, `jyutping` should not be an array.

I've spent today reviewing https://github.com/lshk-org/jyutping-table and I believe that there are issues in the upstream source data.

First, I believe that jyutping should be a single field, not an array. All of the items which appear as an array are listed here:
lshk-org/jyutping-table#3 (comment)

With the exceptions of these, which I believe are actually multi-syllable:

ch	ucs2	pronunciation
𠯢	U+E064	saa1 aa6
〺	U+5345	saa1 aa6
卌	U+534C	sei3 aa6

I believe that the remainder of the list should be interpreted as this (does not include the entire list):

ch	ucs2	descriptor	pronunciation
籿	U+7C7F	fan1	mai5
粀	U+7C80	sap6	mai5
粁	U+7C81	cin1	mai5
粌	U+7C8C	baak3	mai5
粍	U+7C8D	hou4	mai5
粨	U+7CA8	baak3	mai5
糎	U+7CCE	lei4	mai5

㖊	U+358A	jing1	cam4
吋	U+540B	jing1	cyun3
呎	U+544E	jing1	cek3
哩	U+54E9	jing1	lei5
唡	U+5521	jing1	loeng2
啢	U+5562	jing1	loeng2
噚	U+565A	jing1	cam4
𠺖	U+F45A	jing1	mau5
𠰴	U+F4C0	jing1	sek6

Of the full list, only 6 are marked for having different phonetics, all of which match the second Jyutping component.

ch	ucs2	pronunciation
吋	U+540B	cyun3
呎	U+544E	cek3
哩	U+54E9	(li1, le1, lei5)
浬	U+6D6C	lei5
𠺖	U+F45A	mau5
𠰴	U+F4C0	sek6

Given the characters' construction matching the description field, a conversation with a native-speaker, and a conversation with a non-native speaker Cantonese linguist (who also consulted a native speaker), I believe:

That the first Jyutping value represents how the character would be described.
The second Jyutping value represents the actual pronunciation.

Further, it's also possible that the descriptor field pronunciation is wrong in a couple of cases, as I mention here: lshk-org/jyutping-table#4.

I propose that the object shape for this library be modified to account for this finding and post-processing added to adjust the data for correctness.

Incorrectly Grouped References

Hello @chaklim! I spent the last day building a project whose output is identical to this in order to check for output correctness by having two separate implementations. My approach works by extracting values directly from the source PDF to make it independent from some application's PDF text extraction algorithm.

You can see the project here:
https://github.com/nathanhammond/parse-jyutping-table-full

The underlying code is still a complete mess, but it works. In checking for output correctness I discovered an issue where your parser is assigning some characters as a reference instead of as a top-level character:
["5002", "5225", "5294", "5A7E", "5DD3", "5ECF", "5F54", "609E", "60B3", "6231", "672E", "69E9", "6DE8", "7522", "75F2", "7D55", "7DA0", "7DD6", "7E15", "860A", "8AAA", "8F40", "9115", "919E", "9292", "92B3", "9304", "93AD", "95B1", "984F", "985A", "98EE", "9920", "9A08", "9C2E"]

除左呢個問題之外 all other output is identical, implying that the two independent implementations are correctly extracting the contents from the source PDF.

My next comparison is with LSHK's list, again to check for correctness: https://github.com/lshk-org/jyutping-table

chaklim / jyutping-table-parser Goto Github PK

jyutping-table-parser's People

Contributors

Stargazers

Watchers

jyutping-table-parser's Issues

Any upcoming schedule of this repository?

Upstream data, `jyutping` should not be an array.

Incorrectly Grouped References

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs