GithubHelp home page GithubHelp logo

jyutping-table-parser's People

Contributors

chaklim avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

jyutping-table-parser's Issues

Any upcoming schedule of this repository?

Hi CL, I am Jeremy :)
I have extreme interest on Chinese Phonetics. I know there are online resource (the service is down unfortunately) that collects the historical pronunciation as well. I can provide that resource I backed up.

Upstream data, `jyutping` should not be an array.

I've spent today reviewing https://github.com/lshk-org/jyutping-table and I believe that there are issues in the upstream source data.

First, I believe that jyutping should be a single field, not an array. All of the items which appear as an array are listed here:
lshk-org/jyutping-table#3 (comment)

With the exceptions of these, which I believe are actually multi-syllable:

ch	ucs2	pronunciation
𠯢	U+E064	saa1 aa6
	U+5345	saa1 aa6
	U+534C	sei3 aa6

I believe that the remainder of the list should be interpreted as this (does not include the entire list):

ch	ucs2	descriptor	pronunciation
籿	U+7C7F	fan1	mai5
	U+7C80	sap6	mai5
	U+7C81	cin1	mai5
	U+7C8C	baak3	mai5
	U+7C8D	hou4	mai5
	U+7CA8	baak3	mai5
	U+7CCE	lei4	mai5

	U+358A	jing1	cam4
	U+540B	jing1	cyun3
	U+544E	jing1	cek3
	U+54E9	jing1	lei5
	U+5521	jing1	loeng2
	U+5562	jing1	loeng2
	U+565A	jing1	cam4
𠺖	U+F45A	jing1	mau5
𠰴	U+F4C0	jing1	sek6

Of the full list, only 6 are marked for having different phonetics, all of which match the second Jyutping component.

ch	ucs2	pronunciation
	U+540B	cyun3
	U+544E	cek3
	U+54E9	(li1, le1, lei5)
	U+6D6C	lei5
𠺖	U+F45A	mau5
𠰴	U+F4C0	sek6

Given the characters' construction matching the description field, a conversation with a native-speaker, and a conversation with a non-native speaker Cantonese linguist (who also consulted a native speaker), I believe:

  • That the first Jyutping value represents how the character would be described.
  • The second Jyutping value represents the actual pronunciation.

Further, it's also possible that the descriptor field pronunciation is wrong in a couple of cases, as I mention here: lshk-org/jyutping-table#4.


I propose that the object shape for this library be modified to account for this finding and post-processing added to adjust the data for correctness.

Incorrectly Grouped References

Hello @chaklim! I spent the last day building a project whose output is identical to this in order to check for output correctness by having two separate implementations. My approach works by extracting values directly from the source PDF to make it independent from some application's PDF text extraction algorithm.

You can see the project here:
https://github.com/nathanhammond/parse-jyutping-table-full

The underlying code is still a complete mess, but it works. In checking for output correctness I discovered an issue where your parser is assigning some characters as a reference instead of as a top-level character:
["5002", "5225", "5294", "5A7E", "5DD3", "5ECF", "5F54", "609E", "60B3", "6231", "672E", "69E9", "6DE8", "7522", "75F2", "7D55", "7DA0", "7DD6", "7E15", "860A", "8AAA", "8F40", "9115", "919E", "9292", "92B3", "9304", "93AD", "95B1", "984F", "985A", "98EE", "9920", "9A08", "9C2E"]

除左呢個問題之外 all other output is identical, implying that the two independent implementations are correctly extracting the contents from the source PDF.

My next comparison is with LSHK's list, again to check for correctness: https://github.com/lshk-org/jyutping-table

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.