kakaobrain / g2pm Goto Github PK
View Code? Open in Web Editor NEWA Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
License: Apache License 2.0
A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
License: Apache License 2.0
Hi,
I found that the count of poly chars in corpus is 623, while count of poly chars in cedict is over 700, what is the reason?
I mean, when we do prediction, the poly in sentense may be not in the set of 623 polys, but in the set of 700+ polys, Then How will the model predict its Pinyin?
model('今天来的目的是什么?') model = G2pM() model('今天来的目的是什么?') output:['jin1', 'tian1', 'lai2', 'de5', 'mu4', 'de5', 'shi4', 'shen2', 'me5', '?']
是否是安装问题。
g2pm 版本为0.1.2.4
how to new data?
about test results, you use which Chinese Bert model ? which repo?
hi, i find for anyone polyphone the netwok's output is a id, and then you can get pinyin from idx2class, i want to know how can you
ensure the output id is one of the classifications? thank you.
Hello, I have trained the Bert model according to your code. How to use the trained Bert model for pinyin annotation? :)
if you open up the .lb file there is only one pinyin there, while the corresponding line in .sent file has a string of characters..shouldn't the .lb file also have a string of pronunciation?
Hi,
Thanks for the good job.
I used chinese bert to do this work with your dataset, but I could't get good result like yours. So I want to study your code and know how to training model using this datasets. But the codes your offerring now are only to predict. So can you provide the complete code about your model or bert in your papper?
Thank you for your help.
I made a pull requests
I want to compare the performance of several g2p systems, so I download the CPP dataset, and try to reproduce the result showed in this repo. But I got much worse acc.
For g2pM v0.1.2.5,I got 92.9% for train set, 92.1% for dev set, and 91.6% for test set. Even ignore the tone information, the accs are: 96.6%, 96.1% 96.0% for train, dev and test set.
For pypinyin v0.36.0, I got 79.2%, 78.7%, 79.1% with tone, and 89.4%, 89.1%, 89.3% without tone.
To be more clear:
re.findall(r'▁ ([a-z0-9:]+) ▁', pinyin)[0]
.np.array([i == j for i, j in zip(pred, gt)])
.I'd like to know how do you get the acc value?
Attachment is the prediction for test set.
If any mistake in the computation, please point it out. Thanks,
Hi, all,
Thanks for the good job. I found there is a special PinYin "xx5" in class2idx; But there is no corpus labled with this pinyin, Then what does this Pinyin Class used for? Is there anything special?
Hello, I used G2PM to convert some chinese sentences, and I found that there are some polyphone words missed in the cedict. For example, “一” only have "yi1" in the cedict, but actually it can be pronounced as "yi1", "yi2", "yi4". Does it mean that the dict, dataset and model should be more generalized and updated to solve this problem?
Hi,
Here is a suggestion that some pinyin style in the CPP should be changed. like this:
女: [nu:3] ---- > [nv3]
略: [lu:e4] -----> [lve4]
The latter are common pinyin style used in China now. All pinyin part “u:” can be changed to "v"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.