kakaobrain / g2pm Goto Github PK

View Code? Open in Web Editor NEW

335.0 335.0 73.0 12.46 MB

A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset

License: Apache License 2.0

Python 100.00%

g2pm's People

Contributors

Stargazers

Watchers

g2pm's Issues

the project can not beat pypinyin!

Pronunciation of "A"

wiki와 g2pc에서는 "诶" 를 ēi와 ei2로 표기하는데,
g2pm에서는 pan1이라고 나옵니다. 에러인가요?

Why the count of polys in cedict is larger then that in corpus

Hi,
I found that the count of poly chars in corpus is 623, while count of poly chars in cedict is over 700, what is the reason?
I mean, when we do prediction, the poly in sentense may be not in the set of 623 polys, but in the set of 700+ polys, Then How will the model predict its Pinyin?

It seems that the result of g2pM model is worse than that of pypinyin model?

Hi, guys! I tested some common Chinese Mandarin texts. The g2pM model gets all error results, and pypinyin get all correct results. Here are the examples I tested.

论文示例里的数据输出错误

model('今天来的目的是什么？') model = G2pM() model('今天来的目的是什么？') output：['jin1', 'tian1', 'lai2', 'de5', 'mu4', 'de5', 'shi4', 'shen2', 'me5', '？']
是否是安装问题。
g2pm 版本为0.1.2.4

how to new data？

which Chinese Bert model ? which repo?

about test results, you use which Chinese Bert model ? which repo?

polyphone's classification

hi, i find for anyone polyphone the netwok's output is a id, and then you can get pinyin from idx2class, i want to know how can you
ensure the output id is one of the classifications? thank you.

how to use Bert model?

Hello, I have trained the Bert model according to your code. How to use the trained Bert model for pinyin annotation? :)

Training Data Explanation

if you open up the .lb file there is only one pinyin there, while the corresponding line in .sent file has a string of characters..shouldn't the .lb file also have a string of pronunciation?

Can you provide the complete code for training?

Hi,
Thanks for the good job.
I used chinese bert to do this work with your dataset, but I could't get good result like yours. So I want to study your code and know how to training model using this datasets. But the codes your offerring now are only to predict. So can you provide the complete code about your model or bert in your papper?
Thank you for your help.

numpy and pytorch predict logits result don't match

I made a pull requests

Can not reproduce the result.

I want to compare the performance of several g2p systems, so I download the CPP dataset, and try to reproduce the result showed in this repo. But I got much worse acc.

For g2pM v0.1.2.5，I got 92.9% for train set, 92.1% for dev set, and 91.6% for test set. Even ignore the tone information, the accs are: 96.6%, 96.1% 96.0% for train, dev and test set.

For pypinyin v0.36.0, I got 79.2%, 78.7%, 79.1% with tone, and 89.4%, 89.1%, 89.3% without tone.

To be more clear：

The full sentence was fed to each system, to got the pinyin result.
Then extract the predict as re.findall(r'▁ ([a-z0-9:]+) ▁', pinyin)[0].
Finally, the acc was calculated as np.array([i == j for i, j in zip(pred, gt)]).

I'd like to know how do you get the acc value?

Attachment is the prediction for test set.

If any mistake in the computation, please point it out. Thanks,

Two suggestions

SOS（BOS） and EOS have no meaning to the effect and can be removed.
There are some label errors in the data（'儿'r5->er5,"樘"cheng3->cheng1,"骑"ji4->qi2）. And after excluding monosyllabic words, the actual number of effective polyphonic sentences is 94857 lines.

what does the special PinYin "xx5" used for

Hi, all,
Thanks for the good job. I found there is a special PinYin "xx5" in class2idx; But there is no corpus labled with this pinyin, Then what does this Pinyin Class used for? Is there anything special?

There are some polyphone words missed

Hello, I used G2PM to convert some chinese sentences, and I found that there are some polyphone words missed in the cedict. For example, “一” only have "yi1" in the cedict, but actually it can be pronounced as "yi1", "yi2", "yi4". Does it mean that the dict, dataset and model should be more generalized and updated to solve this problem?

can you open train data?thanks

suggestion to change some Pinyin style

Hi,
Here is a suggestion that some pinyin style in the CPP should be changed. like this:
女: [nu:3] ---- > [nv3]
略: [lu:e4] -----> [lve4]
The latter are common pinyin style used in China now. All pinyin part “u:” can be changed to "v"

kakaobrain / g2pm Goto Github PK

g2pm's People

Contributors

Stargazers

Watchers

Forkers

g2pm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs