justhalf / bpe_analysis Goto Github PK
View Code? Open in Web Editor NEWAnalysis of BPE on four languages: English, Indonesian, Chinese, Japanese
Analysis of BPE on four languages: English, Indonesian, Chinese, Japanese
Here are some feedbacks we got in class yesterday.
My thoughts:
tion_to
. (This is one of our non-trivial findings, though.)Normalizing allomorphs (?) so that BPE can find identical morphemes across words.
cats -> cat+s
boxes -> box+s
usefulness -> useful+ness
happiness -> happy+ness
食べる taberu = eat
食べた tabeta = eat+past -> 食べる+た taberu+ta
食べなかった tabenakatta = did not eat -> 食べる+ない+た taberu+nai+ta
I expect we get more similar results to UDPipe segmentation if we normalize Japanese morphemes.
Hi,
I update the LM scores for UD-English, with a 4-gram LM trained on Wiki.
The training script can be found here: score_lm.py
The outputs are at outputs/ud2/
, where en_ud2.tok.score
is the corresponding average log-prob/token KenLM (4-order with default params) scores, with which we can further bucket sentences and analyze results at different buckets.
Quantitative analysis (coverage of affixes):
affixes/*/{prefixes,suffixes}.txt
exist in our corpora.Qualitative analysis
Suggestions are welcome.
Hi, I've collected some statistics of the corpus:
Are we ready to train BPE on UD or WIKI? I think we can fix a series of vocab-sizes for all languages since the type numbers of UD-tokenization are similar. (For WIKI, the number of word types are much larger since the long long tail.)
For example: sth like [5k, 10k, 20k, 30k, 50k] or [4k, 8k, 16k, 32k, 64k] ... ?
Thank you @zzsfornlp for updating the slides with MWEs.
I've updated the slides as well with BPE examples in the beginning, and also to include the results we got yesterday.
For reference, the slides are here: https://docs.google.com/presentation/d/1CoHRkqqLaI5dihQl8ZxD7UbsJeBP2hN7AM6NbEroDKE/edit#slide=id.p
Hi, I'm really sorry that the training of BPE on WIKI-data is behind schedule, recently the server that I use is slightly crowded...
Currently there are still several instances remaining to train:
Ja-Wiki: 90k
Zh-Wiki: 10k 30k 60k 90k
Ja-Norm-Wiki: 60k 90k
But I guess these can be finished given another 2 to 3 days.
I've uploaded what I've got to the server, here are the structure of the files under the home dir /home/zhisong/
:
data_old/
: previous data, deprecated
data_ud2/
: models trained on merged-UD (GSD/EWT+PUD) and the outputs
data_ud2_norm/
: models trained on normed merged-UD
data_wiki/
: models trained on WIKI-data (full id/zh, 10% en, 50% ja) and the outputs for both WIKI and merged-UD
data_wiki_norm/
: models trained on normed WIKI-data (same sample rate as in WIKI-data) and the outputs for both normed WIKI and normed merged-UD
models
dir contains the models and outputs
dir contains the outputs, file-names indicate how they are generated.I'll wait for another day to collect more results to do the Comparison-to-UD analysis. By the way, I've also added some scripts for extracting examples of Affix (here, oh, just saw Naoki's update, I think his script is more efficient than this one) or MWE-type(here), which may be helpful.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.