justhalf / bpe_analysis Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 30.21 MB

Analysis of BPE on four languages: English, Indonesian, Chinese, Japanese

Shell 4.31% Python 48.17% Perl 47.53%

bpe_analysis's People

Watchers

Forkers

mahjiong

bpe_analysis's Issues

Discussion for the final report

Here are some feedbacks we got in class yesterday.

Chinese and Japanese don't use whitespace, but their characters are logogram. (The unique number of characters is large.) What happens if we transcribe them in Roman alphabets? (Few character types) Or in languages that don't use whitespace but use phonogram? (Maybe Thai?)
- [Naoki] We could use Hiragana/Katakana (50+50 characters) for Japanese.
Indonesian: Is Indonesian high in synthesis? WALS
Why the vocab size maximizing F1 of Chinese and Japanese are smaller than those of English and Indonesian? (Slide 11-14)
- [Naoki] Japanese reference segmentation is morpheme-based, and Chinese words contain a few characters => those reference tokens contain fewer characters than English and Indonesian reference tokens.
Why Chinese 了一 becomes one token?
Core arguments of verbs could affect what verbs (and substrings of verbs) and following tokens are combined by BPE (Slide 28).

My thoughts:

We found some potentially general patterns (e.g. [zh] 了一, [id] prefix + first char of root, [en,ja] verbs + fragments of core arguments). How can we say this is not dataset-dependent?
Treating whitespace as one character doesn't seem a good idea; BPE generates many meaningless multi-token units like tion_to. (This is one of our non-trivial findings, though.)

Running morphological analyzers in English and Japanese

Normalizing allomorphs (?) so that BPE can find identical morphemes across words.

cats -> cat+s
boxes -> box+s

usefulness -> useful+ness
happiness -> happy+ness

食べる taberu = eat
食べた tabeta = eat+past -> 食べる+た taberu+ta
食べなかった tabenakatta = did not eat -> 食べる+ない+た taberu+nai+ta

I expect we get more similar results to UDPipe segmentation if we normalize Japanese morphemes.

LM-scores for UD-En

Hi,
I update the LM scores for UD-English, with a 4-gram LM trained on Wiki.
The training script can be found here: score_lm.py
The outputs are at outputs/ud2/, where en_ud2.tok.score is the corresponding average log-prob/token KenLM (4-order with default params) scores, with which we can further bucket sentences and analyze results at different buckets.

Analysis on affixes

Quantitative analysis (coverage of affixes):

Check how many affixes in affixes/*/{prefixes,suffixes}.txt exist in our corpora.
Count frequencies of affixes,
Calculate the coverage of those affixes in BPE vocabulary file.
Plot a binned coverage curve. (horizontal=frequency bins of an affix,vertical=coverage of affixes)

Qualitative analysis

Sample some examples from high/med/low-frequency affixes.

Suggestions are welcome.

Statistics of the Corpus

Hi, I've collected some statistics of the corpus:

Are we ready to train BPE on UD or WIKI? I think we can fix a series of vocab-sizes for all languages since the type numbers of UD-tokenization are similar. (For WIKI, the number of word types are much larger since the long long tail.)

For example: sth like [5k, 10k, 20k, 30k, 50k] or [4k, 8k, 16k, 32k, 64k] ... ?

Slides Updated

Thank you @zzsfornlp for updating the slides with MWEs.

I've updated the slides as well with BPE examples in the beginning, and also to include the results we got yesterday.

For reference, the slides are here: https://docs.google.com/presentation/d/1CoHRkqqLaI5dihQl8ZxD7UbsJeBP2hN7AM6NbEroDKE/edit#slide=id.p

BPE training status on WIKI-data

Hi, I'm really sorry that the training of BPE on WIKI-data is behind schedule, recently the server that I use is slightly crowded...

Currently there are still several instances remaining to train:
Ja-Wiki: 90k
Zh-Wiki: 10k 30k 60k 90k
Ja-Norm-Wiki: 60k 90k
But I guess these can be finished given another 2 to 3 days.

I've uploaded what I've got to the server, here are the structure of the files under the home dir /home/zhisong/:
data_old/: previous data, deprecated
data_ud2/: models trained on merged-UD (GSD/EWT+PUD) and the outputs
data_ud2_norm/: models trained on normed merged-UD
data_wiki/: models trained on WIKI-data (full id/zh, 10% en, 50% ja) and the outputs for both WIKI and merged-UD
data_wiki_norm/: models trained on normed WIKI-data (same sample rate as in WIKI-data) and the outputs for both normed WIKI and normed merged-UD

In each dir, the models dir contains the models and outputs dir contains the outputs, file-names indicate how they are generated.

I'll wait for another day to collect more results to do the Comparison-to-UD analysis. By the way, I've also added some scripts for extracting examples of Affix (here, oh, just saw Naoki's update, I think his script is more efficient than this one) or MWE-type(here), which may be helpful.

justhalf / bpe_analysis Goto Github PK

bpe_analysis's People

Watchers

Forkers

bpe_analysis's Issues

Discussion for the final report

Running morphological analyzers in English and Japanese

LM-scores for UD-En

Analysis on affixes

Statistics of the Corpus

Slides Updated

BPE training status on WIKI-data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs