GithubHelp home page GithubHelp logo

bpe_analysis's People

Watchers

 avatar  avatar  avatar  avatar

Forkers

mahjiong

bpe_analysis's Issues

Discussion for the final report

Here are some feedbacks we got in class yesterday.

  1. Chinese and Japanese don't use whitespace, but their characters are logogram. (The unique number of characters is large.) What happens if we transcribe them in Roman alphabets? (Few character types) Or in languages that don't use whitespace but use phonogram? (Maybe Thai?)
    • [Naoki] We could use Hiragana/Katakana (50+50 characters) for Japanese.
  2. Indonesian: Is Indonesian high in synthesis? WALS
  3. Why the vocab size maximizing F1 of Chinese and Japanese are smaller than those of English and Indonesian? (Slide 11-14)
    • [Naoki] Japanese reference segmentation is morpheme-based, and Chinese words contain a few characters => those reference tokens contain fewer characters than English and Indonesian reference tokens.
  4. Why Chinese 了一 becomes one token?
  5. Core arguments of verbs could affect what verbs (and substrings of verbs) and following tokens are combined by BPE (Slide 28).

My thoughts:

  1. We found some potentially general patterns (e.g. [zh] 了一, [id] prefix + first char of root, [en,ja] verbs + fragments of core arguments). How can we say this is not dataset-dependent?
  2. Treating whitespace as one character doesn't seem a good idea; BPE generates many meaningless multi-token units like tion_to. (This is one of our non-trivial findings, though.)

Running morphological analyzers in English and Japanese

Normalizing allomorphs (?) so that BPE can find identical morphemes across words.

cats -> cat+s
boxes -> box+s

usefulness -> useful+ness
happiness -> happy+ness
食べる taberu = eat
食べた tabeta = eat+past -> 食べる+た taberu+ta
食べなかった tabenakatta = did not eat -> 食べる+ない+た taberu+nai+ta

I expect we get more similar results to UDPipe segmentation if we normalize Japanese morphemes.

LM-scores for UD-En

Hi,
I update the LM scores for UD-English, with a 4-gram LM trained on Wiki.
The training script can be found here: score_lm.py
The outputs are at outputs/ud2/, where en_ud2.tok.score is the corresponding average log-prob/token KenLM (4-order with default params) scores, with which we can further bucket sentences and analyze results at different buckets.

Analysis on affixes

Quantitative analysis (coverage of affixes):

  1. Check how many affixes in affixes/*/{prefixes,suffixes}.txt exist in our corpora.
  2. Count frequencies of affixes,
  3. Calculate the coverage of those affixes in BPE vocabulary file.
  4. Plot a binned coverage curve. (horizontal=frequency bins of an affix,vertical=coverage of affixes)

Qualitative analysis

  1. Sample some examples from high/med/low-frequency affixes.

Suggestions are welcome.

Statistics of the Corpus

Hi, I've collected some statistics of the corpus:

图片

Are we ready to train BPE on UD or WIKI? I think we can fix a series of vocab-sizes for all languages since the type numbers of UD-tokenization are similar. (For WIKI, the number of word types are much larger since the long long tail.)

For example: sth like [5k, 10k, 20k, 30k, 50k] or [4k, 8k, 16k, 32k, 64k] ... ?

BPE training status on WIKI-data

Hi, I'm really sorry that the training of BPE on WIKI-data is behind schedule, recently the server that I use is slightly crowded...

Currently there are still several instances remaining to train:
Ja-Wiki: 90k
Zh-Wiki: 10k 30k 60k 90k
Ja-Norm-Wiki: 60k 90k
But I guess these can be finished given another 2 to 3 days.

I've uploaded what I've got to the server, here are the structure of the files under the home dir /home/zhisong/:
data_old/: previous data, deprecated
data_ud2/: models trained on merged-UD (GSD/EWT+PUD) and the outputs
data_ud2_norm/: models trained on normed merged-UD
data_wiki/: models trained on WIKI-data (full id/zh, 10% en, 50% ja) and the outputs for both WIKI and merged-UD
data_wiki_norm/: models trained on normed WIKI-data (same sample rate as in WIKI-data) and the outputs for both normed WIKI and normed merged-UD

  • In each dir, the models dir contains the models and outputs dir contains the outputs, file-names indicate how they are generated.

I'll wait for another day to collect more results to do the Comparison-to-UD analysis. By the way, I've also added some scripts for extracting examples of Affix (here, oh, just saw Naoki's update, I think his script is more efficient than this one) or MWE-type(here), which may be helpful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.