GithubHelp home page GithubHelp logo

jonasknobloch / mbpe Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 132 KB

Morphologically biased byte-pair encoding

Rust 81.73% Python 18.27%
byte-pair-encoding tokenizer morphological-analysis morphology nlp segmentation

mbpe's Introduction

Morphologically Biased Byte-Pair Encoding

mBPE acts as an extension to the huggingface/tokenizers library and is designed to enhance segmentations produced by the byte-pair encoding tokenization algorithm1. Byte-pair encoding has been show to badly approximate morphological boundaries2, which is especially problematic for morphologically rich language. By incorporating morphological knowledge into the pre-tokenization process, we aim to improve the quality of produced segmentations through an induced bias towards morphologically motivated sub-word boundaries.

Pre-trained tokenizers and models are available on Hugging Face.

Pre-Tokenizers

External

The external pre-tokenizer enables the integration custom pre-tokenization algorithms via a socket connection. Tokenization parallelism should be disabled by setting TOKENIZERS_PARALLELISM=true. Note that disabling parallelism will slow down tokenization significantly. See jonasknobloch/unimorph for a reference server implementation.

Tree-Split

The tree-split pre-tokenizer introduces additional boundaries by clustering inflected word forms retrieved from UniMorph3 dictionaries. Form clusters are aligned by constructing a suffix tree for each cluster. New boundaries are then introduced by traversing the trees and introducing boundaries at nodes with multiple children.

Morfessor

The Morfessor pre-tokenizer introduces additional boundaries retrieved using an arbitrary Morfessor45 model. Trained Morfessor models need to be converted using the provided protobuf definition and conversion script

Intrinsic Metrics

Tokenizer Fertility

tokenizer compounds fertility
gpt2_cx-en_00000-00000_50k 4992469 1.32
gpt2+ts_cx-en_00000-00000_50k 4923123 1.40
gpt2+morf_u0-30-50-x_cx-en_00000-00000_50k 3630703 1.42
gpt2+morf_s0-30-x-2_cx-en_00000-00000_50k 99191 1.69

Boundary Precision and Recall

tokenizer P R F1
gpt2_cx-en_00000-00000_50k 0.33 0.56 0.42
gpt2+ts_cx-en_00000-00000_50k 0.40 0.58 0.47
gpt2+morf_u0-30-50-x_cx-en_00000-00000_50k 0.45 0.61 0.52
gpt2+morf_s0-30-x-2_cx-en_00000-00000_50k 0.56 0.59 0.57

Footnotes

  1. Neural Machine Translation of Rare Words with Subword Units

  2. Byte Pair Encoding is Suboptimal for Language Model Pretraining

  3. UniMorph 4.0: Universal Morphology

  4. Unsupervised Discovery of Morphemes

  5. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline

mbpe's People

Contributors

jonasknobloch avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.