An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Generative LLM Inference

This is the official code for the paper titled "An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Generative LLM Inference." For reproduction, please refer to Reproduction.

Requirements

Python 3.10 or later
PyTorch v2.1.0 or later
transformers==4.35.0.dev0
peft==0.6.2
datasets==2.15.0
evaluate==0.4.1
bitsandbytes==0.41.2.post2
scipy==1.11.4
scikit-learn==1.3.2
sentencepiece
seaborn==0.13.0
fasttext: Please visit https://github.com/facebookresearch/fastText to install this package.
jupyterlab
sumeval
janome
protobuf==4.25.1
entmax==1.1
fastdist==1.1.6
dynamic_embedding_pruning==0.0.1
rouge-score==0.1.2
numba==0.58.1
tensorboardX==2.6.2.2
pyarabic==0.6.15
rouge==1.0.1

Installation

After manually installing PyTorch, transformers, and fasttext, please run the following.

pip install -r requirements.txt

Reproduction

Adapted Models

All models are available on the Hugging Face Model Hub.

Approach	BLOOM-1B	BLOOM-7B	TigerBot-7B	Mistral-7B
LAPT	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
Random	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
CLP	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
Heuristics	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
FOCUS	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw
CLP+	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw	de/ja/ar/sw

+ Output projection layer initialization

We also release some TigerBot-7B and Mistral-7B models whose output layer is initialized according to each corresponding vocabulary initialization method instead of random initialization.

Approach	TigerBot-7B	Mistral-7B
Heuristics	de/ja/ar/sw	de/ja/ar/sw
CLP+	de/ja/ar/sw	de/ja/ar/sw

fastText weights

Pre-trained fastText weights, used for FOCUS initialization, are uploaded with BLOOM-1B FOCUS models.

License

MIT License

Adapted Tokenizer

Note that adapted tokenizers were obtained from the following for each language:

German: https://huggingface.co/malteos/gpt2-xl-wechsel-german
Japanese: https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo
Arabic: https://huggingface.co/aubmindlab/aragpt2-base
Swahili: https://huggingface.co/benjamin/gpt2-wechsel-swahili

Due to the license restriction of the Arabic tokenizer, we have excluded the Arabic tokenizer from each corresponding adapted model. To use it, please make sure to download the tokenizer beforehand from the above link.

Citation

If you find this work useful, please cite the following:

@article{yamaguchi2024empirical,
  title={An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Generative {LLM} Inference}, 
  author={Atsuki Yamaguchi and Aline Villavicencio and Nikolaos Aletras},
  journal={ArXiv},
  year={2024},
  volume={abs/2402.10712},
  url={https://arxiv.org/abs/2402.10712}
}

gucci-j / llm-cva Goto Github PK

llm-cva's Introduction

An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Generative LLM Inference

Requirements

Installation

Reproduction

1. Preprocessing

2. Target Model Initialization

3. LAPT

4. Evaluation

Adapted Models

+ Output projection layer initialization

fastText weights

License

Adapted Tokenizer

Citation

llm-cva's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs