microsoft / unicoder Goto Github PK

Unicoder model for understanding and generation.

License: MIT License

Shell 1.09% Python 96.50% Makefile 0.03% Batchfile 0.01% C++ 0.26% Cuda 0.60% Lua 0.07% Dockerfile 0.06% CSS 0.10% JavaScript 0.27% Jupyter Notebook 0.86% Cython 0.15%

unicoder's Introduction

Unicoder

This repo provides the code for reproducing the experiments in XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation (leaderboard).

We provide three models, Unicoder for understanding tasks, Unicoder for generation tasks (pre-trained with xDAE) and Unicoder for generation tasks (pre-trained with xFNP).

Updates

Code for XLM-K: we add the code for XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge at here.

Unicoder for understanding tasks

We share a 12-layers model which is pre-trained with 100 languages.

This code can reproduce the experiments on 9 understanding XGLUE tasks: NER, POS Tagging (POS), News Classification (NC), MLQA, XNLI, PAWS-X, Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM).

For more details, you can go to understanding README.

Unicoder for generation tasks (pre-trained with xDAE)

We share a 12-layer encoder and 12-layer decoder model which is pre-trained with 100 languages.

The code can reproduce the experiments on 2 generation XGLUE tasks: News Title Generation(NTG) and Question Generation (QG).

For more details, you can go to generation README.

Unicoder for generation tasks (pre-trained with xFNP)

We share a 12-layer encoder and 12-layer decoder model which is pre-trained with 100 languages.

The code can reproduce the experiments on 2 generation XGLUE tasks: News Title Generation(NTG) and Question Generation (QG).

For more details, you can go to ProphetNet.

How to cite

If you extend or use this work, please cite our paper.

@inproceedings{huang2019unicoder,
  title={Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks},
  author={Huang, Haoyang and Liang, Yaobo and Duan, Nan and Gong, Ming and Shou, Linjun and Jiang, Daxin and Zhou, Ming},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={2485--2494},
  year={2019}
}
@article{Liang2020XGLUEAN,
  title={XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation},
  author={Yaobo Liang and Nan Duan and Yeyun Gong and Ning Wu and Fenfei Guo and Weizhen Qi and Ming Gong and Linjun Shou and Daxin Jiang and Guihong Cao and Xiaodong Fan and Ruofei Zhang and Rahul Agrawal and Edward Cui and Sining Wei and Taroon Bharti and Ying Qiao and Jiun-Hung Chen and Winnie Wu and Shuguang Liu and Fan Yang and Daniel Campos and Rangan Majumder and Ming Zhou},
  journal={arXiv},
  year={2020},
  volume={abs/2004.01401}
}

More Models in the Unicoder Family

Unicoder-VL (image): a monolingual (English) pre-trained model for image-language understanding tasks.
Unicoder-VL (video): a monolingual (English) pre-trained model for video-language understanding and generation tasks.
XGPT (image): a monolingual (English) pre-trained model for image captioning.
M^3P (image): a multilingual (100 languages) pre-trained model for image-language understanding and generation tasks.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

unicoder's People

Contributors

Stargazers

Watchers

Forkers

taffywrinkle luofuli claudiusgonzo davidorz timothyxxx dayihengliu nitintitus koukoulala standardgalactic test-mass-forker-org-1 islammohamedmosaad ukilin dearborn-open-ai

unicoder's Issues

decoder parameter initialization and dictionary during define-tuning

Hi:

Good evening!

I read from your paper that Unicode uses an XLM/transformer structure.

I understand that after pertaining you will keep the parameters in the transformer encoder, but I am a little bit confused about how do you deal with the output/decoder part of the model.

I would use POS-tag as an example.

Since the pre-training task is a masked recovering, LM, etc. It will have a fairly large output layer.

If we keep the original output layer parameter, the vocabulary size will be much larger than the legit 17 POS-tag labels. It may produce a token that is outside of legit POS-tag labels.
Those POS-tag labels did not appear in the previous pre-training. Thus they will create an OOV problem.

From what I understand, XLM will attach an NMT translator when the pre-training task is machine translation, it will only keep the encoder if the pre-training task is masked tokens recovering. In the fine-tuning, it will attach a new classifier to the pre-trained encoder for fine-tuning for tasks such as XNLI. It will attach the original NMT decoder if the evaluation task is machine translation.

So during Unicode fine-tuning, do I need to cut off the original output/decoder after the pre-training and re-initialize an output layer based on the task at hand? Using POS-tag as an example, do I need to connect a new decoder/output layer to the encoder and cut the vocabulary of 50K words to the 17 labels in the POS-tag task? If so, what is the optimal structure of the decoder/output layer for the POS-tag task or NER task? should I connect the encoder to a complex decoder like in the NMT, or maybe a simple classifier layer connected to the encoder is fine?

What happens for other tasks like NER? Do we fine-tine all the tasks in one model, or do we fine-tine a new model for each task?

Should I fix the encoder parameters and just fine-tune the decoder part?

#=======================================================================
Question on BPE dictionary that split the original words:

I am a little puzzled about how the BPE dictionary is used on POS-tagging in your paper (the unicoder baseline).

for example a sentence like:
"although , as you will have seen , the dre@@ aded ' millennium bu@@ g ' failed to materi@@ alise , "
The BPE dictionary would split the word "materialize" into "materi@@" and "alise"

That is fine for the translation task, however, for the POS-tagging task, wouldn't it be better if we can keep the word `materialize' together so we can have one POS-tag for each word, instead of two POS-tag for each word-piece or two word-piece using a POS-tag together?

So do you let the "materi@@" "alise" produce two "verb" label or do you let `materi@@ "alise" produces a single "verb" label (then the length of input and output would be different)

                                                                                                                                              Regards!
                                                                                                                                                           Wei

What is the language name of 100 languages in which Unicoderv3 are pre-trained?

I have read Unicoder paper, but I can't find the name of 100 language in which Unicoderv3 are pre-trained. Maybe I miss some important parts, could your please provide the name of the 100 languages? Thanks a lot!

Where is the Unicoder model file?

Hi, I can not find where is the Unicoder model.py file? Could you tell me? Thank you very much

dict.txt in generation model and translation model

Good Morning!
I am sorry that I met some issues that need further clarification.

Q1:
I want to know how is your dict.txt is generated and used in the fairseq-preprocess

dict.txt in your pre-trained model
(
generation model:
https://onedrive.live.com/?authkey=%21AOkzIYo8pYONMb4&id=5C2C1309D09F7C6B%212278&cid=5C2C1309D09F7C6B
understanding model:
https://onedrive.live.com/?authkey=%21AHp7WdcNCamlkVw&id=5C2C1309D09F7C6B%212249&cid=5C2C1309D09F7C6B
):

, 1
. 1
▁ 1
s 1
▁de 1
- 1
▁a 1
a 1
: 1
e 1
i 1
▁( 1
) 1
▁i 1
t 1
n 1
▁- 1
▁la 1
▁en 1
▁in 1
▁na 1
' 1
’ 1
... 1
▁e 1
▁на 1
。 1
o 1
? 1
en 1

The second column is always 1.

However when I following the prepare-iwslt17-multilingual.sh
https://github.com/pytorch/fairseq/tree/master/examples/translation
The dictionary I get by fairseq-preprocess is like this:

▁the 3488934
, 2538615
. 1863447
▁of 1773456
▁to 1645036
▁and 1395057
▁in 1078123
▁that 856757
▁is 824487
▁a 817413
▁for 556440
▁I 542653
▁on 487723
▁be 442718
▁this 439733
▁we 424269
▁are 365974
▁have 348747
▁it 345349
▁not 332216
▁as 317747
▁with 311021
▁which 303192
▁European 285795
▁The 261132
▁will 256812
▁by 254216
- 243721
▁has 227060
s 195653

The second column is the frequency of the tokens.

So how to generate the dict.txt for your model? why does the frequency of tokens are all 1 (I assume you also generate the dict.txt by fairseq-preprocess)?

BTW: It seems to me the default fairseq example all use fastBPE instead of sentencepiece model. In fastBPE you can obtain a vocab file when you learn the BPE dictionary, so you can provide this vocab.txt to fairseq-preprocess, because the vocab file is also the token and the token frequency format. In this way, you can always get a unified joint dictionary for all the languages by letting setting --tgtdict = vocab:

fairseq-preprocess --source-lang $src --target-lang $tgt
--trainpref $bpe/train.$src-$tgt
--joined-dictionary --tgtdict $bpe/vocab
--destdir $data_bin
--workers 40

However, the sentencepiece.bpe.vocab learned by sentencepiece is different from vocab learned by fastBPE. sentencepiece.bpe.vocab is like:

<unk>	0
<s>	0
</s>	0
en	-0
▁d	-1
er	-2
es	-3
on	-4
▁a	-5
in	-6
▁p	-7
▁l	-8
▁s	-9
▁c	-10
ti	-11
▁t	-12
re	-13
▁de	-14
is	-15

where the second column is actually the negative id index instead of frequency, so it can not be provided for fairseq-preprocess as input for --tgtdict . So I am wondering:

How you get your dict.txt (as sentencepiece do not automatically produce such file, while fastBPE does produce similar file but the second column is frequency rather than 1)
How do you relate this dict.txt with fairseq-preprocess when binarization the training text? Do you provide dict.txt as input for --tgtdict as an unified joined dictionary for all the languages?

Q2:
As I mentioned earlier:

I see that in the "generation" folder the unicoder X_dae model is fairseq based.
In the "understanding" folder, the pre-trained model is huggings transformer-based.
It seems to me that the fairseq trained model is xxx.pt while the huggings transformer-based model is saved as pytorch_model.bin and config.json. So I am puzzled about how to use one encoder for generation and understanding tasks.

Is it correct that I need to train a Roberta model in fairseq first then use the conversion code (https://github.com/huggingface/transformers/blob/master/src/transformers/models/roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py) to convert it to transformer format?

On the learning rate of Bert and linear classifiers in tasks

Hello, reading your code run_ xnli.py During parameter adjustment, it seems that the same learning rate and the same parameter "learning_rate" are set in the optimizer of the code for the Bert and linear modules. However, in the previous research internship, we set these two learning rates at different learning rates. I'm very puzzled. Which one is right or both? I'm looking forward to your reply

Error of 'ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1011CPUTensorIdEv'

Error of 'ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1011CPUTensorIdEv'. I guess the issue caused by torch or torch vision version.
What is version of pytorch and torch vision is reccomended? Thanks.

2020-06-22 13:19:55 | INFO | fairseq.tasks.generation_from_pretrained_bart | /home/xuewa/data/dataset_xglue/QG/bin/en valid src-tgt 10000 examples Traceback (most recent call last): File "./train.py", line 11, in <module> cli_main() File "/home/xuewa/code/Unicoder/generation/fairseq_cli/train.py", line 343, in cli_main main(args) File "/home/xuewa/code/Unicoder/generation/fairseq_cli/train.py", line 64, in main model = task.build_model(args) File "/home/xuewa/code/Unicoder/generation/fairseq/tasks/translation.py", line 264, in build_model model = super().build_model(args) File "/home/xuewa/code/Unicoder/generation/fairseq/tasks/fairseq_task.py", line 211, in build_model return models.build_model(args, self) File "/home/xuewa/code/Unicoder/generation/fairseq/models/__init__.py", line 48, in build_model return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task) File "/home/xuewa/code/Unicoder/generation/fairseq/models/transformer.py", line 212, in build_model encoder = cls.build_encoder(args, src_dict, encoder_embed_tokens) File "/home/xuewa/code/Unicoder/generation/fairseq/models/transformer.py", line 230, in build_encoder return TransformerEncoder(args, src_dict, embed_tokens) File "/home/xuewa/code/Unicoder/generation/fairseq/models/transformer.py", line 401, in __init__ [self.build_encoder_layer(args) for i in range(args.encoder_layers)] File "/home/xuewa/code/Unicoder/generation/fairseq/models/transformer.py", line 401, in <listcomp> [self.build_encoder_layer(args) for i in range(args.encoder_layers)] File "/home/xuewa/code/Unicoder/generation/fairseq/models/transformer.py", line 415, in build_encoder_layer return TransformerEncoderLayer(args) File "/home/xuewa/code/Unicoder/generation/fairseq/modules/transformer_layer.py", line 35, in __init__ self.self_attn_layer_norm = LayerNorm(self.embed_dim) File "/home/xuewa/code/Unicoder/generation/fairseq/modules/layer_norm.py", line 31, in LayerNorm return FusedLayerNorm(normalized_shape, eps, elementwise_affine) File "/usr/local/lib/python3.6/dist-packages/apex/normalization/fused_layer_norm.py", line 133, in __init__ fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 994, in _gcd_import File "<frozen importlib._bootstrap>", line 971, in _find_and_load File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 658, in _load_unlocked File "<frozen importlib._bootstrap>", line 571, in module_from_spec File "<frozen importlib._bootstrap_external>", line 922, in create_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed ImportError: /usr/local/lib/python3.6/dist-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c1011CPUTensorIdEv