GithubHelp home page GithubHelp logo

microsoft / unilm Goto Github PK

View Code? Open in Web Editor NEW
18.3K 293.0 2.4K 57.88 MB

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Home Page: https://aka.ms/GeneralAI

License: MIT License

Python 83.17% Shell 2.06% Makefile 0.01% Batchfile 0.01% C++ 0.33% Cython 0.19% Cuda 0.63% Lua 0.07% Jupyter Notebook 13.50% C 0.01% HTML 0.02% Perl 0.01% Mako 0.01% Dockerfile 0.01%
nlp pre-trained-model unilm minilm layoutlm layoutxlm beit document-ai trocr beit-3

unilm's Introduction

Hiring

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and General AI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to [email protected].

Foundation Architecture

TorchScale - A Library of Foundation Architectures (repo)

Fundamental research to develop new architectures for foundation models and AI, focusing on modeling generality and capability, as well as training stability and efficiency.

Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond

Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)

Capability - A Length-Extrapolatable Transformer

Efficiency & Transferability - X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)

The Revolution of Model Architecture

BitNet: 1-bit Transformers for Large Language Models

RetNet: Retentive Network: A Successor to Transformer for Large Language Models

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Foundation Models

The Evolution of (M)LLM (Multimodal LLM)

Kosmos-2.5: A Multimodal Literate Model

Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-1: A Multimodal Large Language Model (MLLM)

MetaLM: Language Models are General-Purpose Interfaces

The Big Convergence - Large-scale self-supervised pre-training across tasks (predictive and generative), languages (100+ languages), and modalities (language, image, audio, layout/format + language, vision + language, audio + language, etc.)

Language & Multilingual

UniLM: unified pre-training for language understanding and generation

InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages

DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages

MiniLM: small and fast pre-trained models for language understanding and generation

AdaLM: domain, language, and task adaptation of pre-trained models

EdgeLM(NEW): small pre-trained models on edge/client devices

SimLM (NEW): large-scale pre-training for similarity matching

E5 (NEW): text embeddings

MiniLLM (NEW): Knowledge Distillation of Large Language Models

Vision

BEiT/BEiT-2: generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers

DiT: self-supervised pre-training for Document Image Transformers

TextDiffuser/TextDiffuser-2 (NEW): Diffusion Models as Text Painters

Speech

WavLM: speech pre-training for full stack tasks

VALL-E: a neural codec language model for TTS

Multimodal (X + Language)

LayoutLM/LayoutLMv2/LayoutLMv3: multimodal (text + layout/format + image) Document Foundation Model for Document AI (e.g. scanned documents, PDF, etc.)

LayoutXLM: multimodal (text + layout/format + image) Document Foundation Model for multilingual Document AI

MarkupLM: markup language model pre-training for visually-rich document understanding

XDoc: unified pre-training for cross-format document understanding

UniSpeech: unified pre-training for self-supervised learning and supervised learning for ASR

UniSpeech-SAT: universal speech representation learning with speaker-aware pre-training

SpeechT5: encoder-decoder pre-training for spoken language processing

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

VLMo: Unified vision-language pre-training

VL-BEiT (NEW): Generative Vision-Language Pre-training - evolution of BEiT to multimodal

BEiT-3 (NEW): a general-purpose multimodal foundation model, and a major milestone of The Big Convergence of Large-scale Pre-training Across Tasks, Languages, and Modalities.

Toolkits

s2s-ft: sequence-to-sequence fine-tuning toolkit

Aggressive Decoding (NEW): lossless and efficient sequence-to-sequence decoding algorithm

Applications

TrOCR: transformer-based OCR w/ pre-trained models

LayoutReader: pre-training of text and layout for reading order detection

XLM-T: multilingual NMT w/ pretrained cross-lingual encoders

Links

LLMOps (repo)

General technology for enabling AI capabilities w/ LLMs and MLLMs.

News

  • December, 2023: LongNet and LongViT released
  • [Model Release] Dec, 2023: TextDiffuser-2 models, code and demo.
  • Sep, 2023: Kosmos-2.5 - a multimodal literate model for machine reading of text-intensive images.
  • [Model Release] May, 2023: TextDiffuser models and code.
  • [Model Release] March, 2023: BEiT-3 pretrained models and code.
  • March, 2023: Kosmos-1 - a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot).
  • January, 2023: VALL-E a language modeling approach for text to speech synthesis (TTS), which achieves state-of-the-art zero-shot TTS performance. See https://aka.ms/valle for demos of our work.
  • [Model Release] January, 2023: E5 - Text Embeddings by Weakly-Supervised Contrastive Pre-training.
  • November, 2022: TorchScale 0.1.1 was released!
  • November, 2022: TrOCR was accepted by AAAI 2023.
  • [Model Release] November, 2022: XDoc BASE models for cross-format document understanding.
  • [Model Release] September, 2022: TrOCR BASE and LARGE models for Scene Text Recognition (STR).
  • [Model Release] September, 2022: BEiT v2 code and pretrained models.
  • August, 2022: BEiT-3 - a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks
  • July, 2022: SimLM - Large-scale self-supervised pre-training for similarity matching
  • June, 2022: DiT and LayoutLMv3 were accepted by ACM Multimedia 2022.
  • June, 2022: MetaLM - Language models are general-purpose interfaces to foundation models (language/multilingual, vision, speech, and multimodal)
  • June, 2022: VL-BEiT - bidirectional multimodal Transformer learned from scratch with one unified pretraining task, one shared backbone, and one-stage training, supporting both vision and vision-language tasks.
  • [Model Release] June, 2022: LayoutLMv3 Chinese - Chinese version of LayoutLMv3
  • [Code Release] May, 2022: Aggressive Decoding - Lossless Speedup for Seq2seq Generation
  • April, 2022: Transformers at Scale = DeepNet + X-MoE
  • [Model Release] April, 2022: LayoutLMv3 - Pre-training for Document AI with Unified Text and Image Masking
  • [Model Release] March, 2022: EdgeFormer - Parameter-efficient Transformer for On-device Seq2seq Generation
  • [Model Release] March, 2022: DiT - Self-supervised Document Image Transformer. Demos: Document Layout Analysis, Document Image Classification
  • January, 2022: BEiT was accepted by ICLR 2022 as Oral presentation (54 out of 3391).
  • [Model Release] December 16th, 2021: TrOCR small models for handwritten and printed texts, with 3x inference speedup.
  • November 24th, 2021: VLMo as the new SOTA on the VQA Challenge
  • November, 2021: Multilingual translation at scale: 10000 language pairs and beyond
  • [Model Release] November, 2021: MarkupLM - Pre-training for text and markup language (e.g. HTML/XML)
  • [Model Release] November, 2021: VLMo - Unified vision-language pre-training w/ BEiT
  • October, 2021: WavLM Large achieves state-of-the-art performance on the SUPERB benchmark
  • [Model Release] October, 2021: WavLM - Large-scale self-supervised pre-trained models for speech.
  • [Model Release] October 2021: TrOCR is on HuggingFace
  • September 28th, 2021: T-ULRv5 (aka XLM-E/InfoXLM) as the SOTA on the XTREME leaderboard. // Blog
  • [Model Release] September, 2021: LayoutLM-cased are on HuggingFace
  • [Model Release] September, 2021: TrOCR - Transformer-based OCR w/ pre-trained BEiT and RoBERTa models.
  • August 2021: LayoutLMv2 and LayoutXLM are on HuggingFace
  • [Model Release] August, 2021: LayoutReader - Built with LayoutLM to improve general reading order detection.
  • [Model Release] August, 2021: DeltaLM - Encoder-decoder pre-training for language generation and translation.
  • August 2021: BEiT is on HuggingFace
  • [Model Release] July, 2021: BEiT - Towards BERT moment for CV
  • [Model Release] June, 2021: LayoutLMv2, LayoutXLM, MiniLMv2, and AdaLM.
  • May, 2021: LayoutLMv2, InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.
  • April, 2021: LayoutXLM is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmark XFUND is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
  • March, 2021: InfoXLM was accepted by NAACL 2021.
  • December 29th, 2020: LayoutLMv2 is coming with the new SOTA on a wide variety of document AI tasks, including DocVQA and SROIE leaderboard.
  • October 8th, 2020: T-ULRv2 (aka InfoXLM) as the SOTA on the XTREME leaderboard. // Blog
  • September, 2020: MiniLM was accepted by NeurIPS 2020.
  • July 16, 2020: InfoXLM (Multilingual UniLM) arXiv
  • June, 2020: UniLMv2 was accepted by ICML 2020; LayoutLM was accepted by KDD 2020.
  • April 5, 2020: Multilingual MiniLM released!
  • September, 2019: UniLMv1 was accepted by NeurIPS 2019.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using the pre-trained models, please submit a GitHub issue.

For other communications, please contact Furu Wei ([email protected]).

unilm's People

Contributors

addf400 avatar buaahsh avatar cclauss avatar dependabot[bot] avatar dod-o avatar donglixp avatar eltociear avatar getao avatar gitnlp avatar hypjudy avatar intfloat avatar j4ckl1u avatar jingyechen avatar liminghao1630 avatar lockon-n avatar markwunlp avatar microsoftopensource avatar muennighoff avatar pengzhiliang avatar piskunow avatar ranpox avatar sanyuan-chen avatar shumingma avatar twak avatar wenhui0924 avatar wolfshow avatar wszlong avatar xichenpan avatar xingxingzhang avatar yuchenlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unilm's Issues

Reproductibility issue

I'm having trouble reproducing the results on CNN/DM dataset.

I downloaded the data and the fine-tuned model provided in the README, and I followed the commands to predict the test set.

Everything is running fine, but at the end I have the following results :

1 ROUGE-1 Average_R: 0.62689 (95%-conf.int. 0.62269 - 0.63111)
1 ROUGE-1 Average_P: 0.13695 (95%-conf.int. 0.13561 - 0.13828)
1 ROUGE-1 Average_F: 0.22101 (95%-conf.int. 0.21918 - 0.22288)

1 ROUGE-2 Average_R: 0.33142 (95%-conf.int. 0.32673 - 0.33603)
1 ROUGE-2 Average_P: 0.06949 (95%-conf.int. 0.06832 - 0.07078)
1 ROUGE-2 Average_F: 0.11266 (95%-conf.int. 0.11089 - 0.11456)

1 ROUGE-L Average_R: 0.52624 (95%-conf.int. 0.52179 - 0.53061)
1 ROUGE-L Average_P: 0.11465 (95%-conf.int. 0.11345 - 0.11598)
1 ROUGE-L Average_F: 0.18509 (95%-conf.int. 0.18333 - 0.18698)

/root/code/unilm/src/cnndm_model/cnndm_model.bin.test.alp1.0
ROUGE-F(1/2/l): 22.10/11.27/18.51
ROUGE-R(1/2/3/l): 62.69/33.14/52.62


It's weird because I checked the prediction file (cnndm_model.bin.test.alp1.0.post) and compared it with the one provided in the README, and most of the time there is only a few differences.

Here is a comparison of the last few lines of the file (left is the 'official' one, right is mine)

Error Windows 10 Git Bash

Environment

Windows

image

Git Bash

git version 2.23.0.windows.1

Docker Version:

image

Error Reproduction

After run:

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-devel bash

##Returns:

C:\Program Files\Docker\Docker\Resources\bin\docker.exe: Error response from daemon: Mount denied:
The source path "C:/Program Files/Git/home/zzj04;C"
doesn't exist and is not known to Docker.
See 'C:\Program Files\Docker\Docker\Resources\bin\docker.exe run --help'.

CPU based pre-trained model

I am guessing that the model provided is for machines with CUDA-capable device.
Do you guys happen to have a pre-trained CPU version for cnndm_model.bin ?

@@ -165,7 +165,7 @@ def main():
     print(args.model_recover_path)
     for model_recover_path in glob.glob(args.model_recover_path.strip()):
         logger.info("***** Recover model: %s *****", model_recover_path)
-        model_recover = torch.load(model_recover_path)
+        model_recover = torch.load(model_recover_path, map_location="cpu")
DATA_DIR=../cnndm_data
MODEL_RECOVER_PATH=../cnndm_model.bin
EVAL_SPLIT=test
export PYTORCH_PRETRAINED_BERT_CACHE=/tmp/bert-cased-pretrained-cache
# run decoding
python biunilm/decode_seq2seq.py --fp16 --amp --bert_model bert-large-cased --new_segment_ids --mode s2s --need_score_t
races \
  --input_file ${DATA_DIR}/${EVAL_SPLIT}.src --split ${EVAL_SPLIT} --tokenized_input \
  --model_recover_path ${MODEL_RECOVER_PATH} \
  --max_seq_length 768 --max_tgt_length 128 \
  --batch_size 64 --beam_size 5 --length_penalty 0 \
  --forbid_duplicate_ngrams --forbid_ignore_word ".|[X_SEP]"
11/04/2019 15:55:06 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/
models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at /tmp/bert-cased-pretrained-cache/cee054f6aafe5e2cf8
16d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=51 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
  File "biunilm/decode_seq2seq.py", line 254, in <module>
    main()
  File "biunilm/decode_seq2seq.py", line 147, in main
    amp_handle = amp.init(enable_caching=True)
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/apex/amp/amp.py", line 65, in init
    handle = AmpHandle(enable_caching, verbose)
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/apex/amp/handle.py", line 14, in __init__
    self._default_scaler = LossScaler()
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/apex/amp/scaler.py", line 35, in __init__
    self._overflow_buf = torch.cuda.IntTensor([0])
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:51
[1]    72305 exit 1     python biunilm/decode_seq2seq.py --fp16 --amp --bert_model bert-large-cased

without --amp:

Traceback (most recent call last):
  File "biunilm/decode_seq2seq.py", line 254, in <module>
    main()
  File "biunilm/decode_seq2seq.py", line 216, in main
    position_ids, input_mask, task_idx=task_idx, mask_qkv=mask_qkv)
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/john/code/unilm/src/pytorch_pretrained_bert/modeling.py", line 1409, in forward
    return self.beam_search(input_ids, token_type_ids, position_ids, attention_mask, task_idx=task_idx, mask_qkv=mask_qkv)
  File "/home/john/code/unilm/src/pytorch_pretrained_bert/modeling.py", line 1528, in beam_search
    output_all_encoded_layers=True, prev_embedding=prev_embedding, prev_encoded_layers=prev_encoded_layers, mask_qkv=mask_qkv)
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/john/code/unilm/src/pytorch_pretrained_bert/modeling.py", line 1062, in forward
    input_ids, token_type_ids, attention_mask)
  File "/home/john/code/unilm/src/pytorch_pretrained_bert/modeling.py", line 1037, in get_extended_attention_mask
    extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/torch/tensor.py", line 371, in __rsub__
    return _C._VariableFunctions.rsub(self, other)
RuntimeError: "add_cpu" not implemented for 'Half'

Packages:

pytorch-pretrained-bert 0.4.0    
torch                   1.1.0 
tensorboardX            1.9
apex                     0.1

Timeline and information for V2?

Hello,

Thank you very much for sharing this codebase together with a good enough documentation, much appreciated! :-)

Is there any timeline or some information for the upcoming V2 release?

Regards,
Fabian

Results of Question Generation on SQuAD

Thank you for making the code open-sourced.

It seems that there is an issue with the question generation evaluation script. The evaluation script is producing the near same results as reported in the paper. The evaluation script contains some post-processing methods which is the main source of improvement.

The majority of the authors have used the nlg-eval to report the performance. With nlg-eval the scores on the test dataset with the released generation output and gold question (test.q.tok.txt) are as follows:
Bleu_1: 0.407580
Bleu_2: 0.275720
Bleu_3: 0.201373
Bleu_4: 0.151140
METEOR: 0.161781
ROUGE_L: 0.436765
I am requesting you to please clarify the difference.

How is the result in the paper evaluated?

I wanna to know the evaluation method of the generated task in the paper. Are you testing according to the highest validation set or following the results of the last epoch? If it is the second test method, is the random seed of each training fixed? Thank you so much~

fine-tune time and decode part

Hello, I try to use your model on QG task. However it takes a lot of time to fine-tune.
It takes 8h for each epoch on SQuAD dataset. (no fp16 since my GPU is not suppoted)
The GPU is 1080ti (11g) and I need to set the batch size to 1.

I wonder why this model is so slow in fine-tune part ? (compare to other pre-train model like GPT and BERT)

Another question is about the decode part. Each time predict a word and feed this updated sequence back to model to predict next word. Is this right? (Sorry the codes are too complex to understand for me.)

Replacing 1 by #

In this line (code for evaluation of CNNDM)

sentence = fix_tokenization(l.strip()).replace('1', '#')

1 is replaced by #.

I don't understand it. Can someone explain me the reason of such post-processing ?

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1, Please help

I was installing the Nvidia/apex particular tree branch by doing this on colab.

%%writefile setup.sh
git clone -q https://github.com/NVIDIA/apex.git
cd apex
git reset --hard 1603407bf49c7fc3da74fceb6a6c7b47fece2ef8
cd ..
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

I'm getting this whole error:
though this problem with particular branch, if I install the master branch it is getting installed.

.....
csrc/scale_check_overflow.cpp:14:3: note: in expansion of macro ‘AT_CHECK’
AT_CHECK(grads.type().is_cuda(), "grads must be a CUDA tensor");
^
.....
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Running setup.py install for apex ... error
Cleaning up...
Removing source in /tmp/pip-req-build-wmagyiis
Removed build tracker '/tmp/pip-req-tracker-9l2sbkhe'
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-wmagyiis/setup.py'"'"'; file='"'"'/tmp/pip-req-build-wmagyiis/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-dmtu3t6t/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/base_command.py", line 153, in _main
status = self.run(options, args)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/commands/install.py", line 455, in run
use_user_site=options.use_user_site,
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/req/init.py", line 62, in install_given_reqs
**kwargs
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/req/req_install.py", line 888, in install
cwd=self.unpacked_source_directory,
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/utils/subprocess.py", line 275, in runner
spinner=spinner,
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess
raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools,

Truncating predictions

Thanks for open-sourcing the repo, code is great and really easy to reproduce thanks to docker, your detailed explanations in the README and your finetuned checkpoints !


I have a question about predictions post-processing.

I could reproduce paper's results on CNN/DM datasets with the command provided in the README. My results are :

R-1 R-2 R-L
43.06 20.42 40.32

But if I run the same command removing the truncation (--trunc_len 0 instead of
--trunc_len 70), results are much lower :

R-1 R-2 R-L
42.05 19.90 39.44

Is this normal ?

In other codebase, I've never seen predictions being truncated. I'm wondering why it is necessary with UniLM.

I'm also curious to hear your opinion about the reason why the score is lower without truncation.

do not manage to run with --fp16 & --amp

Hi
Thanks for sharing this.
I've tried to run the question generation part.
I can manage to make it work, but with a batch_size of 8, because of my 16GB limit on my GPU.
So I wanted to switch to FP16 to increase the batch_size and speed up the training.

I'm getting this error:

10/16/2019 09:32:57 - INFO - main - device: cuda n_gpu: 1, distributed training: False, 16-bits training: True
10/16/2019 09:32:57 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at /tmp/bert-cased-pretrained-cache/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
Loading Train Dataset /root/unilm/data/train
Load 75722 documents
10/16/2019 09:33:02 - INFO - main - enable fp16 with amp
10/16/2019 09:33:02 - INFO - main - ***** Recover model: /root/unilm/models/unilmv1-large-cased.bin *****
10/16/2019 09:33:03 - INFO - pytorch_pretrained_bert.modeling - loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz from cache at /tmp/bert-cased-pretrained-cache/7fb0534b83c42daee7d3ddb0ebaa81387925b71665d6ea195c5447f1077454cd.eea60d9ebb03c75bb36302aa9d241d3b7a04bba39c360cf035e8bf8140816233
10/16/2019 09:33:03 - INFO - pytorch_pretrained_bert.modeling - extracting archive file /tmp/bert-cased-pretrained-cache/7fb0534b83c42daee7d3ddb0ebaa81387925b71665d6ea195c5447f1077454cd.eea60d9ebb03c75bb36302aa9d241d3b7a04bba39c360cf035e8bf8140816233 to temp dir /tmp/tmppli0_vk5
10/16/2019 09:33:14 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"ffn_type": 0,
"fp32_embedding": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"label_smoothing": 0.1,
"max_position_embeddings": 512,
"new_pos_ids": false,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"num_qkv": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"relax_projection": 0,
"seg_emb": false,
"task_idx": 3,
"type_vocab_size": 6,
"vocab_size": 28996
}

^[[A10/16/2019 09:33:49 - INFO - pytorch_pretrained_bert.modeling - Weights of BertForPreTrainingLossMask not initialized from pretrained model: ['crit_mask_lm_smoothed.one_hot']
10/16/2019 09:33:50 - INFO - main - ***** CUDA.empty_cache() *****
10/16/2019 09:33:50 - INFO - main - ***** Running training *****
10/16/2019 09:33:50 - INFO - main - Batch size = 4
10/16/2019 09:33:50 - INFO - main - Num steps = 9465
Epoch: 0%| | 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
File "biunilm/run_seq2seq.py", line 483, in
main()
File "biunilm/run_seq2seq.py", line 461, in main
optimizer.step()
File "/root/.local/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/fp16_optimizer.py", line 157, in step
grads_groups_flat.append(_flatten_dense_tensors([p.grad for p in group]))
File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 192, in _flatten_dense_tensors
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 192, in
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
AttributeError: 'NoneType' object has no attribute 'contiguous'
Iter (loss=5.700): 0%|

when running this command line:

python3 biunilm/run_seq2seq.py --do_train --num_workers 0
--bert_model bert-large-cased --new_segment_ids --tokenized_input
--data_dir ${DATA_DIR} --src_file train.pa.tok.txt --tgt_file train.q.tok.txt
--output_dir ${OUTPUT_DIR}/bert_save
--log_dir ${OUTPUT_DIR}/bert_log
--model_recover_path ${MODEL_RECOVER_PATH}
--max_seq_length 512 --max_position_embeddings 512
--mask_prob 0.7 --max_pred 48
--train_batch_size 8 --gradient_accumulation_steps 2
--learning_rate 0.00002 --warmup_proportion 0.1 --label_smoothing 0.1
--num_train_epochs 1
--amp
--fp16

Any clue of what it could come from ?
Thanks in advance
Philippe

Regarding Mask Matrix M

I am a bit confused regarding the implementation of equation 3 of the paper where the Mask Matrix M was used. Can you please describe, how is the Mask Matrix M implemented in the code and in which portion of the code equation 3 was used?

How to reproduce Long Text Generation examples from paper?

I would appreciate if another user could share the sample commands that can reproduce the interesting results included with the arvix paper. No success so far with original MS repo and only limited, repetitious output using the current huggingface unilm fork and same prompt.

Cheers

Seed line from 1984 used in original paper:

Winston sat back. A sense of complete helplessness had descended upon him.

Published output:

He was now helpless and unmoving. What would that
mean? To become a monk under her mother’s rule. What would it
mean for him to be put at the center of the most dangerous of all of
demons, the demon known as Azrael? He would never know exactly
what that was, or even how it would end, unless he could sense the
strength of Azrael’s mind and soul. It wasn’t only the existence of
his mother, it was all he knew, and every moment. He was trapped
within her embrace, a trapped, torn, battered, crippling presence he
only knew it would mean was eternity. And in the instant that he
was lost his life changed forever, when Azraelis’ spirit touched his
mind, he was engulfed in this dark and mystical, so enchanting and
beautiful force.

question about seq2seq pre-training

For pre-training seq2seq LM , how to construct the training example? Especially, given the unannotated corpus, what are the source segment and target segment?

Any suggestions on how to create [SEP] hints for Neural Question Generation ...

UniLM team, Awesome work!! I am able to generate very good quality questions & am thoroughly impressed with some particularly generated questions, they are simply amazing, generative models are the future & carry insane potential!!

I have one question though:

I followed your approach of using [SEP] tag post passages providing hints to draw meaningful questions. However, am not sure how can I scale this for the dataset I have? I am thinking to apply NER on the passages I have and further piggybacking on selective NERs, generating [SEP] hints for every passage.

Is there a better & faster approach?? Obviously NERs are not always picking the desired hints I would wish to capture & hence I may lose lot of intel with NERs approach. I am just not able to think of any better alternatives. Humanly curating [SEP] hints is not feasible for my research work.

Thank you & keep doing the amazing work!
Anshoo

Hardware configuration

What is the hardware setup used in the training here?
We have V100, 2080 RTX, 1080 GTX ti, 1060 GTX, and 960m people here. Hopefully it works on diverse setups.

Can not use custom sentence for QG

Problem

Hi. I want to try the QG using decode_seq2seq.py. It works when I try use the sample data. But when I use another data, it encounter Key Error: 'H.E.

Note

  • I use BERT-LARGE-CASED
  • It will success if I remove that word, and error with another 'weird' word.

Question

  1. Is the decode seq2seq will match each the input word with a BERT LARGE CASED vocab?
  2. How to preprocess the text before decode seq2seq? any guidance for preprocessing?
  3. I also read similar issue from pytorch-bert-transformer huggingface/transformers#63 ?

Terminal Output

File "/root/code/unilm/src/pytorch_pretrained_bert/tokenization.py", line 117, in convert_tokens_to_ids 
    ids.append(self.vocab[token])
KeyError: 'H.E.' # or another weird word

How to pretrain UniLM on abstract summarization task?

If I want to train UniLM from scratch on another abstract summarization task (not in English), how do I do it?

I guess the fine tuning and inference code from Readme can be reused, but I'm not sure how to do the pretraining. Can you guys share the pre-train code on CNN summarization? Thanks guys!

CNN/DM : data preprocessing

The link to the data of CNN/DM dataset is an already preprocessed dataset.

How can we reproduce similar dataset from the official .story files ?

unilm model in other languages ?

Thanks for answering my previous issue.
I have a new one (easy): do you plan to release UNILM model in other languages than english ?
Thanks in advance for your response
Philippe

Empty predicted sequences being generated.

I have a very limited data set- 225 samples. The task is similar to Gigaword headline generation. The statistics for my source and target sequences look like this:

Source:
(After BERT tokenizer)

  • Max token length: 380
  • Min token length: 5
  • Average token length: 97.53

Target:
(After BERT tokenizer)

  • Max token length : 85
  • Min token length: 5
  • Average token length: 29.8

I used an 80-20 split and trained on 180 samples and tested on 45. I tried running decoding with different values for max_tgt_length, and got the following results:

  • 40 tokens : output sequences generated for 37 samples out of 45.
  • 60 tokens : output sequences generated for 35 samples.
  • 100 tokens: output sequences generated for 34 samples.

What's happening here and what is a good work around given the variations in my data?

No Module Named Bleu

Hi, where is the Bleu come from?

I encounter this when run eval

File "src/qg/eval_on_unilm_tokenized_ref.py", line 4, in <module>
    from bleu.bleu import Bleu
ImportError: No module named bleu.bleu

Could you please release the dev results?

It's very nice to get the test result, but in fact it is useless for researching because we should't touch the test.

So if it is convenient, please release the dev result. I know it is critical because providing a useful guild is already very very very nice, but I make this request because of the high-quality of the repo and maintainer.

Thanks a lot.

Regarding Abstractive Summaries

As mentioned in the paper, during fine-tuning, you masked some tokens in the summaries and then you predicted those tokens. But during inference (only test-data is given), you don't have the summaries. So, how did the prediction occur during inference time? I mean, how you gave the inputs for inference and how did the decoding work?

Segmentation Fault Core dump

After installing the environment,I run the run_seq2seq.py. When I load the model,Segmentation Fault Core dump occurs. My environment is pytorch 1.1.0, cuda 10.1, torch-vision 0.3.0. so why the core dump occurs?

Whole Word Masking

It seems you don't use Whole Word Masking in pre-training.
Whole Word Masking has been showed useful in BERT., so will you try this on uniLM?(And release the pre-trained model)

Thanks !

Is there a way to optimize while loop within BertForSeq2SeqDecoder?

Hello,

I noticed the BertForSeq2SeqDecoder to be slow on CPU and this is mainly due to the while looping inside the forward method, where basically it iterates N times where N is the difference between next_pos and output_length, and next_pos increments by 1 at each iteration.

Can you explain me what are:

  • curr_length
  • next_pos and input_length
  • output_length
  • token_type_ids

Do you have any idea on how to optimize the code in order to get rid of the while loop?

Thanks a lot, any help with the explanation of those variables will be appreciated

A QG task problem

Hi, I have a problem with the QG task when I evaluate the performance. I use your released evaluation scripts can achieve your performance. But, I use https://github.com/xinyadu/nqg.

image

Why I get this result?
Thank you for your help.

❓ Question : Training - Evaluating discrepancy in Abstractive Summarization

Thanks for open-sourcing the code !

After reading your paper, I have a question about the finetuning procedure for Abstractive summarization (and more generally any Seq2Seq task).


image

I understand this idea : Similarly to Bert and to UniLM pretraining, finetuning on Abstractive Summarization is masking some token and predicting it in order to learn a bidirectional representation of tokens.

But at inference time, since we don't have access to the whole summary (it is yet to be generated), we can only apply a left-to-right LM.

It seems a pretty big discrepancy between training and testing.


What I don't understand is that people already tried to use BERT (trained as a bidirectional encoder) as a left-to-right LM. But results were really low.

And in your case, results are very high !

So my questions are :

  • Did I miss something ? Did I misunderstood and there is in fact no discrepancy ?

  • If I understood right, why do you finetune Seq2Seq model using bidirectional LM, and not left-to-right LM ?

Fine tuned models for Generative QA

Excellent work!

Would it be possible to provide fine-tuned models for Generative QA along with training and inference instructions similar to those provided for Abstractive Summarization and Question Generation?

Segmentation fault (core dumped)

When run the run_seq2seq.py to finetune the model on summarization datasets, the program will always crash and output "Segmentation fault (core dumped)". The command is as follow:

export CUDA_VISIBLE_DEVICES=0,1,2,3
python biunilm/run_seq2seq.py
--do_train --fp16 --amp --num_workers 0
--bert_model ../bert-large-cased/ --new_segment_ids --tokenized_input
--output_dir ../summ_model/bert_save
--log_dir ../summ_model/bert_log
--model_recover_path ../storage/unilmv1-large-cased.bin
--max_seq_length 768 --max_position_embeddings 768
--trunc_seg a --always_truncate_tail
--max_len_a 568 --max_len_b 200
--mask_prob 0.7 --max_pred 140
--train_batch_size 48 --gradient_accumulation_steps 2
--learning_rate 0.00003 --warmup_proportion 0.1 --label_smoothing 0.1
--num_train_epochs 30

What causes the Segmentation fault error? Thanks for your help!

Where is bert pretrained cache?

export PYTORCH_PRETRAINED_BERT_CACHE=/{tmp_folder}/bert-cased-pretrained-cache
From this command I can't understand where can I find bert-cased-pretrained-cache. I tried to pip install separately, but there's no bert-cased-pretrained-cache

For the fine-tuning in NLG task, why not use standard language model objective with teacher forcing?

Hi, I would like to know more about the comparison between the standard language model fine-tuning (teacher forcing) and the masked language model fine-tuning (same objective as pre-training) in your paper. In my opinion, for the text generation task, the most popular fine-tuning approach would be the teacher forcing in language modeling as it mimics the generation process during testing. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.