ai4bharat / indic-bart Goto Github PK

View Code? Open in Web Editor NEW

42.0 6.0 5.0 81 KB

Pre-trained, multilingual sequence-to-sequence models for Indian languages

License: MIT License

Python 100.00%

indic-bart's Introduction

IndicBART

Pre-trained, multilingual sequence-to-sequence models for Indian languages

You can read more about IndicBART here. IndicBART is part of the AI4Bharat tools for Indian languages.

You can read more about IndicBART in this paper.

Installation

Install the YANMTT toolkit. Checkout the v1.0 release via "git checkout v1.0". Make sure to create a new conda or virtual environment to ensure things work smoothly.
Download the following:
- v1 (Vocabulary) (Model)
Decompress the vocabulary zip: unzip albert-indicunified64k.zip

Finetuning IndicBART for NMT

Sample training corpora

Sample development set

3-way parallel: en hi bn

Script conversion

The Indic side of the data needs to converted to the Devanagari script. You may use the indic_scriptmap.py script.
- This script depends on Indic NLP Library and Indic NLP Resources which should be manually installed.
- Following this, change the paths in lines 13 and 16 in indic_scriptmap.py.
- Usage: python indic_scriptmap.py <input_file> <output_file> <source_language> <target_language>
  - Example: python indic_scriptmap.py input.txt output.txt ta hi
  - This will map the script in the input.txt file from Tamil to Hindi.
The sample data provided above has already been converted to the Devanagari script, so you can use it as is.

Fine-tuning command

python PATH-TO-YANMTT/train_nmt.py --train_slang hi,bn --train_tlang en,en  \
    --dev_slang hi,bn --dev_tlang en,en --train_src train.en-hi.hi,train.en-bn.bn \
    --train_tgt train.en-hi.en,train.en-bn.en --dev_src dev.hi,dev.bn --dev_tgt dev.en,dev.en \
    --model_path model.ft --encoder_layers 6 --decoder_layers 6 --label_smoothing 0.1 \
    --dropout 0.1 --attention_dropout 0.1 --activation_dropout 0.1 --encoder_attention_heads 16 \
    --decoder_attention_heads 16 --encoder_ffn_dim 4096 --decoder_ffn_dim 4096 \
    --d_model 1024 --tokenizer_name_or_path albert-indicunified64k --warmup_steps 16000 \
    --weight_decay 0.00001 --lr 0.001 --max_gradient_clip_value 1.0 --dev_batch_size 128 \
    --port 22222 --shard_files --hard_truncate_length 256 --pretrained_model indicbart_model.ckpt &> log

At the end of training, you should find the model with the highest BLEU score for a given language pair. This will be model.ft.best_dev_bleu.-en where language can be hi or bn. The model training log will tell you the iteration number when the best performing checkpoint was last saved.

Decoding command

decmod=BEST-CHECKPOINT-NAME
    
python PATH-TO-YANMTT/decode_nmt.py --model_path $decmod --slang hi --tlang en \
    --test_src dev.hi --test_tgt dev.trans --port 23352 --encoder_layers 6 --decoder_layers 6 \
    --encoder_attention_heads 16 --decoder_attention_heads 16 --encoder_ffn_dim 4096 \
    --decoder_ffn_dim 4096 --d_model 1024 --tokenizer_name_or_path albert-indicunified64k \
    --beam_size 4 --length_penalty 0.8

Notes:

If you want to use an IndicBART model with language specific scripts, we provide that variant as well: (Vocabulary) (Model)
If you want to perform additional pre-training of IndicBART or train your own then follow the instructions in: https://github.com/prajdabre/yanmtt/blob/main/examples/train_mbart_model.sh
For advanced training options, look at the examples in: https://github.com/prajdabre/yanmtt/blob/main/examples

Finetuning IndicBART for Summarization

Sample Corpus

Fine-tuning command

python PATH-TO-YANMTT/train_nmt.py --train_slang hi --train_tlang hi --dev_slang hi --dev_tlang hi \
    --train_src train.text.hi --train_tgt train.summary.hi --dev_src dev.text.hi \
    --dev_tgt dev.summary.hi --model_path model.ft --encoder_layers 6 --decoder_layers 6 \
    --label_smoothing 0.1 --dropout 0.1 --attention_dropout 0.1 --activation_dropout 0.1 \
    --encoder_attention_heads 16 --decoder_attention_heads 16 --encoder_ffn_dim 4096 \
    --decoder_ffn_dim 4096 --d_model 1024 --tokenizer_name_or_path albert-indicunified64k \
    --warmup_steps 16000 --weight_decay 0.00001 --lr 0.0003 --max_gradient_clip_value 1.0 \
    --dev_batch_size 128 --port 22222 --shard_files --hard_truncate_length 512 \
    --pretrained_model indicbart_model.ckpt --max_src_length 384 --max_tgt_length 40 \
    --is_summarization --dev_batch_size 64 --max_decode_length_multiplier -60 \
    --min_decode_length_multiplier -10 --no_repeat_ngram_size 4 --length_penalty 1.0 \
    --max_eval_batches 20 --hard_truncate_length 512

Decoding command

decmod=BEST-CHECKPOINT-NAME
    
python PATH-TO-YANMTT/decode_nmt.py --model_path $decmod --slang hi --tlang en \
    --test_src dev.text.hi --test_tgt dev.trans --port 23352 --encoder_layers 6 \
    --decoder_layers 6 --encoder_attention_heads 16 --decoder_attention_heads 16 \
    --encoder_ffn_dim 4096 --decoder_ffn_dim 4096 --d_model 1024 \
    --tokenizer_name_or_path albert-indicunified64k --beam_size 4 \
    --max_src_length 384 --max_decode_length_multiplier -60 --min_decode_length_multiplier -10 \
    --no_repeat_ngram_size 4 --length_penalty 1.0 --hard_truncate_length 512

Contributors

Raj Dabre
Himani Shrotriya
Anoop Kunchukuttan
Ratish Puduppully
Mitesh M. Khapra
Pratyush Kumar

Citing

If you use IndicBART, please cite the following paper:

@misc{dabre2021indicbart,
      title={IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages}, 
      author={Raj Dabre and Himani Shrotriya and Anoop Kunchukuttan and Ratish Puduppully and Mitesh M. Khapra and Pratyush Kumar},
      year={2021},
      eprint={2109.02903},
      archivePrefix={arXiv},
      booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
      publisher = "Association for Computational Linguistics",
      primaryClass={cs.CL}
    }

License

IndicBART is licensed under the MIT License

indic-bart's People

Contributors

Stargazers

Watchers

Forkers

aniruddha-ju ayush1702 manikant92 shivprasadsagare aqhali

indic-bart's Issues

Difference between IndicBARTSS and IndicBART-XLSum?

With reference to the model cards on HuggingFace for ai4bharat/IndicBARTSS and ai4bharat/IndicBART-XLSum, other than support for additional Indic languages, is there any difference between the two?

IndicBART-XLSum training has been detailed in the IndicBART paper[https://arxiv.org/abs/2109.02903]. Is there any literature detailing the IndicBARTSS training?

Clarification regarding samanantar model

As mentioned in the paper, trained model on samanantar from scratch for comparison purpose. may I know whether the trained model is bilingual or multilingual?

Import error while running decoding_nmt.py

I have installed all the prerequisites as stated in the README. I have also installed transformers(4.3.2) using the modified code made available in yanmtt repo. Following error occurs on running the decoding command as stated here https://github.com/AI4Bharat/indic-bart#decoding-command

Traceback (most recent call last):
  File "decode_nmt.py", line 34, in <module>
    from transformers import AutoTokenizer, MBartTokenizer, MBart50Tokenizer, BartTokenizer, AlbertTokenizer
ImportError: cannot import name 'MBart50Tokenizer'

How to finetune on generic seq2seq task apart from NMT?

Summarization does not return a summary using the pretrained model

from transformers import AutoTokenizer, AlbertTokenizer
from transformers import AutoModelForSeq2SeqLM, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained('/content/IndicBART')
tokenizer = AlbertTokenizer.from_pretrained('/content/IndicBART')

marathi_text_sample = """
काल रात्री मी जे काही पाहिलं ते एका कलाकाराने दिलेले तेजस्वी सुख होतं. तू ज्या पद्धतीने चंद्रमुखी साकारली आहेस, ती मनाचा खोलवर ठाव घेणारी आहे. एक प्रेक्षक म्हणून मी त्यावेळी त्या चंद्रमुखीच्या प्रेमात पडलो. त्याक्षणी मला तिला माझ्यासोबत घेऊन जावंस वाटलं. तिचे संरक्षण करावंस वाटलं.
मला त्यावेळी तिच्यातील वेदना जाणवल्या आणि मी तिच्यातील एक भाग आहे, असंही क्षणार्धात मला वाटलं. मनाचा घाव घेणारी, निष्पाप, खोलवर जखम झालेली असली तरीही तिचे शुद्ध अंतकरण पाहून मी भरुन पावलो आहे. तिच्या प्रवासाचे साक्षीदार झालेल्या सर्वांनाच तिने मोहिनी घातली आहे.
अमृता तुझ्या माध्यमातून तू आम्हा सर्वांना चंद्रमुखीच्या प्रवासाचे साक्षीदार बनवल्याबद्दल तुझे धन्यवाद.
कलाकाराच्या आयुष्यात वेगवेगळ्या टप्प्यावर येणारी प्रत्येक भूमिका ही एका खास हेतूसाठी येते आणि तो हेतू म्हणजे पुन्हा स्वत:शीच कनेक्ट होणे. कलाकार करत असलेली प्रत्येक भूमिका ही त्याच्या आत्मावर खोलवर ठसा उमटवते. चंद्रमुखी ही तुझ्यासाठी तेच करेल.
या प्रवासात तुला स्वत:चा खरा शोध घेण्यासाठी, पुढे जाण्यासाठी आणि पुन्हा मागच्या गोष्टींसोबत रि-कनेक्ट होण्यासाठी नक्कीच मदत करतील. अमृता मी तुझ्यासाठी खूप आनंदी आहे आणि मला तुझा खूप अभिमान वाटतो.
"""

inputs = tokenizer.encode(marathi_text_sample, return_tensors="pt", padding=True)

# summarize
generated_ids = model.generate(inputs, min_length=30, max_length=70, no_repeat_ngram_size=2, num_beams=4, early_stopping=True)

# generate summary
tokenizer.decode(generated_ids[0], skip_special_tokens=True)

I have been trying to perform summarization task on a Marathi text using the above piece of code as an example, however, the generated summary is simply the first couple lines of the input text.
Is there something I'm missing, I have tried passing "summarize: " at the beginning like done in MBart but no results.

how can we finetune indicbart pre-trained other than specified language.?

Actually when I fine-tune it for other language, it is getting
RuntimeError: The expanded size of the tensor (20) must match the existing size (17) at non-singleton dimension 1. Target sizes: [3, 20, 1]. Tensor sizes: [3, 17, 1]

File downloads are not accessible

Hello,
The files hosted on GCP seem to be down again for about a week now.
Is there any other way we can download the models and tokenizers (possibly a temporary GDrive link, like the one shared for AI4Bharat/indicTrans#31 (comment)) ?

Error in download link of IndicBart model checkpoint and vocabulary files

Dear Authors,

Thanks for sharing the code and model weights for running indicBart experiments. Unfortunately, the download link for checkpoint and vocabulary files was not accessible using the following link:

checkpoint_link: https://storage.googleapis.com/ai4bharat-indicnlg-public/indic-bart-v1/indicbart_model.ckpt
vocab_link: https://storage.googleapis.com/ai4bharat-indicnlg-public/indic-bart-v1/albert-indicunified64k.zip

I am facing the following error upon opening the link mentioned on the readme page:

<Error>
<Code>UserProjectAccountProblem</Code>
<Message>The project to be billed is associated with a delinquent billing account.</Message>
<Details>The billing account for the owning project is disabled in state delinquent</Details>
</Error>

Please help with the above issue or kindly share an alternative link.

Thank in advance

Doubt: Where is the final model checkpoint(for all languages combined) saved for multilingual training?

Hi, I had a doubt about the training process.

After running the training on (hi, en) and (bn, en) pairs at the same time, I can see the respective model checkpoints, for e.g. model.ft.best_dev_bleu.hi-en. I wish to ask if two separate models are trained for 2 languages passed, or is it the same multilingual model that is trained, and the checkpoints are stored for different languages based on individual dev BLEU? In the later case, where is the combined model checkpoint for all langs saved, if at all it is saved? Is it model.ft? Or is it in model.ft_deploy directory?

Thank you in advance.

How to fine-tune the model for summarization, if we want to train the model with more than one languges

I want to mix English and Hind Summarization data, then how can I pass the language code here?

Error in running decoding_nmt.py

Following error occurs on running the command as stated here https://github.com/AI4Bharat/indic-bart/blob/main/README.md#decoding-command

Traceback (most recent call last):
  File "decode_nmt.py", line 510, in <module>
    run_demo()
  File "decode_nmt.py", line 506, in run_demo
    mp.spawn(model_create_load_decode, nprocs=args.gpus, args=(args,))         #
  File "/home2/shivprasad.sagare/miniconda3/envs/indicbart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home2/shivprasad.sagare/miniconda3/envs/indicbart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home2/shivprasad.sagare/miniconda3/envs/indicbart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home2/shivprasad.sagare/miniconda3/envs/indicbart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home2/shivprasad.sagare/stuff/copernicus/indicbart/yanmtt/decode_nmt.py", line 171, in model_create_load_decode
    translations = model.module.generate(input_ids.to(gpu), use_cache=True, num_beams=args.beam_size, max_length=int((len(input_ids[0])*args.max_decode_length_multiplier) if args.max_decode_length_multiplier > 0 else -args.max_decode_length_multiplier), min_length=int((len(input_ids[0])*args.min_decode_length_multiplier) if args.min_decode_length_multiplier > 0 else -args.min_decode_length_multiplier), early_stopping=True, attention_mask=input_masks.to(gpu), pad_token_id=tok.pad_token_id, eos_token_id=tok(["</s>"], add_special_tokens=False).input_ids[0][0], decoder_start_token_id=tok([args.tlang if args.use_official_pretrained else "<2"+args.tlang+">"], add_special_tokens=False).input_ids[0][0], bos_token_id=tok(["<s>"], add_special_tokens=False).input_ids[0][0], length_penalty=args.length_penalty, repetition_penalty=args.repetition_penalty, encoder_no_repeat_ngram_size=args.encoder_no_repeat_ngram_size, no_repeat_ngram_size=args.no_repeat_ngram_size, num_return_sequences=args.beam_size if args.return_all_sequences else 1, additional_input_ids=input_ids_parent.to(gpu) if args.multi_source else None, additional_input_ids_mask=input_masks_parent.to(gpu) if args.multi_source else None) ## We translate the batch.
  File "/home2/shivprasad.sagare/miniconda3/envs/indicbart/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home2/shivprasad.sagare/miniconda3/envs/indicbart/lib/python3.6/site-packages/transformers/generation_utils.py", line 847, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
  File "/home2/shivprasad.sagare/miniconda3/envs/indicbart/lib/python3.6/site-packages/transformers/generation_utils.py", line 379, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
  File "/home2/shivprasad.sagare/miniconda3/envs/indicbart/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'additional_input_ids'

Duplicate file name in sample data section for finetuning for NMT

train.en-hi.hi file name is repeated, whereas one of it should be train.en-hi.en. The inherent links are correct though and download correct files for both hi and en.

ai4bharat / indic-bart Goto Github PK

indic-bart's Introduction

IndicBART

Installation

Finetuning IndicBART for NMT

Sample training corpora

Sample development set

Script conversion

Fine-tuning command

Decoding command

Notes:

Finetuning IndicBART for Summarization

Sample Corpus

Fine-tuning command

Decoding command

Contributors

Citing

License

indic-bart's People

Contributors

Stargazers

Watchers

Forkers

indic-bart's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs