GithubHelp home page GithubHelp logo

kathir-ks / indictranstokenizer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from varungumma/indictranstokenizer

0.0 0.0 0.0 4.85 MB

A simple, consistent and extendable module for IndicTrans2 tokenizer

License: MIT License

Python 100.00%

indictranstokenizer's Introduction

IndicTransTokenizer

The goal of this repository is to provide a simple, modular, and extendable toolkit for IndicTrans2 and be compatible with the HuggingFace models released.

Changelog

Major Update (v1.0.0)

  • The PreTrainedTokenizer for IndicTrans2 is now available on HF 🎉🎉 Note that, you still need the IndicProcessor to pre-process the sentences before tokenization.
  • In favor of the standard PreTrainedTokenizer, we have deprecated the custom tokenizer. However, this custom tokenizer will still be available here for backward compatibility, but no further updates or bug fixes will be provided.
  • The indic_evaluate function is now consolidated into a concrete IndicEvaluator class.
  • The data collation function for training is consolidated into a concrete IndicDataCollator class.
  • A simple batching method is now available in the IndicProcessor.

Update (v1.0.1)

  • Added an argument for progress bar during preprocessing (show_progress_bar=True).
  • Added an argument to prepend additional tags like __bt__ and __ft__ similar to IT2 BT/FT data preprocessing (additional_tag="__bt__").

Pre-requisites

Configuration

  • Editable installation (Note, this may take a while):
git clone https://github.com/VarunGumma/IndicTransTokenizer
cd IndicTransTokenizer

pip install --editable ./

Examples

For the training usecase, please refer here. Please do not use the custom tokenizer to train/fine-tune models. Training models with the custom tokenizer is untested and can lead to unexpected results.

PreTainedTokenizer

import torch
from IndicTransTokenizer import IndicProcessor
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

ip = IndicProcessor(inference=True)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

sentences = [
    "This is a test sentence.",
    "This is another longer different test sentence.",
    "Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]

batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", show_progress_bar=False)
batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")

with torch.inference_mode():
    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)

with tokenizer.as_target_tokenizer():
    # This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
    # Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)

outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']

Custom Tokenizer (DEPRECATED)

import torch
from transformers import AutoModelForSeq2SeqLM
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

tokenizer = IndicTransTokenizer(direction="en-indic")
ip = IndicProcessor(inference=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

sentences = [
    "This is a test sentence.",
    "This is another longer different test sentence.",
    "Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]

batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", show_progress_bar=False)
batch = tokenizer(batch, src=True, return_tensors="pt")

with torch.inference_mode():
    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)

outputs = tokenizer.batch_decode(outputs, src=False)
outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']

Evaluation

  • IndicEvaluator is a python implementation of compute_metrics.sh.
  • We have found that this python implementation gives slightly lower scores than the original compute_metrics.sh. So, please use this function cautiously, and feel free to raise a PR if you have found the bug/fix.
from IndicTransTokenizer import IndicEvaluator

# this method returns a dictionary with BLEU and ChrF2++ scores with appropriate signatures
evaluator = IndicEvalutor()
scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=pred_file, refs=ref_file) 

# alternatively, you can pass the list of predictions and references instead of files 
# scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=preds, refs=refs)

Batching

ip = IndicProcessor(inference=True)

for batch in ip.get_batches(source_sentences, batch_size=32):
    # perform necessary operations on the batch
    # ... pre-processing
    # ... tokenization 
    # ... generation 
    # ... decoding

Authors

Bugs and Contribution

Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise Issues/Pull Requests or contact the authors.

Citation

If you use our codebase, models, or tokenizer, please cite the following paper:

@article{
    gala2023indictrans,
    title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
    author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2023},
    url={https://openreview.net/forum?id=vfT4YuzAYA},
    note={}
}

indictranstokenizer's People

Contributors

kathir-ks avatar pranjalchitale avatar varungumma avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.