GithubHelp home page GithubHelp logo

aditeyabaral / calbert Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 2.0 121.21 MB

CalBERT - Code-mixed Adaptive Language representations using BERT, published at AAAI-MAKE 2022

Home Page: https://ceur-ws.org/Vol-3121/short3.pdf

License: MIT License

Python 100.00%
bert transformer code-mixed natural-language-processing nlp deep-learning machine-learning

calbert's Introduction

CalBERT - Code-mixed Adaptive Language representations using BERT

PWC

PWC

This repository contains the source code for CalBERT - Code-mixed Adaptive Language representations using BERT, published at AAAI-MAKE 2022, Stanford University.

CalBERT can be used to adapt existing Transformer language representations into another similar language by minimising the semantic space between equivalent sentences in those languages, thus allowing the Transformer to learn representations for words across two languages. It relies on a novel pre-training architecture named Siamese Pre-training to learn task-agnostic and language-agnostic representations. For more information, please refer to the paper.

This framework allows you to perform CalBERT's Siamese Pre-training to learn representations for your own data and can be used to obtain dense vector representations for words, sentences or paragraphs. The base models used to train CalBERT consist of BERT-based Transformer models such as BERT, RoBERTa, XLM, XLNet, DistilBERT, and so on. CalBERT achieves state-of-the-art results on the SAIL and IIT-P Product Reviews datasets. CalBERT is also one of the only models able to learn code-mixed language representations without the need for traditional pre-training methods and is currently one of the few models available for Indian code-mixing such as Hinglish.

Installation

We recommend Python 3.9 or higher for CalBERT.

Install PyTorch

Follow PyTorch - Get Started for further details on how to install PyTorch with or without CUDA.

Install CalBERT

Install with pip

pip install calbert

Install from source

You can also clone the current version from the repository and then directly install the package.

pip install -e .

Getting Started

You can read the docs to learn more about how to train CalBERT for your own use case.

The following example shows you how to use CalBERT to obtain sentence embeddings.

Training

This framework allows you to also train your own CalBERT models on your own code-mixed data so you can learn embeddings for your custom code-mixed languages. There are various options to choose from in order to get the best embeddings for your language.

First, initialise a model with the base Transformer

from calbert import CalBERT
model = CalBERT('bert-base-uncased')

Create a CalBERTDataset using your sentences

from calbert import CalBERTDataset
base_language_sentences = [
   "I am going to Delhi today via flight",
   "This movie is awesome!"
]
target_language_sentences = [
   "Main aaj flight lekar Delhi ja raha hoon.",
   "Mujhe yeh movie bahut awesome lagi!"
]
dataset = CalBERTDataset(base_language_sentences, target_language_sentences)

Then create a trainer and train the model

from calbert import SiamesePreTrainer
trainer = SiamesePreTrainer(model, dataset)
trainer.train()

Performance

Our models achieve state-of-the-art results on the SAIL and IIT-P Product Reviews datasets.

More information will be added soon.

Application and Uses

This framework can be used for:

  • Computing code-mixed as well as plain sentence embeddings
  • Obtaining semantic similarities between any two sentences
  • Other textual tasks such as clustering, text summarization, semantic search and many more.

Citing and Authors

If you find this repository useful, please cite our publication CalBERT - Code-mixed Apaptive Language representations using BERT.

@inproceedings{calbert-baral-et-al-2022,
  author    = {Aditeya Baral and
               Aronya Baksy and
               Ansh Sarkar and
               Deeksha D and
               Ashwini M. Joshi},
  editor    = {Andreas Martin and
               Knut Hinkelmann and
               Hans{-}Georg Fill and
               Aurona Gerber and
               Doug Lenat and
               Reinhard Stolle and
               Frank van Harmelen},
  title     = {CalBERT - Code-Mixed Adaptive Language Representations Using {BERT}},
  booktitle = {Proceedings of the {AAAI} 2022 Spring Symposium on Machine Learning
               and Knowledge Engineering for Hybrid Intelligence {(AAAI-MAKE} 2022),
               Stanford University, Palo Alto, California, USA, March 21-23, 2022},
  series    = {{CEUR} Workshop Proceedings},
  volume    = {3121},
  publisher = {CEUR-WS.org},
  year      = {2022},
  url       = {http://ceur-ws.org/Vol-3121/short3.pdf},
  timestamp = {Fri, 22 Apr 2022 14:55:37 +0200}
}

Contact

Please feel free to contact us by emailing us to report any issues or suggestions, or if you have any further questions.

Contact: - Aditeya Baral, [email protected]

You can also contact the other maintainers listed below.

calbert's People

Contributors

abaksy avatar aditeyabaral avatar anshsarkar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

calbert's Issues

Add `NextSentencePrediction` as objective to `SiamesePreTrainer`

Add Next Sentence Prediction as an objective to the SiamesePreTrainer class. This will require changes to CalBERT class as well to allow a Transformer model to be passed as input for model_path. The NSP head can be extracted from AutoModel classes present on HuggingFace

Updated documentation on README

The README requires updates with respect to documentation - initialising models, the loss functions, etc. Expecting an extensive documentation present for each function and class present in the readthedocs documentation. A few things like performance metrics need to be added from the paper as well.

Add MLM as an objective to `SiamesePreTrainer`

Add Masked Language Modelling as an objective to the SiamesePreTrainer class. This will require changes to CalBERT class as well to allow a Transformer model to be passed as input for model_path. The MLM head can be extracted from AutoModel classes present on HuggingFace

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.