GithubHelp home page GithubHelp logo

00mjk / vaaku2vec Goto Github PK

View Code? Open in Web Editor NEW

This project forked from adamshamsudeen/vaaku2vec

0.0 0.0 0.0 1.1 MB

Language Modeling and Text Classification in Malayalam Language using ULMFiT

License: GNU General Public License v3.0

Python 2.56% Jupyter Notebook 97.44%

vaaku2vec's Introduction

Vaaku2Vec

State-of-the-Art Language Modeling and Text Classification in Malayalam Language

Results

We trained a Malayalam language model on the Wikipedia article dump from Oct, 2018. The Wikipedia dump had 55k+ articles. The difficuly in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. In the current model, we are using nltk tokenizer (will try better alternative in the future) and the vocab size is 30k. The language model was used to train a classifier which classifies a news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whooping 92% accuracy in the classification task.

Releases

  • Proccessed wikipedia dump of articles split into test and train.
  • Script and weights for Malayalam Language model.
  • Malayalam text classifier with pretrained weights.
  • Inference code for text classifier.

Downloads

Requirements

Installing dependencies

python3.6>=

If you are using virtualenvwrapper use the following steps:

  1. git clone https://github.com/adamshamsudeen/Vaaku2Vec.git
  2. mkvirtualenv -p python3.6 venv
  3. workon venv
  4. cd Vaaku2Vec
  5. pip install -r requirements.txt

Usage

Training language model with preprocessed data:

  1. Download the pretrained language model folder, it contains the preprocessed test and train csv. If you would like to preproccess and retrain the LM using the latest dump article dump using the scripts provided here.
  2. Create tokens:
    python lm/create_toks.py <path_to_processed_wiki_dump>
    eg: python lm/create_toks.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/
  3. Create a token to id mapping:
    python lm/tok2id.py <path_to_processed_wiki_dump>
    eg: python lm/tok2id.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/
  4. Train language model:
    python lm/pretrain_lm.py <path_to_processed_wiki_dump> 0 --lr 1e-3 --cl 40
    eg: python lm/pretrain_lm.py /home/adamshamsudeen/mal/Vaaku2Vec/wiki/ml/ 0 --lr 1e-3 --cl 40
    lr is the learning rate and cl is the no of epochs.

Training the classifier:

  1. Use train_classifier.ipynb to train a malayalam text classifier.
  2. We have not released the news dataset, raise a request if you want to experiment with the same.

Testing the classifier:

  1. To test the classifier trained on Manorama news, download the Pretrained Malyalam Text Classifier mentioned in the downloads.
  2. Use prediction.ipynb and test out your input.

We manually tested the model on news from other leading news paper and the model performed pretty well. result

Word2Vec:

  1. We also trained a word2vec model using gensim with the Wikipedia dump.
  2. You can also use word2vec model to train a text classifier. News Classifier
  3. You can see the word2vec demo in the below link.

Demo

TODO

  • Malayalam Language modeling based on wikipedia articles.
  • Release Trained Language Models weights.
  • Malayalam Text classifier script.
  • Benchmark with mlmorph for tokenization.
  • Benchmark with Byte pair encoding for tokenization
  • UI to train and test classifier.
  • Basic Chatbot using this implementation.

Thanks

  1. Special thanks to Sebastian Ruder and Jeremy Howard and other contributors to fastai and ULTMFiT.
  2. Logo base design
  3. Raeesa for designing the logo.

Contibutors

  1. Kamal K Raj
  2. Adam Shamsudeen

vaaku2vec's People

Contributors

abhaikollara avatar adamshamsudeen avatar amrrs avatar dependabot[bot] avatar kamalkraj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.