GithubHelp home page GithubHelp logo

alexgidiotis / document-classifier-lstm Goto Github PK

View Code? Open in Web Editor NEW
169.0 7.0 52.0 86 KB

A bidirectional LSTM with attention for multiclass/multilabel text classification.

License: MIT License

Python 100.00%
keras tensorflow multilabel-multiclass lstm arxiv text-classification recurrent-neural-networks attention-mechanism hierarchical-attention-networks

document-classifier-lstm's Introduction

Document-Classifier-LSTM

Recurrent Neural Networks for multilclass, multilabel classification of texts. The models that learn to tag samll texts with 169 different tags from arxiv.

In classifier.py is implemented a standard BLSTM network with attention.

In hatt_classifier.py you can find the implementation of Hierarchical Attention Networks for Document Classification.

The neural networks were built using Keras and Tensorflow.

The best performing model is the attention BLSTM that achieves a micro f-score of 0.67 on the test set.

The Hierarchical Attention Network achieves only 0.65 micro f-score.

I am using 500k paper abstracts from arxiv. In order to download your own data refer to the arxiv OAI api.

Pretrained word embeddings can be used. The embeddings can either be GloVe or Word2Vec. You can download the GoogleNews-vectors-negative300.bin or the GloVe embeddings.

Usage:

  1. In order to train your own model you must prepare your data set using the data_prep.py script. The preprocessing converts to lower case, tokenizes and removes very short words. The preprocessed files and label files should be saved in a /data folder.

  2. You can now run classifier.py or hatt_classifier.py to build and train the models.

  3. The trained models are exported to json and the weights to h5 for later use.

  4. You can use utils.visualize_attention to visualize the attention weights.

Requirements

Run pip install -r requirements.txt to install the requirements.

document-classifier-lstm's People

Contributors

alexgidiotis avatar dependabot[bot] avatar shuttle1987 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

document-classifier-lstm's Issues

Saving Training model and then running it on test dataset

Hello,
As I understand, there has to be(very often), training set,validation set, test set. Machine learning model trains on training set and the parameters are tuned at validation set. This model is then saved and test set is then applied at that model. However going through, hatt.py, I saw the model being saved as checkpoint. But I did not find any implementation of using that model for test set. Am I right or have i missed something in the code? The model should be saved and then applied to test set. In your case, did you not use the testing set?

Example use cases

Hi,

This library seems perfect for my needs. However, I'm struggling to figure out how to use the various different files.

I would greatly appreciate if someone could create an example.py that imports all of these files and trains the BLSTM on a basic CSV in the form:

label, "... all my text here ..."
other_label, "other labels represent different text. Not all text is the same length."
label, "also, text may contain commas. That's the point of these speech marks."

Thanks in advance. I am not great at Python and have limited experience with Tensorflow, so if someone could even just provide me with an implementation they used on a project to help me figure it out, I would be very grateful.

Regarding precission and recall

Hello,

In the hatt_classifier.py file you have defined a function called f1_score. I wanted to separately calculate precision and recall. While calculating them like in f1_score, do I have to apply tf.reduce_mean function on them?

Like for f1_score you have applied the reduce_mean function at the end. Do I have do similar with precision and recall?

Regarding Multilabel classification

Hello,

I wanted to know about multi label classification implementation. Did you implement Hierarchical attention mechanism with multi label classification? The reason why I am asking is that I searched a lot through internet and did not find any implementation. I have not looked through the code but if you give positive reply, I will start looking at it because this is exactly what I wanted to achieve.

Next question is I looked at your data folder. There is a csv file with labels and another with data. Right now I have a excel file that I am attaching here. The excel file has multiple columns with 1 indicating the topic the article belongs to and 0 indicating not. Could you please suggest me with your code how can i implement it?
train.xlsx

Regarding better accuracy with BILSTM Attention

You mentioned in your readme file that you achieved better accuracy with BILSTM attention instead of HAN. What did you mean by BILSTM attention? I want to know the architecture. Are you referring to any specific research paper? How is the architecture different from HAN?

Regarding Visualize Attention

Where should I call visualize attention from? I am going through your code and have not been able to decide from where should I call it. Can you tell me from where can I call the visual attention code like in HATT.py?I have my own dataset and I am trying to visualize which words have caused the labels to be assigned?

Regarding HAN implementation

The HAN paper considers only words that appears more than 5 times. I don't think it is implemented in the code. Also, does stop word removal take place in the paper? As I mentioned if stop words are repeated 5 or more times then event they have to be considered. What are your views on this?

Regarding custom dataset

I have my own custom dataset. The dataset is exactly in the form as label file and sample data you uploaded. I assume this means I don't have to run data_prep.py now.Am I right?

Is there anything that I should skip in hatt_classifier.py as well like some lines in load_data function?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.