sbdsubjectclassifier:Scholarly Big Data Subject Category Classifier

In this project, we attempt to classify research papers into their subject areas. By splitting abstracts into words and converting each word into a n-dimensional word embedding, we project the text data into a vector space with time-steps. In order to select the words, we use TF-IDF weights to find the most important words in the abstract and sort them based on the TF-IDF values. To classify the data, we use 2 flavors of Recurrent Neural networks (RNN's) namely : LSTM and GRU. We tried various Word Embedding (WE) models such as GloVe, SciBert and FastText. We also use Universal Sentence Encoder (USE) with Multi-layers perceptron (MLP) and char-level CNN to compare the performance of the above model.

Requirements:

Keras
tensorflow (tensorflow_gpu recommended)
nltk
pandas
sklearn
tensorflow_hub(for USE)

Models

RNNs with WE: This model takes text data path (.csv file with data and label) and WE file path as inputs and classifies the data using RNN's.

step1: clean the data.

In order to clean and order the data, use tf_idf sorting module. This module splits the sentences into words and removes stopwords, punctuations and numbers from the words. These words are lemmatized and then sorted based on TF-IDF values. This module takes 3 arguments:

data_path : path to the text data
max_len (optional) : maximum length of the words to be retained in each text sequence (in our case, abstract).
tf_idf ordering (optional) : boolean value. set to 'True' to sort the values based on TF-IDF valuses. 'False' retains the order instead of sorting.

from tfidf_ordering import tfidf_ordering
tfidf_ordering(data_path,tfidf_sorting=True,max_len=80)

The cleaned data will be saved in a csv file named 'final_tfidf_ordered_data.csv'.

step2: Run the model.

After cleaning the data, build and run the model to classify the data. Arguments for the model are:

abstracts_path : provide the path to the cleaned data, i.e 'final_tfidf_ordered_data.csv'.
WE_path : provide the path to the WE file.
max_len (optional) : maximum length of the words to be retained in each text sequence (in our case, abstract). Default value-80.
nodes (optional) : No of rnn cells in each layer. Default-128
layers (optional) : No of rnn layers required. Default : 2
loss (optional) : default='categorical_crossentropy',
optimizer (optional) : default ='Adam'
activation (optional) : default = 'tanh'
dropout fraction (optional) : default = '0.2'
batch_size (optional) : size of each batch for stochastic gradient descent. default=1000
epochs (optional) : No of epochs for training
gpus (optional) : No of gpus in case of multi gpu training. Default : None. If None, triggers cpu model.

Run the following to create a model object which takes abstracts_path and WE_path as input (add optional arguments if required) and prints accuracy and f1-score of the classification.

from RnnModelMain import Model
Model(abstracts_path, WE_path)

character-level CNN: Similarly, to implement character-level CNN model, implement the following:

from char_cnn import char_cnn
char_cnn(abstracts_path)

optional arguments are : batch_size,epochs and gpus

USE with MLP: To implement character-level CNN model, implement the following:

from use_with_mlp import MlpModelWithUSE
MlpModelWithUSE(abstracts_path)

optional arguments are : nodes, layers, loss,optimizer, activation, dropout, batch_size, epochs,gpus

seerlabs / sbdsubjectclassifier Goto Github PK

sbdsubjectclassifier's Introduction

sbdsubjectclassifier:Scholarly Big Data Subject Category Classifier

Requirements:

Models

step1: clean the data.

step2: Run the model.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs