GithubHelp home page GithubHelp logo

dheeraj7596 / scdv Goto Github PK

View Code? Open in Web Editor NEW
61.0 4.0 19.0 40.11 MB

Text classification with Sparse Composite Document Vectors.

Home Page: https://dheeraj7596.github.io/SDV/

License: MIT License

Python 100.00%
natural-language-processing text-classification emnlp emnlp2017 information-retrieval document-vector

scdv's Introduction

Text Classification with Sparse Composite Document Vectors (SCDV)

Introduction

Citation

If you find SCDV useful in your research, please consider citing:

@inproceedings{mekala2017scdv,
  title={SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations},
  author={Mekala, Dheeraj and Gupta, Vivek and Paranjape, Bhargavi and Karnick, Harish},
  booktitle={Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  pages={659--669},
  year={2017}
}

New Features

  • Python 3.7 As Python2 is deprecated, whole repository is moved to Python3.7.
  • Support FastText. The word vectors can be trained through Word2Vec or FastText.

Testing

There are 2 folders named 20news and Reuters which contains code related to multi-class classification on 20Newsgroup dataset and multi-label classification on Reuters dataset.

20Newsgroup

Change directory to 20news for experimenting on 20Newsgroup dataset and create train and test tsv files as follows:

$ cd 20news
$ python create_tsv.py

Get word vectors for all words in vocabulary through Word2Vec:

$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.

Get word vectors for all words in vocabulary through FastText:

$ python FastText.py 200
# FastText.py takes word vector dimension as an argument. We took it as 200.

Get Sparse Document Vectors (SCDV) for documents in train and test set and accuracy of prediction on test set:

$ python SCDV.py 200 60 model_type
# SCDV.py takes word vector dimension, number of clusters as arguments and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200 and number of clusters as 60.

Get Topic coherence for documents in train set:

$ python TopicCoherence.py 200 60 10 model_type
# TopicCoherence.py takes word vector dimension, number of clusters, number of top words and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200, number of clusters as 60 and number of top words as 10.

Reuters

Change directory to Reuters for experimenting on Reuters-21578 dataset. As reuters data is in SGML format, parsing data and creating pickle file of parsed data can be done as follows:

$ python create_data.py
# We don't save train and test files locally. We split data into train and test whenever needed.

Get word vectors for all words in vocabulary through Word2Vec:

$ python Word2Vec.py 200
# Word2Vec.py takes word vector dimension as an argument. We took it as 200.

Get word vectors for all words in vocabulary through FastText:

$ python FastText.py 200
# FastText.py takes word vector dimension as an argument. We took it as 200.

Get Sparse Document Vectors (SCDV) for documents in train and test set and accuracy of prediction on test set:

$ python SCDV.py 200 60 model_type
# SCDV.py takes word vector dimension, number of clusters as arguments and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 200 and number of clusters as 60.

Get performance metrics on test set:

$ python metrics.py 200 60
# metrics.py takes word vector dimension and number of clusters as arguments. We took word vector dimension as 200 and number of clusters as 60.

Information Retrieval

Change directory to IR for experimenting on information Retrieval task. IR Datasets mentioned in the paper can be downloaded from TREC website.

You will need to run the documents and queries through a full fledged IR pipeline system like Apache Lucene or Project Lemur in order to

  • Tokenize the data, remove stop words and pass tokens through a Porter Stemmer.
  • Build inverted and forward index.
  • Build a basic language model retrieval system with Dirichlet smoothing.

Data Format

  • The IR Data folder must have a file called "queries.txt" and a folder called raw that has all the documents.
  • Each file in raw should be a single document containing space separated processed tokens. File must be named as doc_ID.txt.
  • Each line in queries.txt should be a single query containing space separated processed words.

To interpolate language model retrieval system with the query-document score obtained from SCDV:

Get word vectors for all terms in vocabulary through Word2Vec:

$ python Word2Vec.py 300 sjm
# Word2Vec.py takes word vector dimension and folder containing IR dataset as arguments. We took 300 and sjm (San Jose Mercury).

Get word vectors for all terms in vocabulary through FastText:

$ python FastText.py 300 sjm
# FastText.py takes word vector dimension and folder containing IR dataset as arguments. We took 300 and sjm (San Jose Mercury).

Create Sparse Document Vectors (SCDV) for all documents and queries and compute similarity scores for all query-document pairs.

$ python SCDV.py 300 100 sjm model_type
# SCDV.py takes word vector dimension, number of clusters, folder containing IR dataset, and model_type as arguments. Here model_type refers to the word vectors trained model types and hence it is one of "word2vec" or "fasttext". We took word vector dimension as 300, number of clusters as 100, and folder as sjm.
# Change the code to store these scores in a format that can be used by the IR system.

Use these scores to interpolate with the language model scores with interpolation parameter 0.5.

Requirements

Minimum requirements:

  • Python 3.7
  • NumPy 1.17.2
  • Scikit-learn 0.23.1
  • Pandas 0.25.1
  • Gensim 3.8.1
  • sgmllib3k

For theory and explanation of SCDV, please visit our EMNLP 2017 paper, BLOG.

Note: You need not download 20Newsgroup or Reuters-21578 dataset. All datasets are present in their respective directories. We used SGMl parser for parsing Reuters-21578 dataset from here

scdv's People

Contributors

bhargaviparanjape avatar dheeraj7596 avatar vgupta123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

scdv's Issues

KeyError when running SCDV.py 200 6 on Reuters

Good afternoon,

I am running python SCDV.py 200 60 on the Reuters set and the below error is occurring, can you kindly assist?

Traceback (most recent call last):
File "SCDV.py", line 161, in
prob_wordvecs = get_probability_word_vectors(featurenames, word_centroid_map, num_clusters, word_idf_dict)
File "SCDV.py", line 61, in get_probability_word_vectors
prob_wordvecs[word][index*num_features:(index+1)*num_features] = model[word] * word_centroid_prob_map[word][index] * word_idf_dict[word]
KeyError: u'here'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.