GithubHelp home page GithubHelp logo

sandy4321 / lda-topic-modeling Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jasjeetim/lda-topic-modeling

0.0 2.0 0.0 41 KB

Latent Dirichlet Allocation for Topic Modeling

Python 98.64% Shell 1.36%

lda-topic-modeling's Introduction

LDA-Topic-Modeling

Latent Dirichlet Allocation for Topic Modeling

Python implementation of LDA Topic modeling using gensim. \textbf{Implementation}
\tab We implement the model in Python and use the Gensim \cite{gensim} library for the LDA modeling. The library provides a simple API to trian and test the model. We provide a summary of the scripts used by the system:

  1. database.py:
    \tab This script provides the Database object. The Database object accepts a data directory to read the documents in from and abstracts away the process of:
    \tab \tab (a) Reading in multiples documents from a given directory
    \tab \tab (b) Tokenizing the text
    \tab \tab (c) Removing stop words from the text (for better training and inference)
    \tab \tab (d) Reducing words to their stems
    \tab Further, it splits the data into a training set and a test set. During training, it provides mini-batches of data.

  2. lda.py:
    \tab This script provides the LDA object. The LDA object accepts an instance of the Database object to train on. The LDA object abstracts away the process of:
    \tab \tab (a) Getting data from the database
    \tab \tab (b) Training on mini-batches of data
    \tab \tab (c) Saving and reloading trained models
    \tab \tab (d) Visualizing training and testing results

  3. main.py:
    This script creates a Database object for every directory that has .txt files in it. It then creates an LDA object that is trained on all the Database objects one by one. Finally, it saves the trained model.

  4. split_data.sh:
    This is a shell script that takes in the data directory formed from the primary dataset. It then splits the data into sub directories of 10,000 .txt files each. The directory containing these sub directories is then fed into main.py. main.py then creates a Database object for each of these sub directories (one at a time) and trains the LDA object. This script was written so as to train an LDA model on all the data without running out of memory.

lda-topic-modeling's People

Contributors

jasjeetim avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.