Latent Dirichlet Allocation for Topic Modeling
Python implementation of LDA Topic modeling using gensim.
\textbf{Implementation}
\tab We implement the model in Python and use the Gensim \cite{gensim} library for the LDA modeling. The library provides a simple API to trian and test the model. We provide a summary of the scripts used by the system:
-
database.py:
\tab This script provides the Database object. The Database object accepts a data directory to read the documents in from and abstracts away the process of:
\tab \tab (a) Reading in multiples documents from a given directory
\tab \tab (b) Tokenizing the text
\tab \tab (c) Removing stop words from the text (for better training and inference)
\tab \tab (d) Reducing words to their stems
\tab Further, it splits the data into a training set and a test set. During training, it provides mini-batches of data. -
lda.py:
\tab This script provides the LDA object. The LDA object accepts an instance of the Database object to train on. The LDA object abstracts away the process of:
\tab \tab (a) Getting data from the database
\tab \tab (b) Training on mini-batches of data
\tab \tab (c) Saving and reloading trained models
\tab \tab (d) Visualizing training and testing results -
main.py:
This script creates a Database object for every directory that has .txt files in it. It then creates an LDA object that is trained on all the Database objects one by one. Finally, it saves the trained model. -
split_data.sh:
This is a shell script that takes in the data directory formed from the primary dataset. It then splits the data into sub directories of 10,000 .txt files each. The directory containing these sub directories is then fed into main.py. main.py then creates a Database object for each of these sub directories (one at a time) and trains the LDA object. This script was written so as to train an LDA model on all the data without running out of memory.