GithubHelp home page GithubHelp logo

hossam-h22 / topic-modeling-project-nlp Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.27 MB

A topic modeling project in NLP using clustering and topic modeling algorithms

Jupyter Notebook 100.00%
nlp topic-modeling

topic-modeling-project-nlp's Introduction

Topic-Modeling-Project-NLP

Topic modeling is a part of NLP that is used to determine the topic title for each similar group of documents based on the content. To achieve this in our project we used Clustering and Topic Modeling algorithms. Five algorithms have been used which are LDA, K-means, Mini-batch K-means, NMF, and LSA. In the following sections, we will clarify each one of them in detail.


Preprocessing

  • Data reading and cleaning

    Here we aimed to handle minor issues in the database including reformatting the data and removing the nulls.

  • Feature extraction

    Feature Extraction Makes Machine Learning More Efficient. It cuts through the noise, removing redundant and unnecessary data. This frees machine learning programs to focus on the most relevant data.


Clustering and Topic Modeling algorithms

  • Mini-batch K-means

    The Mini Batch K-means algorithm is a variation of the traditional K-means algorithm that uses smaller random subsets or batches of data points to update cluster centers at each iteration. This approach makes the algorithm computationally efficient, particularly for large datasets. The Mini Batch K-means algorithm approximates the results of K-means while reducing the time and memory requirements, making it suitable for real-time and online clustering tasks.

  • LSA Model

    LSA (Latent Semantic Analysis) is a technique used for analyzing relationships between documents and terms within a large corpus. It represents documents and terms as vectors in a high-dimensional space and reduces the dimensionality to capture latent semantic meaning. LSA algorithm identifies patterns of word co-occurrence and similarity to uncover underlying semantic relationships in textual data.

  • NMF Model

    NMF (Non-Negative Matrix Factorization) is a matrix factorization technique used for dimensionality reduction and feature extraction. It assumes that the input matrix consists of non-negative values and decomposes it into two non-negative matrices representing a low-rank approximation of the original data. NMF algorithm extracts interpretable features by enforcing non-negativity constraints, making it useful for tasks such as text mining and image processing.

  • LDA Model

    LDA (Latent Dirichlet Allocation) is a probabilistic generative model used for topic modeling. It assumes that documents are composed of multiple topics, and each word in a document is generated from one of these topics. LDA algorithm infers the latent topic structure in each set of documents and assigns the most probable topics to each word.

  • K-means Model

    The K-means algorithm is an iterative clustering algorithm used to partition a dataset into k distinct clusters. It assigns each data point to the cluster with the nearest mean, iteratively updating the cluster centers until convergence. The algorithm aims to minimize the within-cluster sum of squared distances, resulting in compact and well-separated clusters.



Conclusion

We started this project to determine the title for each cluster of articles and then assign the article used as test data in its suitable cluster. As we managed to implement the requirements successfully, we now can say that we proved that the Topic modeling algorithms, the LSA, LDA, and NMF, have approximately similar accuracy, which is higher than the general clustering algorithms accuracy, K-means, Mini-batch K-means.



Project GUI


topic-modeling-project-nlp's People

Contributors

hossam-h22 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.