GithubHelp home page GithubHelp logo

joshenlim / cz4032-topic-modelling-lda Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 221 KB

CZ4032 Data Analytics and Mining - Topic Modelling using LDA with business reviews from the Yelp dataset.

Jupyter Notebook 100.00%

cz4032-topic-modelling-lda's Introduction

CZ4032 Project - Topic Modelling via Latent Dirichlet Allocation (LDA)

This project seeks to derive a score profile for restaurants across certain aspects in hopes to provide some insight as to how each restaurant can improve their business. The aspects are derived via topic modelling using LDA and we're using business reviews from the Yelp dataset to train our LDA model with.

As there is a large amount of reviews from the dataset, we are only zooming in to reviews for restaurants located in Las Vegas for computational efficiency since this subset contributes to the greatest proportion of points in the dataset.

Requirements

The yelp dataset can be obtained from here, and experiments were conducted primarily on Google Colab as well. We only require two of the datasets from Yelp, namely business.json and review.json.

Set up

We have only 3 files which focus on different actions:

  • yelp_business_filtering.ipynb: Handles the filtering of businesses to extract restaurants within Las Vegas based on the properties city and categories

  • yelp_review_filtering.ipynb: Handles the filtering of reviews to extract all reviews related to restaurants in Las Vegas based on business_id.

  • yelp_lda_model.ipynb: Handles the preprocessing and actual training of the LDA model based on the filtered reviews, as well some analysis on the results.

Running the program

  • yelp_business_filtering.ipynb: This notebook was written and executed on a local machine. Ensure that business.json is located in the same directory as this script, and rename the file to yelp_academic_dataset_business.json

  • yelp_review_filtering.ipynb: This notebook was written and executed on a local machine. Ensure that review.json is located in the same directory as this script, convert it to CSV and rename the file to review_tester.csv

  • yelp_lda_model.ipynb: This notebook was written and executed on Google Colab, which uses Google Drive as the primary storage location. Ensure that this script, along with the filtered reviews dataset (yelp_review.csv) are located on a folder CZ4032 (Local) in the root of your Google Drive.

cz4032-topic-modelling-lda's People

Contributors

joshenlim avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.