GithubHelp home page GithubHelp logo

taka-wang / nlp_yelp_review_unsupervised Goto Github PK

View Code? Open in Web Editor NEW

This project forked from marcmuon/nlp_yelp_review_unsupervised

0.0 2.0 0.0 27.94 MB

Train unsupervised LDA Topic Model on raw Yelp review text, use topic distributions as feature inputs to supervised classifier of review sentiment

Home Page: https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28

Jupyter Notebook 98.06% Python 1.94%

nlp_yelp_review_unsupervised's Introduction

LDA Topic Models as Supervised Classification Inputs

The Data

This experiment uses Yelp's publicly available restaurant review data (6,685,900 reviews across 192,609 businesses).

I've written instructions for setting up your own DB and loading Yelp data below. However, the pre-processing output from preprocess.py was compact enough that I could include the rev_train.pkl and rev_test.pkl files in the /data directory. Thus you can skip the DB setup sections below and just use those if desired, then explore the LDA experiment using Notebooks #2 (train corpus) and #3 (test corpus).

Download JSON and Setup Mongo

  1. Yelp data is in raw JSON here: https://www.yelp.com/dataset
  2. Install Mongo locally if needed via instructions here: https://docs.mongodb.com/manual/tutorial/

Mongo Creation

  1. You'll need to start mongo as a foreground service. Generally this can be done via mongod --config /usr/local/etc/mongod.conf, but if you installed Mongo via Brew on Mac you can alternatively use: brew services start mongodb
  2. From directory where you extracted Yelp JSON, run the following commands: mongoimport --db yelp --collection review review.json and mongoimport --db yelp --collection business business.json. Those are the only two portions of the Yelp dataset I used for this experiment.

Mongo Load Script

I've created two helper scripts to load data from Mongo and Pickle into DataFrame objects. If you want to follow along with the LDA experiments and fork your own, just run the following 2 scripts from terminal. Assuming you're in the mongo_load directory of this repo:

  1. python business_load.py
  2. python reviews_load.py

That will create two pickle .pkl dataframe objects within the mongo_load directory, and we'll use those as a basis for the rest of the project. They're filtered to a specific subset of columns.

Mongo Load Script - Alternate

In lieu of using the two helper scripts above, you could likely just use the pandas read_json function outlined here to create the DataFrames: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html.

However, I haven't tested that, and if you're an experienced Mongo user there's likely more flexibility in just running your own DB for this data.

nlp_yelp_review_unsupervised's People

Contributors

marcmuon avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.