Removes unwanted columns and rows from yelp businesses csv file
Yelp data contains businesses in addition to restaurants,
so we filter the categories column for these words: ['RESTAURANTS','BARS','FOOD','BREAKFAST & BRUNCH','DESSERTS','BAKERIES, DELIS, SANDWICHES', 'COFFEE & TEA', 'DINERS', 'CAFES']
Can be run with --Small True to use the small dataset so it does not take as long
Imports: pandas
Create a green rating for each restaurant based on whether it’s reviews contains “environmental” terms.
Examples of environmental terms: compost, recycle, green, local, vegan, vegetarian
If 1% or more of the total words in the reviews were environmental words, the restaurant got a score of 3, the rest got scores of 0, 1, or 2 but in our final dataset we only counted those with a score of 3 as “green”
files required: data/helper_files/environmentalTerms.txt,
data/big_restaurants_and_reviews.csv or
data/small_restaurants_and_reviews.csv
file generated: data/big_term_based_green_rating_results.csv or
data/small_term_based_green_rating_results.csv
Can be run with --Small True to use the small dataset so it does not take as long
Imports: pandas
Creates the Final dataset which contains a row for each restaurant
columns: name, review text, yelp stars, GRA rating, seafood watch rating, term based rating, overall green rating
files required: data/small_restaurants_and_reviews.csv or data/big_restaurants_and_reviews.csvdata/small_term_based_green_rating_results.csv or data/big_term_based_green_rating_results.csv
file generated: data/small_restaurants_reviews_ratings.csv or data/big_restaurants_reviews_ratings.csv,
Analysis
In the jupyter notebook notebooks/restaurants_reviews_ratings_analysis.ipynb I generate the dataset generated by pythonScripts/merge_all_ratings.py (notebook cells are copied from there)
and do some basic data exploration and analysis. The visualization and graph above come from this notebook. There are no striking conclusions based on this analysis.
Future Work
Run better stemming and lemmatization algorithms on the reviews and determine topics for green restaurant reviews and non green restaurant reviews.
Expand list of green restaurants by using more alternative data sources (such as blogs or web scraping)
Use nyc open data in addition to yelp stars to measure if green restaurants are more successful
This project stems partially from the NYU Big Data Science course project by Nellie Spektor, Valerie Angulo, and Andrea Waxman. here
Nellie Spektor has since continued working on this project under Professor Anasse Bari