GithubHelp home page GithubHelp logo

namsto / movie_rating_prediction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sundeepblue/movie_rating_prediction

0.0 2.0 0.0 5.71 MB

Predict movie's IMDB rating

Python 72.05% R 27.95%

movie_rating_prediction's Introduction

=================================================================================== Predict IMDB movie rating by Chuan Sun (sundeepblue at gmail dot com) https://twitter.com/sundeepblue Scrapy project @ NYC Data Science Academy 8/14/2016

===================================================================================

STEP 1:

Fetch a list of 5000 movie titles and budgets from www.the-numbers.com

This step will generate a JSON file 'movie_budget.json'

$ scrapy crawl movie_budget -o movie_budget.json

===================================================================================

STEP 2:

Load 5000+ movie titles from the JSON file 'movie_budget.json'

Then search those titles from IMDB website to get the real IMDB movie links

It will generate a JSON file 'fetch_imdb_url.json' containing movie-link pairs

$ scrapy crawl fetch_imdb_url -o fetch_imdb_url.json

===================================================================================

STEP 3:

Scrape 5000+ IMDB movie information

This step will load the JSON file 'fetch_imdb_url.json', go into each movie page, and grab data

This step will generate a JSON file 'imdb_output.json' (20M) containing detailed info of 5000+ movies

It will also download all available posters for all movies.

A total of 4907 posters can be downloaded (998MB). Note that I am not sure if I can upload all those posters into github, so I only uploaded a few. You can see from my code how to use scrapy to grab them all.

$ scrapy crawl imdb -o imdb_output.json

===================================================================================

STEP 4:

Perform face recognition to count face numbers from all posters

This step will save result into JSON file 'image_and_facenumber_pair_list.json'

$ python detect_faces_from_posters.py

===================================================================================

STEP 5:

Load two JSON files 'imdb_output.json' and 'image_and_facenumber_pair_list.json'

Parse all variables into valid format.

Generate a final CSV table containing 28 variables that can be loaded in R or Pandas

The output will be a CSV file 'movie_metadata.csv' (1.5MB)

"movie_title"
"color"
"num_critic_for_reviews"
"movie_facebook_likes" "duration"
"director_name"
"director_facebook_likes"
"actor_3_name" "actor_3_facebook_likes"
"actor_2_name"
"actor_2_facebook_likes"
"actor_1_name" "actor_1_facebook_likes"
"gross"
"genres"
"num_voted_users"
"cast_total_facebook_likes" "facenumber_in_poster"
"plot_keywords"
"movie_imdb_link"
"num_user_for_reviews"
"language"
"country"
"content_rating"
"budget"
"title_year"
"imdb_score"
"aspect_ratio"

$ python parse_scraped_data.py

===================================================================================

STEP 6:

Load the 'movie_metadata.csv' file in RStudio, and perform EDA and LASSO regression

$ > run the RStudio

$ > load the file 'movie_rating_prediction.R'

movie_rating_prediction's People

Contributors

sundeepblue avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.