Edureka Python Project Documentation

Problem Statement

IMDB provides a list of celebrities born on the current date. Below is the link: http://m.imdb.com/feature/bornondate

Get the list of these celebrities from this webpage using web scraping (the ones that are displayed i.e top 10). You have to extract the below information:

Name of the celebrity
Celebrity Image
Profession
Best Work

Once you have this list, run a sentiment analysis on twitter for each celebrity and finally the output should be in the below format

Name of the celebrity:
Celebrity Image:
Profession:
Best Work:
Overall Sentiment on Twitter: Positive, Negative or Neutral

Hint: Use IMDB scrapping sample example as reference for scraping the mentioned web page. For sentiment analysis use the Twitter sentiment code as reference.

Please Note That I Am Using Python 3.4

Tools and Packages Used

• Version: Python 3.4 [VERY IMPORTANT] • Tweepy  Tweepy is an open-sourced, hosted on GitHub, and enables Python to communicate with the Twitter platform and use its API. Here's the documentation.

• Codecs  The codecs module provides stream and file interfaces for transcoding data in your program. In this project I use the module for storing the tweets as Unicode text. Here's the documentation.

• String (punctuation)  To strip the tweets of all punctuations.

• BeautifulSoup  Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree using Python parsers like lxml and html5lib. It automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Here's the documentation.

• Selenium  The webdriver kit emulates a web-browser (I chose FireFox) and executes the JS scripts to load the dynamic content.

Challenges Faced during the project

Tweepy has an issue with Python 3

Error message: TypeError: Can't convert 'bytes' object to str implicitly inside: tweepy\streaming.py

Solution:
Can be found at tweepy/tweepy#615. In streaming.py: I changed line 161 to

self._buffer += self._stream.read(read_len).decode('ascii')

and line 171 to

self._buffer += self._stream.read(self._chunk_size).decode('ascii')

and then reinstalled.

The IMDB website has dynamic content:

Reference: http://fruchter.co/post/53164489086/python-headless-web-browser-scraping-on-amazon

Description: Had to use the Selenium’s webdriver to emulate a Firefox browser and execute the JS functions which dynamically fetches the details of celebrities born on the current day.

sandy4321 / webscraping_twittersentimentanalysis Goto Github PK

webscraping_twittersentimentanalysis's Introduction

Edureka Python Project Documentation

Problem Statement

Tools and Packages Used

Challenges Faced during the project

Tweepy has an issue with Python 3

The IMDB website has dynamic content:

webscraping_twittersentimentanalysis's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs