GithubHelp home page GithubHelp logo

twitter_scraping's Introduction

Twitter Scraper

Twitter makes it hard to get all of a user's tweets (assuming they have more than 3200). This is a way to get around that using Python, Selenium, and Tweepy.

Essentially, we will use Selenium to open up a browser and automatically visit Twitter's search page, searching for a single user's tweets on a single day. If we want all tweets from 2015, we will check all 365 days / pages. This would be a nightmare to do manually, so the scrape.py script does it all for you - all you have to do is input a date range and a twitter user handle, and wait for it to finish.

The scrape.py script collects tweet ids. If you know a tweet's id number, you can get all the information available about that tweet using Tweepy - text, timestamp, number of retweets / replies / favorites, geolocation, etc. Tweepy uses Twitter's API, so you will need to get API keys. Once you have them, you can run the get_metadata.py script.

My updates 2023/05/08

The two biggest updates I made from the original are that more output files are generated after grabbing all tweets. Specifically a .bat file is created for both twitter images and videos to download those additionally if desired. Also in order for this to work you need to manually log into a valid twitter account within 60 seconds of the chrome window opening otherwise each subsequent page load will prompt you to log in isntead of identifying new tweets to scrape. Don't forget to add your own api id's as well.

Requirements

  • basic knowledge on how to use a terminal
  • Safari 10+ with 'Allow Remote Automation' option enabled in Safari's Develop menu to control Safari via WebDriver.
  • python3
    • to check, in your terminal, enter python3
    • if you don't have it, check YouTube for installation instructions
  • pip or pip3
    • to check, in your terminal, enter pip or pip3
    • if you don't have it, again, check YouTube for installation instructions
  • selenium (3.0.1)
    • pip3 install selenium
  • tweepy (3.5.0)
    • pip3 install tweepy

Running the scraper

  • open up scrape.py and edit the user, start, and end variables (and save the file)
  • run python3 scrape.py
  • you'll see a browser pop up and output in the terminal
  • do some fun other task until it finishes
  • once it's done, it outputs all the tweet ids it found into all_ids.json
  • every time you run the scraper with different dates, it will add the new ids to the same file
    • it automatically removes duplicates so don't worry about small date overlaps

Troubleshooting the scraper

  • do you get a no such file error? you need to cd to the directory of scrape.py
  • do you get a driver error when you try and run the script?
    • open scrape.py and change the driver to use Chrome() or Firefox()
      • if neither work, google the error (you probably need to install a new driver)
  • does it seem like it's not collecting tweets for days that have tweets?
    • open scrape.py and change the delay variable to 2 or 3

Getting the metadata

  • first you'll need to get twitter API keys
  • put your keys into the sample_api_keys.json file
  • change the name of sample_api_keys.json to api_keys.json
  • open up get_metadata.py and edit the user variable (and save the file)
  • run python3 get_metadata.py
  • this will get metadata for every tweet id in all_ids.json
  • it will create 4 files
    • username.json (master file with all metadata)
    • username.zip (a zipped file of the master file with all metadata)
    • username_short.json (smaller master file with relevant metadata fields)
    • username.csv (csv version of the smaller master file)

twitter_scraping's People

Contributors

bpb27 avatar nmjohnson avatar aboutaaron avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.