GithubHelp home page GithubHelp logo

jmallone / birdwatch Goto Github PK

View Code? Open in Web Editor NEW

This project forked from programmingincluded/tb-watcher

0.0 0.0 0.0 4.2 MB

A Twitter profile archival tool.

License: GNU General Public License v3.0

Python 100.00%

birdwatch's Introduction

Birdwatch Logo

License: GPL v3

Birdwatch: A Twitter Profile Archival Tool

Birdwatch snapshots a profile page when given a URL (or an exported .js list of following from the official Twitter exporter.) Supports UTF-8 text JSON files and image snapshots of each Twitter post!

This script is purely for the purposes of archival use only.

Birdwatch

Features

  • Stores metadata in json format for each specified twitter profile.
  • Neatly organizes tweets by user and takes a snapshot of each tweet.
  • Marks potential tweets that are self-retweeted.
  • Removes Tweet Ads.
  • Allows for manual login (use at own risk.)

Usage

# Install the requirements. Once only.
python -m pip install -r requirements.txt
python birdwatch.py --url www.twitter.com/<profile>

# For more help use:
python birdwatch.py --help

Tested on Python 3.10.

Output

Birdwatch generates the following in the snapshots folder:

└───snapshots
    └───<user_id>           # Username
        │   metadata.json   # profile metadata
        │   profile.png     # snapshot of profile page
        │
        └───tweets
                0.png       # snapshot of latest tweet
                1.png
                ...
                9.png
                tweets.json # Metadata of each screen-capped tweet.

Self Boosted Tweet Detection

A self-boosted tweet is a tweet where the original author retweets. These types of tweets are marked with potential_boost as true in tweets.json. The script detects these by matching exact meta-datas e.g. duplicate posts.

Schemas

Assume all data is UTF-8 compliant.

Input File

These files are what the Twitter exporter should generate (.js file) from the users you are following:

window.* = [
    {
        "following": {
            "accountId": <id>,
            "userLink": <url>
        }
        ...
    }
]

You can rename as json or specify via input flags to parse the file. window.* = is automatically removed by the script and is default generated by Twitter. However, you can also manually remove it to parse the file as JSON directly.

tweets.json

[
    {
        "id": int,
        "tag_text": str,
        "name": str,
        "handle" str,
        "timestamp": str,
        "tweet_text": str,
        "retweet_count": str,
        "like_count": str,
        "reply_count": str,
        "potential_boost":  bool
    }
]

Invalid string entries will be marked as "NULL".

metadata.json

{
    "bio": str,
    "name": str,
    "username": str,
    "location": str,
    "website": str,
    "join_date": str,
    "following": str,
    "followers": str
}

Invalid string entries will be marked as "NULL".

Troubleshoot

  • My scraper terminates early?

It is possible that your images are taking sometime to load, Consider using -s to adjust load-time. Or your scrolling height is too low / too high. Consider using --scroll-algorithm to adjust the type of algorithm Then passing in a value to the algorithm --scroll-value.

"--help" has more information as to what --scroll-value encodes.

Future Updates and Goals

  • Support Running Multiple Sessions to Resume Per-Profile Fetching
  • Save and Expand Post Attachments

birdwatch's People

Contributors

programmingincluded avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.