GithubHelp home page GithubHelp logo

ziruihao / twitter-stream-algorithms Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 4.04 MB

Testing streaming algorithms on Twitter and Shakespeare

Python 100.00%
stream-algorithms twitter-streams twitter-api

twitter-stream-algorithms's Introduction

Twitter Stream Algorithms Demo

After learning a few stream algorithms from COSC 35 (@ Dartmouth College taught by Prof. Chakrabarti), I wanted to implement some and see them performing with real data.

The first question was what kind of stream data can a college student get access to? The easiest one was from a platform that is very much just streams of data - Twitter.

twitter hashtags streams

Get Started

Create a .env in root directory and paste the variables from here.

$ pip install -r requirements.txt
$ python __init__.py

Results

Twitter words Actual Estimate (algorithm output)
Total tokens 20,000 16,383
Distinct tokens 3,195 3,586
Heavy hitters* See data/exact-twitter.json See data/misra-twitter.json
Shakespeare words Actual Estimate (algorithm output)
Total tokens 50,000 32,767
Distinct tokens 7,589 8,192
Heavy hitters* See data/exact-shakespeare.json See data/misra-shakespeare.json

These estimates have too high of a variance and they land on the same number some times. I will implement some new methods we just learned in class to reduce this variance and make the space of possible estimates more dense.

  • The heavy hitters approximation is not complete yet.

Streaming

Algorithms

I implement the following stream algorithms:

  1. Misra-Gries - token frequencies to generate heavy hitters
    1. k: number of bins, scoring_method: either Levenshtein or SequenceMatcher for word similarity
  2. Moris - total tokens counter
    1. t: number of parallel estimators to then take the medians of
  3. BJKST - distinct tokens counter
    1. k: number of bins, t: number of parallel estimators to then take the medians of
  4. CountSketch - token frequencies (work in progress)
    1. k: number of bins, t: number of parallel estimators to then take the medians of
  5. Exact - not an algorithm, rather it just counts the exact number of tokens, distinct tokens, and frequencies for each token to provide a baseline of comparison for the other algorithms

Data

A steady stream of data is fed into each of those algorithms via these two data sources.

Twitter API

I'm leveraging Twitter's Stream API via the Python Tweepy library.

Once a stream is initiated, we receive continuous selected data from Twitter. This is not the entirety of Twitter's streams, but rather a percentage (controlled by Twitter based on allocations for free developer users).

Shakespeare's Works

A more offline comes from simulating a stream using words from 100 of Shakespeare's works. The raw text was accessed from Project Gutenberg, cleaned, and extracted into 219 separate works, each containing around a few thousand words. The stream chooses a particular work and feeds the words the same way as the Twitter streaming process.

Next Steps

Web Interface

I am planning to build an iteractive web app for better interaction with the algorithms, and to visualize how these streaming algorithms manipulate data in real-time. For example, for Misra-Gries, I want to animate the size of incoming tokens (words) to demonstrate their predicted accumulated counts held by the algorithm.

Improvements

I want to improve the word matching process to better condense words based on similarity and gramatical connections. I also want to devise a better hashing method from word strings to integers that is more suited for this particular application. This would involve factoring in the domain of possible strings (length, arrangement of characters) to make the hashing more tailored to English words.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.