GithubHelp home page GithubHelp logo

benedeki / nba_enricher Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 19 KB

Gather NBA players and their stats, then scan's tweets ametnioning top players and add extra info into these.

Shell 2.74% PLpgSQL 27.62% Python 69.64%
database postgres python nba twitter-api tweets-scanning enriched-tweets

nba_enricher's Introduction

NBA Enricher

version 1.0 RC

A small (mostly) Python project to gather NBA related statistics, find the best player 9based on chosen criteria) and eventually scan Twitter for mentions of these players. Then the tweets are enriched with additional info about the players.

Required packages and other requirements

Postgres DB server (tested with 9.5, should work with 9+)

Twitter developer account to be able to use Twitter API

Python (tested with 3.6, effort made to make it 2.7 compatible)

Python Packages

  • requests
  • tweepy
  • psycopg2

Configuration

  1. Install or choose Postgres Server
  2. Create or choose existing database (let's call it nbadb)
  3. Create user nba on the server and add him CONNECT and CREATE privileges to the database from step #2 (nbadb)
  4. run script deploy.sh from the DB directory (deploy.sh --host=localhost --dbname=nbadb --username=nba --password=???); on Windows WSL can be used
  5. Change DB_CONNECTION in src/configuration.py to reflect the database set up
  6. Add Twitter API keys and secrets into TWITTER_CONNECTION in src/configuration.py
  7. Change any other configuration in src/configuration.py according to your privileges

Run

Logical steps

  1. Get players
  2. Get players' stats
  3. Identify the top players
  4. Gather the tweets
  5. Enrich the tweets matching
  6. Output the enriched tweets
  • Execute run_01_get_players.py (can be run repeatedly)
  • Execute run_02_get_player_stats.py (can be run repeatedly)
  • Start run_03_enrich_tweets.py

Tests

  • Execute runt_tests.py

##Highlights

  • both statistics gathering and Tweet scanning steps are created as multi-threaded
  • the threads communicate via command queue(s)
  • Tweet scanning for multiple different string occurrences is done using Aho-Corasick algorithm, which searches a text wiht one pass, replacement is then another, two in total (in case of at least one hit)
  • Database part is implemented as a service, not tightly sewn in into the program
  • Key parts that can be expected to change or be enhanced are clearly separated to allow easy alternation (statistics gathering, best players criteria, enriching rules, output of the enriched tweets, ...)

Known issues and TODOs

  • Aho Corasick - use smarter result accumulation, so no sorting is needed (use cache of size equal to longest searched word)
  • Some stats are back-computed from per game stats, better source would be more prices (points, time player, shots)
  • Increased robustness in calling NBA API (retries, thread recreation)
  • More tests
  • Components are tight together somewhat closely which complicates testing
  • Deploy script can be much more sophisticated
  • Tweets scanning/enriching could be multi-processed instead of multi-threaded
  • Add threads to initial players load (step 1)

... and Beyond

The enriching part (03) was designed to offer flexibility first in the way of output (method _output in class TweetEnricher), the way what are the actual enriching rules (the replacement dictionary coming out of players_to_enriching function in enriching.py file) and finally the possibility to relatively easily switch the design to multi-process or even multi-service design in case the enriching becomes CPU intensive or just for scaling reasons.

Also the database is used as juts another service in the multi-service architecture of this small application, together with NBA and Twitter. The Python code communicate with the database via API only, no direct access to the db (SELECTs, INSERTs etc.). This allows to hide from the application the DB implementation details, actual data structures can morph, and the database can scale and offer redundancy without any changes to the application. Thanks to this approach the database can also - as any other service - also expand it's API to offer richer service, with just some care not to break the past contracts.

nba_enricher's People

Contributors

benedeki avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.