GithubHelp home page GithubHelp logo

bdilday / pybbda Goto Github PK

View Code? Open in Web Editor NEW
31.0 6.0 2.0 5.92 MB

Python Baseball Data and Analysis

Home Page: https://pybbda.readthedocs.io/en/stable/

License: GNU General Public License v2.0

Python 98.98% Makefile 0.87% Dockerfile 0.15%
baseball baseball-statistics baseball-analytics baseball-analysis-packages python

pybbda's Introduction

pybbda

pybbda is a package for Python Baseball Data and Analysis.

data

pybbda aims to provide a uniform framework for accessing baseball data from various sources. The data are exposed as pandas DataFrames

The data sources it currently supports are:

  • Lahman data

  • Baseball Reference WAR

  • Fangraphs leaderboards and park factors

  • Retrosheet event data

  • Statcast pitch-by-pitch data

analysis

pybbda also provides analysis tools.

It currently supports:

  • Marcel projections

  • Batted ball trajectories

  • Run expectancy via Markov chains

The following are planned for a future release:

  • Simulations

  • and more...!

Installation

This package is available on PyPI, so you can install it with pip,

$ pip install -U pybbda

Or you can install the latest master branch directly from the github repo using pip,

$ pip install git+https://github.com/bdilday/pybbda.git

or download the source,

$ git clone [email protected]:bdilday/pybbda.git
$ cd pybbda
$ pip install .

Requirements

This package explicitly supports Python 3.6 andPython 3.7. It aims to support Python 3.8 but this is not guaranteed. It explicitly does not support any versions prior to Python 3.6, includingPython 2.7.

Installing data

This package ships without any data. Instead it provides tools to fetch and store data from a variety of sources.

To install data you can use the update tool in the pybbda.data.tools sub-module.

Example,

$ python -m pybbda.data.tools.update -h
usage: update.py [-h] [--data-root DATA_ROOT] --data-source
                 {Lahman,BaseballReference,Fangraphs,retrosheet,all} [--make-dirs]
                 [--overwrite] [--min-year MIN_YEAR] [--max-year MAX_YEAR]
                 [--num-threads NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  --data-root DATA_ROOT
                        Root directory for data storage
  --data-source {Lahman,BaseballReference,Fangraphs,retrosheet,all}
                        Update source
  --make-dirs           Make root dir if does not exist
  --overwrite           Overwrite files if they exist
  --min-year MIN_YEAR   Min year to download
  --max-year MAX_YEAR   Max year to download
  --num-threads NUM_THREADS
                        Number of threads to use for downloads

The data will be downloaded to --data-root, which defaults to the PYBBDA_DATA_ROOT

Detailed instructions are provided in the documentation

Example Usage

After installing some or all of the data, you can start using the package.

Following is an example of accessing Lahman data. More examples are included in the documentation

Lahman data

>>> from pybbda.data import LahmanData
>>> lahman_data = LahmanData()
>>> batting_df= lahman_data.batting
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/.pybbda/data/Lahman/Batting.csv
>>> batting_df.head()
    playerID  yearID  stint teamID lgID   G   AB   R   H  2B  3B  HR   RBI   SB   CS  BB   SO  IBB  HBP  SH  SF  GIDP
0  abercda01    1871      1    TRO  NaN   1    4   0   0   0   0   0   0.0  0.0  0.0   0  0.0  NaN  NaN NaN NaN   0.0
1   addybo01    1871      1    RC1  NaN  25  118  30  32   6   0   0  13.0  8.0  1.0   4  0.0  NaN  NaN NaN NaN   0.0
2  allisar01    1871      1    CL1  NaN  29  137  28  40   4   5   0  19.0  3.0  1.0   2  5.0  NaN  NaN NaN NaN   1.0
3  allisdo01    1871      1    WS3  NaN  27  133  28  44  10   2   2  27.0  1.0  1.0   0  2.0  NaN  NaN NaN NaN   0.0
4  ansonca01    1871      1    RC1  NaN  25  120  29  39  11   3   0  16.0  6.0  2.0   2  1.0  NaN  NaN NaN NaN   0.0
>>> batting_df.groupby("playerID").HR.sum().sort_values(ascending=False)
playerID
bondsba01    762
aaronha01    755
ruthba01     714
rodrial01    696
mayswi01     660
            ... 
mcconra01      0
mccolal01      0
mccluse01      0
mcclula01      0
aardsda01      0
Name: HR, Length: 19689, dtype: int64

CLI tools

Run expectancies

There is a cli tool for computing run expectancies from Markov chains.

$ python -m pybbda.analysis.run_expectancy.markov.cli --help

This Markov chain uses a lineup of 9 batters instead of assuming each batter has the same characteristics. You can also assign running probabilities, although they apply to all batters equally.

You can assign batting-event probabilities using a sequence of probabilities, or by referencing a player-season with the format {playerID}_{season}, where playerID is the Lahman ID and season is a 4-digit year. For example, to refer to Rickey Henderson's 1982 season, use henderi01_1982.

The lineup is assigned by giving the lineup slot followed by either 5 probabilities, or a player-season id. The lineup-slot 0 is a code to assign all nine batters to this value. Any other specific slots will be filled in as noted.

The number of outs to model is 3 by default. It can be changed by setting the environment variable PYBBDA_MAX_OUTS.

Example: Use a default set of probabilities for all 9 slots with no taking extra bases

$ python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 --running-probs 0 0 0 0 
mean score per 27 outs = 3.5227
std. score per 27 outs = 2.8009

Example: Use a default set of probabilities for all 9 slots with default probabilities for taking extra bases

$ python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03
mean score per 27 outs = 4.2242
std. score per 27 outs = 3.0161

Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff (using 27 outs, instead of 3)

$ PYBBDA_MAX_OUTS=27  python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982
WARNING:pybbda:__init__:Environment variable PYBBDA_DATA_ROOT is not set, defaulting to /home/bdilday/github/pybbda/pybbda/data/assets
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/github/pybbda/pybbda/data/assets/Lahman/Batting.csv
mean score per 27 outs = 4.3628
std. score per 27 outs = 3.0999

Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff and Babe Ruth 1927 bat clean-up (using 27 outs, instead of 3)

$ PYBBDA_MAX_OUTS=27  python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982 -i 4 ruthba01_1927 
WARNING:pybbda:__init__:Environment variable PYBBDA_DATA_ROOT is not set, defaulting to /home/bdilday/github/pybbda/pybbda/data/assets
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/github/pybbda/pybbda/data/assets/Lahman/Batting.csv
mean score per 27 outs = 5.1420
std. score per 27 outs = 3.3996

Contributing

Contributions from the community are welcome. See the contributing guide.

License

GPLv2

pybbda's People

Contributors

bdilday avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

epistemikpython

pybbda's Issues

Statcast data

The package should provide access to Statcast pitch-level data

Refactor DataSource

The DataSource class needs to be refactored to make the interface and the update methods more consistent between different sources (lahman, statcats, fangraphs, etc). and to make it more straight-forward to add new sources.

Linting

The code is currently totally ignoring linting warnings. I don't want this code base to be overly dependent on linting conventions, and I don't want to litter it with #pylint: disable statements, however, a lot of the more egregious linting issues should be fixed. CI should do some additional checks.

Caching in Markov simulations

The lru_cache decorator is not working properly in the event classes. This should be pulled out into function so the lru_cache can be used. This will be especially important for tractable runtimes when the Markov simulations are extended to 27 outs. See base_out_state_evolve_fun for an example implementation.

install issues

on a new linux machine I had install issues, due to scikit-build and psycopg2 dependencies. see. bdilday/pychadwick#18

The psycopg issue is,

  • psycopg2 can't be built and installed from pip unless

    • the postgres server or dev libraries are installed, e.g. libpg-dev for client or postgresql-server-dev-X for server
    • the python dev libraries, which provide Python.h, e.g. python3.7-dev

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.