keberwein / mlbgameday Goto Github PK

Multi-core processing of 'Gameday' data from Major League Baseball Advanced Media. Additional tools to parallelize large data sets and write them to a database.

License: Other

R 100.00%

parallel-processing database baseball statistics mlb-gameday mlbam etl

mlbgameday's Introduction

mlbgameday

Why mlbgameday?

Designed to facilitate extract, transform and load for MLBAM “Gameday” data. The package is optimized for parallel processing of data that may be larger than memory. There are other packages in the R universe that were built to perform statistics and visualizations on these data, but mlbgameday is concerned primarily with data collection. More uses of these data can be found in the pitchRx, openWAR, and baseballr packages.

Install

Stable version from CRAN

install.packages("mlbgameday")

The latest development version from GitHub:

devtools::install_github("keberwein/mlbgameday")

Basic Usage

Although the package is optimized for parallel processing, it will also work without registering a parallel backend. When only querying a single day's data, a parallel backend may not provide much additional performance. However, parallel backends are suggested for larger data sets, as the process will be faster by several orders of magnitude.

library(mlbgameday)

innings_df <- get_payload(start = "2017-04-03", end = "2017-04-04")

Take a peek at the data.

head(innings_df$atbat, 1)
#>   num b s o start_tfs       start_tfs_zulu batter stand b_height pitcher
#> 1   1 2 2 1    170552 2017-04-03T17:05:52Z 543829     L     5-11  544931
#>   p_throws                                                  des
#> 1        R Dee Gordon lines out to left fielder Jayson Werth.  
#>                                                                des_es
#> 1 Dee Gordon batea línea de out a jardinero izquierdo Jayson Werth.  
#>   event_num   event     event_es home_team_runs away_team_runs inning
#> 1        11 Lineout Línea de Out              0              0      1
#>   next_ inning_side
#> 1     Y         top
#>                                                                                                                      url
#> 1 http://gd2.mlb.com/components/game/mlb//year_2017/month_04/day_03/gid_2017_04_03_miamlb_wasmlb_1/inning/inning_all.xml
#>         date                    gameday_link score
#> 1 2017-04-03 /gid_2017_04_03_miamlb_wasmlb_1  <NA>
#>                              play_guid event2 event2_es event3 event3_es
#> 1 76e23666-26f1-4339-967f-c6f759d864f4   <NA>      <NA>   <NA>      <NA>
#>      batter_name      pitcher_name
#> 1 Devaris Gordon Stephen Strasburg

Parallel Processing

The package's internal functions are optimized to work with the doParallel package. By default, the R language will use one core of our CPU. The doParallel package enables us to use several cores, which will execute tasks simultaneously. In a standard regular season for all teams, the function has to process more than 2,400 individual files, which depending on your system, can take quite some time. Parallel processing speeds this process up by several times, depending on how many processor cores we choose to use.

library(mlbgameday)
library(doParallel)

# First we need to register our parallel cluster.
# Set the number of cores to use as the machine's maximum number of cores minus 1 for background processes.
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)  
registerDoParallel(cl)

# Then run the get_payload function as normal.
innings_df <- get_payload(start = "2017-04-03", end = "2017-04-10")

# Don't forget to stop the cluster when finished.
stopImplicitCluster()
rm(cl)

Note: The mlbgameday package is inteded for use on a single machine, using multiple cores. However, it may be possible to use a cluster of multiple machines as well. For more on parallel processing, please see the package vignettes

Databases

When collecting several seasons worth of data, the data may become larger than memory. If this is the case, the mlbgameday package includes functionality to break the data into "chunks" and load into a database. Database connections are provided by the DBI package, which includes connections for most modern relational databases. Below is an example that creates a SQLite database in our working directory and populates it with MLBAM Gameday data. Although this technique is fast, it is also a system intensive process. The authors of mlbgameday suggest loading no more than a single season per R session.

library(mlbgameday)
library(doParallel)
library(DBI)
library(RSQLite)

# First we need to register our parallel cluster.
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)  
registerDoParallel(cl)

# Create the database in our working directory.
con <- dbConnect(RSQLite::SQLite(), dbname = "gameday.sqlite3")

# Collect all games, including pre and post-season for the 2016 season.
get_payload(start = "2016-01-01", end = "2017-01-01", db_con = con)

# Don't forget to stop the cluster when finished.
stopImplicitCluster()
rm(cl)

For a more in-depth look at reading and writing to databases, please see the package vignettes.

Gameday Data Sets

Those familiar with Carson Sievert's pitchRx package probably recognize the default data format returned by the get_payload() function. The format was intentionally designed to be similar to the data returned by the pitchRx package for those who may be keeping persistent databases. The default data set returned is "inning_all," however there are several more options including:

inning_hit
bis_boxscore
game_events
linescore

For example, the following with query the linescore data set.

library(mlbgameday)

linescore_df <- get_payload(start = "2017-04-03", end = "2017-04-04", dataset = "linescore")

Visualization

The mlbgameday package is data-centric and does not provide any built-in visualization tools. However, there are several excellent visualization packages available for the R language. Below is a short example of what can be done with ggplot2. For more examples, please see the package vignettes.

First, get the data.

library(mlbgameday)
library(dplyr)

# Grap some Gameday data. We're specifically looking for Jake Arrieta's no-hitter.
gamedat <- get_payload(start = "2016-04-21", end = "2016-04-21")

# Subset that atbat table to only Arrieta's pitches and join it with the pitch table.
pitches <- inner_join(gamedat$pitch, gamedat$atbat, by = c("num", "url")) %>%
    subset(pitcher_name == "Jake Arrieta")

library(ggplot2)

# basic example
ggplot() +
    geom_point(data=pitches, aes(x=px, y=pz, shape=type, col=pitch_type)) +
    coord_equal() + geom_path(aes(x, y), data = mlbgameday::kzone)

library(ggplot2)

# basic example with stand.
ggplot() +
    geom_point(data=pitches, aes(x=px, y=pz, shape=type, col=pitch_type)) +
    facet_grid(. ~ stand) + coord_equal() +
    geom_path(aes(x, y), data = mlbgameday::kzone)

Acknowledgements

This package was inspired by the mlbgame Python library by Zach Panzarino, the pitchRx package by Carson Sievert and the openWAR package by Ben Baumer and Gregory Matthews.

mlbgameday's People

Contributors

Stargazers

Watchers

Forkers

andreemidio mark-rrnr naikabhilash atroiano nliced berkeley44 codenameglen mpellet771

mlbgameday's Issues

2019 year's data downloading failed.

I tried to get 2019 year's data.
But the 2019 year's data downloads failed.
Except 2019 year's, it is working well. ( I tested 2013 to 2018 )

This is the error message and my code.

> con2019 <- DBI::dbConnect(RPostgreSQL::PostgreSQL(), dbname = "mlb_2019",
+                       host = "172.28.1.2", port = 5432,
+                       user = "postgres", password = "postgres")

> get_payload(start = "2019-04-01", end = "2019-05-01", async = TRUE, db_con=con2019)

Gathering Gameday data, please be patient...
Processing data chunk 1 of 2
Error: `by` can't contain join column `tfs_zulu`, `inning`, `inning_side`, `des` which is missing from LHS
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/rlang_error>
`by` can't contain join column `tfs_zulu`, `inning`, `inning_side`, `des` which is missing from LHS
Backtrace:
 1. mlbgameday::get_payload(...)
 2. mlbgameday::payload.gd_inning_all(urlz)
 4. dplyr:::left_join.tbl_df(...)
 6. dplyr:::common_by.character(by, x, y)
 7. dplyr:::common_by.list(by, x, y)
 8. dplyr:::bad_args(...)
 9. dplyr:::glubort(fmt_args(args), ..., .envir = .envir)
Run `rlang::last_trace()` to see the full context.

> rlang::last_trace()
<error/rlang_error>
`by` can't contain join column `tfs_zulu`, `inning`, `inning_side`, `des` which is missing from LHS
Backtrace:
    █
 1. └─mlbgameday::get_payload(...)
 2.   └─mlbgameday::payload.gd_inning_all(urlz)
 3.     ├─dplyr::left_join(...)
 4.     └─dplyr:::left_join.tbl_df(...)
 5.       ├─dplyr::common_by(by, x, y)
 6.       └─dplyr:::common_by.character(by, x, y)
 7.         └─dplyr:::common_by.list(by, x, y)
 8.           └─dplyr:::bad_args(...)
 9.             └─dplyr:::glubort(fmt_args(args), ..., .envir = .envir)

error using get_payload

Using get_payload(start = "2019-07-01", end = "2019-07-10") I'm getting the following error:
Error: by can't contain join column tfs_zulu, inning, inning_side, des which is missing from LHS

Any idea how to solve this issue?
Regards

Updated GIDs / Data refresh

Hi, thanks for pulling this together.
Just curious, do you have code that pulls the updated gids? Looks like you load them into a data file. 2018 data (shells) are out on the mlb gameday site. Neither pitchrx or your code has the game listing pulled in. Was wondering if you had that before I build code to pull them in myself. Thanks!

Data download Error.

I'm getting an error message with my get payload command.

innings_df <- get_payload(start = "2017-04-03", end = "2017-04-04")
Gathering Gameday data, please be patient...
Error: by can't contain join column batter which is missing from LHS

Umpire IDs

Hi @keberwein, in which table are the umpire IDs stored? In other words, is the home plate umpire who made df$pitch$des not available in the data? I understand you have the script for updating umpire IDs, but I do not see the umpire ID in any of the tables from the get_payload() call.

Hope it's clear what I'm asking.

Table names

Need to line up tables names with those output by pitchrx, in case some users may want to append to an existing database.

Search Game Id's Function not working properly

I just ran a 2017 regular season scrape using the search_gids function. The last dataset it returns is from September 20th games.

Get_Payload

There appears to be an issue pulling data <= 2014. Below are two examples of error messages returned when trying to pull data.

Events_14 <- get_payload(start = "2014-04-04", end = "2014-04-05", dataset = "inning_all")
Gathering Gameday data, please be patient...
Error: tfs_zulu = NULL must be a column name or position, not NULL

Events_10 <- get_payload(start = "2010-04-04", end = "2010-04-05", dataset = "inning_all")
Gathering Gameday data, please be patient...
Error: tfs_zulu = NULL must be a column name or position, not NULL

BIS_BOXSCORE Issue

The "gameday_link" returned when using the BIS_BOXSCORE dataset seems to be cut off.

A typical link/id returned using other datasets has the following format:
gid_X_X_X_Xmlb_Xmlb_X

The Gameday_Link returned is missing the last 5 characters of the normal format, which is a problem for double-headers, which end with _2.

Double-Header Example:
gid_2018_06_19_lanmlb_chnmlb_1
gid_2018_06_19_lanmlb_chnmlb_1
Link in BIS_BOXSCORE --->>>> gid_2018_06_19_lanmlb_chn

**There's no way to differentiate the two games on this date

Appending database and obtaining wins

Kris- What commands do I need to use if I want to append my 2017 database to now include 2016 (and other historical data)?

Would I need to separately scrape the linescore dataset in order to merge wins and saves to the play by play data?

Minor league data

When I place league = "aaa" into the get_payload command, only mlb data is pulled. The syntax i use is below:
events <- get_payload(start = "2018-04-01", end = "2018-04-07", league = "aaa")

Is there a different way to get just aaa data? Is it possible the get_payload needs to be modified to access the gdx server?

Acquiring Historical Starting Lineups

Is there a simple way to access historical starting lineups?

Linescore Dataset error

I'm getting the following error below.

innings_df <- get_payload(start = "2017-01-01", end = "2018-01-01", dataset = "linescore", db_con = con)
Gathering Gameday data, please be patient...
Processing data chunk 1 of 7
Processing data chunk 2 of 7
Error: Column name mismatch.

No Team ID in get_payload function

Can't seem to find any information about the teams playing in the get_payload function call. Would be nice to get some ID of which player is on what team.

2019 season get_payload() Error: Column `on_1b` must be length 4055 (the number of rows) or one, not 0

Expected Behavior

Expected to get df of pitch data for data on 2019-03-28
http://gd2.mlb.com/components/game/mlb/year_2019/month_03/day_28

df = get_payload(start = '2019-03-28', end = '2019-03-28')

Games listed here
https://www.mlb.com/scores/2019-03-28

Current Behavior

Encounters error message, the download fails

Gathering Gameday data, please be patient...
Error: Column `on_1b` must be length 4055 (the number of rows) or one, not 0
In addition: Warning messages:
1: NAs introduced by coercion 
2: NAs introduced by coercion 
3: NAs introduced by coercion 
4: NAs introduced by coercion 
5: NAs introduced by coercion 
6: NAs introduced by coercion

However, data download succeeds when the date is before 2018-10-28

df = get_payload(start = '2018-10-28', end = '2018-10-28')

Attempted Solution

Tried to reinstall the latest package to GitHub dev version
Did not solve the issue

devtools::install_github("keberwein/mlbgameday", force = TRUE)

Context

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doParallel_1.0.14 iterators_1.0.10  foreach_1.4.4     mlbgameday_0.1.4  jsonlite_1.5     
 [6] stringi_1.4.3     RSQLite_2.1.1     DBI_1.0.0         dbplyr_1.2.2      dplyr_0.8.0.1    
[11] config_0.3       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       rstudioapi_0.8   xml2_1.2.0       magrittr_1.5     tidyselect_0.2.5
 [6] bit_1.1-14       R6_2.4.0         rlang_0.3.1      stringr_1.4.0    blob_1.1.1      
[11] tools_3.5.1      yaml_2.2.0       bit64_0.9-7      assertthat_0.2.0 digest_0.6.17   
[16] tibble_2.1.1     crayon_1.3.4     tidyr_0.8.3      purrr_0.3.2      codetools_0.2-15
[21] curl_3.2         memoise_1.1.0    glue_1.3.1       compiler_3.5.1   pillar_1.3.1    
[26] pkgconfig_2.0.2

Subscript errors

Getting a "subscript out of bounds" error on the following line of code. Probably due to problems in with the make_gids() function. A fix for the next patch release should be a priority.

innings_df <- get_payload(start = "2017-09-21", end = Sys.Date()-1)

Individual pitch counts

Is there currently a way to pull pitch by pitch counts? Or does that exist elsewhere?