beanumber / baseball_r Goto Github PK

View Code? Open in Web Editor NEW

This project forked from maxtoki/baseball_r

94.0 94.0 62.0 43.34 MB

Companion to Analyzing Baseball Data with R, 2nd edition

R 100.00%

baseball_r's Introduction

beanumber

personal stuff

Rythmbox

Install file-organizer
https://lachlandewaard.org/organise-your-music-with-rhythmbox-file-organizer/
https://answers.launchpad.net/rb-fileorganizer/+question/182442
set a customized fill organization scheme using tokens

gsettings set org.gnome.rhythmbox.library layout-filename '%aN_%tN_%tt'

baseball_r's People

Contributors

Stargazers

Watchers

baseball_r's Issues

errata

jboardman found a couple of small errors:

Top line on page 17: Manny Alexander (not Chris Hoiles) played second base
On page 18, the description for Table 1.6 says September 9, 1995. It should be September 6, 1995.

Chapter 8 lm function error

Getting this error from chapter 8.4.1:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases

I copied the code exactly from this website:

midcareers <- batting_2000 %>%
group_by(playerID) %>%
summarize(Midyear = (min(yearID) + max(yearID)) / 2,
AB.total = first(Career.AB))
batting_2000 %>%
inner_join(midcareers, by = "playerID") -> batting_2000

------------------------------------------------------------------------

models <- batting_2000 %>%
split(pull(., playerID)) %>%
map(~lm(OPS ~ I(Age - 30) + I((Age - 30)^2), data = .)) %>%
map_df(tidy, .id = "playerID")

I don't know why this error is appearing, everything else from the chapter has worked perfectly so far.

5.3 Creating the Matrix - getting NAs

Having issues with creating the run expectancy matrix.

Here's my code

data2016 %>%
mutate(BASES =
paste(ifelse(BASE1_RUN_ID > '', 1, 0),
ifelse(BASE2_RUN_ID > '', 1, 0),
ifelse(BASE3_RUN_ID > '', 1, 0), sep = ""),
STATE = paste(BASES, OUTS_CT)) ->
data2016

This is the error warning message that I think is the root to my issue:
Warning messages:
1: Problem with mutate() column BASES.
ℹ BASES = paste(...).
ℹ ‘>’ not meaningful for factors
2: Problem with mutate() column BASES.
ℹ BASES = paste(...).
ℹ ‘>’ not meaningful for factors
3: Problem with mutate() column BASES.
ℹ BASES = paste(...).
ℹ ‘>’ not meaningful for factors

I have attached a screenshot of what it looks like in the data frame

I have also attached what the output looks like in 5.5 when we analyze Jose Altuve

Chapter 5 code

The following code for creating a run expectancy matrix in chapter 5 is giving me an error saying that "> not meaningful for factors" and failing to create the BASES and STATE variables

data2016 %>%
mutate(BASES =
paste(ifelse(BASE1_RUN_ID > ' ', 1, 0),
ifelse(BASE2_RUN_ID > ' ', 1, 0),
ifelse(BASE3_RUN_ID > ' ', 1, 0), sep = ' '),
STATE = paste(BASES, OUTS_CT)) ->
data2016

Chapter 3 code

I'm a beginner here so please excuse my naiveness.

It appears that the book contains out-dated files from the Lahman data base? The Lahman data base that I've downloaded contains only a 'HallOfFame.csv' file and no longer separate ones like the book insinuates - hofbatting.csv & hofpitching.csv?

In order to get around this I read in the data through:
hof <- read_csv("Documents/R Project/Baseball/Lahman/core/HallOfFame.csv")

But the next code in the sequence I can't seem to navigate around. The message I keep receiving is "Error: object 'From' not found" for this code:
hof$MidCareer <- with(hof, (From + To) / 2)
hof$Era <- cut(hof$MidCareer,
breaks = c(1800, 1900, 1919, 1941, 1960, 1976, 1993, 2050),
labels = c("19th Century", "Dead Ball", "Lively Ball", "Integration", "Expansion", "Free Agency", "Long Ball"))

I've tried to find answers online about the 'From + To' function of the code but cant seem to find anything relevant. There must be an easy solution to this but I'm a beginner so I'm unaware of any easy fixes. Any help to this problem would be much appreciated, Thanks.

parse_retrosheet_pbp could not open file

Please see the attached screenshot.

I cannot get the parse_retrosheet_pbp function to create an "all" csv for downloaded data. I have "retrosheet > unzipped > cwevent" in place. However, the csv generated contains no data.

Please advice on necessary steps to alleviate this.

Chapter 7.2 - pitchRx Data Issue

Having issues with the first step in 7.2 where we create an empty SQLite database using src_sqlite()

db <- src_sqlite("~/Desktop/Data Analysis/Analyzing Baseball Data with R/baseball_R/data/pitchrx.sqlite", create = TRUE)
this code leads to the error:
Error in (function (cond) :
error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': there is no package called ‘RSQLite’

I then use the tbl() function which was prompted since src_sqlite is not available anymore:
db <- tbl("~/Desktop/Data Analysis/Analyzing Baseball Data with R/baseball_R/data/pitchrx.sqlite", create = TRUE)
which then leads to the error:
Error in UseMethod("tbl") :
no applicable method for 'tbl' applied to an object of class "character"

I'm curious if I entered the wrong arguments in the tbl() function, or if there is another way to get through this.

fix errata from 1st edition

1. Page 3, line 6 from bottom. count should be counts.
2. Page 4, second sentence in the first paragraph in Section 1.2.3 should be changed to:

It contains bibliographical information on every player and manager who have appeared at the Major League Baseball level and of all people who have been inducted in the Baseball Hall of Fame.

3. Page 8, second line in last paragraph should be changed to:

Table 1.4 displays statistics from the data file Pitching.csv for the seasons where Ruth was a pitcher.

4. Page 13, first line in Section 1.2.8. Change to

The following questions can be answered with Lahman�s database.

5. Page 13, first Answer paragraph in Section 1.2.8.

Replace "per games�" to "per game�" (two times).

6. Page 16, replace this paragraph:

This table displays team statistics \footnote{Some of the less important statistics, such as Catcher Interference, have been omitted in Table \ref{tab:gamelog}} as well as the players' identities and fielding positions for the home team; similar statistics and player information are available for the visitor team.

with this paragraph:

This table displays team statistics \footnote{Some other team statistics, such as Stolen Bases and Caught Stealings, omitted in Table \ref{tab:gamelog}, are reported in Game log files.} as well as the players' identities and fielding positions for the home team; similar statistics and player information are available for the visitor team.

7. Page 30, Section 2.2, second paragraph.

rstudio.org should be rstudio.com

8. Page 42, Section 2.5.2

Change "features of R" to "feature of R"

9. Page 49, line 3.

"ball in play�" should be "balls in play�"

10. Page 57, Exercises 3 title and 4 title and part (a).

Change "350 Wins" to "350-Wins� (three times)

11. Page 60. The code on this page should read

hof <- read.csv("hofbatting.csv")
hof$MidCareer <- with(hof, (From + To) / 2)
hof$Era <- cut(hof$MidCareer,
breaks = c(1800, 1900, 1919, 1941, 1960, 1976, 1993, 2050),
labels = c("19th Century", "Dead Ball", "Lively Ball",
"Integration", "Expansion", "Free Agency",
"Long Ball"))

12. Page 64, line 8.

"dotplot�" should be "dot plot�"

13. Chapter 5

In many places, "runs expectancy" should be replaced with "run expectancy". Similarly replace "runs value" with "run value" throughout this chapter.

14. Page 113, Figure 5.1.

Replace "Dotplot�" with "Stripchart�".

15. Page 134, Section 6.2.5, line 4

Replace "group argument" with "groups argument"

16. Page 165, line 10.

Replace "appearance of the line" with "appearance of the line,"

17. Page 187, Section 8.1, line 3.

Replace "pitching statistics as from his MLB" to "pitching statistics from his MLB"

18. Page 195,last sentence.

Replace "In addition, one adds the difference between the fielding position values of the two players." with "In addition, one subtracts the absolute value of the difference between the fielding position values of the two players. ".

19. Page 302, Section A.1.4, line 3.

Replace "field.csv�" with "fields.csv"

Career Trajectory Chart

The career trajectory chart and line fit should exclude active players. You can see the shape of the curve descends surprisingly rapidly. This is explained by using players who are in the middle of their career who have not yet hit their future peak.

I'm using the 2021 data, but the concept should be the same, and my "before" plot looks similar to the one in the book.

Original plot, from L247 of trajectories.R

After removing players whose final year was earlier than 2018

This can be seen by calculating the final year for each playerID and plotting that per year and observing the large spike in the most recent year, which makes sense intuitively.

error message I don't understand

Hi I'm working my way through Analyzing Baseball with R. When working on the Chapter 2 first exercise, I'm getting the following error message even after typing straight from the answer given here (which is what I had done in the first place). What am I doing wrong?

SB.Attempt = SB + CS
Error in SB + CS : non-numeric argument to binary operator

Issues with parse_retrosheet_pbp() files

Hi Ben,

When I tried to duplicate the script in the beginning of Chapter 5, it seems that the "all2016.csv" file has not data in it. The "ros2016.csv" works fine. I tried this for other years as well (1950 as mentioned in the appendix, and 2018), the "all1950.csv" and "all2018.csv" have no data in it but the roster files do.

Attached is a screenshot for your reference. Thanks!

Chapter 3: Graph of the 1998 home run race

Hey man,

I am working through the lessons and homework and I am having trouble running the graph in section 3.8.3. I keep getting the following error.

ggplot(hr_ytd, aes(Date, cumHR, linetype = nameLast)) +
geom_line() +
geom_hline(yintercept = 62, color = crcblue) +
annotate("text", ymd("1998-04-15"), 65,
label = "62", color = crcblue) +
ylab("Home Runs in the Season")

Error in xj[i] : object of type 'closure' is not subsettable

I believe it has something to do with R not being able to subset a function. I am thinking the error is being caused by this line of code.

library(lubridate)
cum_hr <- function(d) {
d %>%
mutate(Date = ymd(str_sub(GAME_ID, 4, 11))) %>%
arrange(Date) %>%
mutate(HR = ifelse(EVENT_CD == 23, 1, 0),
cumHR = cumsum(HR)) %>%
select(Date, cumHR)
}

I am brand new to R so if I am making an obvious mistake I apologize in advance :D

Appendix A - Analizing Baseball Data with R

Greetings,

I am having a problem with the following codes.

These are the codes I'm using.

library(tidyverse)
library(retrosheet)

download_retrosheet <- function(season) {

get zip file from retrosheet website

download.file(
url = paste0(
"http://www.retrosheet.org/events/", season, "eve.zip"),
destfile = file.path("retrosheet", "zipped",
paste0(season, "eve.zip"))
)
}

unzip_retrosheet <- function(season) {

unzip retrosheet files

unzip(file.path("retrosheet", "zipped",
paste0(season, "eve.zip")),
exdir = file.path("retrosheet", "unzipped"))
}

create_csv_file <- function(season) {

http://chadwick.sourceforge.net/doc/cwevent.html

shell("cwevent -y 2000 2000TOR.EVA > 2000TOR.bev")

wd <- getwd()
setwd("retrosheet/unzipped")
cmd <- paste0("cwevent -y ", season, " -f 0-96 ",
season, ".EV", " > all", season, ".csv")
message(cmd)
if (.Platform$OS.type == "unix") {
system(cmd)
} else {
shell(cmd)
}
setwd(wd)
}

create_csv_roster <- function(season) {

creates a CSV file of the rosters

rosters <- list.files(
path = file.path("retrosheet", "unzipped"),
pattern = paste0(season, ".ROS"),
full.names = TRUE)

rosters %>%
map_df(read_csv,
col_names = c("PlayerID", "LastName", "FirstName",
"Bats", "Pitches", "Team")) %>%
write_csv(path = file.path("retrosheet",
"unzipped",
paste0("roster", season, ".csv")))
}

cleanup <- function() {

removes retrosheet files not needed

files <- list.files(
path = file.path("retrosheet", "unzipped"),
pattern = "(.EV|.ROS|TEAM*)",
full.names = TRUE
)
unlink(files)

zips <- list.files(
path = file.path("retrosheet", "zipped"),
pattern = "*.zip",
full.names = TRUE
)
unlink(zips)
}

parse_retrosheet_pbp <- function(season) {
download_retrosheet(season)
unzip_retrosheet(season)
create_csv_file(season)
create_csv_roster(season)
cleanup()
}

parse_retrosheet_pbp(1950)

After running the function parse_retrosheet_pbp(1950), Rstudio is giving me the following message:

cwevent -y 1950 -f 0-96 1950*.EV* > all1950.csv
'cwevent' is not recognized as an internal or external command,
operable program or batch file.
Warning messages:
1: In download.file(url = paste0("http://www.retrosheet.org/events/", :
URL http://www.retrosheet.org/events/1950eve.zip: cannot open destfile 'retrosheet/zipped/1950eve.zip', reason 'No such file or directory'
2: In download.file(url = paste0("http://www.retrosheet.org/events/", :
download had nonzero exit status
3: In unzip(file.path("retrosheet", "zipped", paste0(season, "eve.zip")), :
error 1 in extracting from zip file
4: In shell(cmd) :
'cwevent -y 1950 -f 0-96 1950*.EV* > all1950.csv' execution failed with error code 1

Chapter 3, all 1998 retrosheet data file

Hi, I have been working through Chapter 3 on graphics. I am on page 86 and it provides code to read the all1998 retrosheet data file, but that doesnt exist in the data folder. Am I missing that file somewhere?

Framing, PitchRx

Working on chapter 7 and pitch framing. I and a few others have had issues scraping pitchrx. Is there a workaround?