jrnold / r4ds-exercise-solutions Goto Github PK

View Code? Open in Web Editor NEW

318.0 17.0 229.0 112.54 MB

Exercise solutions to "R for Data Science"

Home Page: https://jrnold.github.io/r4ds-exercise-solutions

License: Creative Commons Attribution 4.0 International

R 59.94% CSS 6.18% TeX 12.40% Dockerfile 2.35% Shell 18.27% HTML 0.86%

data-science exercise-solutions tidyverse r ggplot2 dplyr r4ds tidyr rmarkdown bookdown

r4ds-exercise-solutions's Introduction

Exercise Solutions to R for Data Science

These are solutions to the 1st edition of R for Data Science. The solutions to the 2nd edition of R for Data Science are available at R for Data Science (2e) - Solutions to Exercises.

This repository contains the code and text behind the Solutions for R for Data Science, which, as its name suggests, has solutions to the the exercises in R for Data Science by Garrett Grolemund and Hadley Wickham.

The R packages used in this book can be installed via

devtools::install_github("jrnold/r4ds-exercise-solutions")

Contributing

Work on this repo has effectively stopped since the 2nd edition of R for Data Science has been published. Please direct your contributions to R for Data Science (2e) - Solutions to Exercises.

Build

The site is built using the bookdown package and pandoc.

r4ds-exercise-solutions's People

Contributors

Stargazers

Watchers

Forkers

anhnguyendepocen decoursin jmclawson brunolucian hafizurcse jonathanecm benherbertson rodmorley jackiemium cimentadaj gustavogilramos aamtz08 tianyingtina ccrowther sidmehta1975 efsilvaa lelouchzhu pvmontes mjones01 alexanderhofler pmsquad658 rishinair-27 mgeard christinageorge carajoos eig5ab htnani perlatex henningsway jillwithaj andrewsky123 stat-jet-asu zhangangus zahoorfiazdigital sophiaho8 asdurso ttippin sani1702 git-ashish emilieliu louieyan tianan2 edavishydro zhangou888 argdata rebeccasjones asalbarak henrikmidtiby magnuson8 nightwingg louisabornebusch deepitapai njitclass vinciuna gtzinov rbjanis jcool12 nixiepixie temuulene jzho0825 nielsenmarkus11 amritsreekumar liuminzhao juegru sindhuselvam rjh0926 sss-13 foolycooly1996 techwrekfix hgwu80 anouel arikunco the-r4u microbe zhan-gao rezapci ronakvijayvergia marciofirmino modhurima-amin cah-sai-madaupu paigeduffin andrew66882011 askarlupka cniu1997 psy-lozzy stoltzmaniac jdblischak anc211 michaelzh24 anxietyvendor eacordova priyesh86 azamatuss rinlinux hkejigu ludwa6 bbw7561135 xiaoouwang zabid sang-ngo

r4ds-exercise-solutions's Issues

Badly-formed div

On page https://jrnold.github.io/r4ds-exercise-solutions/data-import.html, at the tail of Exercise 11.3.5, the text div> appears - I presume this is a badly-formed <div> in the source.

Incorrect chapter numbering

This solutions book is missing one chapter from R for Data Science: 9 - Introduction. The chapter numberings following this chapter are off by one as a result.

Great repo! Thanks for making this available to everyone

fixing Exercise 20.3.4

semantic error

See the value of looking at the value of
the code chunk
1. the first line x <- seq(-10, 10, by = 0.5) is redundant
2. round2() doesn't use the to_even parameter.

Formatting glitch in Exercise 21.5.3.4

There appears to be an extra > in one part:

and an extra line break in another:

Dangling sentence fragment

On https://jrnold.github.io/r4ds-exercise-solutions/tidy-data.html, in the description of Exercise 12.3.4, there is a dangling sentence fragment "I will" in it is often preferable to store them as logical vectors. I will

Should describe purpose as "string has prefix" not "function has prefix"

In https://jrnold.github.io/r4ds-exercise-solutions/functions.html, Exercise 19.3.1.1, "The function f1 returns whether a function has a common prefix." <- should be "a vector of character strings" rather than "function".

3.9.4 answer typo

the text reads as:
"If we didn’t include geom_point, then the line is no longer at 45 degrees:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline()"

geom_point() was not excluded, coord_fixed() was.

Better way to supply packages needed in the answer

It's a good practice to explicit library() the packages needed in the beginning of a chapter. But sometimes that might be too far away, for example, we need lubridate for Exercise 19.4.2

print source code of `rlang::set_names_impl()` for Exercise 20.4.3

User can't learn that

set_names adds a few sanity checks: x has to be a vector, and the lengths of the object and the names have to be the same.

from merely

purrr::set_names
#> function (x, nm = x, ...) 
#> {
#>     set_names_impl(x, x, nm, ...)
#> }
#> <bytecode: 0x7fc3439f4bf0>
#> <environment: namespace:rlang>

Missing closing back quote

In https://jrnold.github.io/r4ds-exercise-solutions/relational-data.html, Exercise 13.1.4, there is a missing back quote in:

It would match to the year, month, day columns of `flights.

Question asks for sum, solution uses mean

In https://jrnold.github.io/r4ds-exercise-solutions/vectors.html, exercise 20.4.6.1, the question asks, "What about sum(!is.finite(x))?", but the solution says, "The expression mean(!is.finite(x))..." (i.e., uses mean instead of sum).

Complete Ch 27 (R Rmarkdown) exercises

Diagrams not rendering in vector solutions

In https://jrnold.github.io/r4ds-exercise-solutions/vectors.html, Exercise 20.5.4.1, Firefox displays this:

add an additional explanation for Exercise 11.2.4

Your answer is pretty good. But you might add that read_csv() also has a quote parameter

read_csv("x,y\n1,'a,b'", quote = "'")

# A tibble: 1 x 2
      x y    
  <int> <chr>
1     1 a,b

add answer to 3.5 - Exercise 4

Numbering mismatch in chapter 13

The exercises under 13.2.1 in http://r4ds.had.co.nz/relational-data.html show up under 13.1.1 in https://jrnold.github.io/r4ds-exercise-solutions/relational-data.html - subsequent sections of this chapter are similarly offset (e.g., 13.3.1 in R4DS is 13.2.1 in solutions).

Complete the answer for Exercise 20.3.3

https://jrnold.github.io/r4ds-exercise-solutions/vectors.html#exercise-20.3.3

Badly-formatted footnote (maybe) in vector chapter

The footer of https://jrnold.github.io/r4ds-exercise-solutions/vectors.html displays this in Firefox:

Is this an unclosed bracket for a footnote?

Complete Ch 19 Exercises

add answer for Exercise 27.3.1

Typo: "an" should be "a"

In https://jrnold.github.io/r4ds-exercise-solutions/vectors.html, exercise 20.4.6.3, "You can name an vector with itself" should be "a vector" (not "an vector").

Typo: "messing" instead of "missing"

In https://jrnold.github.io/r4ds-exercise-solutions/tidy-data.html, Exercise 12.6.2, the word "messing" in we see that sexage is messing should be "missing".

5.2.4 Excercise 1.1

Question 1: Had an arrival delay of two or more hours
provided answer:
filter(flights, arr_delay > 120)

flights delayed 2 or more hours, should be inclusive.

filter(flights, arr_delay >=120)

where is the graph for Exercise 20.5.1

https://jrnold.github.io/r4ds-exercise-solutions/vectors.html#recursive-vectors-lists#exercise-20.5.1

5.2.4 excercise 1.5 and 1.6

Question: Arrived more than two hours late, but didn’t leave late
answer provided:
filter(flights, !is.na(dep_delay), dep_delay <= 0, arr_delay > 120)

This works properly, but the !is.na(dep_delay) is not needed. filter only includes rows for which the condition is true.

From the R4DS text: "filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values."

question 1.6 has the same issue.
question: Were delayed by at least an hour, but made up over 30 minutes in flight
provided answer:
filter(flights, !is.na(dep_delay),
dep_delay >= 60, dep_delay - arr_delay > 30)

Complete Ch 21 Exercises

Add Dockerfile

Add Dockerfile to help reproducibly build the book.

improve Exercise 20.4.5

the explanation for x[x <= 0] is quite good, but I suppose we should go into more details for x[-which(x > 0)] , like

> x <- c(-5:5, Inf, -Inf, NaN, NA)

> x > 0
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE    NA    NA

> which(x > 0)
[1]  7  8  9 10 11 12

> -which(x > 0)
[1]  -7  -8  -9 -10 -11 -12

answer for 5.7 - Exercise 5

I think the alternative code should be

flights %>%
    filter(!is.na(arr_delay), arr_delay > 0) %>%  
    group_by(dest) %>%
    mutate(total_delay = sum(arr_delay - min(arr_delay)),
           prop_delay = (arr_delay - min(arr_delay)) / total_delay) %>%
    select(
        (year:day),
        dep_delay,
        total_delay,
        prop_delay
    )

Complete Ch 28 (Graphics for Communication) Exercises

answer for 5.7 - Exercise 6

Although the question asks us to "Look at each destination", I think it weird not to take origin into account.

flights %>%
    filter(!is.na(air_time)) %>%
    group_by(dest, origin) %>%
    mutate(med_time = median(air_time),
           fast = (air_time - med_time) / med_time) %>%
    arrange(fast) %>%
    select(dest, origin, air_time, med_time, fast, dep_time, sched_dep_time, arr_time, sched_arr_time) %>%
    head(15)

Fortunately, the results seems not to be influenced too much.

# A tibble: 15 x 9
# Groups:   dest, origin [13]
   dest  origin air_time med_time   fast dep_time sched_dep_time arr_time sched_arr_time
   <chr> <chr>     <dbl>    <dbl>  <dbl>    <int>          <int>    <int>          <int>
 1 BOS   LGA          21       37 -0.432     1450           1500     1547           1608
 2 ATL   LGA          65      112 -0.420     1709           1700     1923           1937
 3 GSP   EWR          55       92 -0.402     2040           2025     2225           2226
 4 BNA   EWR          70      113 -0.381     1914           1910     2045           2043
 5 BOS   LGA          23       37 -0.378     1954           2000     2131           2114
 6 MSP   EWR          93      149 -0.376     1558           1513     1745           1719
 7 CVG   EWR          62       95 -0.347     1359           1343     1523           1545
 8 RIC   EWR          35       53 -0.340     1812           1639     1942           1812
 9 BUF   JFK          38       57 -0.333     2307           2250       34              8
10 BOS   JFK          26       38 -0.316     1200           1200     1254           1313
11 ROC   JFK          35       51 -0.314     2340           2250      120              5
12 ORF   JFK          36       52 -0.308     1720           1645     1820           1820
13 PIT   LGA          40       57 -0.298     1557           1610     1723           1755
14 BOS   LGA          26       37 -0.297     1711           1700     1827           1813
15 DCA   JFK          34       48 -0.292     1104           1105     1158           1215

correct exercise number for chapter 28-30

In some chapters, like https://jrnold.github.io/r4ds-exercise-solutions/functions.html, you use "Exercise 19.2.1", in other chapters, like https://jrnold.github.io/r4ds-exercise-solutions/r-markdown.html, you use "Exercise 27.1.1.1".

I suppose that chapter 28-30 should be modified (but I might have missed some other chapters)

Duplicated "a"

In https://jrnold.github.io/r4ds-exercise-solutions/functions.html, Exercise 19.2.1.1., "note that by a a single missing value" <- duplicated "a".

3.9.3 typo in answer

The answer provided is:
coord_map() uses map projection to project 3-dimensional Earth onto a 2-dimensional plane. By default, coord_map() uses the Mercator projection. However, this projection must be applied to all geoms in the plot. coord_quickmap() uses a faster, but approximate map projection. This approximation ignores the curvature of Earth and adjusts the map for the latitude/longitude ratio. This transformation is quicker than the because the shapes do not need to be transformed.

There is a type in the last sentence. It should read something more like: "This transformation is quicker than coord_map() because the shapes do not need to be transformed."

answer for 5.7 - Exercise 7

we need to "Find all destinations that are flown by at least two carriers"

(two_more_carriers <- flights %>%
    group_by(dest) %>% 
    mutate(n_carrier = n_distinct(carrier)) %>% 
    filter(n_carrier >= 2) %>% 
    select(-n_carrier) %>%
    ungroup())

# A tibble: 325,397 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL        158
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO        140
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138
# ... with 325,387 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

then "rank the carriers"

two_more_carriers %>% 
    group_by(carrier) %>%
    summarise(n_dest = n_distinct(dest)) %>%
    arrange(desc(n_dest))

# A tibble: 16 x 2
   carrier n_dest
   <chr>    <int>
 1 EV          51
 2 9E          48
 3 UA          42
 4 DL          39
 5 B6          35
 6 AA          19
 7 MQ          19
 8 WN          10
 9 OO           5
10 US           5
11 VX           4
12 YV           3
13 FL           2
14 AS           1
15 F9           1
16 HA           1

Actually it took me a while to understand your code. I found that count() can be quite confused when you have to group by two variables. Luckily, n_distinct() and customize the new variable name by my own summarise() makes the process quite clear and straight-forward.

Exercise 4.3.4

If x = Inf, then x * 0 = NaN. That's why NA * 0 is missing, I believe.

answer for 7.5.3 - Exercise 1

There are two

Plotting the density instead of counts will make the distributions comparable, although the bins with few observations will still be hard to interpret.

I think the second one should be talking about using cut_number() v.s. cut_width()

improve answer for 20.3.3

In https://jrnold.github.io/r4ds-exercise-solutions/vectors.html#exercise-20.3.3, if you want to show

However, you can represent that value (exactly) with a numeric vector at the cost of about two times the memory.

you should show

> as.numeric(.Machine$integer.max) + 1
[1] 2147483648

rather than

as.numeric(.Machine$integer.max) + 1
#> [1] 2.15e+09

Reorganize code

Ensure that all notebooks compile. Consider switching to a website or bookdown. Is there a better way to publish notebooks?

Correction for Ex 3.3.5

From an email

I have one small correction for exercise 3.3.5, "What does the stroke
aesthetic do?". You have "Stroke changes the color of the border for
shapes (22-24)", but it should be "Stroke changes the thickness of the
border for shapes (22-24)."

Exercises use concepts not yet introduced in the book

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.

Brief description of the problem

Multiple exercises use concepts, such as pipes, that are yet to be covered in the book. Here is a list of solutions with this issue.

Also, one of the below code examples does not produce the correct (at least as I think it should be, I could be wrong) answer due to mathematical error

5.5.3 {unnumbered exercise} - uses pipes which isn't introduced until 5.6

arrange(flights, air_time) %>%
select(origin, dest, air_time) %>%
head()

Suggested replacement code-

new <- arrange(flights, air_time)
select(new, origin, dest, air_time)
head(new)

Exercise 5.4.1 {.unnumbered .exercise} - uses regular expressions which isn't introduced until 5.6, I only know this because I asked a friend who is much more experienced than I in R. I do not know how to fix this code since I have just started chapter 9! I wil

Exercise 5.5.1 - Uses pipes and code does not compute the correct values and I find the use of %% confusing and seems to add more code.

mutate(flights, dep_time_mins = dep_time%/%100 * 60 + dep_time%%100,
sched_dep_time_mins = sched_dep_time%/%100 * 60 + sched_dep_time%%100) %>%
select(dep_time, dep_time_mins, sched_dep_time, sched_dep_time_mins)
This is the resulting tibble

A tibble: 336,776 x 4

dep_time dep_time_mins sched_dep_time sched_dep_time_mins

1 517 317 515 315
2 533 333 529 329
3 542 342 540 340

dep_time_mins should be 310 not 317
(517/100)*60
[1] 310.2
and sched_dep_time_mins should be 309 not 315
(515/100)*60
[1] 309

I suggest this code which does not use pipes and skips the %% notation.

new2<- mutate(flights,
dep_time_mins = (dep_time / 100)*60,
sched_dep_time_mins = (sched_dep_time / 100)*60)
select(new2, dep_time, dep_time_mins, sched_dep_time, sched_dep_time_mins)

Which results in this tibble with the correct values

A tibble: 336,776 x 4

dep_time dep_time_mins sched_dep_time sched_dep_time_mins

1 517 310. 515 309
2 533 320. 529 317.

Here is the code in the portion which creates a function first
time2mins <- function(x) {
(x / 100) * 60
}

new3 <- mutate(flights,
dep_time_mins = time2mins(dep_time),
sched_dep_time_mins = time2mins(sched_dep_time))

select(new3,dep_time, dep_time_mins, sched_dep_time, sched_dep_time_mins)

Exercises 5.5.2 & 5.5.3 - solution doesn't seem to work and uses pipes

I do not understand how this code results in something that shows the problem is due to a time zone or next day departure. Maybe the solutions are switched around because I see the 1440 is a 25 hour period so that would account for departing the next day? To me if it was due to a time zone difference the differences between air time and `arr_time - dep_time' would always be in 60 min increments, which the results are not. More clarification is needed, at least for me!

The comment in 5.5.2 "As with the previous question, we will need to Since arr_time and dep_time may be in different time zones," suggests these 2 solutions are inverted. I can't quite reconcile this in my mind how to fix it as I am muddled going back and forth!

Exercise 5.5.6 uses Tibble in solution which has not been introduced. I learn from this but would not have used it on my own.

Chapter index is off because an "Introduction" chapter was added to "Wrangle".

Exercise 3.8.2. What parameters to geom_jitter() control the amount of jittering?

This exercise asks about geom_jitter but your solutions use geom_point. While this is a way to achieve the same plots it is not the answer to the question and causes confusion as there is no clear way to see how you got from geom_jitter -> position_jitter. I propose the following edited code as more clear examples for this exercise.

Exercise 3.8.2. {.unnumbered .exercise}
What parameters to geom_jitter() control the amount of jittering?
From the position_jitter documentation, there are two arguments to jitter: width and height, which control the amount of vertical and horizontal jitter.

No horizontal jitter

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 0)
Way too much vertical jitter

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 0, height = 15)
Only horizontal jitter:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom(height = 0)
Way too much horizontal jitter:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 0, width = 20)

$ grep -nH 'class="question"' *.Rmd | wc -l
304
$ grep -nH 'class="answer"' *.Rmd | wc -l
299

jrnold / r4ds-exercise-solutions Goto Github PK

r4ds-exercise-solutions's Introduction

Exercise Solutions to R for Data Science

Contributing

Build

r4ds-exercise-solutions's People

Contributors

Stargazers

Watchers

Forkers

r4ds-exercise-solutions's Issues

A tibble: 336,776 x 4

A tibble: 336,776 x 4

Recommend Projects

Recommend Topics

Recommend Org

Jobs