GithubHelp home page GithubHelp logo

jrnold / r4ds-exercise-solutions Goto Github PK

View Code? Open in Web Editor NEW
318.0 17.0 229.0 112.54 MB

Exercise solutions to "R for Data Science"

Home Page: https://jrnold.github.io/r4ds-exercise-solutions

License: Creative Commons Attribution 4.0 International

R 59.94% CSS 6.18% TeX 12.40% Dockerfile 2.35% Shell 18.27% HTML 0.86%
data-science exercise-solutions tidyverse r ggplot2 dplyr r4ds tidyr rmarkdown bookdown

r4ds-exercise-solutions's Introduction

Lifecycle: superseded

Exercise Solutions to R for Data Science

These are solutions to the 1st edition of R for Data Science. The solutions to the 2nd edition of R for Data Science are available at R for Data Science (2e) - Solutions to Exercises.

This repository contains the code and text behind the Solutions for R for Data Science, which, as its name suggests, has solutions to the the exercises in R for Data Science by Garrett Grolemund and Hadley Wickham.

The R packages used in this book can be installed via

devtools::install_github("jrnold/r4ds-exercise-solutions")

Contributing

Work on this repo has effectively stopped since the 2nd edition of R for Data Science has been published. Please direct your contributions to R for Data Science (2e) - Solutions to Exercises.

Build

The site is built using the bookdown package and pandoc.

r4ds-exercise-solutions's People

Contributors

adamblake avatar benherbertson avatar chrisyeh96 avatar decoursin avatar dependabot[bot] avatar dvanic avatar edavishydro avatar goldbergdata avatar henrikmidtiby avatar jdblischak avatar jmclawson avatar jrnold avatar liuminzhao avatar matthewlock91 avatar mugpeng avatar nielsenmarkus11 avatar nzxwang avatar rbjanis avatar shurakai avatar tinhb92 avatar xiaoouwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

r4ds-exercise-solutions's Issues

Incorrect chapter numbering

This solutions book is missing one chapter from R for Data Science: 9 - Introduction. The chapter numberings following this chapter are off by one as a result.

Great repo! Thanks for making this available to everyone

fixing Exercise 20.3.4

  1. semantic error

    See the value of looking at the value of

  2. the code chunk
    1. the first line x <- seq(-10, 10, by = 0.5) is redundant
    2. round2() doesn't use the to_even parameter.

3.9.4 answer typo

the text reads as:
"If we didn’t include geom_point, then the line is no longer at 45 degrees:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline()"

geom_point() was not excluded, coord_fixed() was.

print source code of `rlang::set_names_impl()` for Exercise 20.4.3

User can't learn that

set_names adds a few sanity checks: x has to be a vector, and the lengths of the object and the names have to be the same.

from merely

purrr::set_names
#> function (x, nm = x, ...) 
#> {
#>     set_names_impl(x, x, nm, ...)
#> }
#> <bytecode: 0x7fc3439f4bf0>
#> <environment: namespace:rlang>

5.2.4 Excercise 1.1

Question 1: Had an arrival delay of two or more hours
provided answer:
filter(flights, arr_delay > 120)

flights delayed 2 or more hours, should be inclusive.

filter(flights, arr_delay >=120)

5.2.4 excercise 1.5 and 1.6

Question: Arrived more than two hours late, but didn’t leave late
answer provided:
filter(flights, !is.na(dep_delay), dep_delay <= 0, arr_delay > 120)

This works properly, but the !is.na(dep_delay) is not needed. filter only includes rows for which the condition is true.

From the R4DS text: "filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values."

question 1.6 has the same issue.
question: Were delayed by at least an hour, but made up over 30 minutes in flight
provided answer:
filter(flights, !is.na(dep_delay),
dep_delay >= 60, dep_delay - arr_delay > 30)

Complete Ch 21 Exercises

  • 21.2 For Loops
  • 21.3 For Loop variants
  • 21.4 For loops vs. functionals
  • 21.5 The map functions
  • 21.5 Exercises
  • 21.9 Other patterns of for loops

improve Exercise 20.4.5

the explanation for x[x <= 0] is quite good, but I suppose we should go into more details for x[-which(x > 0)] , like

> x <- c(-5:5, Inf, -Inf, NaN, NA)

> x > 0
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE    NA    NA

> which(x > 0)
[1]  7  8  9 10 11 12

> -which(x > 0)
[1]  -7  -8  -9 -10 -11 -12

answer for 5.7 - Exercise 5

I think the alternative code should be

flights %>%
    filter(!is.na(arr_delay), arr_delay > 0) %>%  
    group_by(dest) %>%
    mutate(total_delay = sum(arr_delay - min(arr_delay)),
           prop_delay = (arr_delay - min(arr_delay)) / total_delay) %>%
    select(
        (year:day),
        dep_delay,
        total_delay,
        prop_delay
    )

answer for 5.7 - Exercise 6

Although the question asks us to "Look at each destination", I think it weird not to take origin into account.

flights %>%
    filter(!is.na(air_time)) %>%
    group_by(dest, origin) %>%
    mutate(med_time = median(air_time),
           fast = (air_time - med_time) / med_time) %>%
    arrange(fast) %>%
    select(dest, origin, air_time, med_time, fast, dep_time, sched_dep_time, arr_time, sched_arr_time) %>%
    head(15)

Fortunately, the results seems not to be influenced too much.

# A tibble: 15 x 9
# Groups:   dest, origin [13]
   dest  origin air_time med_time   fast dep_time sched_dep_time arr_time sched_arr_time
   <chr> <chr>     <dbl>    <dbl>  <dbl>    <int>          <int>    <int>          <int>
 1 BOS   LGA          21       37 -0.432     1450           1500     1547           1608
 2 ATL   LGA          65      112 -0.420     1709           1700     1923           1937
 3 GSP   EWR          55       92 -0.402     2040           2025     2225           2226
 4 BNA   EWR          70      113 -0.381     1914           1910     2045           2043
 5 BOS   LGA          23       37 -0.378     1954           2000     2131           2114
 6 MSP   EWR          93      149 -0.376     1558           1513     1745           1719
 7 CVG   EWR          62       95 -0.347     1359           1343     1523           1545
 8 RIC   EWR          35       53 -0.340     1812           1639     1942           1812
 9 BUF   JFK          38       57 -0.333     2307           2250       34              8
10 BOS   JFK          26       38 -0.316     1200           1200     1254           1313
11 ROC   JFK          35       51 -0.314     2340           2250      120              5
12 ORF   JFK          36       52 -0.308     1720           1645     1820           1820
13 PIT   LGA          40       57 -0.298     1557           1610     1723           1755
14 BOS   LGA          26       37 -0.297     1711           1700     1827           1813
15 DCA   JFK          34       48 -0.292     1104           1105     1158           1215

3.9.3 typo in answer

The answer provided is:
coord_map() uses map projection to project 3-dimensional Earth onto a 2-dimensional plane. By default, coord_map() uses the Mercator projection. However, this projection must be applied to all geoms in the plot. coord_quickmap() uses a faster, but approximate map projection. This approximation ignores the curvature of Earth and adjusts the map for the latitude/longitude ratio. This transformation is quicker than the because the shapes do not need to be transformed.

There is a type in the last sentence. It should read something more like: "This transformation is quicker than coord_map() because the shapes do not need to be transformed."

answer for 5.7 - Exercise 7

we need to "Find all destinations that are flown by at least two carriers"

(two_more_carriers <- flights %>%
    group_by(dest) %>% 
    mutate(n_carrier = n_distinct(carrier)) %>% 
    filter(n_carrier >= 2) %>% 
    select(-n_carrier) %>%
    ungroup())
# A tibble: 325,397 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL        158
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO        140
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138
# ... with 325,387 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

then "rank the carriers"

two_more_carriers %>% 
    group_by(carrier) %>%
    summarise(n_dest = n_distinct(dest)) %>%
    arrange(desc(n_dest))
# A tibble: 16 x 2
   carrier n_dest
   <chr>    <int>
 1 EV          51
 2 9E          48
 3 UA          42
 4 DL          39
 5 B6          35
 6 AA          19
 7 MQ          19
 8 WN          10
 9 OO           5
10 US           5
11 VX           4
12 YV           3
13 FL           2
14 AS           1
15 F9           1
16 HA           1

Actually it took me a while to understand your code. I found that count() can be quite confused when you have to group by two variables. Luckily, n_distinct() and customize the new variable name by my own summarise() makes the process quite clear and straight-forward.

Exercise 4.3.4

If x = Inf, then x * 0 = NaN. That's why NA * 0 is missing, I believe.

answer for 7.5.3 - Exercise 1

There are two

Plotting the density instead of counts will make the distributions comparable, although the bins with few observations will still be hard to interpret.

I think the second one should be talking about using cut_number() v.s. cut_width()

Reorganize code

Ensure that all notebooks compile. Consider switching to a website or bookdown. Is there a better way to publish notebooks?

Correction for Ex 3.3.5

From an email

I have one small correction for exercise 3.3.5, "What does the stroke
aesthetic do?". You have "Stroke changes the color of the border for
shapes (22-24)", but it should be "Stroke changes the thickness of the
border for shapes (22-24)."

Exercises use concepts not yet introduced in the book

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.


Brief description of the problem

Multiple exercises use concepts, such as pipes, that are yet to be covered in the book. Here is a list of solutions with this issue.

Also, one of the below code examples does not produce the correct (at least as I think it should be, I could be wrong) answer due to mathematical error

5.5.3 {unnumbered exercise} - uses pipes which isn't introduced until 5.6

arrange(flights, air_time) %>%
select(origin, dest, air_time) %>%
head()

Suggested replacement code-

new <- arrange(flights, air_time)
select(new, origin, dest, air_time)
head(new)

Exercise 5.4.1 {.unnumbered .exercise} - uses regular expressions which isn't introduced until 5.6, I only know this because I asked a friend who is much more experienced than I in R. I do not know how to fix this code since I have just started chapter 9! I wil

Exercise 5.5.1 - Uses pipes and code does not compute the correct values and I find the use of %% confusing and seems to add more code.

mutate(flights, dep_time_mins = dep_time%/%100 * 60 + dep_time%%100,
sched_dep_time_mins = sched_dep_time%/%100 * 60 + sched_dep_time%%100) %>%
select(dep_time, dep_time_mins, sched_dep_time, sched_dep_time_mins)
This is the resulting tibble

A tibble: 336,776 x 4

dep_time dep_time_mins sched_dep_time sched_dep_time_mins

1 517 317 515 315
2 533 333 529 329
3 542 342 540 340

dep_time_mins should be 310 not 317
(517/100)*60
[1] 310.2
and sched_dep_time_mins should be 309 not 315
(515/100)*60
[1] 309

I suggest this code which does not use pipes and skips the %% notation.

new2<- mutate(flights,
dep_time_mins = (dep_time / 100)*60,
sched_dep_time_mins = (sched_dep_time / 100)*60)
select(new2, dep_time, dep_time_mins, sched_dep_time, sched_dep_time_mins)

Which results in this tibble with the correct values

A tibble: 336,776 x 4

dep_time dep_time_mins sched_dep_time sched_dep_time_mins

1 517 310. 515 309
2 533 320. 529 317.

Here is the code in the portion which creates a function first
time2mins <- function(x) {
(x / 100) * 60
}

new3 <- mutate(flights,
dep_time_mins = time2mins(dep_time),
sched_dep_time_mins = time2mins(sched_dep_time))

select(new3,dep_time, dep_time_mins, sched_dep_time, sched_dep_time_mins)

Exercises 5.5.2 & 5.5.3 - solution doesn't seem to work and uses pipes

I do not understand how this code results in something that shows the problem is due to a time zone or next day departure. Maybe the solutions are switched around because I see the 1440 is a 25 hour period so that would account for departing the next day? To me if it was due to a time zone difference the differences between air time and `arr_time - dep_time' would always be in 60 min increments, which the results are not. More clarification is needed, at least for me!

The comment in 5.5.2 "As with the previous question, we will need to Since arr_time and dep_time may be in different time zones," suggests these 2 solutions are inverted. I can't quite reconcile this in my mind how to fix it as I am muddled going back and forth!

Exercise 5.5.6 uses Tibble in solution which has not been introduced. I learn from this but would not have used it on my own.

Exercise 3.8.2. What parameters to geom_jitter() control the amount of jittering?

This exercise asks about geom_jitter but your solutions use geom_point. While this is a way to achieve the same plots it is not the answer to the question and causes confusion as there is no clear way to see how you got from geom_jitter -> position_jitter. I propose the following edited code as more clear examples for this exercise.

Exercise 3.8.2. {.unnumbered .exercise}
What parameters to geom_jitter() control the amount of jittering?
From the position_jitter documentation, there are two arguments to jitter: width and height, which control the amount of vertical and horizontal jitter.

No horizontal jitter

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 0)
Way too much vertical jitter

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 0, height = 15)
Only horizontal jitter:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom(height = 0)
Way too much horizontal jitter:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 0, width = 20)

5.3.3 Exercise 3 small typo

last sentence reads "The fastest flights area couple of flights between EWR and BDL with an air time of 20 minutes."

"area" should be "are a"

Find missing answer divs

There are some missing answer divs.

$ grep -nH 'class="question"' *.Rmd | wc -l
304
$ grep -nH 'class="answer"' *.Rmd | wc -l
299

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.