tidyverse / design Goto Github PK

View Code? Open in Web Editor NEW

210.0 17.0 48.0 2.67 MB

Tidyverse design principles

Home Page: https://design.tidyverse.org

License: Other

HTML 2.22% R 68.89% SCSS 28.89%

r book design

design's People

Contributors

Stargazers

Watchers

design's Issues

NULL vs zero-length vectors

For vectorised arguments I think we should consistently treat NULL in the same way as if the argument had not been supplied. This is symmetric with our use of NULL for the default value of optional arguments that need complex calculations.

library(vctrs)
vec_c(TRUE, double())
#> [1] 1

vec_c(TRUE, NULL)
#> [1] TRUE
# Same as 
vec_c(TRUE, )
#> [1] TRUE
# Same as
vec_c(TRUE)
#> [1] TRUE

Verbosity

Should you provide control over verbosity?

At the function level or via some big kill switch?

Example of function level: devtools::check(quiet = TRUE)
Example of higher level kill switch: option usethis.quiet affects all of the ui_*() functions

Context: I'm calling check() and install() and test() in .Rmd for R Packages. All of these functions make pretty strong assumptions that they're being run interactively. Such functions are easier to "write prose around" if there are ways to muffle, capture, and redact their output.

I think this might be analogous to format() and print() methods.

Row names

Why should you not rely on tidyverse functions preserving row names (because metadata is data).

Update netlify bits of travis.yml

The .travis.yml file here needs an update for the new netlify CLI.

Before/current:

deploy:
  provider: script
  script: netlify deploy -t $NETLIFY_PAT
  skip_cleanup: true

After needs to look more like Advanced R:

deploy:
  provider: script
  script: netlify deploy --prod --dir _book
  skip_cleanup: true

and there's the hidden assumption that the netlify pat has been stored as an encrypted env var named NETLIFY_AUTH_TOKEN on travis.

Auto-build-and-deploy will be broken until we do this.

Collect design principles from other langs

https://package.elm-lang.org/help/design-guidelines
https://en.m.wikipedia.org/wiki/Unix_philosophy
https://python-patterns.guide
https://martinfowler.com/bliki/BeckDesignRules.html — Passes the tests; Reveals intention; No duplication; Fewest elements

Encoding practices

UTF-8 wherever possible and tips for actually achieving that

Return types

How do you decide the return type of your function?

Think about invariants. e.g. is vec_type(f(x)) a constant? Is it vec_type(x)? The less information needed to predict the return type, the better.
Pick the "smallest"/most constrained type that the returned data fits into.
If you return the same type of output from multiple functions, you should create a function that consistently creates exact the same format (to avoid accidentally inconsistency), and consider making it an S3 class (so you can have a custom print method).

Example

When a function returns two vectors of the same size, as a general rule should you return a tibble:

A matrix would only work if the vectors were the same type (and not factor or Date), doesn't make it easy to extract the individual values, and is not easily input to other tidyverse functions.
A list doesn't capture the constraint that both vectors are the same length.
A data frame is ok if you don't want to take a dependency on tibble, but you need to remember the drawbacks: if the columns are character vectors you'll need to remember to use stringsAsFactors = FALSE, and the print method is confusing for list- and df-cols (and you have to create by modifying an existing data frame, not by calling data.frame()). (Example: it would be weird if glue returned tibbles from a function.)

e.g. str_locate(), str_locate_all()

Minimize Global Options

From Advanced R:

global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave.

Pattern: serve internal data in a humane way

Placeholder for another concrete pattern that recurs in tidyverse/r-lib packages, like #53 Pattern: Cross that Bridge When you Come To It

Sometimes internal data is also useful to users. Write a helper to give it to them in usable way.

Examples:

readr::readr_example(), readxl::readxl_example() for accessing example files that ship with the packages
googledrive::drive_endpoints(), googledrive::drive_mime_type() for accessing pre-processed knowledge about the Drive API

NULL input means "show me everything"

character or integer input is for selective access

Default values of function arguments

moved manually from hadley/adv-r#1248

abind::abind() has this signature:

abind(..., along=N, <more args>)

where along is "(optional) The dimension along which to bind the arrays. The default is the last dimension, i.e., the maximum length of the dim attribute of the supplied arrays."

Good example for exploring the relative merits of different default strategies. Seems that along = NULL would have been easier to program around / wrap & expose.

Keep amount of code needed for analysis as small as possible

Documentation next to code
Reason for skip_if() should be a comment next to the code
Test and R file pairing
If you recursively unlink a directory, you should be able see that you created the directory
If you're creating an global cache (env) in package, it should be near the function that uses it.

Analysis of a function

From https://edwinth.github.io/multiperson-project/, with permission

save_as_rds <- function(file, 
                        filename) {
  
  node <- Sys.info()["nodename"]
  user <- Sys.info()['user']
  
  if (node == "server_node_name") {
    path <- "path/to_the_data/on_the/server"
  } else if (user == 'user1') {
    path <- "path/for/user1"
  } else if (user2) {
    path <- "user2/has_data/stored/here"
  }
  
  file_path <- file.path(path, filename)
  saveRDS(file, file_path)
}

User has a constrained set of values, so reflect that with a switch():

save_as_rds <- function(file, 
                        filename) {
  
  node <- Sys.info()["nodename"]
  user <- Sys.info()['user']
  
  if (node == "server_node_name") {
    path <- "path/to_the_data/on_the/server"
  } else {
    path <- switch(user,
      user1 = "path/for/user1",
      user2 = "user2/has_data/stored/here",
      stop(glue::glue("Unknown user '{user}'"), call.= FALSE)
    )
  }
  
  file_path <- file.path(path, filename)
  saveRDS(file, file_path)
}

This also fixes small bug in original function (if (user2)) which would be soon discovered, but can't happen with switch(). Now also gives clear error.

Currently function is hard to test because it has hidden inputs and mingles computation and side-effects. Can fix by pulling out path generation function that has node and user as arguments: this makes it easier to experiment with/test:

user_home <- function(node = Sys.info()[["nodename"]],
                      user = Sys.info()[["user"]]) {
    
  if (node == "server_node_name") {
    "path/to_the_data/on_the/server"
  } else {
    switch(user,
      user1 = "path/for/user1",
      user2 = "user2/has_data/stored/here",
      stop(glue::glue("Unknown user '{user}'"), call.= FALSE)
    )
  }
}

I'd then simplify to use early return:

user_home <- function(node = Sys.info()[["nodename"]],
                      user = Sys.info()[["user"]]
                      ) {
    
  if (identical(node, "server_node_name")) {
    return("path/to_the_data/on_the/server")
  }
    
  switch(user,
    user1 = "path/for/user1",
    user2 = "user2/has_data/stored/here",
    stop(glue::glue("Unknown user '{user}'"), call.= FALSE)
  )
}

(also notice switch from vectorised == to identical())

Would also be worth considering if its better to swap configuation for convention:

user_home <- function(node = Sys.info()[["nodename"]]) {
    
  if (identical(node, "server_node_name")) {
    path <- "path/to_the_data/on_the/server"
  } else {
    path <- file.path("~/project_name/data")
  }
  
  if (!file.exists(path)) {
    stop(glue::glue("Data must live at '{path}'"), call. = FALSE)
  }
  path
}

Functions that fabricate or repair column names

Inspired by this list made by @hadley in early vctrs work, regarding type coercion across the tidyverse.

Functions where we create column names out of thin air or from inputs and could / should do so according to common principles. this list will grow as we stumble across these

tidyr::spread()
dplyr::inner_join() and friends
dplyr::summarise()
dplyr::mutate()
tidyr::gather()
tidyr::unnest()

Side-effect-y functions should return first value invisibly

Put package principles in principles.md

The superset principle

Only provide functions with the same name as base R functions if they follow the superset principle, i.e. they only provide additional functionality (e.g. turning into a generic or adding additional arguments)

Scope of effects

Examples:

A function shouldn't modify objects in the global environment.
A package shouldn't attach other packages (or otherwise mess with the search path).
A script shouldn't install packages, or change the working directory, or rm(list = ls()).

(But includes anything stateful: collation order, working directory, env varas, library paths, locales, makevars, options, graphics parameters, path, random seed, ...)

Why not? Because these are actions outside the usual scope of effects — if the scope of effects is constrained/contained then you have a simpler model of computation that makes analysing/understanding code easier. Imagine each function/package/script creates a sort of nested tree: it's ok to affect your children, but not your parents.

There's one big generic exception to this rule: a function/package/script can have actions outside of its usual scope if that is it's explicit and specific purpose:

It's ok for <- to modify the global environment, because that is its one job. It's ok for save_output(path) to create files in path because it's clear from the name.
It's fine for library(conflicted) to mess with your search path; that is its one purpose.
It's ok for source("class-setup.R") to install packages because the intent of a setup script is to get your computer into the same state as someone else's (but be aware by doing this, you might break other projects).

(i.e. it's ok if the user explicitly requests that your code do these things, but you should avoid doing it automatically, or as a side-effect of something unrelated)

It's also general ok to do things temporarily. i.e. it's ok for your function to change global options, as long as you change them back. And it's for anyone to write into the temporary directory.

Talking points:

library(usethis): all the functions in usethis are specifically for modifying your computing environment. They designed to be used interactively, but shouldn't be called automatically (i.e. it's fine to wrap them in a function that is then called by the user, but you shouldn't generally run them in a script)
Assigning multiple objects in a for loop — generally this pattern does not set you up for success because once you have the objects in the environment, how do you work with them? It's better to put them in a list and then you can use the same techniques you would for iterating over values in a vector or columns in a data frame.
library(reprex): the ultimate example of where you want to make small completely self-contained code because you want someone to help you.

All arguments should be prefixed with a `.` when `...` are present and used

Not just the ones to the right of the dots as mentioned here:
https://principles.tidyverse.org/dots.html#avoiding-false-matches

Use of argument names as values

Recent tweets here and here pointed out inconsistencies in the use of argument names as function inputs. It may be worth touching on best practices around the use of named arguments in the tidyverse principles. (If not, then please feel free to close this issue.)

library(tidyverse)
(x <- data_frame(
  v1 = =c("a", "b"),
  v2 = factor(c("c", "d"), levels = c("c", "d"))
))
#> # A tibble: 2 x 2
#>   v1    v2   
#>   <chr> <fct>
#> 1 a     c    
#> 2 b     d
mutate(x, v1 = recode(v2, "a" = "zzz")) # current = new
#> # A tibble: 2 x 2
#>   v1    v2   
#>   <fct> <fct>
#> 1 c     c    
#> 2 d     d
mutate(x, v2 = fct_recode(v2, "yyy" = "c")) # new = current
#> # A tibble: 2 x 2
#>   v1    v2   
#>   <chr> <fct>
#> 1 a     yyy  
#> 2 b     d
rename(x, xxx = v1) # new = current
#> # A tibble: 2 x 2
#>   xxx   v2   
#>   <chr> <fct>
#> 1 a     c    
#> 2 b     d

Created on 2018-10-09 by the reprex package (v0.2.0).

Recommendations for tests

@jennybc commented on Feb 5, 2017, 5:15 PM UTC:

It would be interesting to formalize certain cross-package consistency expectations into tests that run every day or week.

Example: For some challenging csv's make sure readr and readxl (csv -> xls(x) -> df) produce same data frame.

These tests might be a useful complement to the ingest conventions (#34). I think this package would be the natural home for this? Although maybe you wouldn't want tidyverse to show build failing whenever these one of these tests fails.

This issue was moved by jennybc from tidyverse/tidyverse/issues/39.

Pattern: Cross that Bridge When you Come To It

The first in a proposed set of concrete patterns that recur in tidyverse/r-lib packages. Interesting to think about collecting several of these and record them in a way that it's easier to see shared qualities.

Pattern: Cross that Bridge When you Come To It

When this comes up:
There is a value that might be needed by your package, in many places,
but also might not come up at all. To set this value properly, you need
user input. Once you have that, you want to remember and reuse the
value.

Examples from real life:

The way httr stores the filepath to the file where it caches OAuth
tokens.
- Storage unit = the option httr_oauth_cache
The way googledrive, etc. handle OAuth tokens themselves.
- Storage unit = a field of an AuthState, which is an R6 object
  held in the package’s namespace

Sketch of implementation via an option:

User may express their wishes by setting an option.
- At startup , in user- or project-level .Rprofile
- For the current session, with code in a script
- How to set the option, in either case:
```
options(PACKAGE_THINGY = VALUE)
```
Package may set the option on load, but deferring to any value the
user may have set. discussion below suggests to deprecate this bit
```
.onLoad <- function(libname, pkgname) {
  op <- options()
  op.PACKAGE <- list(
    PACKAGE_THINGY = NA
  )
  toset <- !(names(op.PACKAGE) %in% names(op))
  if (any(toset)) options(op.PACKAGE[toset])

  invisible()
}
```
- This value set here could be either a valid value or a sentinel,
  such as NA, that signals we need the user to do something upon
  first need.
- Alternatively a default could be enforced via the .default
  argument of getOption().

Have a function to summon the value and store it for the remainder
of the session. If the value is unset, trigger whatever needs to
happen to set it.

get_thingy <- function() {
  thingy <- getOption(PACKAGE_THINGY, default = NA)
  if (is.na(thingy)) {
    thingy <- make_user_decide()
    ## in the above interaction, tell user how to set this option at startup
    ## and never see the interactive prompt again
    options(PACKAGE_THINGY = thingy)
  }
  match.arg(thingy, choices = thingy_values)
}

If it’s likely that user might want to provide thingy in an ad
hoc manner, expose it as an argument of relevant functions, with a
default value thingy = get_thingy().
If thingy is the sort of thing most people either don’t think
about or have strong opinions about, just use get_thingy() behind
the scenes. (Also applies if thingy is not something users should
manage with bare hands, e.g. an OAuth token). Those who don’t want
to think about it will be forced to provide permissions or details once
per session. Those who care deeply and find this irritating will be
motivated to set the PACKAGE_THINGY option in a startup file.

I wrote this while working on usethis:;use_git_protocol() so it will
be interesting to consult that once it’s done as another concrete
example.

Describe standard set of vector classes that we support

e.g. int64 via bit64 package

Information about custom conditions

When to use (always?), and basic code template (following abort_bad_arg() from adv-r)

Extract repeated error messages

If you use the same form of an error message in multiple places, extract it out into a function.

Use a custom condition. Consider using glue.

Prefer data frames to matrices

Even if a matrix is sufficient to model the return value, a data frame is more convenient since there are few places in the tidyverse that force you to use matrices. The primarily exception is stringr.

Glossary chapter

Containing commonly used words

Each word would be a h3 (so we can link to) followed by a paragraph definition

cc @batpigandme

Context setters should return previous values invisibly

Like options() and par().

Also need to ensure that first argument has same type as output.

Together this allows you to use a nice on.exit() pattern.

Make it clear what you are branching over

is_foofy <- function(x) x %in% c("a", "b", "c")

# How do you make clear which cases need to be handled?
if (is_foofy(x)) {
  if (x == "a") {
    
  } else if (x == "b") {
    
  } else if (x == "c") {
    
  }
}

# How do you make the 3 cases clear?
n_unnamed <- sum(!named)
if (n_unnamed == 0) {
  # do nothing
} else if (n_unnamed == 1) {
  clauses <- c(clauses, build_sql("ELSE ", input[!named]))
} else {
  stop("Can only have one unnamed (ELSE) input", call. = FALSE)
}

Duplicated code

What are obvious patterns of duplicated code in R? How do you fix them? (e.g. make a for loop, use a functional, make a functional)

What are less obvious patterns? How do you identify them? How do you fix them?

Prefer code to documentation

i.e.

readLines <- function(...) stop("Use read_lines() instead")

Because you can't forget

Meet them more than halfway and speak their language

Something about writing functions and packages that meet the user halfway -- or more!

This is about default behaviour. It's related to humane defaults for arguments but on a larger scale. Be willing to take instructions in the user's preferred terminology and translate it to what the computer requires, internally, without pedantry or drama.

An example of the principle I have in mind is how ggplot2::ggsave() can infer figure file format from the extension of the putative file name. There are some other really humane touches to this function. This was one of my most delightful discoveries when I switched to ggplot2 from base/lattice

A good base R example is the ability to specify legend location from the keywords "bottomright", "bottom", "bottomleft", "left", etc. instead of, e.g., 1, 2, 3, or 4 (par(mar)). The whole notion of a formula interface may also qualify.

My plan for this issue is to use it to collect more positive negative and examples of this principle.

[ vs [[

Use [ for selecting multiple things
Use [[ for selecting one thing
Mention purrr::pluck()?

Recycling rules

i.e. only recycle vectors of length 1 to length of longest.

Describe the rules for recycling vectors of length 0. @jimhester did we discuss those rules for glue?

Sentinel objects

e.g. NULL, ggplot2::waiver(), rlang::done()

ignoring trailing arguments

We ignore trailing empty arguments by default in all dots-collecting functions ...

Responsible use of `...`

Some topics:

Implications re: its placement in the signature
How to handle it inside the function
Documentation concerns

tidyverse ingest conventions

@jennybc commented on Jan 24, 2017, 7:11 AM UTC:

The requested brain dump to get this started.

Function that reads a thingy should be named read_thingy(). The opposite, when it exists, should be named write_thingy().
If it makes sense, return a tibble.
- Any package that creates tibbles should import tibble, to reduce gotchas around, e.g., the drop = FALSE behaviour of [ or the lack of partial matching on $.
- If thingy has no data, return a tibble with 0 rows and 0 columns.
Don't have row names.
col_names is the tidyverse answer to header = TRUE. Either logical indicating that first row gives variable names or character vector of names.
Don't mess with column names, i.e. don't modify non-syntactic names. Exceptions:
- Fill in missing column names.
- De-duplicate column names.
Don't coerce character to factor, stringsAsFactors = FALSE.
Do guess column types under what circumstances? according to what rules?
- Provide control over how much data to use for the guessing, guess_max = min(1000, n_max)*
- Recognize dates and date-times, for some universe of default formats. And always convert to POSIXct?
col_types is the tidyverse answer to colClasses. There is an entire system for type specification, with short codes or more general "collectors".
- consider readr's "problems" and printed/returned colspec
something about locale
something about UTF-8 encoding
Explicit control over rows to skip, unrelated to the data: skip = 0, n_max = Inf.
Control over rows (or parts thereof) to skip, based on the data
- comment
- a vector of NA values. Use quoted_na to specify what happens to "NA".
Something about empty rows/columns. Proposal: include and fill with NA when leading (and not explicitly skipped) or embedded. Treatment of trailing empty rows/columns will depend on context (e.g. possible to include with readr, impossible with readxl).
something about reading from file, compressed file, URL, connection, memory
something about chunked reading
whitespace trimming?
progress?
Recommendations for implementation? Or a package?

This issue was moved by jennybc from tidyverse/tidyverse/issues/34.

Common suffixes vs common prefixes

Think about autocomplete.

Roundtrips

When possible, if you have read_thingy(), it is good to have write_thingy() and for x to be identical to read_thingy(write_thingy(x)) and to test for that. Find positive (and negative?) examples.

People don't read (especially repeated output)

i.e. when you start R it prints a bunch of information, but most people have never read it (judging by how few people have heard of citation())

Communicating many problems

What are the conventions for communicating multiple non-fatal problems to the user (i.e. in readr, and with warning and storing data in attribute)

Avoiding temporal dependence

(notes extracted out of spooky action)

I think this is where poor process leaks into artefacts. By constantly running different scripts or chunks in a long-running shared R process, you create wormholes that are not explicit any where in the code.

when running several scripts or knitting several Rmds, each should get its own fresh process
they should not be communicating with each other in the global workspace
they should not commuunicating at all or it should be in some very obvious way, probably through the file system and perhaps a choreographing tool like make or drake
And so working in projects, restarting often, etc. are a method of constantly checking for these unplanned communication channels and eradicating them.

Column renaming conventions

Under what circumstances are column names changed, and how are those changes communicated to the user.

readr:::standardize_path()

Inspired by tidyverse/readr#863

Reusable logic for flexible reading from a (generalized) path:

https://github.com/tidyverse/readr/blob/c6a059a4865041397a48ae6e3ef3d61c629840fe/R/source.R#L111-L151

Possible home to facilitate reuse: fs

User engagement

Marketing matters, and isn't about who shouts the loudest.
Warm and welcoming community
Beginners mind when thinking about documentation
Thank contributors
Logos/stickers
Give talks

Naming functions: auto-completion and links

It can be nice to think about auto-completion when naming functions. The str_ prefix of stringr makes a great example.
If you care to link out, two favourites:

* [I Shall Call It.. SomethingManager](https://blog.codinghorror.com/i-shall-call-it-somethingmanager/)
* [The Poetry of Function Naming](http://blog.stephenwolfram.com/2010/10/the-poetry-of-function-naming/)