tidyverse / design Goto Github PK
View Code? Open in Web Editor NEWTidyverse design principles
Home Page: https://design.tidyverse.org
License: Other
Tidyverse design principles
Home Page: https://design.tidyverse.org
License: Other
For vectorised arguments I think we should consistently treat NULL in the same way as if the argument had not been supplied. This is symmetric with our use of NULL
for the default value of optional arguments that need complex calculations.
library(vctrs)
vec_c(TRUE, double())
#> [1] 1
vec_c(TRUE, NULL)
#> [1] TRUE
# Same as
vec_c(TRUE, )
#> [1] TRUE
# Same as
vec_c(TRUE)
#> [1] TRUE
Should you provide control over verbosity?
At the function level or via some big kill switch?
devtools::check(quiet = TRUE)
usethis.quiet
affects all of the ui_*()
functionsContext: I'm calling check()
and install()
and test()
in .Rmd
for R Packages. All of these functions make pretty strong assumptions that they're being run interactively. Such functions are easier to "write prose around" if there are ways to muffle, capture, and redact their output.
I think this might be analogous to format()
and print()
methods.
Why should you not rely on tidyverse functions preserving row names (because metadata is data).
The .travis.yml
file here needs an update for the new netlify CLI.
Before/current:
deploy:
provider: script
script: netlify deploy -t $NETLIFY_PAT
skip_cleanup: true
After needs to look more like Advanced R:
deploy:
provider: script
script: netlify deploy --prod --dir _book
skip_cleanup: true
and there's the hidden assumption that the netlify pat has been stored as an encrypted env var named NETLIFY_AUTH_TOKEN
on travis.
Auto-build-and-deploy will be broken until we do this.
UTF-8 wherever possible and tips for actually achieving that
How do you decide the return type of your function?
Think about invariants. e.g. is vec_type(f(x))
a constant? Is it vec_type(x)
? The less information needed to predict the return type, the better.
Pick the "smallest"/most constrained type that the returned data fits into.
If you return the same type of output from multiple functions, you should create a function that consistently creates exact the same format (to avoid accidentally inconsistency), and consider making it an S3 class (so you can have a custom print method).
When a function returns two vectors of the same size, as a general rule should you return a tibble:
A matrix would only work if the vectors were the same type (and not factor or Date), doesn't make it easy to extract the individual values, and is not easily input to other tidyverse functions.
A list doesn't capture the constraint that both vectors are the same length.
A data frame is ok if you don't want to take a dependency on tibble, but you need to remember the drawbacks: if the columns are character vectors you'll need to remember to use stringsAsFactors = FALSE
, and the print method is confusing for list- and df-cols (and you have to create by modifying an existing data frame, not by calling data.frame()
). (Example: it would be weird if glue returned tibbles from a function.)
e.g. str_locate()
, str_locate_all()
From Advanced R:
global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave.
Placeholder for another concrete pattern that recurs in tidyverse/r-lib packages, like #53 Pattern: Cross that Bridge When you Come To It
Sometimes internal data is also useful to users. Write a helper to give it to them in usable way.
Examples:
readr::readr_example()
, readxl::readxl_example()
for accessing example files that ship with the packagesgoogledrive::drive_endpoints()
, googledrive::drive_mime_type()
for accessing pre-processed knowledge about the Drive APINULL
input means "show me everything"
character or integer input is for selective access
moved manually from hadley/adv-r#1248
abind::abind()
has this signature:
abind(..., along=N, <more args>)
where along
is "(optional) The dimension along which to bind the arrays. The default is the last dimension, i.e., the maximum length of the dim attribute of the supplied arrays."
Good example for exploring the relative merits of different default strategies. Seems that along = NULL
would have been easier to program around / wrap & expose.
skip_if()
should be a comment next to the codeFrom https://edwinth.github.io/multiperson-project/, with permission
save_as_rds <- function(file,
filename) {
node <- Sys.info()["nodename"]
user <- Sys.info()['user']
if (node == "server_node_name") {
path <- "path/to_the_data/on_the/server"
} else if (user == 'user1') {
path <- "path/for/user1"
} else if (user2) {
path <- "user2/has_data/stored/here"
}
file_path <- file.path(path, filename)
saveRDS(file, file_path)
}
User has a constrained set of values, so reflect that with a switch()
:
save_as_rds <- function(file,
filename) {
node <- Sys.info()["nodename"]
user <- Sys.info()['user']
if (node == "server_node_name") {
path <- "path/to_the_data/on_the/server"
} else {
path <- switch(user,
user1 = "path/for/user1",
user2 = "user2/has_data/stored/here",
stop(glue::glue("Unknown user '{user}'"), call.= FALSE)
)
}
file_path <- file.path(path, filename)
saveRDS(file, file_path)
}
This also fixes small bug in original function (if (user2)
) which would be soon discovered, but can't happen with switch()
. Now also gives clear error.
Currently function is hard to test because it has hidden inputs and mingles computation and side-effects. Can fix by pulling out path generation function that has node
and user
as arguments: this makes it easier to experiment with/test:
user_home <- function(node = Sys.info()[["nodename"]],
user = Sys.info()[["user"]]) {
if (node == "server_node_name") {
"path/to_the_data/on_the/server"
} else {
switch(user,
user1 = "path/for/user1",
user2 = "user2/has_data/stored/here",
stop(glue::glue("Unknown user '{user}'"), call.= FALSE)
)
}
}
I'd then simplify to use early return:
user_home <- function(node = Sys.info()[["nodename"]],
user = Sys.info()[["user"]]
) {
if (identical(node, "server_node_name")) {
return("path/to_the_data/on_the/server")
}
switch(user,
user1 = "path/for/user1",
user2 = "user2/has_data/stored/here",
stop(glue::glue("Unknown user '{user}'"), call.= FALSE)
)
}
(also notice switch from vectorised ==
to identical()
)
Would also be worth considering if its better to swap configuation for convention:
user_home <- function(node = Sys.info()[["nodename"]]) {
if (identical(node, "server_node_name")) {
path <- "path/to_the_data/on_the/server"
} else {
path <- file.path("~/project_name/data")
}
if (!file.exists(path)) {
stop(glue::glue("Data must live at '{path}'"), call. = FALSE)
}
path
}
Inspired by this list made by @hadley in early vctrs work, regarding type coercion across the tidyverse.
Functions where we create column names out of thin air or from inputs and could / should do so according to common principles. this list will grow as we stumble across these
tidyr::spread()
dplyr::inner_join()
and friendsdplyr::summarise()
dplyr::mutate()
tidyr::gather()
tidyr::unnest()
Only provide functions with the same name as base R functions if they follow the superset principle, i.e. they only provide additional functionality (e.g. turning into a generic or adding additional arguments)
Examples:
rm(list = ls())
.(But includes anything stateful: collation order, working directory, env varas, library paths, locales, makevars, options, graphics parameters, path, random seed, ...)
Why not? Because these are actions outside the usual scope of effects — if the scope of effects is constrained/contained then you have a simpler model of computation that makes analysing/understanding code easier. Imagine each function/package/script creates a sort of nested tree: it's ok to affect your children, but not your parents.
There's one big generic exception to this rule: a function/package/script can have actions outside of its usual scope if that is it's explicit and specific purpose:
It's ok for <-
to modify the global environment, because that is its one job. It's ok for save_output(path)
to create files in path because it's clear from the name.
It's fine for library(conflicted)
to mess with your search path; that is its one purpose.
It's ok for source("class-setup.R")
to install packages because the intent of a setup script is to get your computer into the same state as someone else's (but be aware by doing this, you might break other projects).
(i.e. it's ok if the user explicitly requests that your code do these things, but you should avoid doing it automatically, or as a side-effect of something unrelated)
It's also general ok to do things temporarily. i.e. it's ok for your function to change global options, as long as you change them back. And it's for anyone to write into the temporary directory.
Talking points:
library(usethis)
: all the functions in usethis are specifically for modifying your computing environment. They designed to be used interactively, but shouldn't be called automatically (i.e. it's fine to wrap them in a function that is then called by the user, but you shouldn't generally run them in a script)
Assigning multiple objects in a for loop — generally this pattern does not set you up for success because once you have the objects in the environment, how do you work with them? It's better to put them in a list and then you can use the same techniques you would for iterating over values in a vector or columns in a data frame.
library(reprex)
: the ultimate example of where you want to make small completely self-contained code because you want someone to help you.
Not just the ones to the right of the dots as mentioned here:
https://principles.tidyverse.org/dots.html#avoiding-false-matches
Recent tweets here and here pointed out inconsistencies in the use of argument names as function inputs. It may be worth touching on best practices around the use of named arguments in the tidyverse principles. (If not, then please feel free to close this issue.)
library(tidyverse)
(x <- data_frame(
v1 = =c("a", "b"),
v2 = factor(c("c", "d"), levels = c("c", "d"))
))
#> # A tibble: 2 x 2
#> v1 v2
#> <chr> <fct>
#> 1 a c
#> 2 b d
mutate(x, v1 = recode(v2, "a" = "zzz")) # current = new
#> # A tibble: 2 x 2
#> v1 v2
#> <fct> <fct>
#> 1 c c
#> 2 d d
mutate(x, v2 = fct_recode(v2, "yyy" = "c")) # new = current
#> # A tibble: 2 x 2
#> v1 v2
#> <chr> <fct>
#> 1 a yyy
#> 2 b d
rename(x, xxx = v1) # new = current
#> # A tibble: 2 x 2
#> xxx v2
#> <chr> <fct>
#> 1 a c
#> 2 b d
Created on 2018-10-09 by the reprex package (v0.2.0).
@jennybc commented on Feb 5, 2017, 5:15 PM UTC:
It would be interesting to formalize certain cross-package consistency expectations into tests that run every day or week.
Example: For some challenging csv's make sure readr and readxl (csv -> xls(x) -> df) produce same data frame.
These tests might be a useful complement to the ingest conventions (#34). I think this package would be the natural home for this? Although maybe you wouldn't want tidyverse to show build failing whenever these one of these tests fails.
This issue was moved by jennybc from tidyverse/tidyverse/issues/39.
The first in a proposed set of concrete patterns that recur in tidyverse/r-lib packages. Interesting to think about collecting several of these and record them in a way that it's easier to see shared qualities.
When this comes up:
There is a value that might be needed by your package, in many places,
but also might not come up at all. To set this value properly, you need
user input. Once you have that, you want to remember and reuse the
value.
Examples from real life:
httr_oauth_cache
AuthState
, which is an R6 objectSketch of implementation via an option:
User may express their wishes by setting an option.
.Rprofile
options(PACKAGE_THINGY = VALUE)
Package may set the option on load, but deferring to any value the
user may have set. discussion below suggests to deprecate this bit
.onLoad <- function(libname, pkgname) {
op <- options()
op.PACKAGE <- list(
PACKAGE_THINGY = NA
)
toset <- !(names(op.PACKAGE) %in% names(op))
if (any(toset)) options(op.PACKAGE[toset])
invisible()
}
NA
, that signals we need the user to do something upon.default
getOption()
.Have a function to summon the value and store it for the remainder
of the session. If the value is unset, trigger whatever needs to
happen to set it.
get_thingy <- function() {
thingy <- getOption(PACKAGE_THINGY, default = NA)
if (is.na(thingy)) {
thingy <- make_user_decide()
## in the above interaction, tell user how to set this option at startup
## and never see the interactive prompt again
options(PACKAGE_THINGY = thingy)
}
match.arg(thingy, choices = thingy_values)
}
If it’s likely that user might want to provide thingy
in an ad
hoc manner, expose it as an argument of relevant functions, with a
default value thingy = get_thingy()
.
If thingy
is the sort of thing most people either don’t think
about or have strong opinions about, just use get_thingy()
behind
the scenes. (Also applies if thingy
is not something users should
manage with bare hands, e.g. an OAuth token). Those who don’t want
to think about it will be forced to provide permissions or details once
per session. Those who care deeply and find this irritating will be
motivated to set the PACKAGE_THINGY
option in a startup file.
I wrote this while working on usethis:;use_git_protocol()
so it will
be interesting to consult that once it’s done as another concrete
example.
e.g. int64 via bit64 package
When to use (always?), and basic code template (following abort_bad_arg()
from adv-r)
If you use the same form of an error message in multiple places, extract it out into a function.
Use a custom condition. Consider using glue.
Even if a matrix is sufficient to model the return value, a data frame is more convenient since there are few places in the tidyverse that force you to use matrices. The primarily exception is stringr.
Containing commonly used words
Each word would be a h3
(so we can link to) followed by a paragraph definition
cc @batpigandme
Like options()
and par()
.
Also need to ensure that first argument has same type as output.
Together this allows you to use a nice on.exit()
pattern.
is_foofy <- function(x) x %in% c("a", "b", "c")
# How do you make clear which cases need to be handled?
if (is_foofy(x)) {
if (x == "a") {
} else if (x == "b") {
} else if (x == "c") {
}
}
# How do you make the 3 cases clear?
n_unnamed <- sum(!named)
if (n_unnamed == 0) {
# do nothing
} else if (n_unnamed == 1) {
clauses <- c(clauses, build_sql("ELSE ", input[!named]))
} else {
stop("Can only have one unnamed (ELSE) input", call. = FALSE)
}
What are obvious patterns of duplicated code in R? How do you fix them? (e.g. make a for loop, use a functional, make a functional)
What are less obvious patterns? How do you identify them? How do you fix them?
i.e.
readLines <- function(...) stop("Use read_lines() instead")
Because you can't forget
Something about writing functions and packages that meet the user halfway -- or more!
This is about default behaviour. It's related to humane defaults for arguments but on a larger scale. Be willing to take instructions in the user's preferred terminology and translate it to what the computer requires, internally, without pedantry or drama.
An example of the principle I have in mind is how ggplot2::ggsave()
can infer figure file format from the extension of the putative file name. There are some other really humane touches to this function. This was one of my most delightful discoveries when I switched to ggplot2 from base/lattice
A good base R example is the ability to specify legend location from the keywords "bottomright", "bottom", "bottomleft", "left", etc. instead of, e.g., 1, 2, 3, or 4 (par(mar)
). The whole notion of a formula interface may also qualify.
My plan for this issue is to use it to collect more positive negative and examples of this principle.
[
for selecting multiple things[[
for selecting one thingpurrr::pluck()
?i.e. only recycle vectors of length 1 to length of longest.
Describe the rules for recycling vectors of length 0. @jimhester did we discuss those rules for glue?
e.g. NULL
, ggplot2::waiver()
, rlang::done()
We ignore trailing empty arguments by default in all dots-collecting functions ...
Some topics:
@jennybc commented on Jan 24, 2017, 7:11 AM UTC:
The requested brain dump to get this started.
thingy
should be named read_thingy()
. The opposite, when it exists, should be named write_thingy()
.drop = FALSE
behaviour of [
or the lack of partial matching on $
.thingy
has no data, return a tibble with 0 rows and 0 columns.col_names
is the tidyverse answer to header = TRUE
. Either logical indicating that first row gives variable names or character vector of names.stringsAsFactors = FALSE
.guess_max = min(1000, n_max)
*col_types
is the tidyverse answer to colClasses
. There is an entire system for type specification, with short codes or more general "collectors".
skip = 0, n_max = Inf
.comment
NA
values. Use quoted_na
to specify what happens to "NA"
.This issue was moved by jennybc from tidyverse/tidyverse/issues/34.
Think about autocomplete.
When possible, if you have read_thingy()
, it is good to have write_thingy()
and for x
to be identical to read_thingy(write_thingy(x))
and to test for that. Find positive (and negative?) examples.
i.e. when you start R it prints a bunch of information, but most people have never read it (judging by how few people have heard of citation()
)
What are the conventions for communicating multiple non-fatal problems to the user (i.e. in readr, and with warning and storing data in attribute)
(notes extracted out of spooky action)
I think this is where poor process leaks into artefacts. By constantly running different scripts or chunks in a long-running shared R process, you create wormholes that are not explicit any where in the code.
when running several scripts or knitting several Rmds, each should get its own fresh process
they should not be communicating with each other in the global workspace
they should not commuunicating at all or it should be in some very obvious way, probably through the file system and perhaps a choreographing tool like make or drake
And so working in projects, restarting often, etc. are a method of constantly checking for these unplanned communication channels and eradicating them.
Under what circumstances are column names changed, and how are those changes communicated to the user.
Inspired by tidyverse/readr#863
Reusable logic for flexible reading from a (generalized) path:
Possible home to facilitate reuse: fs
It can be nice to think about auto-completion when naming functions. The str_
prefix of stringr makes a great example.
If you care to link out, two favourites:
* [I Shall Call It.. SomethingManager](https://blog.codinghorror.com/i-shall-call-it-somethingmanager/)
* [The Poetry of Function Naming](http://blog.stephenwolfram.com/2010/10/the-poetry-of-function-naming/)
When we munge names (or refrain from doing so) and exactly how.
Most recent group discussion of name repair:
Which lead to tibble::set_tidy_names()
, tibble::tidy_names()
in this PR:
Discuss.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.