tinylabels

Variable labels are useful in many data-analysis contexts, but R does not provide variable labels in its base distribution. Several R packages introduced (sometimes conflicting) implementations (e.g., Hmisc, haven, sjlabelled), but these packages come with extensive dependencies. Following the philosophy of a tiny dependency graph (the tinyverse philosophy), tinylabels set out to provide functionality for variable labels without depending on any non-base R packages, while also being as compatible as possible with other implementations. Another deliberate choice is that tinylabels only implements variable labels, not value labels.

With a tiny dependency graph and a small code base, we hope that tinylabels may become an excellent choice for package developers, as well as for data analysts who care about dependencies.

Function overview:

Assign variable label: variable_label()<-
Extract variable label: variable_label()
Remove labels and tiny_labelled class: unlabel()
Assign labels in a piped workflow: label_variables()

Basic Usage

Assign a variable label to a vector:

library(tinylabels)

x <- rnorm(6)
variable_label(x) <- "Values randomly drawn from a standard-normal distribution"
x
#> Variable label     : Values randomly drawn from a standard-normal distribution
#> [1] -2.1576383  1.3916812  1.1212565 -0.2966887  0.6191351 -0.0413536

# Extract the variable label from a vector (e.g., a numeric vector)
variable_label(x)
#> [1] "Values randomly drawn from a standard-normal distribution"

It is also possible to assign variable labels to (all or a subset of) the columns of a data frame. For instance, consider the built-in data set npk:

# View original data set ----
str(npk)
#> 'data.frame':    24 obs. of  5 variables:
#>  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
#>  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
#>  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
#>  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
#>  $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...

# Assign labels to the built-in data set 'npk' ----
variable_label(npk) <- c(
  N = "Nitrogen"
  , P = "Phosphate"
  , yield = "Pea yield"
)

# View the altered data set ----
str(npk)
#> 'data.frame':    24 obs. of  5 variables:
#>  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
#>  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
#>   ..- attr(*, "label")= chr "Nitrogen"
#>  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
#>   ..- attr(*, "label")= chr "Phosphate"
#>  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
#>  $ yield: 'tiny_labelled' num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
#>   ..- attr(*, "label")= chr "Pea yield"

Each labelled column now has an attribute label that contains the respective variable label, and is now of class tiny_labelled that extends the simple vector classes (e.g., logical, integer, double, etc.)

You can access the variable labels by using

variable_labels(npk)
#> $block
#> NULL
#> 
#> $N
#> [1] "Nitrogen"
#> 
#> $P
#> [1] "Phosphate"
#> 
#> $K
#> NULL
#> 
#> $yield
#> [1] "Pea yield"

The return value of variable_label() applied to a data frame is a named list, where each list element corresponds to a column of the data frame. For columns that do not have a variable label, the corresponding list element is NULL.

For data frames, the right-hand side of the assignment has to be a named list or vector. When trying to set columns that are not present in the data frame, a meaningful error message is thrown:

variable_label(npk) <- c(wrong_column_name = "A supposedly terrific label")
#> Error: While trying to set variable labels, some requested columns could not be found in data.frame:
#> 'wrong_column_name'

If you want to remove labels (and the corresponding S3 class) from a vector or all columns of a data frame, you may use function unlabel:

# Return as a simple factor ----
unlabel(npk$N)
#>  [1] 0 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0
#> Levels: 0 1

# Remove all labels (and class 'tiny_labelled') from all columns ----
npk <- unlabel(npk)
str(npk)
#> 'data.frame':    24 obs. of  5 variables:
#>  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
#>  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
#>  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
#>  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
#>  $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...

Supporting the tidyverse

Developing tinylabels, we aimed at supporting the tidyverse while also keeping the package’s dependency graph as tiny as possible. For this reason, tidyverse packages are not imported. However, if you have already installed the vctrs package on your computer, tinylabels will dynamically register methods for vctrs::vec_ptype2() and vctrs::vec_cast() that are necessary to play well with the tidyverse. With these methods, the tidyverse’s data manipulation functions become rather explicit about how they handle labels.

For instance, If you try to combine two vectors that have different labels, only the label of the first vector is retained – but in addition, you will receive a meaningful warning message. Consider, for instance, dplyr::bind_rows() applied to two data frames with non-matching variable labels:

data_1 <- data_2 <- data.frame(
  x = rnorm(10)
  , y = rnorm(10)
)

variable_label(data_1) <- c(x = "Label for x", y = "Label for y")
variable_label(data_2) <- c(x = "Label for x", y = "Another label for y")

library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union
combined_data <- bind_rows(data_1, data_2)
#> Warning in vec_ptype2.tiny_labelled.tiny_labelled(x = x, y = y, x_arg = x_arg, :
#> While combining two labelled vectors, variable label 'Another label for y' was
#> dropped and variable label 'Label for y' was retained.
variable_label(combined_data)
#> $x
#> [1] "Label for x"
#> 
#> $y
#> [1] "Label for y"

To further support tidyverse-ish code, we also wrote the function label_variables() that is intended to be used in conjunction with the tidyverse’s pipe operator:

test <- npk %>%
  group_by(N, P) %>%
  summarize(yield = mean(yield), .groups = "keep") %>%
  label_variables(N = "Nitrogen", P = "Phosphate", yield = "Average yield")

variable_labels(test)
#> $N
#> [1] "Nitrogen"
#> 
#> $P
#> [1] "Phosphate"
#> 
#> $yield
#> [1] "Average yield"

Use variable labels in ggplots

Hi Marius,

I think it would be great, if the labels attached to a data.frame could be used in ggplots. I suppose, to make this a default would require some changes to the internals of ggplot2. But I think there is a convenient way to achive this "on demand".

Example plot:

library("ggplot2")
library("tinylabels")

# Set up data & labels
mtcars2 <- within(mtcars, {
  vs <- factor(vs, labels = c("V-shaped", "Straight"))
  am <- factor(am, labels = c("Automatic", "Manual"))
  cyl  <- factor(cyl)
  gear <- factor(gear)
}) |>
  label_variables(
    wt = "Weight (1000 lbs)",
    mpg = "Fuel economy (mpg)",
    gear = "Gears"
  )

# Create plot
p1 <- ggplot(mtcars2, aes(x = wt, y = mpg, colour = gear)) +
  geom_point()

To make this work, we need a dummy object with a defined class, so that we can add a new method to ggplot2::ggplot_add(). The dummy object could, in principle, add more control, but I saw no need here. In this example, I call the class tinylabels and define the method such that it simply uses the variable labels, if available, and falls back on the column names otherwise. This currently relies on an unexported papaja function.

variable_labels_as_labs <- function(x) {
  structure(list(), class = "tinylabels")
}

ggplot_add.tinylabels <- function (object, plot, object_name) {
  aesthetics <- lapply(plot$mapping, function(x) rlang::as_name(rlang::f_rhs(x)))
  dat <- papaja:::default_label.data.frame(plot$data)
  object <- variable_labels(dat)[unlist(aesthetics)]
  names(object) <- names(plot$mapping)
  ggplot2::update_labels(plot, object)
}

Now, all we need to do is this:

p1 + variable_labels_as_labs()

Of course, this can subsequently be updated or overwritten as needed:

p1 + 
  variable_labels_as_labs() +
  labs(y = "Fuel economy (mpg)")

All this seems like a minor addition in terms of to-be-maintained code (and I think it would just require additional suggests), but a nice additional use for the labels. What do you think?

mariusbarth / tinylabels Goto Github PK

tinylabels's Introduction

tinylabels

Basic Usage

Supporting the tidyverse

tinylabels's People

Contributors

Stargazers

Watchers

tinylabels's Issues

Use variable labels in ggplots

Data documentation with roxygen2

Dplyr joins fail when tinylabels is not loaded

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs