GithubHelp home page GithubHelp logo

tinylabels's Introduction

tinylabels

Project Status: Active – The project has reached a stable, usable state and is being actively developed. CRAN status R build status codecov

Variable labels are useful in many data-analysis contexts, but R does not provide variable labels in its base distribution. Several R packages introduced (sometimes conflicting) implementations (e.g., Hmisc, haven, sjlabelled), but these packages come with extensive dependencies. Following the philosophy of a tiny dependency graph (the tinyverse philosophy), tinylabels set out to provide functionality for variable labels without depending on any non-base R packages, while also being as compatible as possible with other implementations. Another deliberate choice is that tinylabels only implements variable labels, not value labels.

With a tiny dependency graph and a small code base, we hope that tinylabels may become an excellent choice for package developers, as well as for data analysts who care about dependencies.

Function overview:

  • Assign variable label: variable_label()<-
  • Extract variable label: variable_label()
  • Remove labels and tiny_labelled class: unlabel()
  • Assign labels in a piped workflow: label_variables()

Basic Usage

Assign a variable label to a vector:

library(tinylabels)

x <- rnorm(6)
variable_label(x) <- "Values randomly drawn from a standard-normal distribution"
x
#> Variable label     : Values randomly drawn from a standard-normal distribution
#> [1] -2.1576383  1.3916812  1.1212565 -0.2966887  0.6191351 -0.0413536
# Extract the variable label from a vector (e.g., a numeric vector)
variable_label(x)
#> [1] "Values randomly drawn from a standard-normal distribution"

It is also possible to assign variable labels to (all or a subset of) the columns of a data frame. For instance, consider the built-in data set npk:

# View original data set ----
str(npk)
#> 'data.frame':    24 obs. of  5 variables:
#>  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
#>  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
#>  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
#>  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
#>  $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
# Assign labels to the built-in data set 'npk' ----
variable_label(npk) <- c(
  N = "Nitrogen"
  , P = "Phosphate"
  , yield = "Pea yield"
)

# View the altered data set ----
str(npk)
#> 'data.frame':    24 obs. of  5 variables:
#>  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
#>  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
#>   ..- attr(*, "label")= chr "Nitrogen"
#>  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
#>   ..- attr(*, "label")= chr "Phosphate"
#>  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
#>  $ yield: 'tiny_labelled' num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...
#>   ..- attr(*, "label")= chr "Pea yield"

Each labelled column now has an attribute label that contains the respective variable label, and is now of class tiny_labelled that extends the simple vector classes (e.g., logical, integer, double, etc.)

You can access the variable labels by using

variable_labels(npk)
#> $block
#> NULL
#> 
#> $N
#> [1] "Nitrogen"
#> 
#> $P
#> [1] "Phosphate"
#> 
#> $K
#> NULL
#> 
#> $yield
#> [1] "Pea yield"

The return value of variable_label() applied to a data frame is a named list, where each list element corresponds to a column of the data frame. For columns that do not have a variable label, the corresponding list element is NULL.

For data frames, the right-hand side of the assignment has to be a named list or vector. When trying to set columns that are not present in the data frame, a meaningful error message is thrown:

variable_label(npk) <- c(wrong_column_name = "A supposedly terrific label")
#> Error: While trying to set variable labels, some requested columns could not be found in data.frame:
#> 'wrong_column_name'

If you want to remove labels (and the corresponding S3 class) from a vector or all columns of a data frame, you may use function unlabel:

# Return as a simple factor ----
unlabel(npk$N)
#>  [1] 0 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0
#> Levels: 0 1

# Remove all labels (and class 'tiny_labelled') from all columns ----
npk <- unlabel(npk)
str(npk)
#> 'data.frame':    24 obs. of  5 variables:
#>  $ block: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
#>  $ N    : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 1 2 ...
#>  $ P    : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 2 ...
#>  $ K    : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 2 1 1 2 ...
#>  $ yield: num  49.5 62.8 46.8 57 59.8 58.5 55.5 56 62.8 55.8 ...

Supporting the tidyverse

Developing tinylabels, we aimed at supporting the tidyverse while also keeping the package’s dependency graph as tiny as possible. For this reason, tidyverse packages are not imported. However, if you have already installed the vctrs package on your computer, tinylabels will dynamically register methods for vctrs::vec_ptype2() and vctrs::vec_cast() that are necessary to play well with the tidyverse. With these methods, the tidyverse’s data manipulation functions become rather explicit about how they handle labels.

For instance, If you try to combine two vectors that have different labels, only the label of the first vector is retained – but in addition, you will receive a meaningful warning message. Consider, for instance, dplyr::bind_rows() applied to two data frames with non-matching variable labels:

data_1 <- data_2 <- data.frame(
  x = rnorm(10)
  , y = rnorm(10)
)

variable_label(data_1) <- c(x = "Label for x", y = "Label for y")
variable_label(data_2) <- c(x = "Label for x", y = "Another label for y")
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union
combined_data <- bind_rows(data_1, data_2)
#> Warning in vec_ptype2.tiny_labelled.tiny_labelled(x = x, y = y, x_arg = x_arg, :
#> While combining two labelled vectors, variable label 'Another label for y' was
#> dropped and variable label 'Label for y' was retained.
variable_label(combined_data)
#> $x
#> [1] "Label for x"
#> 
#> $y
#> [1] "Label for y"

To further support tidyverse-ish code, we also wrote the function label_variables() that is intended to be used in conjunction with the tidyverse’s pipe operator:

test <- npk %>%
  group_by(N, P) %>%
  summarize(yield = mean(yield), .groups = "keep") %>%
  label_variables(N = "Nitrogen", P = "Phosphate", yield = "Average yield")

variable_labels(test)
#> $N
#> [1] "Nitrogen"
#> 
#> $P
#> [1] "Phosphate"
#> 
#> $yield
#> [1] "Average yield"

tinylabels's People

Contributors

mariusbarth avatar

Stargazers

 avatar

Watchers

 avatar

tinylabels's Issues

Use variable labels in ggplots

Hi Marius,

I think it would be great, if the labels attached to a data.frame could be used in ggplots. I suppose, to make this a default would require some changes to the internals of ggplot2. But I think there is a convenient way to achive this "on demand".

Example plot:

library("ggplot2")
library("tinylabels")

# Set up data & labels
mtcars2 <- within(mtcars, {
  vs <- factor(vs, labels = c("V-shaped", "Straight"))
  am <- factor(am, labels = c("Automatic", "Manual"))
  cyl  <- factor(cyl)
  gear <- factor(gear)
}) |>
  label_variables(
    wt = "Weight (1000 lbs)",
    mpg = "Fuel economy (mpg)",
    gear = "Gears"
  )

# Create plot
p1 <- ggplot(mtcars2, aes(x = wt, y = mpg, colour = gear)) +
  geom_point()

To make this work, we need a dummy object with a defined class, so that we can add a new method to ggplot2::ggplot_add(). The dummy object could, in principle, add more control, but I saw no need here. In this example, I call the class tinylabels and define the method such that it simply uses the variable labels, if available, and falls back on the column names otherwise. This currently relies on an unexported papaja function.

variable_labels_as_labs <- function(x) {
  structure(list(), class = "tinylabels")
}

ggplot_add.tinylabels <- function (object, plot, object_name) {
  aesthetics <- lapply(plot$mapping, function(x) rlang::as_name(rlang::f_rhs(x)))
  dat <- papaja:::default_label.data.frame(plot$data)
  object <- variable_labels(dat)[unlist(aesthetics)]
  names(object) <- names(plot$mapping)
  ggplot2::update_labels(plot, object)
}

Now, all we need to do is this:

p1 + variable_labels_as_labs()

Of course, this can subsequently be updated or overwritten as needed:

p1 + 
  variable_labels_as_labs() +
  labs(y = "Fuel economy (mpg)")

All this seems like a minor addition in terms of to-be-maintained code (and I think it would just require additional suggests), but a nice additional use for the labels. What do you think?

Data documentation with roxygen2

Hi @mariusbarth,

I would like to use assigned variable labels to programmatically generate documentation for a dataset to be included in {roxgen2} documentation. It is possible to execute code inline and I would like to generate an itemized list of column names and variable labels.

Consider the following example:

labelled_df <- cars |>
  tinylabels::label_variables(
    speed = "Speed [mph]"
    , dist = "Distance [ft]"
  )

setGeneric("document_data", function(object) {
  standardGeneric("document_data_frame")
})

setMethod("document_data", signature(object = "data.frame"), function(object) {
  label_df <- data.frame(
    colname = colnames(object)
    , variable_label = tinylabels::variable_labels(object) |>
      as.character() |>
      unlist()
  )

  items <- apply(
    label_df
    , 1
    , \(x) paste0("  \\item{", x["colname"], "}{", x["variable_label"], "}")
  )

  documentation <- paste(
    "\\describe{"
    , paste(items, collapse = "\n")
    , "}"
    , sep = "\n"
  )
  cat(documentation)
  invisible(label_df)
})

document_data(object = labelled_df)
\describe{
  \item{speed}{Speed [mph]}
  \item{dist}{Distance [ft]}
}

This approach could be extended to provide nicer typesetting for labels containing expressions or for other object types.

Is this something you would be willing to include in {tinylabels}? If so, I'd be happy to open a PR.

Dplyr joins fail when tinylabels is not loaded

I've run into an interesting bug when working with datasets that are loaded from RDS format. I prepared the data in one program and tried to use tinylabels to label the variables before saving the RDS file where it could be used by others. I assumed that people would be able to read and work with these datasets without issue since tidylabels just sets an attribute (much like the way other libraries handle labels).

However, it seems that if you don't load the tinylabels package, dplyr won't understand what to do with the labelled columns. For example, if you try to join a labelled dataset with an unlabelled dataset, it will fail with an "incompatible types" error.

Here's an attempt at an example that reproduces. First run this code in a fresh R session (set working directory as needed)

library(dplyr)
library(tinylabels)

dfA = tibble(id = as.character(1:5),
             valA = 1: 5*2)
dfB = tibble(id = as.character(1:5),
             valB = 1: 5*3) %>%
  label_variables(id='Identifier',
                  valB = 'Value')

test1 = dfA %>% left_join(dfB)

saveRDS(dfA,'A.RDS')
saveRDS(dfB,'B.RDS')

Then restart R and run this code (again with same working directory)

library(dplyr)

dfA = readRDS('A.RDS')
dfB = readRDS('B.RDS')
test2 = dfA %>% left_join(dfB)

it will fail like this:

> test2 = dfA %>% left_join(dfB)
Joining with `by = join_by(id)`
Error in `left_join()`:
! Can't join `x$id` with `y$id` due to incompatible types.
ℹ `x$id` is a <character>.
ℹ `y$id` is a <tiny_labelled>.

I'm new to this library so I don't have any idea of how to best fix this, or if it even should be fixed, but, in my view, it would be nice if the labelling were able to persist even when the library is not loaded. I plan to do some further testing with other methods of labelling variables in R, as I don't recall ever having a similar issue in the past when working with labelled data using other libraries.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.