GithubHelp home page GithubHelp logo

larmarange / labelled Goto Github PK

View Code? Open in Web Editor NEW
73.0 8.0 16.0 31.34 MB

Manipulating labelled vectors in R

Home Page: https://larmarange.github.io/labelled/

License: GNU General Public License v3.0

R 100.00%
r metadata labels cran haven spss sas stata

labelled's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

labelled's Issues

Strange behavior when creating labelled variables in data frames

I want to create variable foo but x is created (with value labels):

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data.frame(
  foo = labelled::labelled(1:5, c(a=1, b=2))
) %>%
  str()
#> 'data.frame':    5 obs. of  1 variable:
#>  $ x:Class 'labelled'  atomic [1:5] 1 2 3 4 5
#>   .. ..- attr(*, "labels")= Named num [1:2] 1 2
#>   .. .. ..- attr(*, "names")= chr [1:2] "a" "b"

But for tibbles it is OK:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tibble(
  foo = labelled::labelled(1:5, c(a=1, b=2))
) %>%
  str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  1 variable:
#>  $ foo:Class 'labelled'  atomic [1:5] 1 2 3 4 5
#>   .. ..- attr(*, "labels")= Named num [1:2] 1 2
#>   .. .. ..- attr(*, "names")= chr [1:2] "a" "b"

I'm not yet sure why it happens.

Release labelled 2.0.1

Prepare for release:

  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • Polish NEWS

Perform release:

  • Bump version (in DESCRIPTION and NEWS)
  • devtools::check_win_devel() (again!)
  • devtools::submit_cran()
  • pkgdown::build_site()
  • Approve email

Wait for CRAN...

  • Tag release
  • Bump dev version

Template from r-lib/usethis#338

Release labelled 2.1.0

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • revdepcheck::revdep_check(num_workers = 4)
  • update revdep\email.yml
  • revdepcheck::revdep_email()
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version()
  • Update cran-comments.md
  • devtools::submit_cran()
  • pkgdown::build_site()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • Remove file CRAN-RELEASE
  • usethis::use_dev_version()

bind_rows(), list columns and var_labels

I need to bind_rows() of two tibbles that contain labelled data and list columns. dplyr is dropping the labels with a warning about "Vectorizing labelled data". To circumvent this I am trying to extract the lists of variable labels and value labels and re-applying them to the binded tibble. This does not work for variable labels and list columns unfortunately (see the reprex below).

What do you think about:

  1. var_label() actually does not need to check whether x is atomic, doesn't it? The test could be dropped IMHO.
  2. a more general solution for binding rows of labelled data. The question is of course what if value labels for the same variables in both data frames are different. Perhaps it could be handled similarly to factor levels. This would be a dplyr issue anyway...
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(haven)
library(labelled)

d <- data_frame(
  x = labelled(1:5, c(a=1, b=5)),
  lc = as.list(1:5)
)
var_label(d$x) <- "This is x"

# Can't apply variable label to a list column
var_label(d$lc) <- "This is lc" # Why not actually?
#> Error: `x` should be atomic



# Extract value labels
vl <- val_labels(d)

# Bind rows and re-apply value labels
dd <- bind_rows(d, d, .id="copy") 
#> Warning in bind_rows_(x, .id): Vectorizing 'labelled' elements may not
#> preserve their attributes

#> Warning in bind_rows_(x, .id): Vectorizing 'labelled' elements may not
#> preserve their attributes
val_labels(dd) <- vl
dd$x # OK!
#> <Labelled integer>
#>  [1] 1 2 3 4 5 1 2 3 4 5
#> 
#> Labels:
#>  value label
#>      1     a
#>      5     b


# Can't extract variable labels
# because d$lc is not atomic
var_label(d)
#> Error: `x` should be atomic
# This can be done "manually" along the following lines
varlabs <- lapply(d, attr, "label")
var_label(dd) <- varlabs[1] # skip the list column
lapply(dd, attr, "label")
#> $copy
#> NULL
#> 
#> $x
#> [1] "This is x"
#> 
#> $lc
#> NULL

Release labelled 2.0.2

Prepare for release:

  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS

Perform release:

  • Bump version (in DESCRIPTION and NEWS)
  • devtools::check_win_devel() (again!)
  • Check cran-comments.md
  • devtools::submit_cran()
  • pkgdown::build_site()
  • Approve email

Wait for CRAN...

  • Tag release
  • Remove file CRAN-RELEASE
  • Bump dev version

Template from r-lib/usethis#338

Release labelled 2.3.0

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • revdepcheck::revdep_reset()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version()
  • Update cran-comments.md
  • devtools::submit_cran()
  • pkgdown::build_site()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • Remove file CRAN-RELEASE
  • usethis::use_dev_version()

Release labelled 2.0.0

Prepare for release:

  • haven 2.0.0 released on CRAN (required for the different tests)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • If new failures, update email.yml then revdepcheck::revdep_email_maintainers()

Perform release:

  • Bump version (in DESCRIPTION and NEWS)
  • devtools::check_win_devel() (again!)
  • devtools::submit_cran()
  • pkgdown::build_site()
  • Approve email

Wait for CRAN...

  • Tag release
  • Bump dev version

Template from r-lib/usethis#338

Trimming "format.*" attributes

Would it be interesting to have a function (or an option) to trim out the "format.*" (e.g. format.stata, etc...) attributes of the variables?

to_factor for data.frame doesn't recognise haven's new haven_labelled class

Seems to be caused by this line checking for the old class name.

if (inherits(x, "labelled"))

Maybe it should be a call to is.labelled() instead.

Here's a minimal example illustrating the unexpected behaviour:

> library(labelled)
[...]
> df <- data.frame(x=labelled(1:3, labels=c(a=1, b=2, c=3)))
> str(df)
'data.frame':	3 obs. of  1 variable:
 $ x: 'haven_labelled' int  1 2 3
  ..- attr(*, "labels")= Named num  1 2 3
  .. ..- attr(*, "names")= chr  "a" "b" "c"
> to_factor(df)  # Unexpected: makes no change to labelled column `x`
  x
1 1
2 2
3 3
> to_factor(df$x)  # Expected: changes levels to factor
[1] a b c
Levels: a b c
> as.data.frame(lapply(df, to_factor))  # Expected behaviour of calling to_factor() on a data.frame
  x
1 a
2 b
3 c

The following snippet shows that changing the line to call is.labelled() instead fixes the behaviour.

> # Patch suspect function with call to `is.labelled()` instead
> utils::assignInNamespace(
+   '.to_factor_col_data_frame',
+   function (x, levels = c("labels", "values", "prefixed"), ordered = FALSE, 
+     nolabel_to_na = FALSE, sort_levels = c("auto", "none", "labels", 
+       "values"), decreasing = FALSE, labelled_only = TRUE, 
+     drop_unused_labels = FALSE, strict = FALSE, ...) 
+   {
+     if (is.labelled(x)) # <-- Change is here
+       x <- to_factor(x, levels = levels, ordered = ordered, 
+         nolabel_to_na = nolabel_to_na, sort_levels = sort_levels, 
+         decreasing = decreasing, drop_unused_labels = drop_unused_labels, 
+         strict = strict, ...)
+     else if (!labelled_only) 
+       x <- to_factor(x)
+     x
+   },
+   'labelled'
+ )
> to_factor(df)  # Now follows expected behaviour
  x
1 a
2 b
3 c

Tested with package labelled version 2.0.1

Other forms of attributes

The labelled and sjlabelled packages are especially useful when automating the production of tables, graphs, and other results from real data. However, I have a few ideas for in-house (possibly worthy of sharing with others) functions that are analogous to the labels, but instead specify whether the columns in a data set are dependent, mediator, or independent variables; and whether they are ordinal or nominal (if categorical/labelled). With some easy-to-use functions that keep these attributes when performing e.g. tidyverse-operations, it would be very easy to produce large amounts of graphs where some functions down the pipeline "understand" what should go on the x-axis, y-axis, caption, etc.
Sure, it is easy to add regular attributes with attr(df$my_var, "type_of_variable") <- "independent" , but the ecosystem of functions in this/these packages seem convenient for the same purpose. Though, I am not sure whether there should be a fixed attribute that is called "vartype", or just some generic functions for the user to define one's own attributes.

Release labelled version 2.2.2

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • revdepcheck::revdep_reset()
  • revdepcheck::revdep_check(num_workers = 4)
  • update revdep\email.yml
  • revdepcheck::revdep_email()
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version()
  • Update cran-comments.md
  • devtools::submit_cran()
  • pkgdown::build_site()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • Remove file CRAN-RELEASE
  • usethis::use_dev_version()

plans for extending support of formats

Hi,

I recently started to use labelled more frequently and find it a bit difficult to switch between representation as labelled vector and factors. Internally, many functions require factors, but factors are not as flexible as labelled vectors. So it would be nice to define a new format class for converting between the two. I.e., have something like

x <- labelled(c(1,2,2), labels = c(1 = "x", 2 = "y"))
fmt <- format(x)
x_fct <- as_factor(x)
xx <- as_labelled(x_fct, fmt)

where xx == x holds. This would allow keeping the data in as labelled vectors with values as specified in the database and switching back and forth between factors and labelled vectors as needed. Are there any plans in this direction or would you accept a pull request?

Best,

Kevin

Non unique value labels?

Below d is a tibble read from an SPSS file with haven::read_spss(). I am getting:

print(d)
Error in `levels<-`(`*tmp*`, value = as.character(levels)) :
  factor level [9] is duplicated
> traceback()
19: factor(x, labs, ordered = ordered)
18: as_factor.haven_labelled(x, "labels")
17: as_factor(x, "labels")
16: lbl_pillar_info(x)
15: pillar_shaft.haven_labelled(X[[i]], ...)
14: FUN(X[[i]], ...)
13: lapply(.x, .f, ...)
12: map(x[pillar_shown], pillar_shaft)
11: colonnade_get_width(x, width, rowid_width)
10: pillar::squeeze(x$mcf, width = width)
9: format.trunc_mat(mat)
8: format(mat)
7: format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
6: format(x, ..., n = n, width = width, n_extra = n_extra)
5: paste0(..., "\n")
4: cat(paste0(..., "\n"), sep = "")
3: cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
2: print.tbl(x)
1: (function (x, ...)
   UseMethod("print"))(x)

I suppose the print method makes a factor out of labelled variable for printing assuming that value labels are unique. There is no such restriction in, say, SPSS. Sometimes ppl take advantage of it.

Value Label in Machine Readable Format (lookfor details = TRUE)

Currently levels, value_labels, na_values, and na_range are converted to a string e.g.: https://github.com/larmarange/labelled/blob/master/R/lookfor.R#L96

The current functionality is useful for Viewing but less useful when the labels are needed for further processing (e.g. to display labels in a chart or graphic).

Could we add the option to use a machine readable format like json, or to preserve the original vectors by storing them in a column of type <list>?

jsonlite::toJSON can be imported lazily using Suggests in the Description file, or no additional dependencies are needed if a flag is added to preserve the original vectors.

variable labels are not preserved with dplyr functions

If I set a variable label with set_variable_labels and later apply dplyr::filter, the variable label is removed. Here's a small example:

library(dplyr)
library(labelled)

df <- tibble(id = 1:2, can = factor(c('yes', 'no'))) %>% 
  set_variable_labels(can = 'Cannabis use')

#variable label is there
df$can

#variable label is not there
filter(df, id == 1)$can

I'm not sure if this is a bug of dplyr or of the labelled package. It seems to have been introduced with dplyr version 0.8

Different classes for lbl+num and lbl+chr?

@elinw suggested here ropensci/skimr#296 that it might be beneficial for different labelled classes to exist, for the different underlying types. This seems to make a lot of sense to me, because this would make it easier (possible?) to write appropriate summary, skim, print methods etc.
Do you agree or is it actually possible now too?

Release labelled 2.2.1

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • revdepcheck::revdep_reset()
  • revdepcheck::revdep_check(num_workers = 4)
  • update revdep\email.yml
  • revdepcheck::revdep_email()
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version()
  • Update cran-comments.md
  • devtools::submit_cran()
  • pkgdown::build_site()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • Remove file CRAN-RELEASE
  • usethis::use_dev_version()

Inverted names of contributors

Hey Joseph,

I think the first and family names are inverted here:

labelled/DESCRIPTION

Lines 10 to 11 in 5ae5354

person("Bojanowski", "Michal", role = "ctb"),
person("Briatte", "François", role = "ctb")

Don't ask me why I'm noticing that now and here!

Hope you're good :)

Include lookfor in labelled

Hey @juba and @larmarange,

I have been working with labelled survey data lately. Every time I do so, I find Stata far superior to R when it comes to doing some of the most basic things that we need to do when exploring that kind of data…

Problem

Take variable labels, which are essential to get a grip of any new survey dataset. How is the user supposed to list them all? Variable labels being in the attributes, users might want to do this:

for each column:
  list variable name and label

Unless I am mistaken, this is not easily doable. The user might try this, which won't work:

attr(labelled_data, "label")
apply(labelled_data, 2, attr, "label")

What the user actually needs is:

vapply(labelled_data, attr, character(1), "label", exact = TRUE) # or...
sapply(labelled_data, attr, "label") # ... but non-strict and risky: might return partial matches

So, to get all variable labels in a easily-searchable format like a data frame, the user needs, at the very least (and these examples do not even preclude partial matching):

data.frame(vapply(labelled_data, attr, character(1), "label"))
tibble::enframe(vapply(labelled_data, attr, character(1), "label"))

In all cases above, the user needs to be fairly familiar with R to get the labels. Furthermore, a single missing variable label will kill the function with a cryptic message:

Error in vapply(labelled_data, attr, character(1), "label") : 
  values must be length 1,
 but FUN(X[[4]]) result is length 0

Here, [[4]] is the column (variable) where the variable no variable label (NULL).

Solution

I wrote a short function to list and search variable labels.

It is named var_labels in the spirit of the labelled package by Joseph, from which I took some code to write the show_values argument, and it is similar to the lookfor function that I wrote for questionr many years ago (thanks for improving it, Julien!):

#' @param data a labelled data frame
#' @param show_values add a column showing labelled values
#' @param ... character string(s) to match in the variable names or labels
#' @param ignore.case whether to ignore case when matching variable names or labels
#' @return a tibble
var_labels <- function(data, show_values = FALSE, ..., ignore.case = TRUE) {
  
  require(magrittr) # can easily be removed if need be
  require(tibble)   # preferrable in my view to returning a data.frame
  
  # variable labels -> tibble
  vars <- names(data)
  lbls <- tibble::tibble(variable = vars) %>% 
    tibble::add_column(
      label = vapply(vars, function(x) {
        # similar to labelled:::var_label.default
        x <- attr(data[[ x ]], "label", exact = TRUE)
        # handle missing variable labels
        ifelse(is.null(x), NA_character_, x)
      }, character(1))
    )
  
  # add labelled values
  if (show_values) {
    # similar to labelled:::val_labels.haven_labelled
    lbls <- tibble::add_column(
      lbls,
      values = vapply(vars, function(x) {
        x <- attr(data[[ x ]], "labels", exact = TRUE)
        # handle missing no value labels
        if (is.null(x)) {
          NA_character_
        } else {
          x <- paste0("[", x, "] ", names(x))
          paste(x, collapse = " ")
        }
      }, character(1))
    )
  }
  
  # subset to matching rows (a more complex option would be to use `tidyselect`)
  find <- c(...)
  if (length(find)) {
    find <- paste(find, collapse = "|")
    find <- grepl(find, lbls$variable, ignore.case = ignore.case) |
      grepl(find, lbls$label, ignore.case = ignore.case)
    lbls[ find, ]
  } else {
    lbls
  }
  
}

(The vapply part cannot be written more efficiently due to the possibility of missing values. Using purrr::attr_getter does not solve the issue, as attr_getter simply wraps around attr.)

Example, using some labelled data included in questionr:

library(questionr)

data(fertility)
women$unlabelled_test_variable <- 1L

var_labels(women)
var_labels(women, show_values = TRUE)
var_labels(women, "weight", "child") # Stata equivalent: lookfor weight child
var_labels(women, "hiv", show_values = TRUE)

Now, I do not know where to submit that function: are any of you interested in including it in questionr or labelled?

I also submitted a simpler function to haven, and opened another issue to discuss its search support.

Add a set_variable_labels_all function to create labels from column names using a specified function

It would be very useful to have a function that automatically sets all data.frame labels as transformed versions of the column names. Similar to the janitor package's clean_names() function that creates and sets snakecase column names, I would like to be able to set all column labels to a readable version of the column names from within a pipe. (Usually I am transforming from snakecase to title case and replacing "_" with spaces).

I could see two approaches to this:

  1. The more straight forward but less flexible approach would be to allow the user a limited set of pre-defined transformation options (e.g. title case, all caps, replace "_" with " ").

  2. Allow a user to use any function to transform. I'm not sure the best way to do this, but perhaps it could employ some of the tools underlying rename_all() in dplyr: (https://github.com/tidyverse/dplyr/blob/master/R/funs.R)

Support for the `Date` class

I'd like to include tagged missings (NAs) in the Date variable. But when I do the following

x <- rep(c(1,2),5)
x[[4]] <- tagged_na('a')
y <- as.Date(x, origin = '1992-01-01')
class(y)
#[1] "Date"
labelled(y, c("NA"=tagged_na('a')))
#Error: 'y' must be a numeric or a character vector

Is this behavior by design, or do I write a valid feature request? :-)

Release labelled 2.2.0

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • revdepcheck::revdep_reset()
  • revdepcheck::revdep_check(num_workers = 4)
  • update revdep\email.yml
  • revdepcheck::revdep_email()
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version()
  • Update cran-comments.md
  • devtools::submit_cran()
  • pkgdown::build_site()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • Remove file CRAN-RELEASE
  • usethis::use_dev_version()

Sort labels

Function sort_val_labels to sort labels according to value or according to label.

set_variable_labels list problem

Hi!

If i provide a named list of values to set_variable_labels(), it does not work because it converts the list into a list. This makes it impossible to follow the efficient workflow of...

  1. get variable labels via var_label(data)
  2. manipulate data, lose the variable labels in the process
  3. reapply variable names via set_variable_labels

The problem is the first line in set_variable_labels: values <- list(...)

Check matching type

This is a little annoying, especially because most haven functions do not complain. Maybe you could check the type in val_labels and cast it correctly if possible or throw an error if not?

x = 1L:5L
labelled::val_labels(x) <-  c("low" = 1)
haven::na_tag(x)

Error: x must be a double vector

Working with old data labels saved with labelled class

I was using remove_val_label to remove labels of some data saved a months ago under labelled class, but since val_label.labelled method was deleted, it does not work. I do not know if it would be a good option to include this method again, or there would be another way to remove labels.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.