rubenarslan / codebook Goto Github PK
View Code? Open in Web Editor NEWCook rmarkdown codebooks from metadata on R data frames
Home Page: https://rubenarslan.github.io/codebook/
License: Other
Cook rmarkdown codebooks from metadata on R data frames
Home Page: https://rubenarslan.github.io/codebook/
License: Other
Hi,
first of all: I LOVE the codebook package you created, thank you for this.
Secondly, a minor improvement request pertaining to conditional headings: It would help to condition the printing of the headers
detailed_variables = FALSE
the knitted codebook in Html currently prints the header "Variables" without content beneath.or at least the possibility to add them
might include refs, with DOIs/links?
especially useful for metadata
OTOH most don't have a unique sourcelink and they are pretty much fully described by their item text
Love this idea - sorely needed to have good meta-data, and will play with this and try it out. I tried to create a codebook for a large rather messy data I have, namely the GSS data (I checked with Release 2 and 3), which can be downloaded here: GSS_data
This is a rather large datafile. I ran the code below, and received an error (pasted below the code). I just wanted to let you know, but I don't find it unreasonable that codebook does not work on this datafile by default without changing anything. Nevertheless, I wanted to let you know in case the error is something worthwhile to fix.
library(haven)
library(codebook)
results = haven::read_dta("GSS7216_R2.DTA", encoding = "windows-1252")
codebook(results)
The error I get when generating a html markdown file is:
Quitting from lines 58-71 (codebook.Rmd)
Error in stringr::str_match(names(stats::na.omit(choices)), "\[?([0-9-]+)(\]|:)")[, :
subscript out of bounds
Calls: ... withCallingHandlers -> withVisible -> eval -> eval -> plot_labelled
Thank you for the package. I usually use codebook_table(), this error was generated using codebook_browser(), but am receiving the above skimr error either way. Scrolling through issues there was mention of this, but I wasn't certain of the conclusion -- if any. Thank you, again; helpful package.
---
title: "Codebook: Donations to JustGiving Fundraiser pages "
author: "David Reinstein"
output:
html_document:
toc: true
code_folding: 'hide'
self_contained:
pdf_document:
toc: yes
toc_depth: 4
latex_engine: xelatex
---
library(tidyverse)
library(codebook)
df <- tibble(x = 1:2, y = c("hello, i", "john"))
metadata(df)$name <- "donation data"
codebook(df, survey_repetition = "single", metadata_table = FALSE)
Knitting the above code (an Rmd file) in R-studio throws error:
Quitting from lines 25-76 (codebook_reprex.Rmd)
Quitting from lines 41-46 (codebook_reprex.Rmd)
Error in value[[3L]](cond) :
Could not summarise item y. Error in as.environment(where): using 'as.environment(NULL)' is defunct
Calls: <Anonymous> ... eval -> value -> value.Future -> resignalConditions
Execution halted
However, removing the comma in the first element of the y character vector in the tibble ...
df <- tibble(x = 1:2, y = c("hello i", "john"))
Does not throw this error.
So that it might be easier to find the URL without scrolling down 😉
In codebook 0.5.8
using the dev version of mice
I get:
library(codebook)
data("bfi")
codebook_missingness(bfi)
Error in `rownames<-`(`*tmp*`, value = table(pat)) :
attempt to set 'rownames' on an object with no dimensions
Could you check?
eg. in cognit.dta
better plotting of csv values
Maybe this exists already. Find out:
@format A data frame with NNNN rows and NN variables:
\describe{
\item{subject}{Anonymized Mechanical Turk Worker ID}
\item{trial}{Trial number, from 1..NNN}
}
Hi,
knitting the below Rmd file to HTML (copied from the tutorial) yields an empty HTML file without errors. The empty HTML occurs only for the codebook argument
metadata_table = TRUE
and not otherwise
Specs:
Windows 10
codebook_0.8.2 as well as the github version
R version 3.6.3 (2020-02-29)
Sublimetext (not R Studio)
Rmarkdown file that I render
---
title: "Test"
author: "JBJ"
date: "yyyy-mm-dd"
output:
html_document:
toc: true
toc_depth: 4
toc_float: true
code_folding: 'hide'
---
```{r setup, include=FALSE}
library(codebook)
knitr::opts_chunk$set(warning = FALSE, message = TRUE, error = FALSE, echo = FALSE)
```
## Demonstrating is.prime
```{r test-this, echo = FALSE}
old_base_dir <- knitr::opts_knit$get("base.dir")
knitr::opts_knit$set(base.dir = tempdir())
on.exit(knitr::opts_knit$set(base.dir = old_base_dir))
data("bfi")
bfi <- bfi[, c("BFIK_open_1", "BFIK_open_1")]
```
```{r codebook}
codebook(bfi,
survey_repetition = "single",
metadata_table = TRUE # <---- causes the empty HTML
)
```
This is the console log
There seems nothing wrong here AFAIS
output file: test.knit.md
"DIR:/Users/your_user_name/AppData/Local/Pandoc/pandoc" +RTS -K512m -RTS test.utf8.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash+smart --output test.html --email-obfuscation none --self-contained --standalone --section-divs --table-of-contents --toc-depth 4 --variable toc_float=1 --variable toc_selectors=h1,h2,h3,h4 --variable toc_collapsed=1 --variable toc_smooth_scroll=1 --variable toc_print=1 --template
"DIR:\Users\your_user_name\Documents\R\win-library\3.6\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header
"DIR:\Users\your_user_name\AppData\Local\Temp\RtmpIjYheO\rmarkdown-str64e045c43f64.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --lua-filter
"DIR:/Users/your_user_name/Documents/R/win-library/3.6/rmarkdown/rmd/lua/pagebreak.lua" --lua-filter
"DIR:/Users/your_user_name/Documents/R/win-library/3.6/rmarkdown/rmd/lua/latex-div.lua" --variable code_folding=hide --variable code_menu=1
Output created: test.html
>
>
[Finished in 4.3s]
maybe basically skimr merged with a df taken from attributes? not clear how to deal with nested attributes in this case, but whatever?
My current recommendation is to set
opts_chunk$set(error = TRUE)
in the knitr chunk preceding the codebook
call to find out with which variables the error happens, to make it easier to generate a reproducible example.
Unfortunately, if people don't set this, the error message they'll get will be highly unspecific and hard to trace except by divide-and-conquer. Plan: try to find out how to put the current variable name into the trace.
I was trying to use detect_missing
to clean the missing data in a dataset, of which a few columns are integers. detect_missing
cannot correctly label the missing values.
Consider the following dataset rd1
:
rd1 <- tibble(
x1 = haven::labelled(x = c(32L, 996L, 40L),
labels = c("Refused to answer" = 996), label = "x1 variable (integer)"),
x2 = haven::labelled(x = c(32, 996, 40),
labels = c("Refused to answer" = 996), label = "x1 variable (double)")
)
# Here is the output of `rd1`:
# A tibble: 3 x 2
# x1 x2
# <int+lbl> <dbl+lbl>
# 32 32
# 996 [Refused to answer] 996 [Refused to answer]
# 40 40
The only difference between x1
and x2
is that x1
has only integers. Applying detect_missing
will only affect x2
, but 996
in x1
remains unchanged.
detect_missing(rd1, missing = c(996))
# # A tibble: 3 x 2
# x1 x2
# <int+lbl> <dbl+lbl>
# 32 32
# 996 [Refused to answer] NA(a) [[996] Refused to answer]
# 40 40
# Warning message:
# In detect_missing(rd1, missing = c(996)) :
# Cannot label missings for integers in variable x1
I looked into the codes of detect_missing
and found that the problem was that the function haven::tagged_na
does not work with vectors of integers. So you include the condition is.double
in a few if
statements.
If I modified these these lines (below) by removing the check for is.double
, detect_missing
will work for columns of integers by converting these columns of integers to column of double. I understand that converting integer
to double
could cause problems later, but it might not be a bad idea to add an option of letting users allow for the conversion so that missing values can be labelled correctly for integer columns.
detect_missing2
function below is a simple modification of the current detect_missing
by adding an extra option force_integer = TRUE
or FALSE
. (The changes are highlighted.) When force_integer = TRUE
, the integer columns will be converted to double and missing values will be labelled.
detect_missing2 <- function (data, only_labelled = TRUE, negative_values_are_missing = TRUE,
ninety_nine_problems = TRUE, learn_from_labels = TRUE, missing = c(),
non_missing = c(), vars = names(data), use_labelled_spss = FALSE, force_integer = FALSE)
{
for (i in seq_along(vars)) {
var <- vars[i]
if (is.numeric(data[[var]]) && any(!is.na(data[[var]]))) {
potential_missing_values <- c()
if (negative_values_are_missing) {
potential_missing_values <- unique(data[[var]][data[[var]] <
0])
}
labels <- attributes(data[[var]])$labels
if (learn_from_labels && length(labels)) {
numeric_representations <- as.numeric(stringr::str_match(names(labels),
"\\[([0-9-]+)\\]")[, 2])
potentially_untagged <- numeric_representations[is.na(labels)]
potential_tags <- labels[is.na(labels)]
if (is.double(data[[var]]) && !all(is.na(haven::na_tag(data[[var]]))) &&
length(intersect(potentially_untagged, data[[var]]))) {
# For integer vectors, their missing values cannot be tagged,
# so we don't need to modify the above if condition for
# integer vectors.
warning("Missing values were already tagged in ",
var, ". Although", "there were further potential missing values as indicated by",
"missing labels, this was not changed.")
} else {
for (e in seq_along(potentially_untagged)) {
pot <- potentially_untagged[e]
data[[var]][data[[var]] == pot] <- potential_tags[e]
}
}
}
if (ninety_nine_problems) {
if (any(!is.na(data[[var]])) && (stats::median(data[[var]],
na.rm = TRUE) + stats::mad(data[[var]], na.rm = TRUE) *
5) < 99) {
potential_missing_values <- c(potential_missing_values,
99)
}
if (any(!is.na(data[[var]])) && (stats::median(data[[var]],
na.rm = TRUE) + stats::mad(data[[var]], na.rm = TRUE) *
5) < 999) {
potential_missing_values <- c(potential_missing_values,
999)
}
}
potential_missing_values <- union(setdiff(potential_missing_values,
non_missing), missing)
if ((!only_labelled || haven::is.labelled(data[[var]])) &&
length(potential_missing_values) > 0) {
if (only_labelled) {
potential_missing_values <- potential_missing_values[potential_missing_values %in%
labels]
potential_missing_values <- union(potential_missing_values,
setdiff(labels[is.na(labels)], data[[var]]))
}
potential_missing_values <- sort(potential_missing_values)
with_tagged_na <- data[[var]]
if (is.double(data[[var]])) {
free_na_tags <- setdiff(letters, haven::na_tag(with_tagged_na))
} else {
free_na_tags <- letters
}
for (i in seq_along(potential_missing_values)) {
miss <- potential_missing_values[i]
if (!use_labelled_spss && !all(potential_missing_values %in%
free_na_tags)) {
new_miss <- free_na_tags[i]
} else {
new_miss <- potential_missing_values[i]
}
that_label <- which(labels == miss)
################################################################################
# I replaced `is.double(data[[var]])` with `(force_integer |
# is.double(data[[var]]))` below
if (length(which(with_tagged_na == miss)) &&
(force_integer | is.double(data[[var]])) && !use_labelled_spss) {
with_tagged_na[which(with_tagged_na == miss)] <- haven::tagged_na(new_miss)
} else if (!force_integer & is.integer(data[[var]])) {
warning("Cannot label missings for integers in variable ",
var, " let force_integer = TRUE if you want to label misssings for integers.")
}
if ((force_integer | is.double(data[[var]])) &&
length(that_label) && !use_labelled_spss) {
labels[that_label] <- haven::tagged_na(new_miss)
names(labels)[that_label] <- paste0("[",
potential_missing_values[i], "] ", names(labels)[that_label])
}
################################################################################
}
if (use_labelled_spss) {
labels <- attributes(data[[var]])$labels
if (is.null(labels)) {
labels <- potential_missing_values
names(labels) <- "autodetected unlabelled missing"
}
data[[var]] <- haven::labelled_spss(data[[var]],
label = attr(data[[var]], "label", TRUE),
labels = labels, na_values = potential_missing_values,
na_range = attr(data[[var]], "na_range",
TRUE))
} else if (haven::is.labelled(data[[var]])) {
data[[var]] <- haven::labelled(with_tagged_na,
label = attr(data[[var]], "label", TRUE),
labels = labels)
} else {
data[[var]] <- with_tagged_na
}
}
}
}
data
}
to decrease number of dependencies
Hi, it's great package. But I have the following problem in installin it. I have used both Windows and Mac computers and get the message below and it doesn't install.
MacOS High Sierra and Windows 10 during the installation via remotes
:
Error in utils::download.file(url, path, method = method, quiet = quiet, :
cannot open URL 'https://api.github.com/repos/rubenarslan/codebook/tarball/master'
I really need to install this package.
My session info:
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] tr_TR.UTF-8/tr_TR.UTF-8/tr_TR.UTF-8/C/tr_TR.UTF-8/tr_TR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] remotes_2.1.0 latex2exp_0.4.0 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.5 purrr_0.3.3
[7] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.3.0 tidyverse_1.3.0 gt_0.2.0.5
loaded via a namespace (and not attached):
[1] tinytex_0.18 kispaddins_0.1.0 tidyselect_1.0.0 xfun_0.12 haven_2.2.0
[6] lattice_0.20-38 colorspace_1.4-1 generics_0.0.2 vctrs_0.2.4 htmltools_0.4.0
[11] yaml_2.2.1 rlang_0.4.5 pillar_1.4.3 withr_2.1.2 glue_1.3.2
[16] DBI_1.1.0 dbplyr_1.4.2 modelr_0.1.5 readxl_1.3.1 lifecycle_0.2.0
[21] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 evaluate_0.14
[26] knitr_1.28 curl_4.3 fansi_0.4.1 broom_0.5.3 Rcpp_1.0.3
[31] checkmate_2.0.0 scales_1.1.0 backports_1.1.5 jsonlite_1.6 fs_1.4.1
[36] hms_0.5.3 packrat_0.5.0 digest_0.6.25 stringi_1.4.6 bookdown_0.17.2
[41] grid_3.5.3 cli_2.0.2 tools_3.5.3 magrittr_1.5 crayon_1.3.4
[46] pkgconfig_2.0.3 xml2_1.2.2 reprex_0.3.0 lubridate_1.7.4 assertthat_0.2.1
[51] rmarkdown_2.0 httr_1.4.1 rstudioapi_0.11 R6_2.4.1 nlme_3.1-143
[56] compiler_3.5.3
The JSON-LD generated for the vignette lists several terms as being in the context of pending.schema.org that do not actually appear to be part of pending. Perhaps you meant to define a custom context outside of schema.org to extend these terms? e.g.
"type": "http://pending.schema.org/propertyValue",
"http://pending.schema.org/data_summary": {
"type": "http://formr.org/codebook/SummaryStatistics",
"http://pending.schema.org/complete": "28",
"http://pending.schema.org/missing": "0",
"http://pending.schema.org/n": "28",
"http://pending.schema.org/n_unique": "4",
"http://pending.schema.org/ordered": "FALSE",
"http://pending.schema.org/top_counts": "4: 15, 5: 10, 3: 2, 2: 1"
},
(Also note some logicals and integers being typed as characters).
See on the playground
When I knit a codebook with the codebook() function I get the following error message (shown in the HTML output):
This seems to be an issue with the new version of rlang (I have 0.4.0 installed).
Can you automatically add alt-text to the distribution plots to meet WCAG accessibility standards? An alt-text of "Distribution of var" should be sufficient.
FYI, The other violations per WAVE are missing form labels in the codebook table, document language missing, and an empty table header in the missingness table.
library(codebook)
packageVersion("haven")
packageVersion("codebook")
data("bfi")
bfi <- bfi[,c("BFIK_open", paste0("BFIK_open_", 1:4))]
codebook_component_scale(bfi[,1], "BFIK_open", bfi[,-1],
reliabilities = list(BFIK_open = psych::alpha(bfi[,-1])))
#> Error: C stack usage 7969280 is too close to the limit
I'm not sure exactly what is causing this, but it's probably related to the changes to the labelled class.
Could you please take a look? I'm planning on submitting haven 2.0 to CRAN on November 7
Small issue: It would be nice to change the Rdocumentation for survey_overview
to make it explicit that all the specified variables need to be present in the data to print a summary. (For example, I usually have created
and ended
in my data, but only in multi-session surveys I have session
, otherwise not, and I erroneously expected the summary to be printed)
Line 23 in 0cf7c55
#' @param survey_overview whether to print an overview of survey entries and durations (only printed if the data contains five variables named session, created, modified, ended, expired)
Alternatively it would be nice to condition the summary on the columns present?
Hey, great package! Thank you for all your work!
I would like to select in the function codebook_table
the columns of the table generated.
order <- c("name", "label", "type", "type_options", "data_type", "ordered",
"value_labels", "optional", "showif",
"scale_item_names",
"value", "item_order", "block_order", "class",
"missing", "complete", "n", "empty", "n_unique",
"top_counts", "count", "median", "min", "max",
"mean", "sd", "p0", "p25", "p50", "p75", "p100", "hist")
Is there already a way to do it?
Thanks again!
We discussed this issue on Twitter. That thread is here.
I am trying to compile a codebook for a relatively large data file (511 variables, 23,000+ observations). There is a lot of missing data due to the design of the study.
I am trying to follow the instructions from the github page. I receive the following error.
dput() for that variable gives me a screen full of NA and the following attributes associated with the variable.
I hope this helps. Thanks!
doesn't seem to work as expected
Is DDI on your radar? https://www.ddialliance.org
Over 10K databases on ICPSR alone. https://www.icpsr.umich.edu/icpsrweb/
Maybe just generate a whole document if it's not a knit child rn
otherwise, cant use this for fig/cache paths if e.g. bfi %>% select(1:3)
is the df name
Hello @rubenarslan
Hitherto the function codebook
worked well for me, but now I'm getting an error. I suppose it must be linked to a package dependency, since I recently updated my project. The error message is as follows:
Error: No common type for `..1$by_variable$numeric.min` <labelled> and `..2$by_variable$numeric.min` <labelled>.
I'm not sure, but it looks like its having difficulty dealing with variables of type numeric
.
Here's the traceback:
33. stop(cnd)
32. abort(message, .subclass = c(.subclass, "vctrs_error"), ...)
31. stop_vctrs(message, .subclass = c(.subclass, "vctrs_error_incompatible"), x = x, y = y, details = details, ...)
30. stop_incompatible(x, y, x_arg = x_arg, y_arg = y_arg, details = details, ..., message = message, .subclass = c(.subclass, "vctrs_error_incompatible_type"))
29. stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
28. vec_ptype2.default(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
27. vec_type2_dispatch(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
26. vec_rbind(!!!x, .ptype = ptype)
25. unchop(data, !!cols, keep_empty = keep_empty, ptype = ptype)
24. unnest.data.frame(out, .data$by_variable)
23. tidyr::unnest(out, .data$by_variable)
22. build_results(skimmed, variable_names, NULL)
21. skim_by_type.data.frame(.x[[1L]], .y[[1L]], ...)
20. .f(.x[[1L]], .y[[1L]], ...)
19. purrr::map2(.data$skimmers, .data$skim_variable, skim_by_type, data)
18. summarise_impl(.data, dots, environment(), caller_env())
17. summarise.tbl_df(grouped, skimmed = purrr::map2(.data$skimmers, .data$skim_variable, skim_by_type, data))
16. dplyr::summarize(grouped, skimmed = purrr::map2(.data$skimmers, .data$skim_variable, skim_by_type, data))
15. skim_codebook(x)
14. "skim_type" %in% names(object)
13. has_type_column(object)
12. stopifnot(has_type_column(object), has_variable_column(object), has_skimr_attributes(object), nrow(object) > 0)
11. assert_is_skim_df(data)
10. skimr::partition(skim_codebook(x))
9. exists("POSIXct", df)
8. coerce_skimmed_summary_to_character(skimr::partition(skim_codebook(x)))
7. dots_values(...)
6. flatten_bindable(dots_values(...))
5. dplyr::bind_rows(coerce_skimmed_summary_to_character(skimr::partition(skim_codebook(x))), .id = "data_type")
4. skim_to_wide_labelled(results)
3. codebook_table(results)
2. codebook_items(results, indent = indent)
1. codebook(codebook_data)
Thanks in advance!
Hi Ruben,
when I use codebook() on my formr data I get the following error:
Error in .f(.x[[i]], ...) :
Names missing from the following functions: top_counts
This seems to be a problem with factors with no label (which should be read as strings instead?), but I can't figure out why / which variables are affected / how to approach this issue.
Skimr v2 is going to be released very soon. You use skim_to_wide()
but actually with the new API the object defaults to wide. However there are other changes that might break things for codebook. Please take a look and let us know if there are problems you can't solve.
aggregate_and_document_scale
allows users to aggregate scales but also mark them up in a codebook-friendly way.@rubenarslan - hey dude - know you have a lot going on right now, but just a note that the web interface is giving me this error:
Also, I can't currently generate a codebook at all - I think it's the skimr
issue mentioned #40.
> meta_data = codebook(mtcars)
No missing values.
Error: 'skim_with_defaults' is not an exported object from 'namespace:skimr'
In addition: Warning message:
'skimr::skim_to_wide' is deprecated.
Use 'skim()' instead.
See help("Deprecated")
Allow the user to define different colors for different values within the variable. (gradient, solid colours, palettes etc)
currently using placeholder contexts for data summary and item schemas
Get rid of my own auto-detection of file formats, for a consistent format. Only snag: rio has many dependencies...
https://github.com/leeper/rio
could md_pattern do more with tagged missings?
i.e. patterns for skipped vs. structural missings?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.