tidyverse / haven Goto Github PK
View Code? Open in Web Editor NEWRead SPSS, Stata and SAS files from R
Home Page: https://haven.tidyverse.org
License: Other
Read SPSS, Stata and SAS files from R
Home Page: https://haven.tidyverse.org
License: Other
In case it's helpful...
Matt Shotwell has a package on github called sas7bdat.parso. It's a wrapper for the GGA Software Parso Java library (ask Google).
I've used Parso to read many dozens of .sas7bdat files without problems (well, the only problem I've found is unescaped quotes in their csv writer - but that's the csv writer, not the sas7bdat reader). I know someone who is working on having Parso turn a bunch of SAS datasets directly into a SQLite db to avoid the whole csv nonsense.
For info, Parso was written because they couldn't get Sassy Reader to do what they wanted. Sassy Reader was based on Matt Shotwells package.
Again, I hope it helps.
Harry
When having ä, ö or ü in variable labels, these are not imported correctly.
as_factor works as expected for a variable, where both levels are present:
var <- as.numeric(rep(1:2, 10))
var <- labelled(var, c(female = 1, male = 2))
as_factor(var)
for a variable, where one level doesn't occur, it throws an error:
var <- as.numeric(rep(0, 10))
var <- labelled(var, c(female = 0, male = 1))
as_factor(var)
Fehler in factor(match(x, attr(x, "labels")), labels = names(attr(x, "labels"))) :
invalid 'labels'; length 2 should be 1 or 1
When missing values span over a larger range of values, you can declare this value range as "missing" values, see:
This differs from declaring specific values as missing, like:
When loading a SPSS file with a "missing range", read_spss
throws an error:
> test <- read_spss("spss_missing_range.sav")
Error: Failed to parse C:\Users\Luedeke\Desktop\spss_missing_range.sav: Invalid file, or file has unsupported features.
I've uploaded two sample files.
SPSS-file that causes trouble (missing range):
https://www.dropbox.com/s/nkkk5mg45xknbeo/spss_missing_range.sav?dl=0
SPSS-file that works (missing values):
https://www.dropbox.com/s/dzdn38f81lk0rbx/spss_missing_value.sav?dl=0
I am unable to write dates using the haven package. I've tried formatting the dates as dates, POSIX, and character types. Any ideas?
library("haven")
id <- c(1, 2, 3)
date <- as.Date(c("2014-09-23", "2014-09-24", "2014-09-25"))
date.pos <- as.POSIXct(date)
date.char <- as.character(date)
write_sav(data.frame(id, date, date.pos, date.char), "test.sav")
even though they apparently aren't in the data view.. sorry I don't have an reproducible example, but that's what my colleague who I prepared the file for told me.
Great package. However, I have some issues reading STATA files. All the missing observations are somehow ignored, and not correctly identified.
My example data is here: https://www.dropbox.com/s/msbz5f7d5p84k2d/BFIR21FL.zip?dl=0
Code replicating my issue:
bfir21fl <- read_dta(path = "BFIR21FL.DTA")
str(bfir21fl$v121)
Class 'labelled' atomic [1:6354] 1 1 1 1 ...
..- attr(, "label")= chr "has television"
..- attr(, "labels")= Named int [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "no" "yes"
sum(is.na(bfir21fl$v121))
[1] 0
sum(is.na(as_factor(bfir21fl$v121)))
[1] 0
Using the read.dta from foreign results in:
bfir21fl_1 <- read.dta("BFIR21FL.DTA")
str(bfir21fl_1$v121)
Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 1 1 1 1 ...
length(bfir21fl_1$v121)
[1] 6354
sum(is.na(bfir21fl_1$v121))
[1] 61
61 is the correct result. My question is, either there is something I am doing wrong, or the conversion is not working correctly, or not according to my expectations.
The data read in fine, but when I went to look at it I got this error message:
head(p7) Error in sprintf("%02d", m) : invalid format '%02d'; use format %f, %e, %g or %a for numeric objects
In further investigation I found the variable that was causing the problem. If I remove this variable, then head works fine. In sas the format and informat is TIME8.
attributes(p7[,42]) $label [1] "Time of highest X within 3 yrs of 7/1/2005" $class [1] "hms" summary(p7[,42]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1800 29280 31200 31020 33420 35940 32987
testsas <- group_by(read_sas("http://crn.cancer.gov/resources/ctcodes-procedures.sas7bdat"), code_type)
# Error: column 'px' of type character has unsupported attributes: label
By using as_factor to change all columns with label attribute I can fix things so that group_by works.
testsas <- read_sas("http://crn.cancer.gov/resources/ctcodes-procedures.sas7bdat")
testsas$px <- as_factor(testsas$px)
testsas$code_type <- as_factor(testsas$code_type)
testsas$description <- as_factor(testsas$description)
testsas$chemo_type <- as_factor(testsas$chemo_type)
testsas$comments <- as_factor(testsas$comments)
group_by(testsas, code_type)
A really great feature would be a write function that writes back a data frame (with variable and value label attributes) to an SPSS/SAS/Stata file, where the value and variable fields are automatically set.
This would be a great benefit for all people who collaborate in teams where some use SPSS and others use R. Imagine you have a base data set (in SPSS format, because majority uses it in your department) and do some recodings and data cleaning with R, and you want to save the changes / added variables back to the SPSS data set.
I'm not sure whether this is possible at all, or if it's possible with low effort?
I'm not seeing any obvious problems with the data - there are missing values scattered throughout. Is there something specific I should look at that might sort out what is going on?
d0 <- read_sas('sasdata/s114640.sas7bdat')
Error: Failed to parse sasdata/s114640.sas7bdat: A row in the file was not the expected length.
d1 <- sas.get('sasdata','s114640')
Read 112 records
Read 40067 records
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grDevices datasets splines graphics utils stats methods
[8] base
other attached packages:
[1] haven_0.0.0.9000 chron_2.3-45 gam_1.09.1 survival_2.37-7
[5] rlocal_2.0.2
loaded via a namespace (and not attached):
[1] Rcpp_0.11.4
When I import data with haven, the returned object from read functions has following class-attributes:
> class(x)
[1] "tbl_df" "tbl" "data.frame"
When accessing a table column, still a data frame is returned:
str(x[, 6])
Classes ‘tbl_df’ and 'data.frame': 1567 obs. of 1 variable:
$ sex:Class 'labelled' atomic [1:1567] 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "label")= chr "Geschlecht der antwortenden Person"
.. ..- attr(*, "labels")= Named int [1:2] 0 1
.. .. ..- attr(*, "names")= chr [1:2] "männlich" "weiblich"
I would assume that I either retrieve an atomic vector with attributes or a vector of class labelled
...
Not a bug - just a comment that I found the labelled function a bit confusing because of the name. So “labelled” is similar to a SAS format, versus a SAS “label”, which I was happy to see does still come across. Both are useful concepts and it would really be nice to have both supported.
Also, I noticed with subsetting the data that the “label” attribute is lost but not the “labels” attribute. Any chance of adding in the “label” concept somehow? It would be nice to not lose that information and it would be nice to have a wrapper function that adds/extracts the information.
Thanks for the consideration.
d2$death <- labelled(d2$status, labels=c(Censored=0, Dead=1))
attributes(d2$death)
$label
[1] "Vital Status"
$labels
Censored Dead
0 1
$class
[1] "labelled"
I've got a following error after importing Social Diagnosis survey dataset (link to data - http://www.diagnoza.com/data/database/2000_2013/SOCIAL_DIAGNOSIS_H_2000_2013_SAV.zip)
> library(dplyr)
> library(haven)
> fname <- '~/Downloads/SOCIAL_DIAGNOSIS_H_2000_2013.SAV'
> dsin <- read_spss(path = fname)
> class(dsin)
[1] "tbl_df" "tbl" "data.frame"
> head(dsin)
Error: `x` and `labels` must be same type
I've got also an error after applying count
function from dplyr
but this is probably that the class labelled
is not implemented in dplyr
?
> d <- dsin %>% count(gdtyp_11)
Error: column 'NUMER_2000_2013' of type numeric has unsupported attributes: label
Hello, I have a SAS dataset that we got from an external source and it has Encoding = shift-jis Japanese (SJIS). Haven generates this error when trying to read it: Error: Failed to parse C:\Temp\lb.sas7bdat: Invalid file, or file has unsupported features.
The problem goes away if I use PC SAS to convert the file to a local encoding, e.g, Encoding = wlatin1 Western (Windows).
It is possible for Haven to support Shift-jis since on many occassions I will not have access to PC SAS to do the conversion.
Regards,
David
A simple request: could you add an option to turn off labelling for labeled Stata files?
I'd be fine with one that dropped labels altogether, returning only the numeric values underlying the Stata labels.
The main concern is that dplyr
still doesn't support labelled data.frames, but I'm worried about compatibility with other packages, too.
In SPSS, you can assign multiple missing values, e.g. "8" as "not applicable" or "9" for "real missing". read_spss()
would set both 8 and 9 to NA
in the imported data.frame, however, if a value label for "8" is set, it is also imported - thus, you have one more value label than values.
Example:
"Do you live in partnership?"
1 - yes
2 - no
8 - not applicable (was asked before if married) -> declared as "8" missing
9 - missing -> declared as "9" missing, but not labelled
Now read_spss() imports this as vector with two values (1/2) and NA's, but three value labels (yes/no/not applicable).
This is no serious bug, but I wonder if it would be possible to distinguish whether an SPSS-declared "missing value" is either a "real" missing or a "not applicable" declared missing. In SPSS you do this e.g. to get valid frequency counts for yes/no answers, but still having the information how much "real" missings you have...
With the latest built, the issue linked here persists. Basically, using df <- read_dta('LIAB_lm_9310_v1_pers.dta')
, where the data file is included in this zip, if I try to head(df)
or data.table(df)
, I get an error
Error: `x` and `labels` must be same type
and in the latter case, R
crashes subsequently.
as_factor is broken for version 0.2.0.
> var <- labelled(c(0, 1), c(female = 0, male = 1))
> as_factor(var)
[1] male <NA>
Levels: female male
SessionInfo:
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] de_AT.UTF-8/de_AT.UTF-8/de_AT.UTF-8/C/de_AT.UTF-8/de_AT.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] haven_0.2.0
loaded via a namespace (and not attached):
[1] Rcpp_0.11.5 tools_3.1.2
Created a data frame using simulated data and tried to save it as a dta file, this led R to crash with segfault error. Using R version 3.1.2 on a Mac with OSX 10.10.2.
Example code:
library(haven)
library(plyr)
set.seed(12345)
N=300
var1 = rnorm(N,0,1.5)
var2 = rnorm(N,2,5)
var3 = runif(N,-10,10)
dataForStata = data.frame(var1,var2,var3)
names(dataForStata) = c("var1","var2","var3")
write_dta(dataForStata,path="~/Desktop/")
Running the code produces the following error:
> write_dta(dataForStata,path="~/Desktop/")
*** caught segfault ***
address 0x68, cause 'memory not mapped'
Traceback:
1: .Call("haven_write_dta", PACKAGE = "haven", data, path)
2: write_dta(dataForStata, path = "~/Desktop/")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
R session info:
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin14.0.0 (64-bit)
locale:
[1] en_US/en_US/en_US/C/en_US/en_US
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.8.1 haven_0.1.1.9000
loaded via a namespace (and not attached):
[1] Rcpp_0.11.4
I am getting this error when I try to open a SPSS por file:
Error in df_parse_por(clean_path(path)) :
attempt to set index 0/0 in SET_STRING_ELT
Below is code that downloads the file. It works with read.spss.
library("haven")
library("foreign")
# download and unzip file to temporary folder
url <- "http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/2006_sqf.zip"
p1 <- file.path(tempdir(), basename(url))
download.file(url, p1, quiet = TRUE)
filename <- unzip(p1, list = TRUE)$Name[1]
unzip(p1, files = filename, exdir = tempdir())
# open file
p2 <- file.path(tempdir(), filename)
DF <- read_por(p2)
# Error in df_parse_por(clean_path(path)) :
# attempt to set index 0/0 in SET_STRING_ELT
DF <- foreign::read.spss(p2, use.value.labels = FALSE, to.data.frame = TRUE)
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreign_0.8-62 haven_0.1.1.9000 Defaults_1.1-1
loaded via a namespace (and not attached):
[1] Rcpp_0.11.4
It would be nice if value and variable labels would be imported as well, e.g. by setting them as data frame attributes (like read.spss
from foreign
package does):
_Example of read.spss_
'data.frame': 908 obs. of 26 variables:
$ c12hour : num 16 148 70 168 168 16 161 110 28 40 ...
$ e15relat: atomic 2 2 1 1 2 2 1 4 2 2 ...
..- attr(, "value.labels")= Named chr "8" "7" "6" "5" ...
.. ..- attr(, "names")= chr "other, specify" "cousin" "nephew/niece" "ancle/aunt" ...
$ e16sex : atomic 2 2 2 2 2 2 1 2 2 2 ...
..- attr(, "value.labels")= Named chr "2" "1"
.. ..- attr(, "names")= chr "female" "male"
$ e17age : num 83 88 82 67 84 85 74 87 79 83 ...
$ e42dep : atomic 3 3 3 4 4 4 4 4 4 4 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "severely dependent" "moderately dependent" "slightly dependent" "independent"
$ c82cop1 : atomic 3 3 2 4 3 2 4 3 3 3 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "always" "often" "sometimes" "never"
$ c83cop2 : atomic 2 3 2 1 2 2 2 2 2 2 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c84cop3 : atomic 2 3 1 3 1 3 4 2 3 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c85cop4 : atomic 2 3 4 1 2 3 1 1 2 2 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c86cop5 : atomic 1 4 1 1 2 3 1 1 2 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c87cop6 : atomic 1 1 1 1 2 2 2 1 1 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c88cop7 : atomic 2 3 1 1 1 2 4 2 3 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c89cop8 : atomic 3 2 4 2 4 1 1 3 1 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "always" "often" "sometimes" "never"
$ c90cop9 : atomic 3 2 3 4 4 1 4 3 3 3 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "always" "often" "sometimes" "never"
$ c160age : num 56 54 80 69 47 56 61 67 59 49 ...
$ c161sex : atomic 2 2 1 1 2 1 2 2 2 2 ...
..- attr(, "value.labels")= Named chr "2" "1"
.. ..- attr(, "names")= chr "Female" "Male"
$ c172code: atomic 2 2 1 2 2 2 2 2 NA 2 ...
..- attr(, "value.labels")= Named chr "3" "2" "1"
.. ..- attr(, "names")= chr "high level of education" "intermediate level of education" "low level of education"
$ c175empl: atomic 1 1 0 0 0 1 0 0 0 0 ...
..- attr(, "value.labels")= Named chr "1" "0"
.. ..- attr(, "names")= chr "yes" "no"
$ barthtot: num 75 75 35 0 25 60 5 35 15 0 ...
$ neg_c_7 : num 12 20 11 10 12 19 15 11 15 10 ...
$ pos_v_4 : num 12 11 13 15 15 9 13 14 13 13 ...
$ quol_5 : num 14 10 7 12 19 8 20 20 8 15 ...
$ resttotn: num 0 4 0 2 2 1 0 0 0 1 ...
$ tot_sc_e: num 4 0 1 0 1 3 0 1 2 1 ...
$ n4pstu : atomic 0 0 2 3 2 2 3 1 3 3 ...
..- attr(, "value.labels")= Named chr "8" "7" "6" "5" ...
.. ..- attr(, "names")= chr "other, specify" "cousin" "nephew/niece" "ancle/aunt" ...
$ nur_pst : atomic NA NA 2 3 2 2 3 1 3 3 ...
..- attr(, "value.labels")= Named chr "89" "88" "87" "86" ...
.. ..- attr(, "names")= chr "other" "co-religionist" "volunteer" "neighbour" ...
_Same with haven_
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 908 obs. of 26 variables:
$ c12hour : num 16 148 70 168 168 16 161 110 28 40 ...
$ e15relat: num 2 2 1 1 2 2 1 4 2 2 ...
$ e16sex : num 2 2 2 2 2 2 1 2 2 2 ...
$ e17age : num 83 88 82 67 84 85 74 87 79 83 ...
$ e42dep : num 3 3 3 4 4 4 4 4 4 4 ...
$ c82cop1 : num 3 3 2 4 3 2 4 3 3 3 ...
$ c83cop2 : num 2 3 2 1 2 2 2 2 2 2 ...
$ c84cop3 : num 2 3 1 3 1 3 4 2 3 1 ...
$ c85cop4 : num 2 3 4 1 2 3 1 1 2 2 ...
$ c86cop5 : num 1 4 1 1 2 3 1 1 2 1 ...
$ c87cop6 : num 1 1 1 1 2 2 2 1 1 1 ...
$ c88cop7 : num 2 3 1 1 1 2 4 2 3 1 ...
$ c89cop8 : num 3 2 4 2 4 1 1 3 1 1 ...
$ c90cop9 : num 3 2 3 4 4 1 4 3 3 3 ...
$ c160age : num 56 54 80 69 47 56 61 67 59 49 ...
$ c161sex : num 2 2 1 1 2 1 2 2 2 2 ...
$ c172code: num 2 2 1 2 2 2 2 2 NaN 2 ...
$ c175empl: num 1 1 0 0 0 1 0 0 0 0 ...
$ barthtot: num 75 75 35 0 25 60 5 35 15 0 ...
$ neg_c_7 : num 12 20 11 10 12 19 15 11 15 10 ...
$ pos_v_4 : num 12 11 13 15 15 9 13 14 13 13 ...
$ quol_5 : num 14 10 7 12 19 8 20 20 8 15 ...
$ resttotn: num 0 4 0 2 2 1 0 0 0 1 ...
$ tot_sc_e: num 4 0 1 0 1 3 0 1 2 1 ...
$ n4pstu : num 0 0 2 3 2 2 3 1 3 3 ...
$ nur_pst : num NaN NaN 2 3 2 2 3 1 3 3 ...
read_dta() seems to convert Stata byte
variables to character strings, instead of to integers. If I create a simple Stata file, with variables of type float, double, long, int and byte, here is the result of importing it using read_dta:
> d = read_dta("stata-datatypes.dta")
> sapply(d, class)
vfloat vdouble vlong vint vbyte
"numeric" "numeric" "integer" "integer" "character"
Version info:
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 LC_CTYPE=Norwegian-Nynorsk_Norway.1252
[3] LC_MONETARY=Norwegian-Nynorsk_Norway.1252 LC_NUMERIC=C
[5] LC_TIME=Norwegian-Nynorsk_Norway.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] haven_0.1.1
loaded via a namespace (and not attached):
[1] Rcpp_0.11.4 tools_3.1.1
functionaity (in readme), should be: functionality
From @evanmiller
Quick API update you should be aware of. The "error_handler" now receives a second argument: the user_ctx variable that is passed to all the other callbacks.
I've also added another callback for progress indicators, may or may not be useful to you:
typedef int (*readstat_progress_handler)(double progress, void *ctx);
readstat_error_t readstat_set_progress_handler(
readstat_parser_t *parser,
readstat_progress_handler progress_handler);
This callback periodically receives a double between 0.0 and 1.0 indicating the % progress through reading a file. I implemented it mainly so I could get a progress indicator on POR files, since as we've discussed the row count is not available in advance. But the progress handler should work on all file types.
I read sas7bdat file by read_sas() function, it is represeted as tbl_df type in R. Than I use names() function and it cause R console in RStudio to "does not respond" state. RStudio works, however I can not close it without killing it. I don't have such problems with data.frames, so I guess it is a problem with haven.
I use R 3.1.2 on Windows with recent development version of RStudio and haven 0.2
Updating R to 3.1.3 resolved the issue
When writing a sav file using
data <- data.frame(Var = c("One", "Two", "Three"), stringsAsFactors = FALSE)
haven::write_sav(data, "string_haven.sav")
I get the following error when opening with SPSS (DATA1205): http://www-01.ibm.com/support/docview.wss?uid=swg21512643
File written with haven:
https://github.com/dgromer/misc/blob/master/string_haven.sav
File written with SPSS:
https://github.com/dgromer/misc/blob/master/string_spss.sav
http://blog.rstudio.org/2015/03/04/haven-0-1-0/ says, "as_factor(): turns labelled integers into factors".
I tried as_factor() on labelled characters (which happened to be characters "1", "2", "3", etc.), and it seemed to convert those into factors just fine.
When loading an SPSS file with read_spss
or read_sav
, I can't view the data frame in the RStudio Viewer.
> View(x)
Error: `x` and `labels` must be same type
Using RStudio Version 0.99.283 on Win 7.
When writing a sav file using
data <- data.frame(Var = c(1, 2, 3))
haven::write_sav(data, "numeric_haven.sav")
in the resulting file, the variable does not have any type (should be numeric). Maybe the reason is me using a German version of SPSS, that expects "," as decimal separator instead of "."?
Here's the file written with haven:
https://github.com/dgromer/misc/blob/master/numeric_haven.sav
and how it should look like (created with SPSS):
https://github.com/dgromer/misc/blob/master/numeric_spss.sav
Hi,
Having an issue with importing labelled Stata variables. Appears to affect labelled numeric(long) stata type but not numeric(byte). Calling summary() fails with error below and as_factor gives a length error.
>havtest$cav_n_stage
<Labelled>
[1] 1 2 1 1 1 1 2 4 1 4 4 1 4 1 1 1 1 1 2 1 1 4 1 1 1 1 1 2 4 1 2
[32] 2 3 1 2 1 2 2 2 4 2 4 1 1 2 4 1 1 1 4 2 2 1 1 1 1 1 1 2 2 3 1
[63] 1 2 4 1
Labels:
N0 N1 N2 NX Nx
1 2 3 4 5
>summary(havtest$cav_n_stage)
Error: `x` and `labels` must be same type
>as_factor(havtest$cav_n_stage)
Error in factor(match(x, attr(x, "labels")), labels = names(attr(x, "labels"))) : invalid 'labels'; length 5 should be 1 or 4
With that error, I thought it may be that there are no "Nx" (=5) in this set. Indeed with a different numeric(long) var that has no redundant labels, I get the same error on using summary(), but as_factor() is successful. I can understand the as_factor() failing with redundant labels (although maybe it shouldn't...) but not sure why summary() fails?
By contrast, the numeric(byte) class seems fine:
>havtest$abort_n2
<Labelled>
[1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[16] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[31] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[46] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[61] "0" "0" "0" "0" "0" "0"
Labels:
no yes
0 1
>summary(havtest$abort_n2)
Length Class Mode
66 labelled character
> as_factor(havtest$abort_n2)
[1] no no no no no no no no no no no no no no no no no no no no no
[22] no no no no no no no no no no no no no no no no no no no no no
[43] no no no no no no no no no no no no no no no no no no no no no
[64] no no no
Levels: no yes
Not sure if this is just me not using things correctly!
Thanks,
David.
As discussed over IM, may make a nice first test for setting up a drat repo.
I'm getting the following error when I include my format catalog:
data<-read_sas("M:/.../alldata.sas7bdat", b7cat = "L:/.../formats.sas7bcat")
Error: Failed to parse L:\...\formats.sas7bcat: Invalid file, or file has unsupported features.
The data itself loads fine if I don't specify b7cat. If the dataset has the value labels already applied, should they be loaded with only the sas7bdat argument? They seem to not be loading. That is, the data all come out as regular numeric data.
More a question than an issue, but which format is required for vectors or the data.frame, so variable and value labels are saved to the SPSS file using write_sav? Value labels in created SPSS-files were all "invalid" (I tried to save vectors with attached label-values as well as factors with labels (no attributes).
hey boss, thanks for all your hard work. here's a reproducible example
# read_sas() fails on compressed .sas7bdat files
# and gives an unhelpful error message
# read.sas7bdat() also fails, but explains why
# read.sas7bdat.parso succeeds
# load a few packages to demonstrate successes and failures
library(haven)
library(sas7bdat)
library(devtools)
install_github( "biostatmatt/sas7bdat.parso" )
library(sas7bdat.parso)
# initiate some temporary files
tfp <- tempfile() ; tff <- tempfile() ; tfh <- tempfile()
# three of the latest files from the us census bureau's current population survey
# the current population survey is the major federal benchmark for employment, poverty, and health insurance in the united states
download.file( "http://www.census.gov/housing/extract_files/data%20extracts/cpsasec14/pppub14_redes.sas7bdat" , tfp , mode = 'wb' )
download.file( "http://www.census.gov/housing/extract_files/data%20extracts/cpsasec14/ffpub14_redes.sas7bdat" , tff , mode = 'wb' )
download.file( "http://www.census.gov/housing/extract_files/data%20extracts/cpsasec14/hhpub14_redes.sas7bdat" , tfh , mode = 'wb' )
# breaks
havenp <- read_sas( tfp )
# works
havenf <- read_sas( tff )
# breaks
havenh <- read_sas( tfh )
# breaks
sbdp <- read.sas7bdat( tfp )
# breaks
sbdf <- read.sas7bdat( tff )
# breaks
sbdh <- read.sas7bdat( tfh )
# works
parsop <- read.sas7bdat.parso( tfp )
# works
parsof <- read.sas7bdat.parso( tff )
# works
parsoh <- read.sas7bdat.parso( tfh )
# more reading about sas7bdat.parso here
# http://biostatmatt.com/archives/2618
I have an example SPSS file with a few non-ASCII characters. When I import this using read_spss(), the non-ASCII characters in variable values are handled properly, but the same characters in variable names are not converted. They look like what UTF-8 byte sequences look when interpreted as ISO-8859-1 byte sequences do. If you supply an e-mail address, I can mail you the example file (the GitHub issue tracker doesn’t seem to support attachments).
Example R session:
> library(haven)
> d=read_spss("spsstest.sav")
> d # Wrong characters in the column header
abc abcæøå testµ
1 foo 1 10
2 bår 2 20
3 æøå 3 30
It seems easy enough to fix:
> Encoding(names(d))
[1] "unknown" "unknown" "unknown"
> Encoding(names(d))="UTF-8"
> d # Correct characters in the column header
abc abcæøå testµ
1 foo 1 10
2 bår 2 20
3 æøå 3 30
The above example was for a SPSS file saved as ‘Unicode’. If I instead save it in the ‘native’ encoding (which seems to be Windows-1252), I get this error message:
> d=read_spss("spsstest2.sav")
Failed to find ABCÆØÅ
Failed to find TESTµ
The resulting data.frame looks like this:
> d
abc ABCÆØÅ TESTµ
1 foo 1 10
2 bår 2 20
3 æøå 3 30
Note that all the variables names are lowercase in the original SPSS file (i.e., abc
, abcæøå
and testµ
), but two of them have been converted to uppercase in the data.frame.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252
[2] LC_CTYPE=Norwegian-Nynorsk_Norway.1252
[3] LC_MONETARY=Norwegian-Nynorsk_Norway.1252
[4] LC_NUMERIC=C
[5] LC_TIME=Norwegian-Nynorsk_Norway.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] haven_0.1.1
loaded via a namespace (and not attached):
[1] Rcpp_0.11.4 tools_3.1.1
I encountered an error when trying to read a SAS dataset that was compressed. I was wondering if there are plans to address this? I had to use PC SAS to uncompress the file before I could get it to work. Relevant log is below:
library(haven)
library(sas7bdat)sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
other attached packages:
[1] sas7bdat_0.5 dplyr_0.4.1 haven_0.0.0.9000
ds.unix.fmt <- read_sas("C:/0_data/GOES/ORACLE/rd_frmae.sas7bdat")
Error: Failed to parse C:\0_data\GOES\ORACLE\rd_frmae.sas7bdat: A row in the file was not the expected length.ds_unix.fmt1 <- read.sas7bdat("C:/0_data/GOES/ORACLE/rd_frmae.sas7bdat")
Error in read.sas7bdat("C:/0_data/GOES/ORACLE/rd_frmae.sas7bdat") :
file contains compressed datads.pc.fmt <- read_sas("C:/0_data/GOES/DATASETS/rd_frmae.sas7bdat")
I think I've found a bug in read_sav
which only presents when I try to View
the resultant data frame
temp <- tempfile()
download.file("http://www.electionstudies.org/studypages/data/anes_panel_2013_inetrecontact/anes_panel_2013_inetrecontactsav.zip",temp)
d1 <- read_sav(unzip(temp, "anes_panel_2013_inetrecontact.sav"))
which all works fine, until I try
View(d1)
which returns a blank data frame view, and this error message
Error: `x` and `labels` must be same type
With the latest cran and github version cannot filter using dplyr after importing this .dta file using haven.
df <- read_dta("Authoritarian Shadow/Test.dta") %>%
filter(year > 1996)
I get the error
Error: column 'country_name' of type character has unsupported attributes: label
I got following error when I try to install the latest snapshot of haven:
> library(devtools)
> devtools::install_github("hadley/haven")
Downloading github repo hadley/haven@master
Installing haven
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL \
'/private/var/folders/k4/qfl_sg2s12d9z7p2c3qlrvqh0000gn/T/Rtmp6Em6Gr/devtoolsa1f23674344/hadley-haven-ac66b3d' \
--library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library' --install-tests
* installing *source* package ‘haven’ ...
** libs
clang -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/BH/include" -fPIC -Wall -mtune=core2 -g -O2 -c CKHashTable.c -o CKHashTable.o
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/BH/include" -fPIC -Wall -mtune=core2 -g -O2 -c DfBuilder.cpp -o DfBuilder.o
DfBuilder.cpp:188:7: error: use of undeclared identifier 'warning'
warning("Unsupported label type: %s", type);
^
1 error generated.
make: *** [DfBuilder.o] Error 1
ERROR: compilation failed for package ‘haven’
* removing ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/haven’
Fehler: Command failed (1)
>
Hi,
I tried to install from github, this is what I got:
> devtools::install_github("hadley/haven")
Downloading github repo hadley/haven@master
Installing haven
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL \
'/private/var/folders/r_/fm5p9qsx519fxh8cvr2vbtt80000gp/T/Rtmp3Bpad8/devtools9be559edde1/hadley-haven-432ad52' \
--library='/Users/dprice/Library/R/3.1/library' --install-tests
* installing *source* package ‘haven’ ...
** libs
clang -I/Library/Frameworks/R.framework/Resources/include -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/dprice/Library/R/3.1/library/Rcpp/include" -I"/Users/dprice/Library/R/3.1/library/BH/include" -fPIC -Wall -mtune=core2 -g -O2 -c CKHashTable.c -o CKHashTable.o
clang++ -arch x86_64 -ftemplate-depth-256 -I/Library/Frameworks/R.framework/Resources/include -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/dprice/Library/R/3.1/library/Rcpp/include" -I"/Users/dprice/Library/R/3.1/library/BH/include" -fPIC -Wall -mtune=core2 -O3 -c DfReader.cpp -o DfReader.o
In file included from DfReader.cpp:1:
In file included from /Users/dprice/Library/R/3.1/library/Rcpp/include/Rcpp.h:27:
In file included from /Users/dprice/Library/R/3.1/library/Rcpp/include/RcppCommon.h:29:
In file included from /Users/dprice/Library/R/3.1/library/Rcpp/include/Rcpp/platform/compiler.h:171:
In file included from /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/map:423:
In file included from /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/__tree:15:
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/iterator:341:10: fatal error: '__debug' file not found
#include <__debug>
^
1 error generated.
make: *** [DfReader.o] Error 1
ERROR: compilation failed for package ‘haven’
* removing ‘/Users/dprice/Library/R/3.1/library/haven’
* restoring previous ‘/Users/dprice/Library/R/3.1/library/haven’
Error: Command failed (1)
> Sys.info()
sysname
"Darwin"
release
"14.3.0"
version
"Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64"
nodename
"UPEI-MacBookAir.local"
machine
"x86_64"
login
"dprice"
user
"dprice"
effective_user
"dprice"
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] bitops_1.0-6 devtools_1.7.0 evaluate_0.5.5 formatR_1.1 httr_0.6.1 knitr_1.9 RCurl_1.95-4.5 stringr_0.6.2 tools_3.1.3
Installing via install.packages
worked though.
Is there a specific reason why label attributes in Haven have other names than attributes from foreign-package? Would be easier for other packages that access value and variable labels to have the same attribute names.
I'll send you example files via mail.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.