tidyverse / haven Goto Github PK

View Code? Open in Web Editor NEW

422.0 21.0 117.0 7.52 MB

Read SPSS, Stata and SAS files from R

Home Page: https://haven.tidyverse.org

License: Other

R 10.52% C 79.84% C++ 4.78% Ragel 4.85% Shell 0.01%

spss stata r sas

haven's Introduction

haven

Overview

Haven enables R to read and write various data formats used by other statistical packages by wrapping the fantastic ReadStat C library written by Evan Miller. Haven is part of the tidyverse. Currently it supports:

SAS: read_sas() reads .sas7bdat + .sas7bcat files and read_xpt() reads SAS transport files (versions 5 and 8). write_xpt() writes SAS transport files (versions 5 and 8).
SPSS: read_sav() reads .sav files and read_por() reads the older .por files. write_sav() writes .sav files.
Stata: read_dta() reads .dta files (up to version 15). write_dta() writes .dta files (versions 8-15).

The output objects:

Are tibbles, which have a better print method for very long and very wide files.
Translate value labels into a new labelled() class, which preserves the original semantics and can easily be coerced to factors with as_factor(). Special missing values are preserved. See vignette("semantics") for more details.
Dates and times are converted to R date/time classes. Character vectors are not converted to factors.

Installation

# The easiest way to get haven is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just haven:
install.packages("haven")

Usage

library(haven)

# SAS
read_sas("mtcars.sas7bdat")
write_xpt(mtcars, "mtcars.xpt")

# SPSS
read_sav("mtcars.sav")
write_sav(mtcars, "mtcars.sav")

# Stata
read_dta("mtcars.dta")
write_dta(mtcars, "mtcars.dta")

Related work

foreign reads from SAS XPORT, SPSS, and Stata (up to version 12) files.
readstat13 reads from and writes to all Stata file format versions.
sas7bdat reads from SAS7BDAT files.

Code of Conduct

Please note that the haven project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

haven's People

Contributors

Stargazers

Watchers

Forkers

jjallaire vnijs martin-jung arturochian tklebel 0x0all bquast mlamias edwindj markriseley marcds kalichore vasanthgx izahn lionel- wilcoxa mikeaddison93 jiangtang marathon490 hkejigu hickeye ladg340 anhqle tormodb allenzhuaz aghaynes solomen313 ecortens nikolayvoronchikhin thieman robinotn pkq cimentadaj barneytotos juzenn iamjoshbinder marthagimbel isomorphisms mfansler tuqiang2014 shubh26 ankravch lorenzwalthert lwjohnst86 sdanielzafar lizl90 igorabas amasasi zhaowill guannan-shen gergness austensen larmarange yangzhihaolj rajesh16702 withr huftis hughparsonage nealrichardson zkamvar sandboxorg ardeeshany yangtaoxf jeffeaton jimhester batpigandme fuganggangxx mikmart krprls briatte srjayep rubenarslan steffengreup takewiki kylehaynes hermansulistiyo pvanheus huangliang0828 xhxy beginnertxh miktro kurt-vd jonkeane dawa406 kameeldoring2020 ayobame algoskynet areza7 fagan2888 lwan130 displayr dsteuer cortega729 krlmlr jmobrien sbae gorcha jmbarbone stjordanis xitechnixjeroen

haven's Issues

How does write_sav write labels to SPSS files?

More a question than an issue, but which format is required for vectors or the data.frame, so variable and value labels are saved to the SPSS file using write_sav? Value labels in created SPSS-files were all "invalid" (I tried to save vectors with attached label-values as well as factors with labels (no attributes).

read_spss() imports value labels for multiple declared missings

In SPSS, you can assign multiple missing values, e.g. "8" as "not applicable" or "9" for "real missing". read_spss() would set both 8 and 9 to NA in the imported data.frame, however, if a value label for "8" is set, it is also imported - thus, you have one more value label than values.

Example:

"Do you live in partnership?"

1 - yes
2 - no
8 - not applicable (was asked before if married) -> declared as "8" missing
9 - missing -> declared as "9" missing, but not labelled

Now read_spss() imports this as vector with two values (1/2) and NA's, but three value labels (yes/no/not applicable).

This is no serious bug, but I wonder if it would be possible to distinguish whether an SPSS-declared "missing value" is either a "real" missing or a "not applicable" declared missing. In SPSS you do this e.g. to get valid frequency counts for yes/no answers, but still having the information how much "real" missings you have...

write_sav() fails to export umlauts in labels

I'll send you example files via mail.

Compatibility of label attributes to foreign-package

Is there a specific reason why label attributes in Haven have other names than attributes from foreign-package? Would be easier for other packages that access value and variable labels to have the same attribute names.

Error after importing data

I've got a following error after importing Social Diagnosis survey dataset (link to data - http://www.diagnoza.com/data/database/2000_2013/SOCIAL_DIAGNOSIS_H_2000_2013_SAV.zip)

> library(dplyr)
> library(haven)
> fname <- '~/Downloads/SOCIAL_DIAGNOSIS_H_2000_2013.SAV'
> dsin <- read_spss(path = fname)
> class(dsin)
[1] "tbl_df"     "tbl"        "data.frame"
> head(dsin)
Error: `x` and `labels` must be same type

I've got also an error after applying count function from dplyr but this is probably that the class labelled is not implemented in dplyr?

> d <- dsin %>% count(gdtyp_11)
Error: column 'NUMER_2000_2013' of type numeric has unsupported attributes: label

read_spss can't deal with range of missing values in SPSS file

When missing values span over a larger range of values, you can declare this value range as "missing" values, see:

This differs from declaring specific values as missing, like:

When loading a SPSS file with a "missing range", read_spss throws an error:

> test <- read_spss("spss_missing_range.sav")
Error: Failed to parse C:\Users\Luedeke\Desktop\spss_missing_range.sav: Invalid file, or file has unsupported features.

I've uploaded two sample files.

SPSS-file that causes trouble (missing range):
https://www.dropbox.com/s/nkkk5mg45xknbeo/spss_missing_range.sav?dl=0

SPSS-file that works (missing values):
https://www.dropbox.com/s/dzdn38f81lk0rbx/spss_missing_value.sav?dl=0

read_spss should switch on file ext

Haven API updates

From @evanmiller

Quick API update you should be aware of. The "error_handler" now receives a second argument: the user_ctx variable that is passed to all the other callbacks.

I've also added another callback for progress indicators, may or may not be useful to you:

typedef int (*readstat_progress_handler)(double progress, void *ctx);

readstat_error_t readstat_set_progress_handler(
    readstat_parser_t *parser,
    readstat_progress_handler progress_handler);

This callback periodically receives a double between 0.0 and 1.0 indicating the % progress through reading a file. I implemented it mainly so I could get a progress indicator on POR files, since as we've discussed the row count is not available in advance. But the progress handler should work on all file types.

Error in sprintf("%02d", m)

The data read in fine, but when I went to look at it I got this error message:

head(p7)
Error in sprintf("%02d", m) : 
  invalid format '%02d'; use format %f, %e, %g or %a for numeric objects

In further investigation I found the variable that was causing the problem. If I remove this variable, then head works fine. In sas the format and informat is TIME8.

attributes(p7[,42])
$label
[1] "Time of highest X within 3 yrs of 7/1/2005"

$class
[1] "hms"

summary(p7[,42])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1800   29280   31200   31020   33420   35940   32987

Parse date/time formats

Place in drat repo?

As discussed over IM, may make a nice first test for setting up a drat repo.

Error after importing Stata data

With the latest built, the issue linked here persists. Basically, using df <- read_dta('LIAB_lm_9310_v1_pers.dta'), where the data file is included in this zip, if I try to head(df) or data.table(df), I get an error

Error: `x` and `labels` must be same type

and in the latter case, R crashes subsequently.

read_sas cannot load my .sas7bcat file

I'm getting the following error when I include my format catalog:

data<-read_sas("M:/.../alldata.sas7bdat", b7cat = "L:/.../formats.sas7bcat")

Error: Failed to parse L:\...\formats.sas7bcat: Invalid file, or file has unsupported features.

The data itself loads fine if I don't specify b7cat. If the dataset has the value labels already applied, should they be loaded with only the sas7bdat argument? They seem to not be loading. That is, the data all come out as regular numeric data.

writing dates to sav

I am unable to write dates using the haven package. I've tried formatting the dates as dates, POSIX, and character types. Any ideas?

library("haven")

id <- c(1, 2, 3)
date <- as.Date(c("2014-09-23", "2014-09-24", "2014-09-25"))
date.pos <- as.POSIXct(date)
date.char <- as.character(date)
write_sav(data.frame(id, date, date.pos, date.char), "test.sav")

Missing values problem read_dta()

Great package. However, I have some issues reading STATA files. All the missing observations are somehow ignored, and not correctly identified.
My example data is here: https://www.dropbox.com/s/msbz5f7d5p84k2d/BFIR21FL.zip?dl=0

Code replicating my issue:

bfir21fl <- read_dta(path = "BFIR21FL.DTA")
str(bfir21fl$v121)
Class 'labelled' atomic [1:6354] 1 1 1 1 ...
..- attr(, "label")= chr "has television"
..- attr(, "labels")= Named int [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "no" "yes"
sum(is.na(bfir21fl$v121))
[1] 0
sum(is.na(as_factor(bfir21fl$v121)))
[1] 0

Using the read.dta from foreign results in:

bfir21fl_1 <- read.dta("BFIR21FL.DTA")
str(bfir21fl_1$v121)
Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 1 1 1 1 ...
length(bfir21fl_1$v121)
[1] 6354
sum(is.na(bfir21fl_1$v121))
[1] 61

61 is the correct result. My question is, either there is something I am doing wrong, or the conversion is not working correctly, or not according to my expectations.

Error when importing .por datafile (read.spss works)

I am getting this error when I try to open a SPSS por file:

Error in df_parse_por(clean_path(path)) : 
attempt to set index 0/0 in SET_STRING_ELT

Below is code that downloads the file. It works with read.spss.

library("haven")
library("foreign")
# download and unzip file to temporary folder
url <- "http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/2006_sqf.zip"
p1 <- file.path(tempdir(), basename(url))
download.file(url, p1, quiet = TRUE)
filename <- unzip(p1, list = TRUE)$Name[1]
unzip(p1, files = filename, exdir = tempdir())
# open file
p2 <- file.path(tempdir(), filename)
DF <- read_por(p2)
# Error in df_parse_por(clean_path(path)) : 
#  attempt to set index 0/0 in SET_STRING_ELT
DF <- foreign::read.spss(p2, use.value.labels = FALSE, to.data.frame = TRUE)

sessionInfo()

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] foreign_0.8-62   haven_0.1.1.9000 Defaults_1.1-1  

loaded via a namespace (and not attached):
[1] Rcpp_0.11.4

Parse sas should have option sas7bcat argument

SAV import fails to import umlauts in labels

When having ä, ö or ü in variable labels, these are not imported correctly.

Option for turning on/off labelled columns (particularly Stata files)

A simple request: could you add an option to turn off labelling for labeled Stata files?

I'd be fine with one that dropped labels altogether, returning only the numeric values underlying the Stata labels.

The main concern is that dplyr still doesn't support labelled data.frames, but I'm worried about compatibility with other packages, too.

read_sas fails on compressed sas7bdat files

hey boss, thanks for all your hard work. here's a reproducible example

# read_sas() fails on compressed .sas7bdat files
    # and gives an unhelpful error message
# read.sas7bdat() also fails, but explains why
# read.sas7bdat.parso succeeds


# load a few packages to demonstrate successes and failures
library(haven)
library(sas7bdat)
library(devtools)
install_github( "biostatmatt/sas7bdat.parso" )
library(sas7bdat.parso)

# initiate some temporary files
tfp <- tempfile() ; tff <- tempfile() ; tfh <- tempfile()

# three of the latest files from the us census bureau's current population survey
# the current population survey is the major federal benchmark for employment, poverty, and health insurance in the united states

download.file( "http://www.census.gov/housing/extract_files/data%20extracts/cpsasec14/pppub14_redes.sas7bdat" , tfp , mode = 'wb' )
download.file( "http://www.census.gov/housing/extract_files/data%20extracts/cpsasec14/ffpub14_redes.sas7bdat" , tff , mode = 'wb' )
download.file( "http://www.census.gov/housing/extract_files/data%20extracts/cpsasec14/hhpub14_redes.sas7bdat" , tfh , mode = 'wb' )

# breaks
havenp <- read_sas( tfp )
# works
havenf <- read_sas( tff )
# breaks
havenh <- read_sas( tfh )

# breaks
sbdp <- read.sas7bdat( tfp )
# breaks
sbdf <- read.sas7bdat( tff )
# breaks
sbdh <- read.sas7bdat( tfh )

# works
parsop <- read.sas7bdat.parso( tfp )
# works
parsof <- read.sas7bdat.parso( tff )
# works
parsoh <- read.sas7bdat.parso( tfh )

# more reading about sas7bdat.parso here
# http://biostatmatt.com/archives/2618

write-functions

A really great feature would be a write function that writes back a data frame (with variable and value label attributes) to an SPSS/SAS/Stata file, where the value and variable fields are automatically set.

This would be a great benefit for all people who collaborate in teams where some use SPSS and others use R. Imagine you have a base data set (in SPSS format, because majority uses it in your department) and do some recodings and data cleaning with R, and you want to save the changes / added variables back to the SPSS data set.

I'm not sure whether this is possible at all, or if it's possible with low effort?

Missing variable type in sav files

When writing a sav file using

data <- data.frame(Var = c(1, 2, 3))
haven::write_sav(data, "numeric_haven.sav")

in the resulting file, the variable does not have any type (should be numeric). Maybe the reason is me using a German version of SPSS, that expects "," as decimal separator instead of "."?

Here's the file written with haven:
https://github.com/dgromer/misc/blob/master/numeric_haven.sav

and how it should look like (created with SPSS):
https://github.com/dgromer/misc/blob/master/numeric_spss.sav

Problem with names() after read_sas()

I read sas7bdat file by read_sas() function, it is represeted as tbl_df type in R. Than I use names() function and it cause R console in RStudio to "does not respond" state. RStudio works, however I can not close it without killing it. I don't have such problems with data.frames, so I guess it is a problem with haven.

I use R 3.1.2 on Windows with recent development version of RStudio and haven 0.2

Updating R to 3.1.3 resolved the issue

read_sav returns "Error: `x` and `labels` must be same type" when trying to view data frame

I think I've found a bug in read_sav which only presents when I try to View the resultant data frame

temp <- tempfile()
download.file("http://www.electionstudies.org/studypages/data/anes_panel_2013_inetrecontact/anes_panel_2013_inetrecontactsav.zip",temp)
d1 <- read_sav(unzip(temp, "anes_panel_2013_inetrecontact.sav"))

which all works fine, until I try

View(d1)

which returns a blank data frame view, and this error message

Error: `x` and `labels` must be same type

Error when installing snapshot

I got following error when I try to install the latest snapshot of haven:

> library(devtools)
> devtools::install_github("hadley/haven")
Downloading github repo hadley/haven@master
Installing haven
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  \
'/private/var/folders/k4/qfl_sg2s12d9z7p2c3qlrvqh0000gn/T/Rtmp6Em6Gr/devtoolsa1f23674344/hadley-haven-ac66b3d'  \
  --library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library' --install-tests 

* installing *source* package ‘haven’ ...
** libs
clang -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/BH/include"   -fPIC  -Wall -mtune=core2 -g -O2  -c CKHashTable.c -o CKHashTable.o
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/BH/include"   -fPIC  -Wall -mtune=core2 -g -O2  -c DfBuilder.cpp -o DfBuilder.o
DfBuilder.cpp:188:7: error: use of undeclared identifier 'warning'
      warning("Unsupported label type: %s", type);
      ^
1 error generated.
make: *** [DfBuilder.o] Error 1
ERROR: compilation failed for package ‘haven’
* removing ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/haven’
Fehler: Command failed (1)
>

Problem with dplyr filter after using haven to import Stata data

With the latest cran and github version cannot filter using dplyr after importing this .dta file using haven.

df <- read_dta("Authoritarian Shadow/Test.dta") %>%
  filter(year > 1996)

I get the error

Error: column 'country_name' of type character has unsupported attributes: label

read_dta() reads byte variables as character strings

read_dta() seems to convert Stata byte variables to character strings, instead of to integers. If I create a simple Stata file, with variables of type float, double, long, int and byte, here is the result of importing it using read_dta:

> d = read_dta("stata-datatypes.dta")
> sapply(d, class)
     vfloat     vdouble       vlong        vint       vbyte 
  "numeric"   "numeric"   "integer"   "integer" "character"

Version info:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252  LC_CTYPE=Norwegian-Nynorsk_Norway.1252   
[3] LC_MONETARY=Norwegian-Nynorsk_Norway.1252 LC_NUMERIC=C                             
[5] LC_TIME=Norwegian-Nynorsk_Norway.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] haven_0.1.1

loaded via a namespace (and not attached):
[1] Rcpp_0.11.4 tools_3.1.1

Fails to load SAS dataset with Shift-JIS encoding

Hello, I have a SAS dataset that we got from an external source and it has Encoding = shift-jis Japanese (SJIS). Haven generates this error when trying to read it: Error: Failed to parse C:\Temp\lb.sas7bdat: Invalid file, or file has unsupported features.

The problem goes away if I use PC SAS to convert the file to a local encoding, e.g, Encoding = wlatin1 Western (Windows).

It is possible for Haven to support Shift-jis since on many occassions I will not have access to PC SAS to do the conversion.

Regards,
David

Add read_stata alias

unable to use group_by if using label attribute from sas file

testsas <- group_by(read_sas("http://crn.cancer.gov/resources/ctcodes-procedures.sas7bdat"), code_type)
#   Error: column 'px' of type character has unsupported attributes: label

By using as_factor to change all columns with label attribute I can fix things so that group_by works.

testsas <- read_sas("http://crn.cancer.gov/resources/ctcodes-procedures.sas7bdat")
testsas$px <- as_factor(testsas$px)
testsas$code_type <- as_factor(testsas$code_type)
testsas$description <- as_factor(testsas$description)
testsas$chemo_type <- as_factor(testsas$chemo_type)
testsas$comments <- as_factor(testsas$comments)
group_by(testsas, code_type)

as_factor broken

as_factor is broken for version 0.2.0.

> var <- labelled(c(0, 1), c(female = 0, male = 1))
> as_factor(var)
[1] male <NA>
Levels: female male

SessionInfo:

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] de_AT.UTF-8/de_AT.UTF-8/de_AT.UTF-8/C/de_AT.UTF-8/de_AT.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] haven_0.2.0

loaded via a namespace (and not attached):
[1] Rcpp_0.11.5 tools_3.1.2

Include variable and value labels

It would be nice if value and variable labels would be imported as well, e.g. by setting them as data frame attributes (like read.spss from foreign package does):

_Example of read.spss_

'data.frame': 908 obs. of 26 variables:
$ c12hour : num 16 148 70 168 168 16 161 110 28 40 ...
$ e15relat: atomic 2 2 1 1 2 2 1 4 2 2 ...
..- attr(, "value.labels")= Named chr "8" "7" "6" "5" ...
.. ..- attr(, "names")= chr "other, specify" "cousin" "nephew/niece" "ancle/aunt" ...
$ e16sex : atomic 2 2 2 2 2 2 1 2 2 2 ...
..- attr(, "value.labels")= Named chr "2" "1"
.. ..- attr(, "names")= chr "female" "male"
$ e17age : num 83 88 82 67 84 85 74 87 79 83 ...
$ e42dep : atomic 3 3 3 4 4 4 4 4 4 4 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "severely dependent" "moderately dependent" "slightly dependent" "independent"
$ c82cop1 : atomic 3 3 2 4 3 2 4 3 3 3 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "always" "often" "sometimes" "never"
$ c83cop2 : atomic 2 3 2 1 2 2 2 2 2 2 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c84cop3 : atomic 2 3 1 3 1 3 4 2 3 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c85cop4 : atomic 2 3 4 1 2 3 1 1 2 2 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c86cop5 : atomic 1 4 1 1 2 3 1 1 2 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c87cop6 : atomic 1 1 1 1 2 2 2 1 1 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c88cop7 : atomic 2 3 1 1 1 2 4 2 3 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "Always" "Often" "Sometimes" "Never"
$ c89cop8 : atomic 3 2 4 2 4 1 1 3 1 1 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "always" "often" "sometimes" "never"
$ c90cop9 : atomic 3 2 3 4 4 1 4 3 3 3 ...
..- attr(, "value.labels")= Named chr "4" "3" "2" "1"
.. ..- attr(, "names")= chr "always" "often" "sometimes" "never"
$ c160age : num 56 54 80 69 47 56 61 67 59 49 ...
$ c161sex : atomic 2 2 1 1 2 1 2 2 2 2 ...
..- attr(, "value.labels")= Named chr "2" "1"
.. ..- attr(, "names")= chr "Female" "Male"
$ c172code: atomic 2 2 1 2 2 2 2 2 NA 2 ...
..- attr(, "value.labels")= Named chr "3" "2" "1"
.. ..- attr(, "names")= chr "high level of education" "intermediate level of education" "low level of education"
$ c175empl: atomic 1 1 0 0 0 1 0 0 0 0 ...
..- attr(, "value.labels")= Named chr "1" "0"
.. ..- attr(, "names")= chr "yes" "no"
$ barthtot: num 75 75 35 0 25 60 5 35 15 0 ...
$ neg_c_7 : num 12 20 11 10 12 19 15 11 15 10 ...
$ pos_v_4 : num 12 11 13 15 15 9 13 14 13 13 ...
$ quol_5 : num 14 10 7 12 19 8 20 20 8 15 ...
$ resttotn: num 0 4 0 2 2 1 0 0 0 1 ...
$ tot_sc_e: num 4 0 1 0 1 3 0 1 2 1 ...
$ n4pstu : atomic 0 0 2 3 2 2 3 1 3 3 ...
..- attr(, "value.labels")= Named chr "8" "7" "6" "5" ...
.. ..- attr(, "names")= chr "other, specify" "cousin" "nephew/niece" "ancle/aunt" ...
$ nur_pst : atomic NA NA 2 3 2 2 3 1 3 3 ...
..- attr(, "value.labels")= Named chr "89" "88" "87" "86" ...
.. ..- attr(, "names")= chr "other" "co-religionist" "volunteer" "neighbour" ...

attr(, "variable.labels")= Named chr "average number of hours of care for the elder in a week" "relationship to elder" "elder's gender" "elder' age" ...
..- attr(, "names")= chr "c12hour" "e15relat" "e16sex" "e17age" ...
attr(*, "codepage")= int 65001

_Same with haven_

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 908 obs. of 26 variables:
$ c12hour : num 16 148 70 168 168 16 161 110 28 40 ...
$ e15relat: num 2 2 1 1 2 2 1 4 2 2 ...
$ e16sex : num 2 2 2 2 2 2 1 2 2 2 ...
$ e17age : num 83 88 82 67 84 85 74 87 79 83 ...
$ e42dep : num 3 3 3 4 4 4 4 4 4 4 ...
$ c82cop1 : num 3 3 2 4 3 2 4 3 3 3 ...
$ c83cop2 : num 2 3 2 1 2 2 2 2 2 2 ...
$ c84cop3 : num 2 3 1 3 1 3 4 2 3 1 ...
$ c85cop4 : num 2 3 4 1 2 3 1 1 2 2 ...
$ c86cop5 : num 1 4 1 1 2 3 1 1 2 1 ...
$ c87cop6 : num 1 1 1 1 2 2 2 1 1 1 ...
$ c88cop7 : num 2 3 1 1 1 2 4 2 3 1 ...
$ c89cop8 : num 3 2 4 2 4 1 1 3 1 1 ...
$ c90cop9 : num 3 2 3 4 4 1 4 3 3 3 ...
$ c160age : num 56 54 80 69 47 56 61 67 59 49 ...
$ c161sex : num 2 2 1 1 2 1 2 2 2 2 ...
$ c172code: num 2 2 1 2 2 2 2 2 NaN 2 ...
$ c175empl: num 1 1 0 0 0 1 0 0 0 0 ...
$ barthtot: num 75 75 35 0 25 60 5 35 15 0 ...
$ neg_c_7 : num 12 20 11 10 12 19 15 11 15 10 ...
$ pos_v_4 : num 12 11 13 15 15 9 13 14 13 13 ...
$ quol_5 : num 14 10 7 12 19 8 20 20 8 15 ...
$ resttotn: num 0 4 0 2 2 1 0 0 0 1 ...
$ tot_sc_e: num 4 0 1 0 1 3 0 1 2 1 ...
$ n4pstu : num 0 0 2 3 2 2 3 1 3 3 ...
$ nur_pst : num NaN NaN 2 3 2 2 3 1 3 3 ...

install from github not working in OS X

Hi,
I tried to install from github, this is what I got:

> devtools::install_github("hadley/haven")
Downloading github repo hadley/haven@master
Installing haven
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  \
  '/private/var/folders/r_/fm5p9qsx519fxh8cvr2vbtt80000gp/T/Rtmp3Bpad8/devtools9be559edde1/hadley-haven-432ad52'  \
  --library='/Users/dprice/Library/R/3.1/library' --install-tests 

* installing *source* package ‘haven’ ...
** libs
clang -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/dprice/Library/R/3.1/library/Rcpp/include" -I"/Users/dprice/Library/R/3.1/library/BH/include"   -fPIC  -Wall -mtune=core2 -g -O2  -c CKHashTable.c -o CKHashTable.o
clang++ -arch x86_64 -ftemplate-depth-256 -I/Library/Frameworks/R.framework/Resources/include     -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/dprice/Library/R/3.1/library/Rcpp/include" -I"/Users/dprice/Library/R/3.1/library/BH/include"   -fPIC  -Wall -mtune=core2 -O3    -c DfReader.cpp -o DfReader.o
In file included from DfReader.cpp:1:
In file included from /Users/dprice/Library/R/3.1/library/Rcpp/include/Rcpp.h:27:
In file included from /Users/dprice/Library/R/3.1/library/Rcpp/include/RcppCommon.h:29:
In file included from /Users/dprice/Library/R/3.1/library/Rcpp/include/Rcpp/platform/compiler.h:171:
In file included from /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/map:423:
In file included from /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/__tree:15:
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/iterator:341:10: fatal error: '__debug' file not found
#include <__debug>
         ^
1 error generated.
make: *** [DfReader.o] Error 1
ERROR: compilation failed for package ‘haven’
* removing ‘/Users/dprice/Library/R/3.1/library/haven’
* restoring previous ‘/Users/dprice/Library/R/3.1/library/haven’
Error: Command failed (1)
> Sys.info()
                                                                                           sysname 
                                                                                          "Darwin" 
                                                                                           release 
                                                                                          "14.3.0" 
                                                                                           version 
"Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64" 
                                                                                          nodename 
                                                                           "UPEI-MacBookAir.local" 
                                                                                           machine 
                                                                                          "x86_64" 
                                                                                             login 
                                                                                          "dprice" 
                                                                                              user 
                                                                                          "dprice" 
                                                                                    effective_user 
                                                                                          "dprice" 
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] bitops_1.0-6   devtools_1.7.0 evaluate_0.5.5 formatR_1.1    httr_0.6.1     knitr_1.9      RCurl_1.95-4.5 stringr_0.6.2  tools_3.1.3

Installing via install.packages worked though.

read_sas: A row in the file was not the expected length

I'm not seeing any obvious problems with the data - there are missing values scattered throughout. Is there something specific I should look at that might sort out what is going on?


d0 <- read_sas('sasdata/s114640.sas7bdat')
Error: Failed to parse sasdata/s114640.sas7bdat: A row in the file was not the expected length.

d1 <- sas.get('sasdata','s114640') 
Read 112 records
Read 40067 records

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C  
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C  
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C  
 [9] LC_ADDRESS=C               LC_TELEPHONE=C  
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grDevices datasets  splines   graphics  utils     stats     methods  
[8] base     

other attached packages:
[1] haven_0.0.0.9000 chron_2.3-45     gam_1.09.1       survival_2.37-7 
[5] rlocal_2.0.2    

loaded via a namespace (and not attached):
[1] Rcpp_0.11.4

as_factor fails for variables where just one level ist present

as_factor works as expected for a variable, where both levels are present:

var <- as.numeric(rep(1:2, 10))
var <- labelled(var, c(female = 1, male = 2))
as_factor(var)

for a variable, where one level doesn't occur, it throws an error:

var <- as.numeric(rep(0, 10))
var <- labelled(var, c(female = 0, male = 1))
as_factor(var)

Fehler in factor(match(x, attr(x, "labels")), labels = names(attr(x, "labels"))) : 
  invalid 'labels'; length 2 should be 1 or 1

typo

functionaity (in readme), should be: functionality

Problem with "labelled" import - Stata 11

Hi,

Having an issue with importing labelled Stata variables. Appears to affect labelled numeric(long) stata type but not numeric(byte). Calling summary() fails with error below and as_factor gives a length error.

>havtest$cav_n_stage
<Labelled>
 [1] 1 2 1 1 1 1 2 4 1 4 4 1 4 1 1 1 1 1 2 1 1 4 1 1 1 1 1 2 4 1 2
[32] 2 3 1 2 1 2 2 2 4 2 4 1 1 2 4 1 1 1 4 2 2 1 1 1 1 1 1 2 2 3 1
[63] 1 2 4 1
Labels:
N0 N1 N2 NX Nx 
 1  2  3  4  5 

>summary(havtest$cav_n_stage)
Error: `x` and `labels` must be same type

>as_factor(havtest$cav_n_stage)
Error in factor(match(x, attr(x, "labels")), labels = names(attr(x, "labels"))) : invalid 'labels'; length 5 should be 1 or 4

With that error, I thought it may be that there are no "Nx" (=5) in this set. Indeed with a different numeric(long) var that has no redundant labels, I get the same error on using summary(), but as_factor() is successful. I can understand the as_factor() failing with redundant labels (although maybe it shouldn't...) but not sure why summary() fails?

By contrast, the numeric(byte) class seems fine:

>havtest$abort_n2
<Labelled>
 [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[16] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[31] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[46] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[61] "0" "0" "0" "0" "0" "0"
Labels:
 no yes 
  0   1 

>summary(havtest$abort_n2)
   Length     Class      Mode 
       66  labelled character 

> as_factor(havtest$abort_n2)
 [1] no no no no no no no no no no no no no no no no no no no no no
[22] no no no no no no no no no no no no no no no no no no no no no
[43] no no no no no no no no no no no no no no no no no no no no no
[64] no no no
Levels: no yes

Not sure if this is just me not using things correctly!

Thanks,
David.

What types does as_factor() accept?

http://blog.rstudio.org/2015/03/04/haven-0-1-0/ says, "as_factor(): turns labelled integers into factors".

I tried as_factor() on labelled characters (which happened to be characters "1", "2", "3", etc.), and it seemed to convert those into factors just fine.

SAS datasets that are compressed generate error using Haven (and also sas7bdat)

I encountered an error when trying to read a SAS dataset that was compressed. I was wondering if there are plans to address this? I had to use PC SAS to uncompress the file before I could get it to work. Relevant log is below:

library(haven)
library(sas7bdat)

sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

other attached packages:
[1] sas7bdat_0.5 dplyr_0.4.1 haven_0.0.0.9000

ds.unix.fmt <- read_sas("C:/0_data/GOES/ORACLE/rd_frmae.sas7bdat")
Error: Failed to parse C:\0_data\GOES\ORACLE\rd_frmae.sas7bdat: A row in the file was not the expected length.

ds_unix.fmt1 <- read.sas7bdat("C:/0_data/GOES/ORACLE/rd_frmae.sas7bdat")
Error in read.sas7bdat("C:/0_data/GOES/ORACLE/rd_frmae.sas7bdat") :
file contains compressed data

ds.pc.fmt <- read_sas("C:/0_data/GOES/DATASETS/rd_frmae.sas7bdat")

write_sav() should set "Missings" to "none" in SPSS data file

When writing data with missings to SPSS, the "Missing" object in the SPSS data file is empty. It better should be set to "None" (i.e., missing values are not assigned with a specific numeric value of the variable).

See screenshots. Sample data set will be sent via mail...

labels versus format

Not a bug - just a comment that I found the labelled function a bit confusing because of the name. So “labelled” is similar to a SAS format, versus a SAS “label”, which I was happy to see does still come across. Both are useful concepts and it would really be nice to have both supported.

Also, I noticed with subsetting the data that the “label” attribute is lost but not the “labels” attribute. Any chance of adding in the “label” concept somehow? It would be nice to not lose that information and it would be nice to have a wrapper function that adds/extracts the information.

Thanks for the consideration.


d2$death <- labelled(d2$status, labels=c(Censored=0, Dead=1))
attributes(d2$death)
$label
[1] "Vital Status"
$labels
Censored     Dead 
       0        1 
$class
[1] "labelled"

Add read_spss as alias for read_sav

Can't properly acces data frame columns from imported files.

When I import data with haven, the returned object from read functions has following class-attributes:

> class(x)
[1] "tbl_df"     "tbl"        "data.frame"

When accessing a table column, still a data frame is returned:

str(x[, 6])
Classes ‘tbl_df’ and 'data.frame':  1567 obs. of  1 variable:
 $ sex:Class 'labelled'  atomic [1:1567] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "label")= chr "Geschlecht der antwortenden Person"
  .. ..- attr(*, "labels")= Named int [1:2] 0 1
  .. .. ..- attr(*, "names")= chr [1:2] "männlich" "weiblich"

I would assume that I either retrieve an atomic vector with attributes or a vector of class labelled...

Writing string variable to sav results in data1205 error

When writing a sav file using

data <- data.frame(Var = c("One", "Two", "Three"), stringsAsFactors = FALSE)
haven::write_sav(data, "string_haven.sav")

I get the following error when opening with SPSS (DATA1205): http://www-01.ibm.com/support/docview.wss?uid=swg21512643

File written with haven:
https://github.com/dgromer/misc/blob/master/string_haven.sav

File written with SPSS:
https://github.com/dgromer/misc/blob/master/string_spss.sav

parso Java library (not an issue)

In case it's helpful...

Matt Shotwell has a package on github called sas7bdat.parso. It's a wrapper for the GGA Software Parso Java library (ask Google).

I've used Parso to read many dozens of .sas7bdat files without problems (well, the only problem I've found is unescaped quotes in their csv writer - but that's the csv writer, not the sas7bdat reader). I know someone who is working on having Parso turn a bunch of SAS datasets directly into a SQLite db to avoid the whole csv nonsense.

For info, Parso was written because they couldn't get Sassy Reader to do what they wanted. Sassy Reader was based on Matt Shotwells package.

Again, I hope it helps.
Harry

Convert empty strings to missing values

RStudio Viewer does not print imported data frames

When loading an SPSS file with read_spss or read_sav, I can't view the data frame in the RStudio Viewer.

> View(x)
Error: `x` and `labels` must be same type

Using RStudio Version 0.99.283 on Win 7.

write_dta leads to R crashing

Created a data frame using simulated data and tried to save it as a dta file, this led R to crash with segfault error. Using R version 3.1.2 on a Mac with OSX 10.10.2.

Example code:

library(haven)
library(plyr)

set.seed(12345)
N=300
var1 = rnorm(N,0,1.5)
var2 = rnorm(N,2,5)
var3 = runif(N,-10,10)

dataForStata = data.frame(var1,var2,var3)
names(dataForStata) = c("var1","var2","var3")
write_dta(dataForStata,path="~/Desktop/")

Running the code produces the following error:

> write_dta(dataForStata,path="~/Desktop/")

 *** caught segfault ***
address 0x68, cause 'memory not mapped'

Traceback:
 1: .Call("haven_write_dta", PACKAGE = "haven", data, path)
 2: write_dta(dataForStata, path = "~/Desktop/")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

R session info:

sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin14.0.0 (64-bit)

locale:
[1] en_US/en_US/en_US/C/en_US/en_US

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.1       haven_0.1.1.9000

loaded via a namespace (and not attached):
[1] Rcpp_0.11.4

spss complains that columns from file made using write_sav are constant

even though they apparently aren't in the data view.. sorry I don't have an reproducible example, but that's what my colleague who I prepared the file for told me.

read_spss() doesn’t handle non-ASCII characters in variable names properly

I have an example SPSS file with a few non-ASCII characters. When I import this using read_spss(), the non-ASCII characters in variable values are handled properly, but the same characters in variable names are not converted. They look like what UTF-8 byte sequences look when interpreted as ISO-8859-1 byte sequences do. If you supply an e-mail address, I can mail you the example file (the GitHub issue tracker doesn’t seem to support attachments).

Example R session:

> library(haven)
> d=read_spss("spsstest.sav")
> d # Wrong characters in the column header
  abc abcÃ¦Ã¸Ã¥ testÂµ
1 foo         1     10
2 bår         2     20
3 æøå         3     30

It seems easy enough to fix:

> Encoding(names(d))
[1] "unknown" "unknown" "unknown"
> Encoding(names(d))="UTF-8"
> d  # Correct characters in the column header
  abc abcæøå testµ
1 foo      1    10
2 bår      2    20
3 æøå      3    30

The above example was for a SPSS file saved as ‘Unicode’. If I instead save it in the ‘native’ encoding (which seems to be Windows-1252), I get this error message:

> d=read_spss("spsstest2.sav")
Failed to find ABCÃ†Ã˜Ã…

Failed to find TESTÂµ

The resulting data.frame looks like this:

> d
  abc ABCÆØÅ TESTµ
1 foo      1    10
2 bår      2    20
3 æøå      3    30

Note that all the variables names are lowercase in the original SPSS file (i.e., abc, abcæøå and testµ), but two of them have been converted to uppercase in the data.frame.

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 
[2] LC_CTYPE=Norwegian-Nynorsk_Norway.1252   
[3] LC_MONETARY=Norwegian-Nynorsk_Norway.1252
[4] LC_NUMERIC=C                             
[5] LC_TIME=Norwegian-Nynorsk_Norway.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] haven_0.1.1

loaded via a namespace (and not attached):
[1] Rcpp_0.11.4 tools_3.1.1