wizardmac / readstat Goto Github PK

Command-line tool (+ C library) for converting SAS, Stata, and SPSS files 💾

License: MIT License

C 94.26% Makefile 1.07% Shell 0.01% M4 0.30% Ragel 4.36%

readstat's Introduction

ReadStat: Read (and write) data sets from SAS, Stata, and SPSS

Originally developed for Wizard, ReadStat is a command-line tool and MIT-licensed C library for reading files from popular stats packages. Supported data formats include:

SAS: SAS7BDAT (binary file) and XPORT (transport file)
Stata: DTA (binary file) versions 104-119
SPSS: POR (portable file), SAV (binary file), and ZSAV (compressed binary)

Supported metadata formats include:

SAS: SAS7BCAT (catalog file) and .sas (command file)
Stata: .dct (dictionary file)
SPSS: .sps (command file)

There is also write support for all the data formats, but not the metadata formats. The produced SAS7BDAT files still cannot be read by SAS, but feel free to contribute your binary-format expertise here.

For reading in R data files, please see the related librdata project.

Installation on Unix / macOS

Grab the latest release and then proceed as usual:

./configure
make
sudo make install

If you're cloning the repository, first make sure you have autotools installed, and then run ./autogen.sh to generate the configure file.

If you're on Mac and see errors about AM_ICONV when you run ./autogen.sh, you'll need to install gettext.

Installation on Windows

ReadStat now includes a Microsoft Visual Studio project file that includes build targets for the library and tests. See the VS17 folder in the downloaded release for a "one-click" Windows build.

Alternatively, you can build ReadStat on the command line using an msys2 environment. After installing msys2, download some other packages:

pacman -S autoconf automake libtool make mingw-w64-x86_64-toolchain mingw-w64-x86_64-cmake mingw-w64-x86_64-libiconv

Then start a MINGW command line (not the msys2 prompt!) and follow the UNIX install instructions above for this package.

Language Bindings

Julia: ReadStat.jl
Perl 6: ReadStat.pm6
Python: pyreadstat
R: haven

Docker

A dockerized version is available here

Command-line Usage

Standard usage:

readstat [-f] <input file> <output file>

Where:

<input file> ends with .dta, .por, .sav, .sas7bdat, or .xptand
<output file> ends with .dta, .por, .sav, .sas7bdat, .xpt or .csv

If libxlsxwriter is found at compile-time, an XLSX file (ending in .xlsx) can be written instead.

If zlib is found at compile-time, compressed SPSS files (.zsav) can be read and written as well.

Use the -f option to overwrite an existing output file.

If you have a plain-text file described by a Stata dictionary file, a SAS command file, or an SPSS command file, a second invocation style is supported:

readstat <input file> <dictionary file> <output file>

Where:

<input file> can be anything
<dictionary file> ends with .dct, .sas, or .sps
<output file> ends with .dta, .por, .sav, .xpt, or .csv

If you have a SAS catalog file containing the data set's value labels, you can use the same invocation:

readstat <input file> <catalog file> <output file>

Except where:

<input file> ends with .sas7bdat
<catalog file> ends with .sas7bcat
<output file> ends with .dta, .por, .sav, .xpt, or .csv

If the file conversion succeeds, ReadStat will report the number of rows and variables converted, e.g.

Converted 111 variables and 160851 rows in 12.36 seconds

At the moment value labels are supported, but the finer nuances of converting format strings (e.g. %8.2g) are not.

Command-line Usage with CSV input

A prerequisite for CSV input is that the libcsv library is found at compile time.

CSV input is supported together with a metadata file describing the data:

readstat <input file.csv> <input metadata.json> <output file>

The <output file> should end with .dta, .sav, or .csv.

The <input file.csv> is a regular CSV file.

The <input metadata.json> is a JSON file describing column types, value labels and missing values. The easiest way to create such a metadata file is to use the provided extract_metadata program on an existing file:

$ extract_metadata <input file.(dta|sav|sas7bcat)>

The schema of this JSON file is fully described in variablemetadata_schema.json using JSON Schema.

The following is an example of a valid metadata file:

{
    "type": "SPSS",
    "variables": [
        {
            "type": "NUMERIC",
            "name": "citizenship",
            "label": "Citizenship of respondent",
            "categories": [
                {
                    "code": 1,
                    "label": "Afghanistan"
                },
                {
                    "code": 2,
                    "label": "Albania"
                },
                {
                    "code": 98,
                    "label": "No answer"
                },
                {
                    "code": 99,
                    "label": "Not applicable"
                }
            ],
            "missing": {
                "type": "DISCRETE",
                "values": [
                    98,
                    99
                ]
            }
        }
    ]
}

Here the column citizenship is a numeric column with four possible values 1, 2, 98, and 99. 1 has the label Afghanistan, 2 has Albania, 98 has No answer and 99 has Not applicable. 98 and 99 are defined as missing values.

Other column types are STRING and DATE. All values in DATE columns are expected to conform to ISO 8601 date. Here is an example of DATE metadata:

{
    "type": "SPSS",
    "variables": [
        {
            "type": "DATE",
            "name": "startdate",
            "label": "Start date",
            "categories": [
                {
                    "code": "6666-01-01",
                    "label": "no date available"
                }
            ],
            "missing": {
                "type": "DISCRETE",
                "values": [
                    "6666-01-01",
                    "9999-01-01"
                ]
            }
        }
    ]
}

Value labels are supported for DATE.

The last column type is STRING:

{
    "type": "SPSS",
    "variables": [
        {
            "type": "STRING",
            "name": "somestring",
            "label": "Label of column",
            "missing": {
                "type": "DISCRETE",
                "values": [
                    "NA",
                    "N/A"
                ]
            }
        }
    ]
}

Value labels are not supported for STRING.

Library Usage: Reading Files

The ReadStat API is callback-based. It uses very little memory, and is suitable for programs with progress bars. ReadStat uses iconv to automatically transcode text data into UTF-8, so you don't have to worry about character encodings.

See src/readstat.h for the complete API. In general you'll provide a filename and a set of optional callback functions for handling various information and data found in the file. It's up to the user to store this information in an appropriate data structure. If a context pointer is passed to the parse_* functions, it will be made available to the various callback functions.

Callback functions should return READSTAT_HANDLER_OK (zero) on success. Returning READSTAT_HANDLER_ABORT will abort the parsing process.

Example: Return the number of records in a DTA file.

#include "readstat.h"

int handle_metadata(readstat_metadata_t *metadata, void *ctx) {
    int *my_count = (int *)ctx;

    *my_count = readstat_get_row_count(metadata);

    return READSTAT_HANDLER_OK;
}

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    int my_count = 0;
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_metadata_handler(parser, &handle_metadata);

    error = readstat_parse_dta(parser, argv[1], &my_count);

    readstat_parser_free(parser);

    if (error != READSTAT_OK) {
        printf("Error processing %s: %d\n", argv[1], error);
        return 1;
    }
    printf("Found %d records\n", my_count);
    return 0;
}

Example: Convert a DTA to a tab-separated file.

#include "readstat.h"

int handle_metadata(readstat_metadata_t *metadata, void *ctx) {
    int *my_var_count = (int *)ctx;
    
    *my_var_count = readstat_get_var_count(metadata);

    return READSTAT_HANDLER_OK;
}

int handle_variable(int index, readstat_variable_t *variable, 
    const char *val_labels, void *ctx) {
    int *my_var_count = (int *)ctx;

    printf("%s", readstat_variable_get_name(variable));
    if (index == *my_var_count - 1) {
        printf("\n");
    } else {
        printf("\t");
    }

    return READSTAT_HANDLER_OK;
}

int handle_value(int obs_index, readstat_variable_t *variable, readstat_value_t value, void *ctx) {
    int *my_var_count = (int *)ctx;
    int var_index = readstat_variable_get_index(variable);
    readstat_type_t type = readstat_value_type(value);
    if (!readstat_value_is_system_missing(value)) {
        if (type == READSTAT_TYPE_STRING) {
            printf("%s", readstat_string_value(value));
        } else if (type == READSTAT_TYPE_INT8) {
            printf("%hhd", readstat_int8_value(value));
        } else if (type == READSTAT_TYPE_INT16) {
            printf("%hd", readstat_int16_value(value));
        } else if (type == READSTAT_TYPE_INT32) {
            printf("%d", readstat_int32_value(value));
        } else if (type == READSTAT_TYPE_FLOAT) {
            printf("%f", readstat_float_value(value));
        } else if (type == READSTAT_TYPE_DOUBLE) {
            printf("%lf", readstat_double_value(value));
        }
    }
    if (var_index == *my_var_count - 1) {
        printf("\n");
    } else {
        printf("\t");
    }

    return READSTAT_HANDLER_OK;
}

int main(int argc, char *argv[]) {
    if (argc != 2) {
        printf("Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    int my_var_count = 0;
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_metadata_handler(parser, &handle_metadata);
    readstat_set_variable_handler(parser, &handle_variable);
    readstat_set_value_handler(parser, &handle_value);

    error = readstat_parse_dta(parser, argv[1], &my_var_count);

    readstat_parser_free(parser);

    if (error != READSTAT_OK) {
        printf("Error processing %s: %d\n", argv[1], error);
        return 1;
    }
    return 0;
}

Library Usage: Writing Files

ReadStat can write data sets to a number of file formats, and uses largely the same API for each of them. Files are written incrementally, with the header written first, followed by individual rows of data, and ending with some kind of trailer. (So the full data file never resides in memory.) Unlike like the callback-based API for reading files, the writer API consists of function that the developer must call in a particular order. The complete API can be found in readstat.h.

Basic usage:

#include "readstat.h"

/* A callback for writing bytes to your file descriptor of choice */
/* The ctx argument comes from the readstat_begin_writing_xxx function */
static ssize_t write_bytes(const void *data, size_t len, void *ctx) {
    int fd = *(int *)ctx;
    return write(fd, data, len);
}

int main(int argc, char *argv[]) {
    readstat_writer_t *writer = readstat_writer_init();
    readstat_set_data_writer(writer, &write_bytes);
    readstat_writer_set_file_label(writer, "My data set");

    int row_count = 1;

    readstat_variable_t *variable = readstat_add_variable(writer, "Var1", READSTAT_TYPE_DOUBLE, 0);
    readstat_variable_set_label(variable, "First variable");

    /* Call one of:
     *   readstat_begin_writing_dta
     *   readstat_begin_writing_por
     *   readstat_begin_writing_sas7bdat
     *   readstat_begin_writing_sav
     *   readstat_begin_writing_xport
     */

    int fd = open("something.dta", O_CREAT | O_WRONLY);
    readstat_begin_writing_dta(writer, &fd, row_count);

    int i;
    for (i=0; i<row_count; i++) {
        readstat_begin_row(writer);
        readstat_insert_double_value(writer, variable, 1.0 * i);
        readstat_end_row(writer);
    }

    readstat_end_writing(writer);
    readstat_writer_free(writer);
    close(fd);

    return 0;
}

Fuzz Testing

To assist in fuzz testing, ReadStat ships with target files designed to work with libFuzzer. Clang 6 or later is required.

./configure --enable-fuzz-testing turns on useful sanitizer and sanitizer-coverage flags
make will create a new binary called generate_corpus. Running this program will use the ReadStat test suite to create a corpus of test files in corpus/. There is a subdirectory for each sub-format (dta104, dta105, etc.). Currently a total of 468 files are created.
If fuzz-testing has been enabled, make will also create fourteen fuzzer targets, one for each of seven file formats, six for internally used grammars, and two fuzzers for testing the compression routines.
- fuzz_format_dta
- fuzz_format_por
- fuzz_format_sas7bcat
- fuzz_format_sas7bdat
- fuzz_format_sav
- fuzz_format_xport
- fuzz_format_stata_dictionary
- fuzz_grammar_dta_timestamp
- fuzz_grammar_por_double
- fuzz_grammar_sav_date
- fuzz_grammar_sav_time
- fuzz_grammar_spss_format
- fuzz_grammar_xport_format
- fuzz_compression_sas_rle
- fuzz_compression_sav

For best results, each sub-directory of the corpus should be passed to the relevant fuzzer, e.g.:

./fuzz_format_dta corpus/dta104
./fuzz_format_dta corpus/dta110
...
./fuzz_format_sav corpus/sav
./fuzz_format_sav corpus/zsav
./fuzz_format_xport corpus/xpt5
./fuzz_format_xport corpus/xpt8

Finally, the compression fuzzers can be invoked without a corpus:

./fuzz_compression_sas_rle
./fuzz_compression_sav

readstat's People

Contributors

Stargazers

Watchers

readstat's Issues

How should I tweak the Makefile on Windows?

Hello,

I am using Cygwin on Windows 7 to install and compile all the necessary files (installing ragel, for example) to compile ReadStat. However, I am not exactly sure how I should "tweak" the Makefile. Any advice will be very helpful!

Thanks

Mingw compile warning

Mingw says:

readstat_sav_write.c: In function 'sav_emit_header':
readstat_sav_write.c:43:14: warning: unknown conversion type character 'T' in format [-Wformat]

Version number

It might be useful to have a version number so that it's easy to check when a binding is out of date

Missing values

How does readstat process them? By analogy to database apis, I'd expect a type like "READSTAT_TYPE_MISSING"

Value labels from SAS catalog files are mixed up

Sometimes value labels from SAS catalog files are off by 1, or completely out of order. For previous discussion see tidyverse/haven#116 and #38

Value wrappers

It would be convenient to have something like readstat_value_char, readstat_value_string etc... rather than having to do the casting yourself (that seems a little risky to me)

Windows binaries

This is not an issue. I was just wondering if there is any Windows binaries that I an download.

Thanks

extern "C"

It would be useful to include #ifdef cpp extern "C" etc to make it easier to bind from C++

Hangs for large sas7bdat

Parses smallish files (on the order of 10mb) fine. Simply hangs for larger (eg. 22gb) files. Any idea what might be going on internally? Data compresses nicely and I have a sample to share if you'd like.

Support Stata 14 files

Sample file:

https://www.dropbox.com/s/t89lbnlhlm3nslc/ct_race_chd.dta?dl=0

Fail to parse compressed sas7bdat file containing uncompressed data

Certain valid sas7bdat files produce the error "Invalid file, or file has unsupported feature".

The test file I am working with (combined_all_sub_00.sas7bdat) is compressed, and appears to have a mixed-use page (metadata + data) even though the signature indicates the page is metadata only. Thus ReadStat looks for metadata where data is found and produces an error.

This issue was surfaced by @vinhdizzo in the discussion about #35.

Parser functions should return readstat_error_t

Currently they return int but readstat_error_message expects a readstat_error_t

Writing to Stata Formatted Files

Not sure if this issue was in the R wrapper (haven) or in the C code itself, but I wanted to post here just to be safe.

library(magrittr)
colnm <- names(d3urls)
d3x <- xml2::read_html(d3urls[[1]]) %>% rvest::html_nodes("p") %>% rvest::html_text()
d3x <- d3x[grepl("^#.*", d3x)]
d3x <- gsub("# ", "", d3x)
r <- c(1:length(d3x))
d3x <- as.data.frame(cbind(r, d3x), stringsAsFactors = FALSE)
names(d3x) <- c("id", colnm[1])

for (i in c(2:40)) {
x <- xml2::read_html(d3urls[[i]]) %>% rvest::html_nodes("p") %>% rvest::html_text()
x <- x[grepl("^#.*", x)]
x <- gsub("# ", "", x)
r <- c(1:length(x))
x <- as.data.frame(cbind(r, x), stringsAsFactors = FALSE)
names(x) <- c("id", colnm[i])
d3x <- dplyr::full_join(d3x, x, by = "id")
}

rm(x, r)

haven::write_dta(d3x, "~/Desktop/d3Methods.dta")
Then I load the file in Stata 14.1MP8 using:

use ~/Desktop/d3Methods.dta, clear
The problem occurs when using the Stata command 'compress', which is used to optimize storage on disk of the file (e.g., downcasts types to the smallest type possible without loosing precision so things like 1.00000000000000000000000 would be cast as a 1-byte integer value rather than a float/double). In this case, I think there is a problem with the writing functions and how they insert binary zeros around the strings in the data frame (Stata uses binary zeros for padding a column so each record for a string column reserves the same number of bits for storage).

If I write the same data out to a csv:

write.csv(d3x, "~/Desktop/d3Methods.csv", row.names = FALSE)
Then load the same data in Stata:

. import delimited using ~/Desktop/d3Methods.csv, delim(",") varn(1) clear
(41 vars, 102 obs)

. compress
(0 bytes saved)

The issue goes away. I couldn't capture the other error since it crashed Stata each time. I can post the .dta files in version 13 and 14 if you'd like to compare it to the output from Haven.

Compilation failure on linux

https://travis-ci.org/hadley/haven#L680

readstat_sav.c: In function ‘sav_read_value_label_record’:
readstat_sav.c:369:9: warning: implicit declaration of function ‘bsearch_b’ [-Wimplicit-function-declaration]
readstat_sav.c:369:106: error: expected expression before ‘^’ token
make: *** [readstat_sav.o] Error 1

More pedantic warnings when building on windows (mingw)

$ make
[ -x /usr/local/bin/ragel ] && /usr/local/bin/ragel src/readstat_por_parse.rl -G2
[ -x /usr/local/bin/ragel ] && /usr/local/bin/ragel src/readstat_sav_parse.rl -G2
[ -x /usr/local/bin/ragel ] && /usr/local/bin/ragel src/readstat_spss_parse.rl -G2
cc -Os src/*.c -o obj/libreadstat.dylib -llzma -lz -liconv -Wall -Wno-multichar -Werror -pedantic -DHAVE_LZMA -std=c99
src/readstat_sav.c: In function ‘sav_ctx_init’:
src/readstat_sav.c:34:40: error: overflow in implicit constant conversion [-Werror=overflow]
         ctx->machine_needs_byte_swap = 1;
                                        ^
cc1: all warnings being treated as errors
src/readstat_sav_read.c: In function ‘sav_read_compressed_data’:
src/readstat_sav_read.c:742:47: error: overflow in implicit constant conversion [-Werror=overflow]
                     value.is_system_missing = 1;
                                               ^
cc1: all warnings being treated as errors
src/readstat_spss.c: In function ‘spss_tag_missing_double’:
src/readstat_spss.c:76:44: error: overflow in implicit constant conversion [-Werror=overflow]
             value->is_considered_missing = 1;
                                            ^
src/readstat_spss.c:83:36: error: overflow in implicit constant conversion [-Werror=overflow]
         value->is_system_missing = 1;
                                    ^
src/readstat_spss.c:85:36: error: overflow in implicit constant conversion [-Werror=overflow]
         value->is_system_missing = 1;
                                    ^
src/readstat_spss.c:87:36: error: overflow in implicit constant conversion [-Werror=overflow]
         value->is_system_missing = 1;
                                    ^
src/readstat_spss.c: In function ‘spss_boxed_value’:
src/readstat_spss.c:109:35: error: overflow in implicit constant conversion [-Werror=overflow]
         value.is_system_missing = 1;
                                   ^
cc1: all warnings being treated as errors
src/readstat_variable.c: In function ‘make_blank_value’:
src/readstat_variable.c:35:5: error: overflow in implicit constant conversion [-Werror=overflow]
     readstat_value_t value = { .is_system_missing = 1, .v = { .double_value = NAN }, .type = READSTAT_TYPE_DOUBLE };
     ^
cc1: all warnings being treated as errors
Makefile:8: recipe for target 'all' failed
make: *** [all] Error 1

Use of stderr

R CMD check complains:

* checking compiled code ... NOTE
File ‘haven/libs/haven.so’:
  Found ‘___stderrp’, possibly from ‘stderr’ (C)
    Objects: ‘readstat_por.o’, ‘readstat_por_parse.o’,
      ‘readstat_sas.o’, ‘readstat_sav.o’, ‘readstat_sav_parse.o’

Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the C RNG.

this isn't a huge problem, but it would be nice if there was some alternative way to capture error messages, rather than sending them to stderr.

SAV parsing performance is bad

Sample benchmark: https://gist.github.com/dickoa/42c04feeac21b46eb2e1

The bottleneck (I'm 99% sure) is that ReadStat reads in the data records 8 bytes at a time, which results in a lot of overhead from syscalls. This should be relatively easy to fix.

SAS transport files?

The foreign package does them, but was wondering if they might be added at some point to this and then to haven.
Thanks for writing such a useful library.

Fail to parse >2GB SAS files on AIX (big endian)

Hi,

We have sas on AIX at work, and I'd like to used Haven (R) to read sas data sets. It currently errors out, and I suspect it's because of AIX's big endian architecture. Are there plans to include support for this? Thanks.

Compilation Issue

Hey Evan,
I am very new to this but I was trying to compile the source using cygwin and I get errors like
src/readstat_bits.c:1:0: error: The C parser does not support -dy, option ignored
cc1.exe: warning: unrecognized gcc debugging option: n
cc1.exe: warning: unrecognized gcc debugging option: m
cc1.exe: warning: unrecognized gcc debugging option: i
cc1.exe: warning: unrecognized gcc debugging option: c
cc1.exe: warning: unrecognized gcc debugging option: l
cc1.exe: warning: unrecognized gcc debugging option: i
cc1.exe: warning: unrecognized gcc debugging option: b

etc.

I have attached an image with some more errors. I needed this for your ReadData module for Julia. Is it a gcc version issue?

Any help would be much appreciated.

thanks

I want to read .sas and .xpt, which is the package I can use?

Building on mingw-w64 (ANSI C portability)

I'm trying to build this library with mingw-w64 (the windows build of gcc) but there are some non portable pieces. First attempt:

gcc -m64 -I"C:/PROGRA~1/R/R-31~1.2/include" -DNDEBUG    -I"C:/Users/Jeroen/Documents/R/win-library/3.1/Rcpp/include" -I"d:/RCompile/CRANpkg/extralibs64/local/include"     -O2 -Wall  -std=gnu99 -mtune=core2 -c readstat_convert.c -o readstat_convert.o
readstat_bits.c: In function 'byteswap_float':
readstat_bits.c:59:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
readstat_bits.c:60:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
readstat_bits.c: In function 'byteswap_double':
readstat_bits.c:64:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
readstat_bits.c:65:5: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
gcc -m64 -I"C:/PROGRA~1/R/R-31~1.2/include" -DNDEBUG    -I"C:/Users/Jeroen/Documents/R/win-library/3.1/Rcpp/include" -I"d:/RCompile/CRANpkg/extralibs64/local/include"     -O2 -Wall  -std=gnu99 -mtune=core2 -c readstat_dta.c -o readstat_dta.o
readstat_convert.c: In function 'readstat_convert':
readstat_convert.c:23:9: warning: passing argument 2 of 'libiconv' from incompatible pointer type [enabled by default]
C:/PROGRA~1/R/R-31~1.2/include/iconv.h:53:8: note: expected 'const char **' but argument is of type 'char **'
readstat_dta.c:6:21: fatal error: sys/uio.h: No such file or directory
compilation terminated.

I tried using:

#ifndef WIN32
#include <sys/uio.h>
#endif

But that makes things worse:

gcc -m64 -I"C:/PROGRA~1/R/R-31~1.2/include" -DNDEBUG    -I"C:/Users/Jeroen/Documents/R/win-library/3.1/Rcpp/include" -I"d:/RCompile/CRANpkg/extralibs64/local/include"     -O2 -Wall  -std=gnu99 -mtune=core2 -c readstat_por_parse.c -o readstat_por_parse.o
src/readstat_por_parse.rl: In function 'readstat_por_parse_double':
src/readstat_por_parse.rl:21:5: error: unknown type name 'u_char'
src/readstat_por_parse.rl:23:5: error: unknown type name 'u_char'
src/readstat_por_parse.rl:23:24: error: 'u_char' undeclared (first use in this function)
src/readstat_por_parse.rl:23:24: note: each undeclared identifier is reported only once for each function it appears in
src/readstat_por_parse.rl:23:32: error: expected expression before ')' token
src/readstat_por_parse.rl:24:18: error: expected '=', ',', ';', 'asm' or '__attribute__' before '*' token
src/readstat_por_parse.rl:24:34: error: expected expression before ')' token
src/readstat_por_parse.c:353:9: warning: comparison of distinct pointer types lacks a cast [enabled by default]
src/readstat_por_parse.rl:69:9: warning: implicit declaration of function 'dprintf' [-Wimplicit-function-declaration]
src/readstat_por_parse.rl:69:84: error: expected expression before ')' token
src/readstat_por_parse.rl:76:32: error: expected expression before ')' token
make: *** [readstat_por_parse.o] Error 1

I think the problem is that u_char is not part of the ANSI C standard. The typical alternative is to use <stdint.h> for the uint8_t type, or unsigned char.

I'm no expert on this so unfortunately I can't further narrow down the problem or provide a PR. Some other data intense libraries that I have worked with that do build on windows are yajl, libgit2 and mongo-c-driver, perhaps we can have a peek.

What I can do is help with testing, or provide a compiler for you to test.

readstat_value_is_considered_missing() confused

Steps to reproduce:

Import https://dl.dropboxusercontent.com/u/239966/spss_missing_bug2.sav
Use readstat_value_is_considered_missing(value) in a value_label() handler
The value 9 of the first column (Missing4) is misread as non missing

The value is correctly read as missing once the second column is removed.

Compilation warnings

src/readstat_sav_parse.c:30:18: warning: unused variable 'sav_long_variable_parse_first_final' [-Wunused-const-variable]
static const int sav_long_variable_parse_first_final = 227;
                 ^
src/readstat_sav_parse.c:31:18: warning: unused variable 'sav_long_variable_parse_error' [-Wunused-const-variable]
static const int sav_long_variable_parse_error = 0;
                 ^
src/readstat_sav_parse.c:33:18: warning: unused variable 'sav_long_variable_parse_en_main' [-Wunused-const-variable]
static const int sav_long_variable_parse_en_main = 1;
                 ^
src/readstat_sav_parse.c:3457:18: warning: unused variable 'sav_very_long_string_parse_first_final' [-Wunused-const-variable]
static const int sav_very_long_string_parse_first_final = 36;
                 ^
src/readstat_sav_parse.c:3458:18: warning: unused variable 'sav_very_long_string_parse_error' [-Wunused-const-variable]
static const int sav_very_long_string_parse_error = 0;
                 ^
src/readstat_sav_parse.c:3460:18: warning: unused variable 'sav_very_long_string_parse_en_main' [-Wunused-const-variable]
static const int sav_very_long_string_parse_en_main = 1;

(feel free to ignore)

Support writing Stata strings with more than 244 characters

Stata 13 (format 117) supports strings with up to 2045 bytes.
See http://www.stata.com/help.cgi?dta_117#strings.

Compiling error

Hi, like many on here I apologize if this is a basic question, I'm new to this...
I'm trying to compile on Cygwin and I get the following error:

$ make
[ -x /usr/local/bin/ragel ] && /usr/local/bin/ragel src/readstat_por_parse.rl -G2
Makefile:9: recipe for target 'all' failed
make: *** [all] Error 1

I assume it's because the makefile isn't tweaked. Can anyone help provide the makefile to compile it on windows?

Fail to parse big-endian >2GB SAS files on Windows

See discussion here: #33

Build fail

Can't build on CentOs 6.6 with gcc 4.4.7

Error 5 on certain sas7bdat files

The library gives an error 5 for certain (very large) sas7bdat files. I haven't been able to pin down the problem myself. Files can be read with https://pypi.python.org/pypi/sas7bdat

Any tips on the kinds of issues I should be thinking about?

Generate error message from readstat_error_t

It would be useful if there was a function that turned the error code into a string for reporting messages to the user.

Seg fault

Can you update the sample implementation for the new API? In sample implementation, handle_value takes 5 inputs. In readstat.h readstat_value_handler takes 4 inputs. (There is an additional readstat_types_t in sample implementation. ) I assume there are other changes as well.

Revamp API to use setHandler pattern

Currently each callback must be passed as a separate argument to the parse_* functions. This limits future flexibility for adding other handlers, e.g. an error handler as discussed in #14. A better API would look something like

readstat_parser_t parser = readstat_parser_init();
readstat_set_info_handler(parser, &my_info_handler);
readstat_set_variable_handler(parser, &my_variable_handler);
readstat_error_t status = readstat_parse(parser, "/path/to/file");
readstat_parser_free(parser);

Fix copyright notice

"MyCompanyName" does not exist.

readstat.h should document return value of callbacks

I assume it's 0 for success and anything else for failure? (This just needs to be a one line comment)

SPSS can't open files by haven

Even very basic ones

haven::write_sav(mtcars, "mtcars.sav")`

GET
FILE='mtcars.sav'.

Error. Command name: GET FILE
Invalid SPSS Statistics data file: mtcars.sav (DATA1204)
Execution of this command stops.

Error # 1405 in column 8. Text: mtcars.sav
Error when attempting to get a data file.
DATASET NAME DataSet1 WINDOW=FRONT.

sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

API Example

Error parsing >4.2GB file on Windows

I'm getting the follow error for a file with 4855062528 bytes (errors out at about 30sec-1min). I am using RRO 3.2.2 (64 bit) with the latest haven and ReadStat from this issue.

> system.time(d1 <- read_sas('bar.sas7bdat')) / 60
Error: Failed to parse bar.sas7bdat: Invalid file, or file has unsupported features.
Timing stopped at: 0.07 0.28 35.03

Data set could be generated from

data bar ;
    call streaminit(123);
    array x{10000} x1-x10000 ;
    do i=1 to 60000 ;
        do j=1 to dim(x) ;
            x(j) = rand('Uniform') ;
        end ;
        output ;
    end ;
run ;

Feature request: parallelize sas7bdat import

Thanks so much for all your work on this library - it is extremely helpful for my work!

I recently discovered a Spark package that parallelizes importation of SAS files with the Parso library (link below). I don't know if it's feasible to implement this approach in ReadStat, but some datasets only available from the government in sas7bdat format are huge and speedier import would be excellent. Unfortunately, I have no C skills to offer for assistance...

https://github.com/saurfang/spark-sas7bdat

encoding problem for sas7bdat file

Great facility for reading sas7bdat files, nearly perfect I think.

But with https://github.com/reikoch/testfiles/blob/master/extr.sas7bdat, which is Shift-JIS encoded, I get on Linux:

yy <- read_sas('extr.sas7bdat')
ReadStat: Error parsing page 0, bytes 65536-131071
Error: Failed to parse extr.sas7bdat: Invalid file, or file has unsupported features.

:-(

Fail to parse sas7bcat file

On loading the SAS files for the 2014 NYTS I get an error about the associated cat file:

"Error: Failed to parse /Users/jared/Dropbox/nyts/nyts2014_formats.sas7bcat: Invalid file, or file has unsupported features."

The data itself appears to load fine, and I have every reason to think the cat file is valid and works in SAS. The files are available at:

http://www.cdc.gov/tobacco/data_statistics/surveys/nyts/

Validating variable names when writing data

Was trying to use @hadley's haven package to convert some SPSS files that some colleagues created a while ago and ran into some issues with a variable name in the original SPSS file that contained an invalid character for variable names in Stata. The variable name in question was filter_$, but the $ is an illegal character for a variable name in Stata. It might be good to standardize variable names to the set that is typically valid [a-zA-Z0-9_] across all file formats and provide an option to either substitute illegal characters with an underscore or to remove them from the variable name. When the file is read into Stata is causes all sorts of issues when trying to reference variable names.

Support file-level compression in SAV files (i.e. ZSAV files)

Newer versions of SPSS allow writing files in "ZSAV" format, which looks a lot like SAV except that the data (rows) portion of the file is compressed.

defining custom missingness conditions when writing to sav (and others?)

Just a reminder, see discussion / pos
ting here. If it would be possible to do that, it's a nice feature to have.

Support binary compression in sas7bdat

Binary (aka Ross) compression is currently not supported. It looks like we can use the Python sas7bdat package as a template for implementing the decompression algorithm:

https://pypi.python.org/pypi/sas7bdat

(The library is MIT licensed.)

At present, compressed files fail with the error message "File has unsupported compression scheme".

Large data set: A row in the file was not the expected length.

I'm getting the follow error for a file with 9704906752 bytes (errors out after a long while trying to load the data). I am using RRO 3.2.2 (64 bit) with the latest haven and ReadStat from this issue.

> d1 = read_sas('combined_all.sas7bdat')
Error: Failed to parse combined_all.sas7bdat: A row in the file was not the expected length.

SAS catalog: large value label sets not being imported

Very long variable labels are sometimes stored in SAS catalog files rather than the data file.

Technical note: they appear to be in blocks where (lsp[8] & 0x80) == 0

Write SAS files

I have completely forgotten if we've talked about this before, but how hard would it be?

problem with CHAR compressed SAS datasets

moved from [https://github.com/tidyverse/haven/issues/131].

Not all char compressed SAS datasets can be read; possibly when a certain size is exceeded.

An example dataset with 16 MB random generated data you find as "bar.sas7bdat" in [https://github.com/reikoch/testfiles].

read_sas refuses to read it with the message

read_sas('~/xx/bar.sas7bdat')
Error: Failed to parse /opt/BIOSTAT/home/kochr4/xx/bar.sas7bdat: Invalid file, or file has unsupported features.

Subsets of this file bar.sas7bdat, like barrows (first 1000 rows of bar) and barcols (last 15 columns dropped), work fine.

Stata C API support

For the sake of consistency, wouldn't it make sense to incorporate Stata's C API header file to define the data types?

http://www.stata.com/plugins/stplugin.h
http://www.stata.com/plugins/stplugin.c

I would think it could help with future releases and make the data types a bit easier to handle.