GithubHelp home page GithubHelp logo

More documentation about readstat HOT 6 CLOSED

wizardmac avatar wizardmac commented on June 15, 2024
More documentation

from readstat.

Comments (6)

evanmiller avatar evanmiller commented on June 15, 2024

Briefly:

READSTAT_TYPE_LONG_STRING corresponds to Stata's new "long string" type, which contains either ASCII or binary data. Perhaps this should be renamed READSTAT_TYPE_BLOB or similar.

var_format is the format string associated with a variable. (E.g. date formatting.) These are often file-type-specific but I try to convert them to Stata format where possible.

val_labels is a string identifier for a set of value labels. It's up to the client to build a dictionary for each set with handle_value_label.

max_len isn't especially helpful for non-string types, but will return the storage size indicated in the file. (8 bytes for doubles etc.) This is inconsistent though and some parsers always return 0. It might be worth just getting rid of it altogether, since I imagine very few applications today will use fixed-width string storage.

from readstat.

hadley avatar hadley commented on June 15, 2024

That brings up another question: what happens with date/time variables?

from readstat.

evanmiller avatar evanmiller commented on June 15, 2024

Date/times are a hack in all of these file formats. They are numeric variables that must be deciphered and presented using the format string. Underlying storage:

SAS: seconds since January 1, 1960.

Stata: milliseconds, seconds, days, weeks, months, quarters, or half-years since January 1, 1960; or years stored as an integer

SPSS: seconds since October 14, 1582

I have some business logic to make sense of all these. It might be worth moving this logic into ReadStat itself. One caveat is the Stata values aren't straight timestamps; many of them have an associated duration (day, week, month, etc.), so we'll probably end up with a struct and two accessor functions (readstat_timestamp_value and readstat_duration_value).

On a related note, I maintain a separate library for parsing time format strings:

https://github.com/WizardMac/TimeFormatStrings

I use this library to infer the intended duration for timestamp values. (E.g. if the shortest unit in the format string is a day I assume that the observation refers to a 24-hour period.)

from readstat.

hadley avatar hadley commented on June 15, 2024

So if you're reading in (e.g.) a SAS data file, you can determine if a variable is a date/time because it's an INT32andvar_format` is non-null?

It would definitely be useful to have a way of returning a date time as POSIX (seconds sinces 1970-01-01). I suspect time zones will be an additional hassle.

from readstat.

evanmiller avatar evanmiller commented on June 15, 2024

As I mentioned elsewhere I try to convert the time format strings to Stata format, so in the case of SAS the format string is "%ts". (And it's a DOUBLE type, as SAS files don't have native integer types.)

from readstat.

wbuchanan avatar wbuchanan commented on June 15, 2024

@hadley The other difficulty with casting to a POSIX variable is that Stata uses a different epoch (1960-01-01 00:00:00) and has different types for values that are adjusted for leap seconds (double precision with format "%tC") and not adjusted for leap seconds (double precision with format "%tc"); both of these use time elapsed in milliseconds. The date types are integer valued days since the Stata epoch date (integer or float value with format "%td"), and unlike some other systems there's no 'time' type (e.g., time of day regardless of the date); there are also different aggregate versions of the date values (e.g., weeks, months, years, quarters, etc...). I'm not sure if the time zone issue would be a factor since there isn't a timezone concept in Stata. I'm not sure how one could infer the time zone from the file either unless the user explicitly added the timezone in a data set characteristic or something like that.

from readstat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.