Comments (6)
Briefly:
READSTAT_TYPE_LONG_STRING
corresponds to Stata's new "long string" type, which contains either ASCII or binary data. Perhaps this should be renamed READSTAT_TYPE_BLOB
or similar.
var_format
is the format string associated with a variable. (E.g. date formatting.) These are often file-type-specific but I try to convert them to Stata format where possible.
val_labels
is a string identifier for a set of value labels. It's up to the client to build a dictionary for each set with handle_value_label.
max_len
isn't especially helpful for non-string types, but will return the storage size indicated in the file. (8 bytes for doubles etc.) This is inconsistent though and some parsers always return 0. It might be worth just getting rid of it altogether, since I imagine very few applications today will use fixed-width string storage.
from readstat.
That brings up another question: what happens with date/time variables?
from readstat.
Date/times are a hack in all of these file formats. They are numeric variables that must be deciphered and presented using the format string. Underlying storage:
SAS: seconds since January 1, 1960.
Stata: milliseconds, seconds, days, weeks, months, quarters, or half-years since January 1, 1960; or years stored as an integer
SPSS: seconds since October 14, 1582
I have some business logic to make sense of all these. It might be worth moving this logic into ReadStat itself. One caveat is the Stata values aren't straight timestamps; many of them have an associated duration (day, week, month, etc.), so we'll probably end up with a struct and two accessor functions (readstat_timestamp_value
and readstat_duration_value
).
On a related note, I maintain a separate library for parsing time format strings:
https://github.com/WizardMac/TimeFormatStrings
I use this library to infer the intended duration for timestamp values. (E.g. if the shortest unit in the format string is a day I assume that the observation refers to a 24-hour period.)
from readstat.
So if you're reading in (e.g.) a SAS data file, you can determine if a variable is a date/time because it's an INT32and
var_format` is non-null?
It would definitely be useful to have a way of returning a date time as POSIX (seconds sinces 1970-01-01). I suspect time zones will be an additional hassle.
from readstat.
As I mentioned elsewhere I try to convert the time format strings to Stata format, so in the case of SAS the format string is "%ts". (And it's a DOUBLE type, as SAS files don't have native integer types.)
from readstat.
@hadley The other difficulty with casting to a POSIX variable is that Stata uses a different epoch (1960-01-01 00:00:00) and has different types for values that are adjusted for leap seconds (double precision with format "%tC") and not adjusted for leap seconds (double precision with format "%tc"); both of these use time elapsed in milliseconds. The date types are integer valued days since the Stata epoch date (integer or float value with format "%td"), and unlike some other systems there's no 'time' type (e.g., time of day regardless of the date); there are also different aggregate versions of the date values (e.g., weeks, months, years, quarters, etc...). I'm not sure if the time zone issue would be a factor since there isn't a timezone concept in Stata. I'm not sure how one could infer the time zone from the file either unless the user explicitly added the timezone in a data set characteristic or something like that.
from readstat.
Related Issues (20)
- spss invalid file when reading char value labels HOT 1
- cannot read correctly variable name
- Issues writing Stata StrL variables HOT 4
- ENH: Add buffer based IO support
- Use-after-free Error , [gcc12 couldnt build] HOT 1
- Improve SAS7BDAT reader performance HOT 1
- Troubleshooting of reading sas7bdat format HOT 2
- Non-deterministic result of readstat_get_file_label in a DTA file HOT 1
- Different results of readstat_get_modified_time on Windows and Mac HOT 1
- readstat exporting value labels to sas7bcat from a Stata dta.
- Example for SAV metadeta changing
- Numeric variables files generated from CSV input always have decimals HOT 1
- Should the write functions use int64_t instead of long for row_count. HOT 1
- Number of rows in sas7bdat file nearly tripled
- Skip deleted observations in SAS7BDAT files HOT 10
- Security: heap-buffer-overflow in readstat_convert
- Unable to parse sas7bdat when data set page size >= 16MB HOT 2
- `Error: Failed to parse [...].sav: Invalid file, or file has unsupported features` when using haven package to read .sav file HOT 3
- Problem in export file (in python libary) HOT 1
- `sprintf()` -> `snprintf()` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from readstat.