Comments (6)
Here are some ideas I have for an API. All the functions (except gisid, which executes 1-4 but is different onward) have a commonality to them:
- Read the data
- Determine hashing strategy (this includes an "is sorted" check)
- Hash
- Sort hash (keeping track of how it maps to Stata row numbers)
- Panel setup
- Check for collisions
- Sort the groups (with index)
- Map sorted group index to sorted hash index
- Function-specific stuff
Steps 2, 3, 6, and 7 require a copy of the data be available for C in memory. Saving the results in steps 3, 4, 5, or 6-8 would require creating variables in Stata in addition to allocating memory in C. To interact with Stata, there is an inefficiency throughout in casting doubles to and from 64-bit integers. To call from C directly, there would have to be a generic way to load the data into memory. Some stuff I could write:
-
Is the data sorted? Executes 1 and checks if sorted. Gives yes or no.
-
Is the bijection OK? Executes 1 and checks if can biject. Gives yes or no.
-
Hash. Creates 3 variables from Stata and executes 1-3. The first two variables need to be double and store either the bijection and empty or the two parts of the spookyhash. The third variable can be long or double and is the index of the hash to the Stata observations (in case the user passes [if] [in] or drops missing rows).
-
Hash sort. Either creates 3 variables from Stata and executes 1-4 or picks up from the hash step above and executes 4. It sorts the hash and stores the sorted hash along with the index.
-
Panel setup. Either creates 2 variables from Stata and executes 1-5 or it creates 1 variable and picks up form the hash sort step above and executes 5. This step creates the index to Stata observations, if it does not exist, and it stores in the first J observations the start points of the grouped data.
-
Check for collisions + sort groups + map sorted group index to hash index. This can pick up from the panel setup step by creating one extra variable (which will be the sort order of the groups); it would re-read the observations into memory, check for collisions, sort them, and store the sort index. It can also do steps 1-8 directly after creating 3 variables from Stata.
-
Various mathematical functions that I use internally (e.g. various functions to compute quantiles).
from stata-gtools.
Have you checked out the ReadStat library? It is the underlying C library used for the haven package in R that reads/writes R, SPSS, SAS, and Stata datasets. Perhaps that would be a way to load data into memory? I’m not sure how garbage collection works with the C API, but if the objects can persist beyond a single call it might make it possible to load multiple datasets simultaneously. I’m not familiar with C at all or I would offer to try helping when I can.
from stata-gtools.
I have this on my list of things to check out. Not sure if it will drastically improve gcollapse or greshape (the main issue there is the inability to create/drop observations and variables in memory). However, I am planning to implement gmerge at some point, and I think the way to go is to try to read the using data via ReadStat, if I can manage.
EDIT: Actually, it should improve it a lot, now that I think about it. If i can save the characteristics of the dataset in memory, save the results from gcollapse/greshape to disk, then do use results, clear
and apply the chars/labels/etc. it should be way faster than calling the C API twice, now that I think about it.
from stata-gtools.
If you were using Java I might be able to help a bit more since that is what I’m more familiar with, but I’ve also been experimenting with trying to do some of this directly in Mata.
from stata-gtools.
@wbuchanan Do you know if it is possible to read data directly from disk when using Java?
from stata-gtools.
@mcaceresb
I had started working on some Java based dta parsers a while ago but didn’t get too far and haven’t been able to put too much work into it since then. I do know that there is a project at Harvard that I’ve starred that has Java parsers that they use for their project (I think IQSS is the user account and it is for their data repository project).
from stata-gtools.
Related Issues (20)
- gegen total vs. egen total HOT 6
- Could not load gtools_macosx_v3.plugin, error 9999 HOT 9
- gegen normalize does not realize that a new variable shall be created HOT 1
- gunique missing scalars when there are no observations
- gtools version of merge HOT 4
- gtools not installing on macos Stata 16 HOT 3
- Problem with -if- condition in several commands HOT 1
- Please update the benchmark using Stata 17. HOT 5
- gtools 1.8.1 not working *at all* with Stata MP 16.1 on MacOS 11.6 HOT 7
- Plugin download error when using "ssc install gtools" HOT 2
- the option cw in gcollapse is invalid. HOT 2
- Error r(111) in Stata MP 16.1 and SE 17.0, macOS Monterey HOT 4
- OSX plugin fails; move OSX Compilation to github
- gegen max does not properly evaluate string expressions HOT 2
- Error trying to copy gtop.sthlp in Stata 14 HOT 3
- Could not load gtools_macosx_v3.plugin, error 9999 HOT 17
- Some commands appear to ignore [w=weights] HOT 3
- Export results to word or excel HOT 4
- Wrong number of groups HOT 1
- Will greshape support strL variabes in the future? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stata-gtools.