Comments (6)
- I'm pretty sure that repo's table has a typo (I opened an issue about it here). Nevertheless, for the time-being I have implemented a fallback option:
oncollision(fallback|fail)
The default is fallback
and that will simply run the regular collapse if there is a collision. This negates the speed gains if it encounters a collision, but collisions should be really rare so I think this is OK for now. Ultimately I would implement a way to resolve collisions internally in C, but it would take some work (my idea is to create a multi-dimensional array with the keys from the bad hash, assign new hashes, and replace the bad hash with the new hashes).
-
Has there been an overflow so far? The prior crashes were both memory crashes. Internally, all computations are done in double precision, and in Stata I take some care to upgrade the variable data type if it will be required.
-
That is interesting; if sum() is done at quad precision, that would also be the case with means and standard deviations, right? This would add complexity to the code... To save memory, the best thing would be to switch to quad precision if max() * _N would overflow (max()^2 in the case of sd). I'll mull it over.
from stata-gtools.
Nothing for now, really. I want to add examples to the documentation I wrote. I hadn't done it in sthlp files because they're a pain to write, but markdown is fine.
Now that I have an OSX version I'll focus on bug fixes and stability. Check corner cases, improve test coverage, minimize memory use, etc. I don't r really have any new features planned.
from stata-gtools.
TBH I wouldn't worry that much about collisions right now: I looked for collisions by creating large datasets with unique identifiers and looking for duplicate IDs coming from gegen, and couldn't find anything with integer values (didn't test for strings or multiple variables). So at least in practice it seems that collisions are extremely unlikely.
What would be really useful, as discussed, would be a program that does egen group+tag+count, as it would allow others to speed up their own commands based on this (kinda how ftools can be used to build fmerge
, etc.)
Regarding 2 and 3, my guess is that the best thing would be to create large random test files, collapse them both ways, and check that the numbers match within some epsilon.
Something like this do file perhaps, but with more obs.
from stata-gtools.
I agree, but for the sake of completeness I think it's OK (and the concern about SpookyHash was confirmed to be a typo anyway; see this commit).
I already have these checks (see here; I compare gtools to collapse
and egen
for all the functions). So there are two concerns: Whether sums of very small numbers will loose precision (and by extension, means and sds) and whether very large numbers will overflow (ibid). Do you know how to check this? When I try to push Stata to its limits in those directions, I get missing values. I will search empirically when I get a chance, but if you already have something in mind let me know.
from stata-gtools.
Re: Are there any overflow risks remaining?
I have normalized the code base to use 64-bit integers (signed and unsigned as applicable) everywhere. That is, I have moved away from C's int
and size_t
which can be as low as 16-bit and 32-bit on some systems. I have also improved internal checks to make sure the bijection does not overflow, etc.
from stata-gtools.
That's quite useful. Btw and just out of curiosity, what do you see as the next steps?
from stata-gtools.
Related Issues (20)
- gegen total vs. egen total HOT 6
- Could not load gtools_macosx_v3.plugin, error 9999 HOT 9
- gegen normalize does not realize that a new variable shall be created HOT 1
- gunique missing scalars when there are no observations
- gtools version of merge HOT 4
- gtools not installing on macos Stata 16 HOT 3
- Problem with -if- condition in several commands HOT 1
- Please update the benchmark using Stata 17. HOT 5
- gtools 1.8.1 not working *at all* with Stata MP 16.1 on MacOS 11.6 HOT 7
- Plugin download error when using "ssc install gtools" HOT 2
- the option cw in gcollapse is invalid. HOT 2
- Error r(111) in Stata MP 16.1 and SE 17.0, macOS Monterey HOT 4
- OSX plugin fails; move OSX Compilation to github
- gegen max does not properly evaluate string expressions HOT 2
- Error trying to copy gtop.sthlp in Stata 14 HOT 3
- Could not load gtools_macosx_v3.plugin, error 9999 HOT 17
- Some commands appear to ignore [w=weights] HOT 3
- Export results to word or excel HOT 4
- Wrong number of groups HOT 1
- Will greshape support strL variabes in the future? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stata-gtools.