Comments (27)
I'm working with a dataset that has ~90million observations all measured over the same daily window and this was a massive speedup so thank you both!
from anytime.
Fixed with 0.3.7 Release
from anytime.
Apparently match()
is the way to go ... here is a function in base R ...
anytime_base_unique <-
function(x, ...){
u_x <- unique(x)
u_y <- anytime(u_x, ...)
return(
u_y[match(x, u_x)]
)
}
from anytime.
I would indeed like to learn to do it correctly ... but I want to test the code first ... and that will likely be tomorrow or Wednesday as I need to concentrate on actually getting some conclusions out of the data today.
from anytime.
Hadn't thought of that as a speedup. Could especially maybe as a simple wrapper.
But ideally without extra depends so dplyr and magritr are a bit of a no-no there. If you wanted to implement this using Rcpp and BH at the C++ level, or in base R without new dependencies we might have something here.
from anytime.
For me it would be base R ... the one thing I don't know how to do is to do a fast n*log(n) merge in base R ... I learned to do that in data.table. Let me research it and see what I can come up with.
from anytime.
Same -- I so love data.table. Maybe we can get by with a Suggests: if we make the function an option or functionality an option and then test for data.table
being present ... but while I wrote this you came up with match()
. Nice!
from anytime.
The result is quite quick ...
> nrow(per_modem_1_min)
[1] 110980140
> length(unique(per_modem_1_min$time_stamp))
[1] 43201
> system.time(per_modem_1_min[, time_stamp := anytime_base_unique(time_stamp)])
user system elapsed
2.777 0.008 2.785
>
from anytime.
What percentage of values are non-unique in that case?
(Reason I am asking is that for most of my cases data was chronologically ordered and hence with unique timestamps -- but your suggestion has clear merit ...)
from anytime.
99.96% are non-unique. (1- 43201 / 110980140)
This is a sample of data from a ~10,000 modem panel (that was the number we requested), so we would expect 1 in 10k or 99.99%. In other words, it is exceedingly common to get several interwoven time series in one input file when you're asking for a time series for a statistically significant number of different cases.
For a real-world case from an operating company I would expect 5,6, or 7 9's. (100k, 1M, 10M)
That these time stamps came in in text form that wasn't always perfect (i.e. as.POSIXct()
failed and I couldn't easily figure out why), isn't exceedingly common, but its not at all unheard of.
So I'm definitely thankful for anytime
in these contexts, but this simple step should allow it to scale to a wide range of common situations.
My intuition as to what percentage non-unique you get to before this technique is slower is that its below 50% as fast as match()
and unique()
are.
--Stephen
from anytime.
That's what they call Yuge. We should support that.
(Only other optimization I had been thinking about but did not need myself was 'memorizing' what last parsing expression was used to not search again. Then again I have the feeling I made that 'index; variable static and enabled it many moons ago ...)
from anytime.
I only just glanced through your C code to see if I could find you boiling it down to unique instances, so I can't tell, but I did notice the large number of potential formats you hand-coded ... thanks again for creating this package and your work on behalf of the community.
from anytime.
Yes. See how it 'holds' a vector of format candidates? It then loops til it finds one that does not generate NA. I meant to double-check that the index j
of that position is the starting value for the next attempt in a possible outer loop (of vectorized input).
As for the 'cache for unique', what are you thinking? Make it an option for (any|utc)(time|date), so add it somewhere to the process code? Or make it a wrapper (potentially four times :-/)?
from anytime.
Honestly its a just a super simple refactoring (more speed) with no other changes, while I'm not nearly as wise in creating user experiences as you are, I don't see any reason to do anything but put it "under the hood" when the time is right and in the right place ... which I think would be in the R (S3?) methods just before the Rcpp call.
I will do a binary search to see where the tradeoff is for anytime()
and anydate()
... but of course it must include calculating unique(x)
in the first place, so that is half of the overhead. Then one can add a test to see if its worth doing.
I don't even think its worth adding a 'calc_uniuqe = TRUE` flag to the function call ... That's the kind of complication that one only care's about when one is parallelizing something.
from anytime.
I am mostly worried about overall behaviour and consistency.
This is a functional change, so my preference would be opt-in, that is to have an option defaulting to FALSE or off which one needs to turn on to get the behaviour you seek.
We could also keep it simple and just make it a demo and/or a new vignette for now.
from anytime.
In that context, then I think calc_unique = FALSE
is likely the way to go ... and then a short vigniette and/or blog post to publicize the speedup.
I'm getting the hint that you'd like me to clone or fork the code and put in the changes and also at least draft the blog post in blogdown?
from anytime.
:-)
I am still trying to think through how to best approach this, and it is not obvious. If you look at anytime.R many/most/all of the path end by eventually calling anytime_cpp
. So we could do it there and hash etc in C++. But then we'd reinvent match()
and that is silly.
Or one just wraps an outer function around that takes a vector, finds the unique ones and maps map to the original ones. Basically three or so data.table
statements. Maybe we just do that?
from anytime.
It feels like you're thinking yourself in circles at this point ... give it a few days ... something will decide it for you.
A little later today, I'll present what I'm thinking in terms of code. Given the speed of unique()
and match()
(the base R team has been working hard on making these fast, methinks), there is little reason to recreate them in C ... plus I am incapable of doing that within reasonable time.
If you want to do it in C, I would take a look at the underlying source for those commands and see what C (hopefully not fortran) libraries they call upon in the first place.
from anytime.
Well I don't want to add code to each s3 methods. Maybe it will work at the respective .default
methods.
But yes, your itch to scratch so please do make a proposal in code and then we see where we are at.
from anytime.
Here is the diff I propose for anytime.R
... I will add that my knowledge of S3 is just what I see here as an example, so take that with a grain of salt.
However, the only modification to each class is of adding the calcUnique varialbe to the function definition. As we discussed, only the calls to anytime_cpp (or the internal call to anytime) are changed.
I haven't tested it just yet, and I can't commit to any testing until well into next week, but this is the outline of my proposal.
from anytime.
That's pretty good and concise. I was musing for a moment if one could get by with just one call to anytime_cpp()
but obviously one cannot working on a reduced set is the whole point of this exercise :)
from anytime.
Do you want to fork, commit your change and propose it as a pull-request?
from anytime.
Its my first rodeo and your repository ... so whichever you want.
from anytime.
I am easy. If you want to "earn your first stripes" of an actual commit and PR I am happy to walk you through. If that is more 'bah humbug' to you, I can also take your (clear enough diff) and apply it, and still give you credit in the ChangeLog etc. Happy to help either way.
from anytime.
That's part of a PR too :) The existing unit tests helps, it is considered very best style to add new tests for new functionality. I.e. we could mock something by having a vector with c(rep(x, 4), rep(y, 3))
and ensure that the first four results are indentical, and as are the last three, and that they are different from each other etc pp.
No rush. anytime
is in a good spot, but your suggested change will make it clearly better for a subset of users such as yourself who encounter lots of 'dupes' in the input.
from anytime.
Thanks for catching that. I usually remember to add a tag (fixes #xyz)
to the commit message but forgot today.
And thanks again for the PR.
from anytime.
I'm happy to ... it was a great learning experience and demystified a lot for me.
from anytime.
Related Issues (20)
- Cannot use both tz and useR=TRUE arguments
- Strange issue - anydate does not recognize format or ... ? HOT 3
- UK v US formats HOT 2
- Time is silently scrubbed when using certain string date time formats HOT 3
- Add argument for default MM/DD to add to just YYYY inputs HOT 1
- timedatectl problem on HPC? HOT 2
- Failed to activate service 'org.freedesktop.timedate1' on Google Cloud VM HOT 9
- month year specification HOT 7
- Could anydate support nanotime ? HOT 4
- Returning NA value HOT 3
- Feature requests: more flexibly find date substring in a non-date string; and process additional incomplete date substrings HOT 11
- European vs US date formats HOT 3
- Feature request: function to return which format was recognized. HOT 3
- Anytime errors with length 1 NA HOT 6
- Chinese date format suggestion HOT 3
- just yyyy-mm-dd hh:mm:ss but not AEST suffix HOT 5
- anytime() sometimes returns the wrong date HOT 2
- Inconsistent handling of vectors with unknown values HOT 1
- single digit dates with unambiguous month and year HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anytime.