GithubHelp home page GithubHelp logo

Comments (27)

be-green avatar be-green commented on June 2, 2024 3

I'm working with a dataset that has ~90million observations all measured over the same daily window and this was a massive speedup so thank you both!

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024 2

Fixed with 0.3.7 Release

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024 1

Apparently match() is the way to go ... here is a function in base R ...

anytime_base_unique <- 
    function(x, ...){
        
        u_x <- unique(x)
       
        u_y <- anytime(u_x, ...)
        
        return(
            u_y[match(x, u_x)]
        )
        
    }

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024 1

I would indeed like to learn to do it correctly ... but I want to test the code first ... and that will likely be tomorrow or Wednesday as I need to concentrate on actually getting some conclusions out of the data today.

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

Hadn't thought of that as a speedup. Could especially maybe as a simple wrapper.

But ideally without extra depends so dplyr and magritr are a bit of a no-no there. If you wanted to implement this using Rcpp and BH at the C++ level, or in base R without new dependencies we might have something here.

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

For me it would be base R ... the one thing I don't know how to do is to do a fast n*log(n) merge in base R ... I learned to do that in data.table. Let me research it and see what I can come up with.

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

Same -- I so love data.table. Maybe we can get by with a Suggests: if we make the function an option or functionality an option and then test for data.table being present ... but while I wrote this you came up with match(). Nice!

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

The result is quite quick ...

> nrow(per_modem_1_min)
[1] 110980140
> length(unique(per_modem_1_min$time_stamp))
[1] 43201
> system.time(per_modem_1_min[, time_stamp := anytime_base_unique(time_stamp)])
   user  system elapsed 
  2.777   0.008   2.785 
> 

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

What percentage of values are non-unique in that case?

(Reason I am asking is that for most of my cases data was chronologically ordered and hence with unique timestamps -- but your suggestion has clear merit ...)

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

99.96% are non-unique. (1- 43201 / 110980140)

This is a sample of data from a ~10,000 modem panel (that was the number we requested), so we would expect 1 in 10k or 99.99%. In other words, it is exceedingly common to get several interwoven time series in one input file when you're asking for a time series for a statistically significant number of different cases.

For a real-world case from an operating company I would expect 5,6, or 7 9's. (100k, 1M, 10M)

That these time stamps came in in text form that wasn't always perfect (i.e. as.POSIXct() failed and I couldn't easily figure out why), isn't exceedingly common, but its not at all unheard of.

So I'm definitely thankful for anytime in these contexts, but this simple step should allow it to scale to a wide range of common situations.

My intuition as to what percentage non-unique you get to before this technique is slower is that its below 50% as fast as match() and unique() are.

--Stephen

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

That's what they call Yuge. We should support that.

(Only other optimization I had been thinking about but did not need myself was 'memorizing' what last parsing expression was used to not search again. Then again I have the feeling I made that 'index; variable static and enabled it many moons ago ...)

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

I only just glanced through your C code to see if I could find you boiling it down to unique instances, so I can't tell, but I did notice the large number of potential formats you hand-coded ... thanks again for creating this package and your work on behalf of the community.

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

Yes. See how it 'holds' a vector of format candidates? It then loops til it finds one that does not generate NA. I meant to double-check that the index j of that position is the starting value for the next attempt in a possible outer loop (of vectorized input).

As for the 'cache for unique', what are you thinking? Make it an option for (any|utc)(time|date), so add it somewhere to the process code? Or make it a wrapper (potentially four times :-/)?

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

Honestly its a just a super simple refactoring (more speed) with no other changes, while I'm not nearly as wise in creating user experiences as you are, I don't see any reason to do anything but put it "under the hood" when the time is right and in the right place ... which I think would be in the R (S3?) methods just before the Rcpp call.

I will do a binary search to see where the tradeoff is for anytime() and anydate() ... but of course it must include calculating unique(x) in the first place, so that is half of the overhead. Then one can add a test to see if its worth doing.

I don't even think its worth adding a 'calc_uniuqe = TRUE` flag to the function call ... That's the kind of complication that one only care's about when one is parallelizing something.

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

I am mostly worried about overall behaviour and consistency.

This is a functional change, so my preference would be opt-in, that is to have an option defaulting to FALSE or off which one needs to turn on to get the behaviour you seek.

We could also keep it simple and just make it a demo and/or a new vignette for now.

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

In that context, then I think calc_unique = FALSE is likely the way to go ... and then a short vigniette and/or blog post to publicize the speedup.

I'm getting the hint that you'd like me to clone or fork the code and put in the changes and also at least draft the blog post in blogdown?

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

:-)

I am still trying to think through how to best approach this, and it is not obvious. If you look at anytime.R many/most/all of the path end by eventually calling anytime_cpp. So we could do it there and hash etc in C++. But then we'd reinvent match() and that is silly.

Or one just wraps an outer function around that takes a vector, finds the unique ones and maps map to the original ones. Basically three or so data.table statements. Maybe we just do that?

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

It feels like you're thinking yourself in circles at this point ... give it a few days ... something will decide it for you.

A little later today, I'll present what I'm thinking in terms of code. Given the speed of unique() and match() (the base R team has been working hard on making these fast, methinks), there is little reason to recreate them in C ... plus I am incapable of doing that within reasonable time.

If you want to do it in C, I would take a look at the underlying source for those commands and see what C (hopefully not fortran) libraries they call upon in the first place.

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

Well I don't want to add code to each s3 methods. Maybe it will work at the respective .default methods.

But yes, your itch to scratch so please do make a proposal in code and then we see where we are at.

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

Here is the diff I propose for anytime.R ... I will add that my knowledge of S3 is just what I see here as an example, so take that with a grain of salt.

However, the only modification to each class is of adding the calcUnique varialbe to the function definition. As we discussed, only the calls to anytime_cpp (or the internal call to anytime) are changed.

I haven't tested it just yet, and I can't commit to any testing until well into next week, but this is the outline of my proposal.

anytime_r_diff_calcUnique.txt

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

That's pretty good and concise. I was musing for a moment if one could get by with just one call to anytime_cpp() but obviously one cannot working on a reduced set is the whole point of this exercise :)

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

Do you want to fork, commit your change and propose it as a pull-request?

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

Its my first rodeo and your repository ... so whichever you want.

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

I am easy. If you want to "earn your first stripes" of an actual commit and PR I am happy to walk you through. If that is more 'bah humbug' to you, I can also take your (clear enough diff) and apply it, and still give you credit in the ChangeLog etc. Happy to help either way.

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

That's part of a PR too :) The existing unit tests helps, it is considered very best style to add new tests for new functionality. I.e. we could mock something by having a vector with c(rep(x, 4), rep(y, 3)) and ensure that the first four results are indentical, and as are the last three, and that they are different from each other etc pp.

No rush. anytime is in a good spot, but your suggested change will make it clearly better for a subset of users such as yourself who encounter lots of 'dupes' in the input.

from anytime.

eddelbuettel avatar eddelbuettel commented on June 2, 2024

Thanks for catching that. I usually remember to add a tag (fixes #xyz) to the commit message but forgot today.

And thanks again for the PR.

from anytime.

stephenbfroehlich avatar stephenbfroehlich commented on June 2, 2024

I'm happy to ... it was a great learning experience and demystified a lot for me.

from anytime.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.