GithubHelp home page GithubHelp logo

CRAN submission about uwot HOT 29 CLOSED

jlmelville avatar jlmelville commented on May 13, 2024 1
CRAN submission

from uwot.

Comments (29)

jlmelville avatar jlmelville commented on May 13, 2024 7

CRAN submission was just accepted: https://cran.r-project.org/package=uwot

Thanks to all in this thread for their contributions to getting this done.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024 4

Thanks to the pointer by @rstub, I have adopted PCG (linking via dqrng) as the default PRNG. Although it faster than Boost, it is noticeably slower than what we had before (optimization is 10-15% slower with default settings on MNIST digits), so I have added a parameter to get back the old behavior (pcg_rand = FALSE).

I have also added fast_sgd, which if set to TRUE will bring back the combination of approximations and thread settings that gives rise to a substantially faster optimization, which works perfectly well for visualizations.

Next: my experiments with nearest neighbor descent have been disappointing so far: you get a better improvement in accuracy by using more accurate Annoy settings, than spending the same time refining nearest neighbor descent. Probably I am doing something wrong. Probably I should remove this code and save it for a triumphant return at a later date.

Finally: I found and fixed a bug in the spectral initialization as I was generating the example images. Having now churned through these datasets successfully, at this point, probably #16 is all that prevents me finally getting round to a CRAN submission. But I haven't tested cosine and hamming distances as thoroughly. Example datasets (especially if they can be used in the examples page and/or will uncover horrible bugs) would be welcome.

from uwot.

rstub avatar rstub commented on May 13, 2024 2

Note that PCG offers a nice convenience function as fast replacement of uniform_int_distribution:
http://www.pcg-random.org/using-pcg-cpp.html#result-type-pcg32-operator-result-type-upper-bound

In principle I also plan to add something along these lines to dqrng itself (c.f.
http://www.pcg-random.org/posts/bounded-rands.html and https://github.com/daqana/dqrng/tree/feature/sample), but I haven't had the time to finish that.

Concerning serialization you might look into https://cran.r-project.org/package=Rcereal and this example using an XPtr: https://stackoverflow.com/a/53157233/8416610.

from uwot.

LTLA avatar LTLA commented on May 13, 2024 1

Just to motivate this discussion: https://github.com/davismcc/scater/tree/uwot. So once uwot gets on CRAN, it'll have at least one reverse dependency straight away.

from uwot.

ttriche avatar ttriche commented on May 13, 2024 1

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024 1

Hmm, this may not be related, but the appconveyor Windows CI builds have started sporadically failing on a umap_transform test, but I can't reproduce it on my machine or rhub and it works again whenever I restart the build, so I don't know what's going on there.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024 1

The CRAN checks have all updated to 0.1.3. There is an UBSAN issue, but it seems to be due to RcppParallel. The same issue occurs in the RcppParallel checks, so I don't think I can do anything about it.

I consider uwot successfully submitted to CRAN at last. Closing.

from uwot.

LTLA avatar LTLA commented on May 13, 2024

This package looks great. I can't believe it's not on CRAN yet. Is there a hold-up, and can I help?

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

Thanks for the kind words.

There's no hold up on CRAN submission, I just don't have any plans to do so at the moment. Here's the roadmap/list of objections:

  1. There's already a UMAP package on CRAN: https://cran.r-project.org/package=umap.
  2. My previous experience with submitting to CRAN suggests that I will have to do a bit of justifying the package's existence (especially due to point 1).
  3. Packages with C++ code in get extra attention (rightfully so). In good conscience, I don't want to submit without having read and acted upon the Writing R Extensions guide first, which -- spoiler alert -- I have not done.
  4. Relatedly, the largeVis package was removed from CRAN for memory violation issues. I don't know what the problem was exactly, but as UMAP uses a very similar SGD strategy, it would benefit from some extra pairs of eyes from those with experience in Rcpp that uwot isn't doing something egregious, especially with how I pass around references to matrices. I may have just got lucky so far in that I have enough RAM for the datasets I look at and/or no garbage collection has kicked in at an inopportune moment.

Basically, I've been very lazy, but I also want to be respectful of the CRAN maintainers' time if the submission runs into issues. However, this sort of reminder is a useful kick in the pants. If having to install via devtools is inconveniencing people, I will think more about CRAN submission. I'll leave this issue open.

from uwot.

LTLA avatar LTLA commented on May 13, 2024

1. and 2. I'm glad you brought this up. I don't think you'll have a problem there:

vals <- readRDS("pbmc68k.rds"# first 50 PCs of the 10X PBMC 68K data set
test <- vals[1:10000,]

library(umap)
system.time(X <- umap::umap(test, n_neighbors=15))
##    user  system elapsed 
##  74.370   3.949  79.067 

library(uwot)
system.time(Y <- uwot::umap(test, n_neighbors=15))
##    user  system elapsed 
##  22.079   0.560  12.822 

umap::umap does most of its heavy lifting in R, which explains the difference in speed.

3. and 4. I can probably help with that. I looked at your source code a while ago, it seemed alright. Does it go through the various platforms on rhub okay?

My motivation is to depend on uwot in my own packages, which requires uwot to be in CRAN (or Bioconductor); I can't pull it in via devtools.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

I didn't know about rhub, thanks for the pointer! I have a lot of platform checking in my future, it seems.

I'm sure I've committed several other crimes against good coding hygiene, but the two major bits of dirty laundry I need to air (and ask for opinions on) are:

  1. In uwot::uwot (in uwot.R), the optimization C++ code is entered with a call a bit like:
result <- optimize_layout_umap(
      head_embedding = embedding,
      tail_embedding = embedding,
      positive_head = positive_head,
      positive_tail = positive_tail,
      epochs_per_sample = epochs_per_sample, ...)

head_embedding and tail_embedding are passed by reference. So when optimize_layout_umap returns, embedding is modified, which is completely against the pass-by-value semantics of R function calls. I've done this because I want to avoid storing a copy of embedding, especially as we don't need the initial coordinates any more: I assign the result of optimize_layout_umap to embedding anyway.

I've got away with it so far (in terms of it generating a segfault), but it seems like a complete no-no to me. Do I need to bite the bullet and ensure a copy is passed?

  1. The other thing I do that seems dodgy is that positive_head, positive_tail and epochs_per_sample end up on the C++ side (in optimize_layout_umap in optimize.cpp) as arma::uvec and arma::vec, passed by copy. These are then passed by reference into the SGDWorker where they are read from (but not written to) by multiple threads, without being wrapped in the thread-safe RcppParallel classes.

As far as I recall from some initial experiments between how RcppArmadillo vectors and Rcpp NumericVectors worked, this seemed like it was safe, in that it seemed like the data was genuinely being copied and hence moved out of R's clutches and so couldn't get garbage collected or moved under me. But I could be wrong. The alternatives I tried were real performance killers, unfortunately.

It feels good to have confessed to these sins.

from uwot.

LTLA avatar LTLA commented on May 13, 2024

Yeah, 1 is definitely bad. You should allocate new memory to store the result; but that should be cheap, I wouldn't worry about it given that the optimize_layout_umap function is only called once anyway.

2 is harder for me to tell, I usually use OpenMP to do my parallelization. My guess is that you should be fine for read-only, you should only need thread safety for writes. It shouldn't even compile if the objects were being modified... in theory, at least, who knows what's going on in there.

More generally, I wonder if you really need Armadillo for what you're doing, it seems that raw vectors might be sufficient for access. (I don't see a whole lot of matrix operations, though I'm not really familiar with Armadillo so I might have missed them.)

I also participated in the parallelization of Rtsne, from which we learnt some interesting lessons about the order of elements during reductions (see jkrijthe/Rtsne#16). I don't see any parallelReduce so you should be okay... but I wonder whether or not it would be simpler to use OpenMP instead.

Both of these last two points are mostly a matter of taste, so ignore them as you please; that's just where my experience tends to be concentrated.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

Yeah, 1 is definitely bad. You should allocate new memory to store the result; but that should be cheap, I wouldn't worry about it given that the optimize_layout_umap function is only called once anyway.

1 is now fixed. It wasn't the speed of allocation that bothered me, but making extra copies of potentially large matrices. This is at best unnecessary, and at worst for standard UMAP, if the code ends up working on two separate copies of the input coordinates, the results come out wrong. But despite the best efforts of the compiler to conspire against me, I have emerged victorious. Probably.

More generally, I wonder if you really need Armadillo for what you're doing, it seems that raw vectors might be sufficient for access. (I don't see a whole lot of matrix operations, though I'm not really familiar with Armadillo so I might have missed them.)

I don't use Armadillo for linear algebra, just for its convenient row, column subsetting and vectorized function features and so on. I assume you are suggesting that I can just do what Rtsne does and work with *double and for loops directly? This certainly has an appeal to me (and makes me nostalgic for my Fortran 77 days), but I find doing the multidimensional pointer arithmetic error prone, hard to debug and hard to understand, so it might take me a while to muster the enthusiasm.

I also participated in the parallelization of Rtsne, from which we learnt some interesting lessons about the order of elements during reductions (see [jkrijthe/Rtsne#16]

That is an interesting read. Looks like the reduce occurs during the normalization to form the Q matrix? Fortunately UMAP is un-normalized so that never has to happen.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

3 on my TODO list: for the spectral initialization, for what I assume are very poorly conditioned input matrices, RSpectra sometimes seems to hang (or at least take an unacceptably long time, I've never stuck around long enough to find out how long that is). Need to see if there's a way to get it to bail out earlier or if there's a better way to find the small eigenvalues needed for the normalized Laplacian.

from uwot.

ttriche avatar ttriche commented on May 13, 2024

from uwot.

ttriche avatar ttriche commented on May 13, 2024

from uwot.

LTLA avatar LTLA commented on May 13, 2024

Though I guess CRAN doesn't track Bioc revdeps, so you wouldn't see them on the landing page. But @jlmelville can use that as an argument for getting CRAN to accept your package.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

This is probably blocked until I at least understand what's causing the Mac issues with #1.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

Other things to fix:

- [ ] Avoid writing Annoy index to disk during multi-threaded NN search.
R policy allows writing to the R session temp directory, so this is ok: https://github.com/eddelbuettel/crp/blob/4dec483eadf5b642726a54099322a6e161392e49/texi/CRAN_policies.texi#L255

  • Can't use C++11 RNG.

from uwot.

LTLA avatar LTLA commented on May 13, 2024

I was going to say that if you can't use <random>, you could use boost/random.hpp or something from sitmo or dqrng, which would save you from having to roll your own PRNG in with the package. But I guess it doesn't really matter if you're happy with the current PRNG.

I should also add that:

some_rng() % tail_nvert

... is technically not correct (in terms of producing a uniform range in [0, tail_nvert)) unless the largest unsigned integer that some_rng() can produce is N*tail_nvert - 1 for some integer N. A more correct approach would be to use std::uniform_int_distribution (or its boost equivalent). This may or may not matter depending on the range of some_rng() and the size of tail_nvert, but why take the chance?

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

For now, I have tried to stick to copying how the Python UMAP implementation works. There is a slight complication that the Python % operator works differently to the C++ version, but it doesn't seem to make a big difference in practice.

I may take a look at pulling in Boost (although does anyone know why Boost RNG is OK to use if the C++11 RNGs aren't?) to generate integers from a uniform distribution the correct way at some point and see what effect it has on the speed of UMAP's optimization. It's not very high on my priority list at the moment, unless something demonstrably horrible is occurring with the current way of doing things.

from uwot.

khughitt avatar khughitt commented on May 13, 2024

Package may also be used in dimRed once it's available on CRAN.

from uwot.

LTLA avatar LTLA commented on May 13, 2024

I was just thinking about this - we should probably get the ball rolling again.

@jlmelville, regarding Boost's random headers: the major difference (for me) is that the C++ <random> does not use the same algorithm in its distribution functions across platforms. The standard only mandates the output of the PRNGs themselves, not how the random number stream is converted into variates that we actually use. For example, std::normal_distirbution has to produce normally distributed values, but it doesn't have to produce the same series of normally distributed values between clang and GCC.

This is a real problem, because I ran into it! I don't understand why the C++ standard was written like that, it basically makes <random> unusable for scientific computing where a result must be reproducible across systems. That's why I only use boost/random.hpp rather than <random>.

If we were to follow the spirit of the law, I would guess that we should not use Boost's random headers either. But there are some real limitations of R's C API for random number generation (see, for example, my comments in daqana/dqrng#11), and the best solution in such cases would be Boost.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

Thanks for explaining the difference between C++11 and Boost's random headers, @LTLA, that makes sense. I am open to considering either Boost or https://github.com/daqana/dqrng.

I have not abandoned preparing this for CRAN, I have just got distracted with other things. Mainly, I am still a bit dissatisfied with the speed of the nearest neighbor calculations. I have created https://github.com/jlmelville/rnndescent to implement the nearest neighbor descent method used by the Python UMAP (admittedly it also does a different initialization) and by LargeVis, but that is still a work in progress.

Also, #16 (spectral initialization basically taking forever) is still not satisfactorily resolved to my liking. Will I have to implement a LOBPCG routine for R? I haven't checked how painful that will be.

Finally, #19 required adding custom save/load functions. This feels like the sort of thing that shouldn't be necessary, but I don't know whether you can write custom serialization hooks for C++ objects that are wrapping external data. This might just be the way it is, and it's a non-issue.

from uwot.

LTLA avatar LTLA commented on May 13, 2024

Cool - I'll check out the compatibility with the uwot branch at https://github.com/davismcc/scater.

from uwot.

LTLA avatar LTLA commented on May 13, 2024

You'll get a revdep as soon as scater re-builds on the BioC machines (probably in a day or two).

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

Does anyone else think the 'hooray' emoji looks like a taco shell with some lettuce sticking out the end?

Anyway, don't get too carried away with those celebratory salad tacos, because of course I managed to crash the R session mere moments after I received the email from CRAN. Fundamentally, the issue is because of spotify/annoy#378 (can't read Annoy indexes > 2GB), but the real blame lies with me for failing to check that Annoy actually returns k neighbors when you ask for them.

There will therefore be a patch release (hopefully 0.1.3) coming ASAP, assuming the maintainers don't ban me or throw uwot off CRAN. It also attempts to fix an ASAN/UBSAN/valgrind issue, although I think most of it seems to be due to internal code used by RcppParallel.

from uwot.

LTLA avatar LTLA commented on May 13, 2024

Well, if it's any consolation, I also have an inexplicable failure on one of the UMAP-related tests in scater on the Windows 32 build machines. Probably something to do with differences in numerical precision between FNN and BiocNeighbors during the nearest neighbor calculations ... whatever, I'm skipping it.

from uwot.

jlmelville avatar jlmelville commented on May 13, 2024

patch release 0.1.3 has hit CRAN, so the immediate panic is over.

The valgrind and related checks haven't been updated yet. If they aren't clean, I have been told they will need fixing "quickly", so I have that to look forward to also.

from uwot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.