GithubHelp home page GithubHelp logo

jlmelville / uwot Goto Github PK

View Code? Open in Web Editor NEW
307.0 21.0 32.0 59.73 MB

An R package implementing the UMAP dimensionality reduction method.

Home Page: https://jlmelville.github.io/uwot/

License: GNU General Public License v3.0

R 80.66% C++ 19.34%
r umap dimensionality-reduction

uwot's Introduction

uwot

R-CMD-check AppVeyor Build Status Coverage Status CRAN Status Badge Dependencies CRAN Monthly Downloads CRAN Downloads Last Commit

An R implementation of the Uniform Manifold Approximation and Projection (UMAP) method for dimensionality reduction of McInnes et al. (2018). Also included are the supervised and metric (out-of-sample) learning extensions to the basic method. Translated from the Python implementation.

News

April 18 2024 Version 0.2.1 of uwot has been released to CRAN. Some features to be aware of: RcppHNSW and rnndescent are now supported as optional dependencies. If you install and load them, you can use them as an alternative to RcppAnnoy in the nearest neighbor search and should be faster. Also, a new umap2 function has been added, with updated defaults compared to umap. Please see the updated and new articles on HNSW, rnndescent, working with sparse data and umap2. I consider this worthy of moving from 0.1.x to 0.2.x, but in the interests of full disclosure, on-going irlba problems has caused a CRAN check failure, so we might be onto 0.2.2 sooner than I'd like.

Installing

From CRAN

install.packages("uwot")

From github

uwot makes use of C++ code which must be compiled. You may have to carry out a few extra steps before being able to build this package:

Windows: install Rtools and ensure C:\Rtools\bin is on your path.

Mac OS X: using a custom ~/.R/Makevars may cause linking errors. This sort of thing is a potential problem on all platforms but seems to bite Mac owners more. The R for Mac OS X FAQ may be helpful here to work out what you can get away with. To be on the safe side, I would advise building uwot without a custom Makevars.

install.packages("devtools")
devtools::install_github("jlmelville/uwot")

Example

library(uwot)

# umap2 is a version of the umap() function with better defaults
iris_umap <- umap2(iris)

# but you can still use the umap function (which most of the existing 
# documentation does)
iris_umap <- umap(iris)

# Load mnist from somewhere, e.g.
# devtools::install_github("jlmelville/snedata")
# mnist <- snedata::download_mnist()

mnist_umap <- umap(mnist, n_neighbors = 15, min_dist = 0.001, verbose = TRUE)
plot(
  mnist_umap,
  cex = 0.1,
  col = grDevices::rainbow(n = length(levels(mnist$Label)))[as.integer(mnist$Label)] |>
    grDevices::adjustcolor(alpha.f = 0.1),
  main = "R uwot::umap",
  xlab = "",
  ylab = ""
)

# I recommend the following optional packages
# for faster or more flexible nearest neighbor search:
install.packages(c("RcppHNSW", "rnndescent"))
# for faster spectral initialization
install.packages("RSpectra")

# Installing RcppHNSW will allow the use of the usually faster HNSW method:
mnist_umap_hnsw <- umap(mnist, n_neighbors = 15, min_dist = 0.001, 
                        nn_method = "hnsw")
# nndescent is also available
mnist_umap_nnd <- umap(mnist, n_neighbors = 15, min_dist = 0.001, 
                       nn_method = "nndescent")
# umap2 will choose HNSW by default if available
mnist_umap2 <- umap2(mnist)

MNIST UMAP

Documentation

https://jlmelville.github.io/uwot/. For more examples see the get started doc. There are plenty of articles describing various aspects of the package.

License

GPLv3 or later.

Citation

If you want to cite the use of uwot, then use the output of running citation("uwot") (you can do this with any R package).

See Also

uwot's People

Contributors

eddelbuettel avatar james-vivodyne avatar jlmelville avatar khughitt avatar ltla avatar sirusb avatar ttriche avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

uwot's Issues

Sample weights

I am curious to see whether there is a way to give individual observations different weights in the UMAP objective function. For instance, I have data from 2 conditions, one with 100 observations and one with 1000. I would like to have both conditions contribute equally to the embedding. Perhaps naively, I would expect observations from each conditions to take up the same amount of real estate in this balanced analysis. I appreciate any thoughts on how feasible this would be. Thanks in advance!

fail to install uwot on linux server due to path fault

Hi James,

I can install it without any problem on my workstation (win10). However, I can't on our linux server with the R-3.6.0. It seems like the LIB path didn't defined in "Makevars" file and the default path "/usr/lib/R/lib" can't be found.

Appreciate any feedbacks!

Yu

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] RcppAnnoy_0.0.12 RcppParallel_4.4.3 devtools_2.1.0 usethis_1.5.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 rstudioapi_0.10 magrittr_1.5 pkgload_1.0.2 R6_2.4.0 rlang_0.4.0 tools_3.6.0 pkgbuild_1.0.5 sessioninfo_1.1.1 cli_1.1.0 withr_2.1.2 remotes_2.1.0
[13] assertthat_0.2.1 digest_0.6.20 rprojroot_1.3-2 crayon_1.3.4 processx_3.4.1 callr_3.3.1 codetools_0.2-16 fs_1.3.1 ps_1.3.0 curl_4.0 testthat_2.2.1 memoise_1.1.0
[25] glue_1.3.1 compiler_3.6.0 desc_1.2.0 backports_1.1.4 prettyunits_1.0.2

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

devtools::install_github("jlmelville/uwot")
Downloading GitHub repo jlmelville/uwot@master
✔ checking for file ‘/tmp/Rtmpr1KisN/remotes81ee6859d9ce/jlmelville-uwot-7418141/DESCRIPTION’ ...
─ preparing ‘uwot’:
✔ checking DESCRIPTION meta-information ...
─ cleaning src
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
─ building ‘uwot_0.1.3.tar.gz’
Warning: invalid uid value replaced by that for user 'nobody'

Installing package into ‘/scratch/TBI/Softwares/R-3.6/Packages’
(as ‘lib’ is unspecified)
[1] "Working in R Studio, setting library path for R 3.6.0"

  • installing source package ‘uwot’ ...
    ** using staged installation
    ** libs
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c RcppExports.cpp -o RcppExports.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c connected_components.cpp -o connected_components.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c gradient.cpp -o gradient.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c nn_parallel.cpp -o nn_parallel.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c optimize.cpp -o optimize.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c perplexity.cpp -o perplexity.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c sampler.cpp -o sampler.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c smooth_knn.cpp -o smooth_knn.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c supervised.cpp -o supervised.o
    g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I"/scratch/TBI/Softwares/R-3.6/Packages/Rcpp/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppProgress/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppParallel/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/RcppAnnoy/include" -I"/scratch/TBI/Softwares/R-3.6/Packages/dqrng/include" -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c transform.cpp -o transform.o
    g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o uwot.so RcppExports.o connected_components.o gradient.o nn_parallel.o optimize.o perplexity.o sampler.o smooth_knn.o supervised.o transform.o [1] Working in R Studio, setting library path for R 3.6.0 -L/usr/lib/R/lib -lR
    g++: error: [1]: No such file or directory
    g++: error: Working in R Studio, setting library path for R 3.6.0: No such file or directory
    /usr/share/R/share/make/shlib.mk:6: recipe for target 'uwot.so' failed
    make: *** [uwot.so] Error 1
    ERROR: compilation failed for package ‘uwot’
  • removing ‘/scratch/TBI/Softwares/R-3.6/Packages/uwot’
  • restoring previous ‘/scratch/TBI/Softwares/R-3.6/Packages/uwot’
    Error: Failed to install 'uwot' from GitHub:
    (converted from warning) installation of package ‘/tmp/Rtmpr1KisN/file81ee6e7324fc/uwot_0.1.3.tar.gz’ had non-zero exit status

Error Loading

Thanks for this implementation, really looking forward to having a native R/Rcpp implementation to use on my big datasets!

The package seems to install fine but then there is a problem loading the shared object. I am running this on R3.5. Do you know what this could be?

Error: package or namespace load failed for ‘uwot’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/uwot/libs/uwot.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/uwot/libs/uwot.so, 6): Symbol not found: __ZN13umap_gradient8clip_maxE
Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/uwot/libs/uwot.so
Expected in: flat namespace
in /Library/Frameworks/R.framework/Versions/3.5/Resources/library/uwot/libs/uwot.so
Error: loading failed
Execution halted
ERROR: loading failed

Integrating HSNE?

Hi,
Thanks for your invaluable work proposing umap in a native R package. Thanks also for the smallvis package.
As you already integrated largeVis and proposed tumap, I am wondering if you planned to integrate HSNE someday. HDI is already integrated in interactive exploration tools such as cytosplore, but no R package is available. If you plan it, let me know.
Best.

Unable to run umap with n_components = 1

Hello!

I'm having some trouble running UMAP with one dimension due to an error that pops up saying

Error in optimize_layout_umap(head_embedding = embedding, tail_embedding = embedding,  : 
  Not a matrix.

I'm assuming that this is because the data being passed into the function is an atomic vector rather than a matrix due to it being one dimension? Perhaps this would be solved by using something like drop = FALSE or something similar?

Thank you.

Return UMAP graph?

Hello! Thank you for writing such a useful package. It is great not to have to switch between python and R to use umap :).

I was wondering: is it possible to output the graph (i.e. the fuzzy simplicial set) that is an intermediate step in the UMAP projection?

In the original python implementation, I obtained this using the function:

umap.umap_.fuzzy_simplicial_set

I have found that this graph has several nice properties, and can be used to cluster data directly using graphical clustering methods.

Tom

Reproducibility issues

First of all, thanks for this package. It is very convenient to have a pure R implementation of UMAP which is fast and reliable !

I am meeting a small and a bit strange problem of reproducibility of results. To be able to get the same UMAP results twice, I use set.seed before calling umap, and locally on my machine it works well :

 set.seed(82223)
 umap <- uwot::umap(USArrests)

The problem is, sometimes if I run the same code on another machine, for example during tests for a package CRAN check, the test fails because the results are different.

I've tried several things : testing that the uwot version are the same, and even testing that set.seed give the same suite of random numbers on every machine. This is true, but when I compare umap results they are different.

I'm not sure I'm completely clear here... But if you have any idea on why this could happen, I'd be glad to hear it :-)

segfault cause 'memory not mapped' when using option n_sgd_threads>1 on Conda install R

Hello,

I'm getting *** caught segfault *** address 0xfffffffffffffff7, cause 'memory not mapped' from each threads when I use uwot() with option n_sgd_threads as soon as the process is "Commencing the optimization epoch"

Here is the command

sentence_umap <- umap(X = corp_sentence_nda, pca=150, n_neighbors = 15, n_components = 3, ret_model = TRUE, verbose = TRUE, n_threads = 40, approx_pow = TRUE, n_sgd_threads=2)

Here is the log

09:58:17 UMAP embedding parameters a = 1.896 b = 0.8006
09:58:27 Read 215213 rows and found 768 numeric columns
09:58:27 Reducing X column dimension to 150 via PCA
10:01:53 PCA: 150 components explained 85.29% variance
10:01:53 Using Annoy for neighbor search, n_neighbors = 15
10:01:54 Building Annoy index with metric = euclidean, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
10:03:06 Writing NN index file to temp file /tmp/RtmpJ0QbYw/file81c79b6196a
10:03:07 Searching Annoy index using 40 threads, search_k = 1500
10:04:42 Annoy recall = 67.79%
10:04:42 Commencing smooth kNN distance calibration using 40 threads
10:04:42 103918 smooth knn distance failures
10:04:50 Found 498 connected components, falling back to 'spca' initialization with init_sdev = 1
10:04:50 Initializing from scaled PCA
10:04:51 Commencing optimization for 200 epochs, with 4880866 positive edges using 2 threads

 *** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'

Traceback:
 1: RcppParallel::setThreadOptions(numThreads = n_sgd_threads)
 2: uwot(X = X, n_neighbors = n_neighbors, n_components = n_components,     metric = metric, n_epochs = n_epochs, alpha = learning_rate,     scale = scale, init = init, init_sdev = init_sdev, spread = spread,     min_dist = min_dist, set_op_mix_ratio = set_op_mix_ratio,     local_connectivity = local_connectivity, bandwidth = bandwidth,     gamma = repulsion_strength, negative_sample_rate = negative_sample_rate,     a = a, b = b, nn_method = nn_method, n_trees = n_trees, search_k = search_k,     method = "umap", approx_pow =approx_pow, n_threads = n_threads,     n_sgd_threads = n_sgd_threads, grain_size = grain_size, y = y,     target_n_neighbors = target_n_neighbors, target_weight = target_weight,     target_metric = target_metric, pca = pca, pca_center = pca_center,     pcg_rand = pcg_rand, fast_sgd = fast_sgd, ret_model = ret_model,     ret_nn = ret_nn, tmpdir = tempdir(), verbose = verbose)
 3: umap(X = corp_sentence_nda, pca = 150, n_neighbors = 15, n_components = 3,     ret_model = TRUE, verbose = TRUE, n_threads = 40,  approx_pow = TRUE,     n_sgd_threads = 2)
 4: system.time(sentence_umap <- umap(X = corp_sentence_nda, pca = 150,     n_neighbors = 15, n_components = 3, ret_model = TRUE, verbose = TRUE,     n_threads = 40, approx_pow= TRUE, n_sgd_threads = 2))

The machine I'm running on provides the following cores to Rcpp :

> RcppParallel::defaultNumThreads()
[1] 48

(that's the reason why I would love to benefit from n_sgd_threads > 1)

note that changing from approx_pow = TRUE to approx_pow = FALSE has no effect and produce the same segfault.

Here is the sessionInfo() i'm using ( note that uwot is the github master version, not the cran one, with no difference)

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] reticulate_1.14 uwot_0.1.6      Matrix_1.2-18

loaded via a namespace (and not attached):
[1] compiler_3.6.1     rappdirs_0.3.1     Rcpp_1.0.3         grid_3.6.1
[5] jsonlite_1.6.1     RcppParallel_4.4.4 lattice_0.20-38

Thanks for your help, and for this fantastic package.

Error installing uwot under Microsoft R

Hello,

i'm trying to install in this environment
CentOS 7.6.1810 - gcc 4.8.5
Microsoft R 3.5.1
but I receive this error

devtools::install_github("jlmelville/uwot")
Downloading GitHub repo jlmelville/uwot@master
from URL https://api.github.com/repos/jlmelville/uwot/zipball/master
Installing uwot
'/opt/microsoft/ropen/3.5.1/lib64/R/bin/R' --no-site-file --no-environ
--no-save --no-restore --quiet CMD INSTALL
'/tmp/RtmpWTJkQy/devtools1bf444ad5fb01/jlmelville-uwot-05e3d4e'
--library='/opt/microsoft/ropen/3.5.1/lib64/R/library' --install-tests

  • installing source package ‘uwot’ ...
    ** libs
    g++ -std=gnu++11 -I/opt/microsoft/ropen/3.5.1/lib64/R/include -DNDEBUG -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/Rcpp/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppProgress/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppParallel/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppAnnoy/include" -DU_STATIC_IMPLEMENTATION -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -DU_STATIC_IMPLEMENTATION -g -O2 -c RcppExports.cpp -o RcppExports.o
    g++ -std=gnu++11 -I/opt/microsoft/ropen/3.5.1/lib64/R/include -DNDEBUG -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/Rcpp/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppProgress/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppParallel/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppAnnoy/include" -DU_STATIC_IMPLEMENTATION -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -DU_STATIC_IMPLEMENTATION -g -O2 -c connected_components.cpp -o connected_components.o
    g++ -std=gnu++11 -I/opt/microsoft/ropen/3.5.1/lib64/R/include -DNDEBUG -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/Rcpp/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppProgress/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppParallel/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppAnnoy/include" -DU_STATIC_IMPLEMENTATION -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -DU_STATIC_IMPLEMENTATION -g -O2 -c gradient.cpp -o gradient.o
    g++ -std=gnu++11 -I/opt/microsoft/ropen/3.5.1/lib64/R/include -DNDEBUG -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/Rcpp/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppProgress/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppParallel/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppAnnoy/include" -DU_STATIC_IMPLEMENTATION -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -DU_STATIC_IMPLEMENTATION -g -O2 -c nn_descent.cpp -o nn_descent.o
    g++ -std=gnu++11 -I/opt/microsoft/ropen/3.5.1/lib64/R/include -DNDEBUG -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/Rcpp/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppProgress/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppParallel/include" -I"/opt/microsoft/ropen/3.5.1/lib64/R/library/RcppAnnoy/include" -DU_STATIC_IMPLEMENTATION -DRCPP_PARALLEL_USE_TBB=1 -DSTRICT_R_HEADERS -fpic -DU_STATIC_IMPLEMENTATION -g -O2 -c nn_parallel.cpp -o nn_parallel.o
    nn_parallel.cpp: In function ‘Rcpp::List annoy_hamming_nns(const string&, const NumericMatrix&, std::size_t, std::size_t, std::size_t, bool)’:
    nn_parallel.cpp:129:31: error: ‘Hamming’ was not declared in this scope
    NNWorker<int32_t, uint64_t, Hamming, Kiss64Random>
    ^
    nn_parallel.cpp:129:52: error: template argument 3 is invalid
    NNWorker<int32_t, uint64_t, Hamming, Kiss64Random>
    ^
    nn_parallel.cpp:130:11: error: invalid type in declaration before ‘(’ token
    worker(index_name, mat, dist, idx, ncol, n, search_k);
    ^
    nn_parallel.cpp:130:57: error: expression list treated as compound expression in initializer [-fpermissive]
    worker(index_name, mat, dist, idx, ncol, n, search_k);
    ^
    nn_parallel.cpp:132:56: error: invalid initialization of reference of type ‘RcppParallel::Worker&’ from expression of type ‘int’
    RcppParallel::parallelFor(0, nrow, worker, grain_size);
    ^
    In file included from nn_parallel.cpp:3:0:
    /opt/microsoft/ropen/3.5.1/lib64/R/library/RcppParallel/include/RcppParallel.h:30:13: error: in passing argument 3 of ‘void RcppParallel::parallelFor(std::size_t, std::size_t, RcppParallel::Worker&, std::size_t)’
    inline void parallelFor(std::size_t begin, std::size_t end,
    ^
    make: *** [nn_parallel.o] Error 1
    ERROR: compilation failed for package ‘uwot’
  • removing ‘/opt/microsoft/ropen/3.5.1/lib64/R/library/uwot’
    Installation failed: Command failed (1)

i've googled but i couldn't find any solutions, i hope you may kindly help with my issue.
regards,
Fabio

Potential optimization opportunity in supervised.cpp

Looking at:

uwot/src/supervised.cpp

Lines 71 to 75 in 3e359a3

for (auto k = indptr1[i]; k < indptr1[i + 1]; k++) {
if (indices1[k] == j) {
left_val = data1[k];
}
}

This seems like it could be replaced by a binary search, assuming that the inputs represent slots from a dgCMatrix; entries of i should always be sorted within each column specified by p.

    auto left_end=indices1.begin() + indptr1[i + 1];
    auto left_it=std::lower_bound(indices1.begin() + indptr1[i], left_end, j);
    double left_val = (left_it!=left_end && *left_it==j ? data1[left_it - indices1.begin()] : left_min);

This should be faster for any decently sized input matrix where you're getting >100 non-zero entries in each column (I don't know if this is particularly common?), and saves two lines as well.

library(uwot)
library(Matrix)

X <- rsparsematrix(10000, 10000, 0.1)
Y <- rsparsematrix(10000, 10000, 0.1)
Z <- as(X + Y, 'dgTMatrix')

system.time({
uwot:::general_sset_intersection_cpp(
    X@p, X@i, X@x,
    Y@p, Y@i, Y@x, 
    Z@i, Z@j, Z@x)
})
##    user  system elapsed 
##  19.838   0.000  19.843 

system.time({
uwot:::general_sset_intersection_cpp2( # modified as above
    X@p, X@i, X@x,
    Y@p, Y@i, Y@x, 
    Z@i, Z@j, Z@x)
})
##    user  system elapsed 
##    1.43    0.00    1.43 

(Not that I have any concerns about speed; I was trawling through the code for other reasons and just happened to notice this. Just something to consider.)

umap_transform function

Could you point me to something that generally explains how the umap_transform does the embedding of the new data, e.g. what information from the embedding of the initial set is used, what is the nature of the objective function that is being minimized? I have gone thru the R code for the function but not getting it.

I have read (and generally understand) the "How UMAP works" description at https://umap-learn.readthedocs.io/en/latest/how_umap_works.html, so I have the basic idea of how the embedding of the initial set of data is done.

Any help would be appreciated!

metric = "cosine" bug

Thanks for developing this!

The umap function appears to have a bug when the 'metric = "cosine"' option is invoked. I get the following error:

Error in search_nn_func(index_file, X, k, search_k, grain_size = grain_size, :
vector::_M_range_insert

However, if I use 'metric = "manhattan" or leave it to the default it works just fine.

Best.

Set random seed for umap_transform()

There is a discussion over at the python UMAP repo where it is mentioned that the python equivalent of uwot::umap_transform includes an option to set a random number seed internally:

The transform function should now be consistent in the transformation (via a fixed transform seed which you can pick on instantiation if you wish).

The current behavior of umap_transform is, for example:

iris_umap <- umap(iris, ret_model = TRUE)
iris_umap2 <- umap_transform(iris, model = iris_umap)

Which will yield something like:

head(iris_umap$embedding)
[,1] [,2]
[1,] 4.200771 8.554187
[2,] 3.545428 6.722146
[3,] 3.172449 7.146079
[4,] 3.298905 7.069255
[5,] 4.084401 8.499889
[6,] 5.120863 9.330461

head(iris_umap2)
[,1] [,2]
[1,] 4.069650 8.825538
[2,] 3.723976 6.704192
[3,] 3.110911 7.511822
[4,] 3.286992 6.673110
[5,] 3.893319 8.772491
[6,] 5.126285 9.693930

The results are close but not exactly the same, due to the stochastic nature of the UMAP algorithm.

However, a relevant (and, I think, potentially common) use case is when umap() is used to reduce the dimensionality of a set of X predictor variables and a model is then trained on the embedding. If new observations (X2) become available for which we wish to make predictions, those values need to be deterministically translated into the embedding space (using the same random number generation as the original UMAP calculations). Even relatively small differences in how the "natural" X values are translated to the embedded space could cause identical X and X2 observations to generate different predictions from subsequent models.

How difficult would it be to enable a random seed to be set (and returned) for umap and then later passed to umap_transform?

Installation error

Hello,

I am trying to install uwot on our RStudio Server and I am getting an error I cannot decipher (google cannot do that either...).

In file included from gradient.cpp:20:
gradient.h:30: error: ISO C++ forbids declaration of ‘constexpr’ with no type
gradient.h:30: error: expected ‘;’ before ‘double’
gradient.h:56: error: ISO C++ forbids declaration of ‘constexpr’ with no type
gradient.h:56: error: expected ‘;’ before ‘double’
gradient.h:64: error: ISO C++ forbids declaration of ‘constexpr’ with no type
gradient.h:64: error: expected ‘;’ before ‘double’
make: *** [gradient.o] Error 1
ERROR: compilation failed for package ‘uwot’
* removing ‘/home/myuser/R/x86_64-pc-linux-gnu-library/3.5/uwot’
Error in i.p(...) : 
  (converted from warning) installation of package ‘/tmp/Rtmp9wtvCO/file2accb4660569f/uwot_0.0.0.9008.tar.gz’ had non-zero exit status

Any idea of what can cause this?

RStudio is running R 3.5.1, I installed RCpp and RCppParallel without any issue. We might have an old-ish version of GCC (4.4.7) , so that might be the issue, but I want to make sure before I go to war against our sysadmin.

To run umap using external pca components

I tried to run umap with init parameter as a matrix that had pca components generated from another software.But the initialisation failed and ran with random instead.

Uwot/umap crashes when run twice ....

Dear all,

I can't get umap to run twice in an R-session without crashing. Initially observed using RunUMAP from Seurat 3.1.1, but also the very basic code (see below) would not work..
I have tried to resolve this using the hints in cole-trapnell-lab/monocle3#186 and satijalab/seurat#2256, but without any succes..

Any suggestions?

Session and info:

library(uwot)
iris_umap <- umap(iris, pca = 50)

# And a second time
iris_umap2 <- umap(iris, pca = 50)

# Crash ....
#
# Bioconductor version [1] ‘3.10’
#
# R Under development (unstable) (2019-11-05 r77375)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C LC_TIME=English_United States.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] uwot_0.1.5 Matrix_1.2-17
#
# loaded via a namespace (and not attached):
# [1] compiler_4.0.0 tools_4.0.0 yaml_2.2.0 Rcpp_1.0.3 grid_4.0.0 FNN_1.1.3 RcppParallel_4.4.4
# [8] lattice_0.20-38

thanks in advance and with kind regards,
Aldo

Fixes for 1.0

Things I should fix, but which may need a major version change. To be edited and updated as I discover more hidden horrors.

  • min_dist default is 0.01, but should be 0.1 for consistency with Python UMAP. Fortunately, this has no discernible effect on the output.
  • should pca be set by default? If users attempt to throw very high dimensional data at uwot at the moment, they are in for a miserable time, because at best Annoy will take hours to complete. At worst, if they are using multi-threading (also a default), Annoy will fail on large datasets due to not being able to read back in an index larger in size than 2GB. I must get back to rnndescent and add rp tree support to provide a replacement/alternative.

umap_transform and tumap

Hi,
If i run umap_trasform after tumap I get the following error

16:25:52 Writing NN index file to temp file /tmp/RtmpAAapHI/file3b610733a05

Error in .External(list(name = "CppMethod__invoke_void", address = <pointer: (nil)>, :
NULL value passed as symbol address

while instead everything works if I use umap and then umap_trasform. Of course using the same data for both methods.

uwot 0.1.8 doesn't install on MacOS 10.14.6

I have 0.1.5 installed and it works find. I tried to upgrade it to the most recent version and I get an error:

> install.packages("uwot")

  There is a binary version available but the source version is later:
     binary source needs_compilation
uwot  0.1.5  0.1.8              TRUE

Do you want to install from sources the package which needs compilation? (Yes/no/cancel) y
installing the source package ‘uwot’

trying URL 'https://cran.rstudio.com/src/contrib/uwot_0.1.8.tar.gz'
Content type 'application/x-gzip' length 90032 bytes (87 KB)
==================================================
downloaded 87 KB

* installing *source* package ‘uwot’ ...
** package ‘uwot’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppProgress/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dqrng/include" -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -DSTRICT_R_HEADERS -DRCPP_NO_RTTI -fPIC  -Wall -g -O2  -c RcppExports.cpp -o RcppExports.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppProgress/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dqrng/include" -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -DSTRICT_R_HEADERS -DRCPP_NO_RTTI -fPIC  -Wall -g -O2  -c connected_components.cpp -o connected_components.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppProgress/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include" -I"/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dqrng/include" -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -DSTRICT_R_HEADERS -DRCPP_NO_RTTI -fPIC  -Wall -g -O2  -c nn_parallel.cpp -o nn_parallel.o
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
In file included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:658:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/gethostuuid.h:39:17: error: C++ requires a type specifier for all declarations
int gethostuuid(uuid_t, const struct timespec *) __OSX_AVAILABLE_STARTING(__MAC_10_5, __IPHONE_NA);
                ^
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:665:27: error: unknown type name 'uuid_t'; did you mean 'uid_t'?
int      getsgroups_np(int *, uuid_t);
                              ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/_types/_uid_t.h:31:31: note: 'uid_t' declared here
typedef __darwin_uid_t        uid_t;
                              ^
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:667:27: error: unknown type name 'uuid_t'; did you mean 'uid_t'?
int      getwgroups_np(int *, uuid_t);
                              ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/_types/_uid_t.h:31:31: note: 'uid_t' declared here
typedef __darwin_uid_t        uid_t;
                              ^
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:730:31: error: unknown type name 'uuid_t'; did you mean 'uid_t'?
int      setsgroups_np(int, const uuid_t);
                                  ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/_types/_uid_t.h:31:31: note: 'uid_t' declared here
typedef __darwin_uid_t        uid_t;
                              ^
In file included from nn_parallel.cpp:6:
In file included from ./nn_parallel.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/RcppAnnoy/include/annoylib.h:22:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/unistd.h:732:31: error: unknown type name 'uuid_t'; did you mean 'uid_t'?
int      setwgroups_np(int, const uuid_t);
                                  ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/_types/_uid_t.h:31:31: note: 'uid_t' declared here
typedef __darwin_uid_t        uid_t;
                              ^
5 errors generated.
make: *** [nn_parallel.o] Error 1
ERROR: compilation failed for package ‘uwot’
* removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/uwot’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/uwot’
Warning in install.packages :
  installation of package ‘uwot’ had non-zero exit status

The downloaded source packages are in
	‘/private/var/folders/l5/b5l0kyfd46780f7qdl5hm9d4cncpc2/T/RtmpLXr669/downloaded_packages’

This is my SessionInfo:

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.2 tools_3.6.2   

uwot::umap_transform not producing same results as those in python implementation

I've noticed that I'm not getting similar results from the uwot as the python implementation.

Goal: Translate a pipeline from Python to R.
Problem: uwot behaves different than umap (python)

For reproducibility, I'm following this workflow in python.

Prior to the supervised clustering with umap there are two steps 1) simulating data with sklearn.datasets.make_classification() and 2) scaling with StandardScaler.fit_transform().

Rather than simulating data and scaling with R functions lets use python so the input to uwot and Python's umap are identical.

First we simulate the data and scale it with Python. Let's enter the python interpreter with `repl_python() then enter the following:

 # importing relevant libraries
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier 
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification
from tqdm import tqdm
from umap import UMAP
from pynndescent import NNDescent
from fastcluster import single
from scipy.cluster.hierarchy import cut_tree, fcluster, dendrogram
from scipy.spatial.distance import squareform
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier

# let us generate some data with 10 clusters per class
X, y = make_classification(n_samples=500000, n_features=200, n_informative=5, 
                           n_redundant=0, n_clusters_per_class=10, weights=[0.80],
                           flip_y=0.05, class_sep=3.5, random_state=42)

# normalizing to eliminate scaling differences
X = pd.DataFrame(StandardScaler().fit_transform(X))

Were going to want to do Python's umap first but we will do the plotting in ggplot2 just to show that it's not an issue with visualization.

# building supervised embedding with UMAP
sup_embed_umap = UMAP().fit_transform(X, y=y)

exit # exit the python interpreter

Now let's plot this in R:

library(ggplot2)

unsup_embed_python <- py$unsup_embed
unsup_embed_python <- as.data.frame(unsup_embed_python)
unsup_embed_python$labels <- py$y

ggplot(sup_embed_python, aes(V1, V2, color = labels)) + geom_point() + scale_color_manual(values = c("#0000FF", "#ff0000"))

Here's what it looks like:
python-result

Now let's do the same thing using the default for supervised dimension reduction with uwot:

sup_embed_R <- umap(py$X, y = py$y)
sup_embed_R <- as.data.frame(sup_embed_R)
sup_embed_R$labels <- py$y
ggplot(sup_embed_R, aes(V1, V2, color = as.character(labels))) + geom_point() + scale_color_manual(values = c("#0000FF", "#ff0000"))

The resulting image looks identical to calling uwot without supervision (y = py$y)?

crappo

I read the "Python Comparison" document which suggests using pca = 100 and min_dist = 0.1 within umap(). So I also tried this but don't see a similar result.

sup_embed_R <- umap(py$X, y = py$y, pca = 100, min_dist = 0.1)
sup_embed_R <- as.data.frame(sup_embed_R)
sup_embed_R$labels <- py$y
ggplot(sup_embed_R, aes(V1, V2, color = as.character(labels))) + geom_point() + scale_color_manual(values = c("#0000FF", "#ff0000"))

crap2

Maybe the issue is that Python is calling the fit_transform() function from umap. Therefore, I tried using the ret_model = TRUE with uwot::umap_transform() but I don't get the same result as python either.

sup_embed_model <- uwot::umap(py$X, y = py$y, ret_model = TRUE)
sup_embed_R <- uwot::umap_transform(py$X, sup_embed_model, verbose = TRUE)
sup_embed_R <- as.data.frame(sup_embed_R)
sup_embed_R$labels <- py$y

ggplot(sup_embed_R, aes(V1, V2, color = as.character(labels))) + geom_point() + scale_color_manual(values = c("#0000FF", "#ff0000"))

This looks more like a donut than two separate cluster

weird_transform

Is there something I'm doing wrong?

Error in FNN::get.knn(X, k) : Data include NAs

Hi,

I have a matrix with a size of 174, 76 and the last column contains 4 NAs. I though uwot have tolerance for NAs in x but I get this error message, "FNN::get.knn(X, k) : Data include NAs". I am using the following command;
umap1<-umap(DT[,28:102], n_neighbors = 10, learning_rate = 0.5, init = "lvrandom", scale = "Z", a=1, b=0.5, min_dist = 1, spread = 4) %>% as.data.table()

Is there any way to get around this NA issue ?

Thanks.

Metric = "precomputed" is not implimented

Metric = "precomputed" is not implemented

I would like to run uwot::umap() with metric = 'pearson'. However, 'pearson' is not an option with within this package and I got the following error:

Error in match.arg(metric, c("euclidean", "cosine", "manhattan", "hamming", : 'arg' should be one of “euclidean”, “cosine”, “manhattan”, “hamming”, “precomputed”

This error suggests that I can use a "precomputed" distance matrix. So I tried to run uwot::umap() with metric = 'precomputed' and got the following error:

Error in create_ann(metric, nc) : BUG: unknown Annoy metric 'precomputed'

This error suggests precomputed is not implemented within this package.

PS. The original umap package allows for metrix = 'pearson.' It would be nice to see this added to this package!

Incorrect number of arguments for '_uwot_smooth_knn_distances_parallel'

Hi James,

Great package, appreciate the efforts! It looks like the latest version is broken though?

umap(iris[,1:4])

results in the error:
Error in .Call(_uwot_smooth_knn_distances_parallel, nn_dist, nn_idx, : Incorrect number of arguments (10), expecting 9 for '_uwot_smooth_knn_distances_parallel'

I didn't have issues until today. Any idea about what might be causing this?

Cheers

Problem of test data features extraction

I reduced the dimension using supervised method. and I used metric learning to reduce the dimension of the test data. But, the accuracy of training feature is 99%, whereas the testing data has very low accuracy, why? Any idea in this case. I don't want to use PCA or other tools for dimension reduction. Thanks
Below is the code!
library(uwot)
library(tidyverse)
library(ggplot2)

iris_umap <- umap(iris, n_neighbors = 40, alpha = 0.6, init = "random",,n_threads =12,n_components =

#perform dimension reduction for object detection features
data_train <- read.table('d://augmented_features/raw_training_features.csv',sep = ',',header = F)
data_test <- read.table('d://augmented_features/raw_testing_features.csv',sep = ',',header = F)
umap_test <- data_test[,-ncol(data_test)]
umap_train <- data_train[,-ncol(data_train)]
train_label <- data_train$V513
#set.seed(1337)
reduced_umap_train <- umap(umap_train,ret_model = TRUE,y=train_label,n_components = 2)
inria_train <- as.data.frame(reduced_umap_train$embedding)
inria_train %>%
mutate(Categories = data_train$V513) %>%
ggplot(aes(V1, V2, color = Categories)) + geom_point(cex=1.5)
#write this file for final feature extraction
#class(inria_umap)
#write.tainble(inria_umap,'G://testing.csv',sep = ',',row.names = F,col.names = F)
set.seed(1337)
reduced_umap_test <- umap_transform(umap_test,reduced_umap_train)
inria_test <- as.data.frame(reduced_umap_test)
inria_test %>%
mutate(Categories = data_test$V513) %>%
ggplot(aes(V1, V2, color = Categories)) + geom_point(cex=1.5)

#make final reduced datasets that is used for SVM training
category <- data_train$V513
train_data <- cbind(inria_train,category)
category <- data_test$V513
test_data <- cbind(inria_test,category)

#write data into the file
write.table(train_data,'d://augmented_features/reduced_training.csv',sep = ',',row.names = F,col.names = F)
write.table(test_data,'d:/augmented_features/reduced_testing.csv',sep = ',',row.names = F,col.names = F)

Are NA values allowed in 'y' for supervised reduction?

UMAP supports partial labeling (i.e. NA values) of a target array when performing supervised reduction.

And this is what uwot::umap() does when the the 'y' argument is a character:

x <- iris
x$Species[sample.int(nrow(iris), 50)] <- NA
iris_umap <- umap(x[,-5], n_neighbors = 50, alpha = 0.5, init = "random", y = x$Species)

However, this fails in the case of a numeric 'y' argument that contains NA values:

x <- mtcars
x$mpg[sample.int(nrow(mtcars), 10)] <- NA
mtcars_umap <- umap(x[,-1], n_neighbors = 10, alpha = 0.5, init = "random", y = x$mpg)

Error in result[n_samples > 0] <- n_epochs/n_samples[n_samples > 0] :
NAs are not allowed in subscripted assignments

Perhaps this is the expected behavior (I admit I am not familiar with the details of the algorithm), but I wanted to confirm if this is the case or not.

Writing NN index file to temp file is slow

Why does writing NN index file to temp take so long? Is it possible to speed it up?

merged is a large numeric matrix.

Input:

markers <- c(19:33,36:51,53,62)
sub <- merged[,]
library(uwot)
threads <- 32
umap_data <- umap(
  sub[,markers],
  n_neighbors = 15,
  n_components = 2,
  metric = "euclidean",
  n_epochs = 1000,
  learning_rate = 1,
  scale = "z",
  init = "spca",
  init_sdev = NULL,
  # spread = 5,
  min_dist = 0.01,
  set_op_mix_ratio = 1,
  local_connectivity = 1,
  bandwidth = 1,
  repulsion_strength = 1,
  negative_sample_rate = 5,
  nn_method = "annoy",
  # n_trees = 50,
  approx_pow = FALSE,
  pca = NULL,
  pca_center = TRUE,
  pcg_rand = TRUE,
  fast_sgd = FALSE,
  ret_model = FALSE,
  ret_nn = FALSE,
  n_threads = threads,
  n_sgd_threads = threads,
  grain_size = 1,
  verbose = TRUE
)

Output:

19:37:09 Read 13869323 rows and found 33 numeric columns
19:37:09 Scaling to zero mean and unit variance
19:37:16 Kept 33 non-zero-variance columns
19:37:38 Building Annoy index with metric = euclidean, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:29:09 Writing NN index file to temp file C:\Users*\AppData\Local\Temp\Rtmp2D6YGc\file47d421456b98

It has been stuck on the last step for more than 15 hours. File size is about 1.8GB.

With 1e5 rows:

markers <- c(19:33,36:51,53,62)
sub <- merged[1:1e5,]
library(uwot)
threads = 64
umap_data <- umap(
  sub[,markers],
  n_neighbors = 15,
  n_components = 2,
  metric = "euclidean",
  n_epochs = 500,
  learning_rate = 1,
  scale = "z",
  init = "spca",
  init_sdev = NULL,
  # spread = 5,
  min_dist = 0.01,
  set_op_mix_ratio = 1,
  local_connectivity = 1,
  bandwidth = 1,
  repulsion_strength = 1,
  negative_sample_rate = 5,
  nn_method = "annoy",
  # n_trees = 50,
  approx_pow = FALSE,
  pca = NULL,
  pca_center = TRUE,
  pcg_rand = TRUE,
  fast_sgd = FALSE,
  ret_model = FALSE,
  ret_nn = FALSE,
  n_threads = threads,
  n_sgd_threads = threads,
  grain_size = 1,
  verbose = TRUE
)

11:57:16 Read 100000 rows and found 33 numeric columns
11:57:16 Scaling to zero mean and unit variance
11:57:16 Kept 33 non-zero-variance columns
11:57:17 Building Annoy index with metric = euclidean, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
11:57:30 Writing NN index file to temp file C:\Users*\AppData\Local\Temp\Rtmpcf6MHg\file1c05eb14ee7
11:57:30 Searching Annoy index using 64 threads, search_k = 1500
12:00:22 Annoy recall = 100%
12:00:23 Commencing smooth kNN distance calibration using 64 threads
12:00:25 Initializing from PCA
12:00:25 PCA: 2 components explained 27.42% variance
12:00:25 Commencing optimization for 500 epochs, with 2325534 positive edges using 64 threads
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
12:00:53 Optimization finished

So with 1e5 rows the writing the NN index takes not even a second and file size is 73MB.
With 1e6 rows:

12:02:21 Read 1000000 rows and found 33 numeric columns
12:02:21 Scaling to zero mean and unit variance
12:02:22 Kept 33 non-zero-variance columns
12:02:23 Building Annoy index with metric = euclidean, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
12:05:16 Writing NN index file to temp file C:\Users*\AppData\Local\Temp\Rtmpcf6MHg\file1c01733238f
12:05:17 Searching Annoy index using 64 threads, search_k = 1500

With 1e6 rows, writing NN index takes 1 second and file size is about 738MB.

Specs:
Samsung EVO SSD 1 TB
128 GB ECC RAM
AMD 2990WX

SessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] uwot_0.1.3 Matrix_1.2-17 foreach_1.4.4 flowCore_1.50.0

loaded via a namespace (and not attached):
[1] graph_1.62.0 Rcpp_1.0.1 cluster_2.0.8 BiocGenerics_0.30.0
[5] MASS_7.3-51.4 lattice_0.20-38 rrcov_1.4-7 pcaPP_1.9-73
[9] vizier_0.3 tools_3.6.0 parallel_3.6.0 grid_3.6.0
[13] Biobase_2.44.0 snow_0.4-3 corpcor_1.6.9 iterators_1.0.10
[17] matrixStats_0.54.0 RcppParallel_4.4.3 doSNOW_1.0.16 codetools_0.2-16
[21] robustbase_0.93-5 compiler_3.6.0 DEoptimR_1.0-8 stats4_3.6.0
[25] mvtnorm_1.0-10

Reproducibility issue with cosine metric

This is a followup to issue #46.

The reproducibility issues described there have been fixed for me in 0.1.8 by using approx_pow = TRUE with an euclidean or manhattan metric, but I still face problems when using cosine.

Here's a result on my laptop (Ubuntu 18.04, R 3.6.3, uwot 0.1.8) :

> set.seed(13); head(uwot::umap(iris, metric = "cosine", init="spca", a=1, b=1, approx_pow=TRUE), 5)
         [,1]      [,2]
[1,] 2.190465 -14.45460
[2,] 2.153269 -11.64510
[3,] 2.337686 -14.14382
[4,] 1.191009 -12.59075
[5,] 1.472325 -15.06042

And here's the same thing on a server (CentOS 7, R 3.6.1, uwot 0.1.8) :

> set.seed(13); head(uwot::umap(iris, metric = "cosine", init="spca", a=1, b=1, approx_pow=TRUE), 5)                                                                
          [,1]      [,2]                                                                                                                                            
[1,] -15.45597 -4.156313
[2,] -17.59474 -4.357967
[3,] -15.25843 -4.456960
[4,] -17.01195 -2.813276
[5,] -14.92331 -3.548293

The results are the same when run with metric = "euclidean".

UBSAN issues: RcppParallel

Running rhub::check_with_sanitizers() has confirmed that the UBSAN issues reported for RcppAnnoy in #50 are fixed with RcppAnnoy 0.0.15. Unfortunately, there are lots of UBSAN complaints originating with RcppParallel. I don't think this is due to me using the package incorrectly, because the RcppParallel CRAN checks give the same messages (see https://www.stats.ox.ac.uk/pub/bdr/memtests/gcc-UBSAN/RcppParallel/RcppParallel-Ex.Rout).

They seem to originate with the Intel tbb library and are well known by the RcppParallel maintainers (see e.g. RcppCore/RcppParallel#36), but they can't do anything about it. The risk here is that the strategy of saying that the UBSAN issues are harmless and originate from a package uwot is using is exactly the strategy that stopped working with RcppAnnoy.

A possible alternative is to look at RcppThread which has a parallel for construct and is not currently showing any check problems.

Failed Installation OSX 10.14.6

Hi,

Installation of uwot, by install_github() or install() failed with the same error. Even R --vanilla failed to install it with the same error;

clang++ -std=gnu++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/opt/gettext/lib -L/usr/local/opt/readline/lib -L/usr/local/lib -L/usr/local/Cellar/r/3.6.1_1/lib/R/lib -L/usr/local/opt/gettext/lib -L/usr/local/opt/readline/lib -L/usr/local/lib -o uwot.so RcppExports.o connected_components.o gradient.o nn_parallel.o optimize.o perplexity.o sampler.o smooth_knn.o supervised.o transform.o All my packages loaded Tue Aug 27 23:05:12 2019 -L/usr/local/Cellar/r/3.6.1_1/lib/R/lib -lR -lintl -Wl,-framework -Wl,CoreFoundation
clang-8: error: no such file or directory: 'All'
clang-8: error: no such file or directory: 'my'
clang-8: error: no such file or directory: 'packages'
clang-8: error: no such file or directory: 'loaded'
clang-8: error: no such file or directory: 'Tue'
clang-8: error: no such file or directory: 'Aug'
clang-8: error: no such file or directory: '27'
clang-8: error: no such file or directory: '23:05:14'
clang-8: error: no such file or directory: '2019'
make: *** [uwot.so] Error 1

I am using OSX Mojave,

sw_vers
ProductName:	Mac OS X
ProductVersion:	10.14.6
BuildVersion:	18G87

gcc version (not apples);

gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/9.2.0/libexec/gcc/x86_64-apple-darwin18/9.2.0/lto-wrapper
Target: x86_64-apple-darwin18
Configured with: ../configure --build=x86_64-apple-darwin18 --prefix=/usr/local/Cellar/gcc/9.2.0 --libdir=/usr/local/Cellar/gcc/9.2.0/lib/gcc/9 --disable-nls --enable-checking=release --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-9 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --with-pkgversion='Homebrew GCC 9.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-multilib --with-native-system-header-dir=/usr/include --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
Thread model: posix
gcc version 9.2.0 (Homebrew GCC 9.2.0)
clang -v
clang version 8.0.1 (tags/RELEASE_801/final)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /usr/local/Cellar/llvm/8.0.1/bin

Any help or pointer to fix this is really appreciated.

Work around Annoy UBSAN issues

The latest submission of uwot to CRAN has been rejected due to the UBSAN issues inherited from RcppAnnoy (the UBSAN check is currently accessible via a link on https://cran.r-project.org/web/checks/check_results_uwot.html):

Thanks, it is your choice to use RcppAnnoy, so you have to work around the issues. The use of undefined behaviour is not compatible with the CRAN policy.

Please fix and resubmit.

If this decision isn't reconsidered, I imagine that this is likely to see uwot being removed from CRAN shortly.

The UBSAN issue is also present in RcppAnnoy itself, not uwot's specific use of the package (as far as I can tell anyway): https://cran.r-project.org/web/checks/check_results_RcppAnnoy.html and is due to how the underlying Annoy library is written. It's not going to get fixed because it's Annoy working as designed. It's not clear to me at the moment if this means RcppAnnoy will also be removed from CRAN or what has changed in policy since the last submission of uwot (or indeed of RcppAnnoy).

At any rate, I grow weary of the ban-hammer lottery uwot enters every time I want to update the package on CRAN. The obvious solution is to stop using Annoy. The upside would be:

  • no more UBSAN issues hanging over uwot.
  • Annoy is able to write indices that are too large for it to read back in, so that would remove an error path that I am unable to detect until it's too late (and lots of time and computation has been expended).

The obvious downsides are:

  • there isn't a good replacement for Annoy.
  • it will break backwards compatibility with any previously saved models.

RcppHNSW is a possible alternative, but it supports fewer metrics than Annoy and is a lot slower.

I do want to get on with rnndescent, the upsides of which are:

  • as a translation of pynndescent, it would be closer to the behavior of the nearest neighbor routines used in UMAP.
  • I can add more metrics support.
  • I can eventually add sparse matrix support.

A big downside is:

  • Nearest neighbor descent doesn't build an index, so the saved model needs to include the entire
    dataset.

Other downsides that emerge from the fact that I am writing the package, so inevitably:

  • it'll take ages to get done.
  • it'll be slower than Annoy.
  • it will have C++ issues that crash your session.
  • it will have multi-threading/R API issues that crash your session.
  • other bugs.

Error when input data is not of a right-numbered column?

Hi there,
I'm getting a weird error when I try to run umap() with my data:
Error: index size 79464 is not a multiple of vector size 16
After playing around trying to figure out why (iris data was working), I discovered that it has something to do with the number of columns that your input data for umap() is.

As an example, when you eliminate one of the numeric columns of the iris dataset (from 4 to 3), I also get a similar error.

iris_train <- data.table(iris[1:100, 1:3])
iris_test <- data.table(iris[101:150, 1:3])

iris_train_umap <- umap(iris_train, n_components = 3, ret_model = T)
save_uwot(iris_train_umap, "~/iris.uwot")
Warning messages:
1: invalid uid value replaced by that for user 'nobody' 
2: invalid gid value replaced by that for user 'nobody' 
iris_umap = load_uwot("~/iris.uwot")
Error: index size 77420 is not a multiple of vector size 16

iris_test_umap <- umap_transform(iris_test, iris_umap)

Rstudio crashes.

Any help would be appreciated!

C++ (and other) code review

I thought I'd go one by one over each of the C++ files, starting with nn_parallel.h. It seems you're doing a nearest-neighbor search on X in the R annoy_nn function. Here's some observations:

  • Annoy will return the observation itself as its own nearest neighbor. I usually have to search k+1 neighbors and check if the observation is not its own neighbor, see here.
  • On that note, you can also use the get_nns_by_item method for get the NNs for a particular item in the index. This avoids the need to repass mat into the NNworker, and eliminates the row-by-row accesses that are not cache optimal in NNworker::operator().
  • You may have some small-to-moderate efficiency gains by saving output in columns of idx and dists, which would be more cache optimal. That is, create transposed matrices for NNWorker to dump results in, and untranspose them just before or after you return to R.
  • Probably the default grain size should be larger, to avoid false sharing and the like?

inverse transform?

Any plan to add a way to perform an inverse transform (from embedding to data space)?

Thanks for the great work!

uwot 0.1.4 is failing to compile on RStudio Server Pro (RHEL 7)

Hi @jlmelville, I'm hitting a compilation issue with uwot 0.1.4 on my RStudio Server Pro installation, running on Azure / RHEL 7.

Specifically with RStudio Server Pro, I'm seeing this failure:

install.packages("uwot")
g++ -std=gnu++11 -shared -L/usr/local/lib64/R/lib -L/usr/local/lib64 -o uwot.so RcppExports.o connected_components.o gradient.o nn_parallel.o optimize.o perplexity.o sampler.o smooth_knn.o supervised.o transform.o WARNING: ignoring environment value of R_HOME -L/usr/local/lib64/R/lib -lR
g++: error: WARNING:: No such file or directory
g++: error: ignoring: No such file or directory
g++: error: environment: No such file or directory
g++: error: value: No such file or directory
g++: error: of: No such file or directory
g++: error: R_HOME: No such file or directory
make: *** [uwot.so] Error 1
ERROR: compilation failed for package ‘uwot’
* removing ‘/data00/R/site-library/3.6/uwot’
* restoring previous ‘/data00/R/site-library/3.6/uwot’
Warning in install.packages(pkgs = doing, lib = lib, repos = repos, ...) :
  installation of package ‘uwot’ had non-zero exit status
Calls: install ... <Anonymous> -> .install -> .install_repos -> install.packages

However, on another virtual machine running regular RStudio Server (RHEL 7), the package installs fine. The failure is due to the R_HOME variable not being detected as expected. Any tips on how to fix this?

Hamming and Manhattan metrics produce different embeddings in spaces where they should produce the same

In the space {0, 1}^n, Hamming and Manhattan metrics are equivalent.
If I however calculate embeddings for such a binary dataset using the 'hamming' and 'manhattan' metric= parameter provided by uwot, I get distinct results.

For example:

library("uwot")
set.seed(42)

frequencies <- c(0.1, 0.2)
size <- c(1000, 1000)

samples <- lapply(frequencies, function(f) matrix(rbinom(prod(size), 1, f), nrow=size[1], ncol=size[2]))
str(samples)
mat <- do.call(rbind, samples)

mat.umap_hamming <- umap(mat, metric='hamming')
mat.umap_manhattan <- umap(mat, metric='manhattan')

par(mfrow=c(2,1))
plot(mat.umap_hamming, main="Hamming metric", xlab="UMAP1", ylab="UMAP2")
plot(mat.umap_manhattan, main="Manhattan metric", xlab="UMAP1", ylab="UMAP2")

uwot

And here is an example from real data I was working with:

real_data

Retrieving results from saved R object

Hi,

This R implementation is very useful for me since I only know R. Thank you for making this package.

So I was trying to run a series of UMAP analysis with different parameters. I saved them with saveRDS() for later use, especially for umap_transform() function for my testing data set. However, when I retrieve it with readRDS() I couldn't use the object as the model for umap_transform(). The error message reads:

Error in .External(list(name = "CppMethod__invoke_void", address = <pointer: (nil)>,  : 
  NULL value passed as symbol address

I work on RStudio Server. Not sure if the information helps to solve the problem.

Thanks a lot for making this package again.

iris example

Thanks for this great implementation.
To be fair, the species column should be removed from the example using iris, as it is the ground truth.

List of metrics not allowed if X is a matrix.

List of metrics not allowed if X is a matrix.

I tried the option metric=list("cosine"=1:27, "categorical"=28) on my data (matrix with dimnames) and got this error:

Error in match.arg(metric, c("euclidean", "cosine", "manhattan", "hamming", : 'arg' should be one of “euclidean”, “cosine”, “manhattan”, “hamming”, “precomputed”

If I set X=as.data.frame(mydata) the error is gone.

Thanks.

Spectral initialization seems to get stuck sometimes

Sometimes the spectral initialization takes a long time; on some occasions, it's got so stuck I've had to terminate the calculation (which can sometimes require terminating the R session).

Probably the input matrix is very poorly conditioned, so that finding the smallest eigenvalues is an exercise in numerical futility.

Recent versions of UMAP detect connected components and initialize them separately, see e.g.
https://github.com/lmcinnes/umap/blob/43cf1a820cea8d5b3218627d047dd78e4a152dd4/umap/spectral.py

This might solve the problem.

Installation error

I'm trying to install uwot in R 3.6.1 and getting an error message I can't debug. I was wondering if anyone has seen something like this before and can give me a pointer:

install.packages("uwot")
Installing package into ‘/usr/local/lib/R/host-site-library’
(as ‘lib’ is unspecified)
trying URL 'https://mran.microsoft.com/snapshot/2019-11-05/src/contrib/uwot_0.1.4.tar.gz'
Content type 'application/octet-stream' length 81262 bytes (79 KB)
==================================================
downloaded 79 KB
...
/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: `  echo g++ -std=gnu++11 -shared -L"/usr/local/lib/R/lib" -L/usr/local/lib -o uwot.so RcppExports.o connected_components.o gradient.o nn_parallel.o optimize.o perplexity.o sampler.o smooth_knn.o supervised.o transform.o > RcppParallel::RcppParallelLibs() >  >   -L"/usr/local/lib/R/lib" -lR; \'
/usr/local/lib/R/share/make/shlib.mk:6: recipe for target 'uwot.so' failed
make: *** [uwot.so] Error 1
ERROR: compilation failed for package ‘uwot’
* removing ‘/usr/local/lib/R/host-site-library/uwot’

Here's my sessionInfo:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocManager_1.30.10

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1   

Model loading is incompatible with newer RcppAnnoy

Dirk Eddelbuettel, RcppAnnoy author, has reached out to inform me of that a new version of RcppAnnoy is coming, backed with an updated version of Annoy.

Unfortunately, uwot's ability to load previously saved models is broken by these changes. It used to be possible to specify an arbitrary dimensionality when creating an index, and then loading a serialized Annoy model would overwrite that dimensionality to whatever the serialized index was supposed to have, e.g.:

ann <- methods::new(RcppAnnoy::AnnoyEuclidean, 1)
ann$load(index_path)

Annoy author Erik Bernhardsson further pointed out that this should never have worked, which adds a little extra urgency to making a fix.

Fortunately, the information needed is readily to hand at load time, so this isn't a difficult (or backwards-compatibility breaking) fix.

Distance metric match argument error

Comment Used:
I have seperated my samples accordinng to column combined and would like to run uwot.
Distancemetric='euclidean'
uwot=umap(data,n_neighbors=n_neighbors,metric=list(Distancemetric=1:n_genes,"categorical"="Combined"),min_dist=mindist,init=init,n_epochs=epochs)

Error:
uwot=umap(data,n_neighbors=n_neighbors,metric=list(Distancemetric=1:n_genes,"categorical"="Combined"),min_dist=mindist,init=init,n_epochs=epochs)

Idea: DimReader for umap?

Thanks for your work on uwot: really love it (including the very clear vignettes explaining umap and tsne).

I was wondering if you came across this paper: https://arxiv.org/abs/1710.00992, which tries to visualize for each of the original variables its dependency in the resulting tnse/lle mapping by plotting contourlines. May be an idea for uwot? (or a seperate package that does that?)

Add github contributors to DESCRIPTION

@sirusb, @ttriche: as contributors of PRs to this package, would you like to be acknowledged as such in the Authors@R field of the DESCRIPTION? You don't need to provide an email address, just a suitable identifier, e.g. first name and last name. For reference, the field currently looks like:

    c(person("James", "Melville", email = "[email protected]", role = c("aut", "cre")),
    person("Aaron", "Lun", role="ctb"))

nn_method = 'FNN' is incompatible with ret_model = TRUE

Hi,

Maybe this is a "no issue".
I´m trying FNN as method for kNN search and need ret_model = TRUE to do metric learning, but I get a error:

Error in x2nn(X, n_neighbors, metric, nn_method, n_trees, search_k, n_refine_iters, :
nn_method = 'FNN' is incompatible with ret_model = TRUE

Do you think there could be a workaround?

Thanks.

About feature extraction after reduced dimension of components

I have training and testing datasets and I reduced their dimension separately (for training and testing data) and accuracy dropped in SVM. What to do? Is it necessary to keep all training data in a single matrix and reduce their dimension for reduced feature extraction? Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.