tkonopka / umap Goto Github PK

View Code? Open in Web Editor NEW

129.0 8.0 18.0 666 KB

Uniform Manifold Approximation and Projection - R package

License: Other

R 88.38% C++ 11.62%

dimensionality-reduction umap

umap's Introduction

umap

R implementation of Uniform Manifold Approximation and Projection

Uniform manifold approximation and projection (UMAP) is a technique for dimensional reduction. The original algorithm is described by McInnes, Heyes, and Melville and is implemented in a python package umap. This package provides an interface to the UMAP algorithm in R, including a translation of the original algorithm into R with minimal dependencies.

Examples

The figure below shows dimensional reduction on the MNIST digits dataset. This dataset consists of 70,000 observations in a 784-dimensional space and labeled by ten distinct classes. The output of this package's `umap' function provides the plot layout, i.e. the arrangement of dots on the plane. The coloring, added to visualize how the known labels are positioned within the layout, demonstrates separation of the underlying data groups.

The package also allows to project data onto an existing embedding. Below, the first figure shows a map created from a subset of 60,000 observations from the MNIST data. The second figure is a projection of the held-out 10,000 observations onto the layout defined by the training data.

More information on usage can be found in the package vignettes.

Implementations

The package provides two implementations of the UMAP algorithm.

The default implementation is one written in R and Rcpp. This implementation follows the original python code. However, any bugs or errors should be regarded as arising solely from this implementation, not from the original. The implementation has minimal dependencies and should work on most platforms. (The MNIST graphic is generated based on this default implementation).

A second implementation is a wrapper for the python package. This offers similar functionality to another existing package umapr. To use this implementation, additional installation steps are required; see documentation for the python package for details.

Note: an independent R implementation of UMAP is also available in package uwot, also available on CRAN.

Acknowledgments

Many thanks to the R and github communities for comments, corrections, and bug reports.

References

The original UMAP algorithm is described in the following article

McInnes, Leland, and John Healy. "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv:1802.03426.

License

MIT License.

umap's People

Contributors

Stargazers

Watchers

Forkers

vishalbelsare talegari lkblum aljabadi andradeandrey jenniferslyon mrgomaa tomkellygenetics shrysr sunny1205124 zzsnow minghao2016 weiwei060512 zonghui-wang ffancheng rnaimehaom liuchuwei

umap's Issues

missing value where TRUE/FALSE needed

Hi,
I have a dataset that works but fails when log-transformed (no NA in data)

uml <- umap(t(log10(data))) 
Error in if (abs(val - target) < tolerance) { : 
  missing value where TRUE/FALSE needed
> traceback()
4: smooth.knn.dist(knn$distance, nk, local.connectivity = connectivity, 
       bandwidth = bandwidth)
3: naive.fuzzy.simplicial.set(knn, config)
2: implementations[[method]](d, config)
1: umap(t(log10(data)))

seems related to smooth.knn.dist, as if the knn graph is not created correctly

umap() produces matrix instead of S3 object

I have a data frame, b, with 21 sample rows and 1789 uniquely-names gene rows.

> class(b)
[1] "data.frame"
> ncol(b)
[1] 1789
> nrow(b)
[1] 21

umap runs without error or warning. I think that umap is supposed to generate an S3 object, but I only get a 2-column matrix with coordinates. This is fine for generating the plot, but I wanted to extract other information from layout. Any suggestions? Thanks!

> um<-umap(b,n_neighbors = 10)
> class(um)
[1] "matrix"
> nrow(um)
[1] 21
> ncol(um)
[1] 2

Add support for umap-learn 0.5

umap-learn 0.5 is released with some new cool features. Time to add support for it!

transforming new data to an embedding

Hi Tomasz, thanks for this package! Following the UMAP docs here, one nice advantage over tSNE is the ability to transform new data to an existing embedding for use in feature preprocessing. Do you know if/how this might be done through your reticulate wrapper?

Allow for supervised/semi-supervised dimension reduction with labels

Is it possible to provide this implementation of UMAP label information (even partial labeling) to help guide the embedding?

predict() on a umap object with n_components=1 gets two errors -- Looks like missing drop=F

Based on the example in the vignette:

iris.data = iris[, grep("Sepal|Petal", colnames(iris))]
iris.labels = iris[, "Species"]
custom.config = umap.defaults
custom.config$n_components = 1
iris.umap = umap(iris.data, config=custom.config)

set.seed(19)
iris.wnoise = iris.data + matrix(rnorm(150*40, 0, 0.1), ncol=4)
colnames(iris.wnoise) = colnames(iris.data)
iris.wnoise.umap = predict(iris.umap, iris.wnoise)

Error in colMeans(embedding[knn.indexes[i, ], ]) :
'x' must be an array of at least two dimensions

traceback()
6: stop("'x' must be an array of at least two dimensions")
5: colMeans(embedding[knn.indexes[i, ], ])
4: make.initial.spectator.embedding(umap$layout, spectator.knn$indexes)
3: implementations[[method]](object, data)
2: predict.umap(iris.umap, iris.wnoise)
1: predict(iris.umap, iris.wnoise)

Looking at make.initial.spectator.embedding, it looks like a drop=F
is missing (line with ## <-----):

trace(umap:::make.initial.spectator.embedding, edit=T)

function (embedding, knn.indexes)
{
result = matrix(0, nrow = nrow(knn.indexes), ncol = ncol(embedding))
rownames(result) = rownames(knn.indexes)
knn.indexes = knn.indexes[, 2:ncol(knn.indexes), drop = FALSE]
for (i in 1:nrow(result)) {
result[i, ] = colMeans(embedding[knn.indexes[i, ], ,
drop = FALSE]) ## <------- added drop = FALSE
}
result
}

This change leads to a new error:

iris.wnoise.umap = predict(iris.umap, iris.wnoise)
Error in temp.embedding[, temp.index] <- result[, indeces[i]] :
incorrect number of subscripts on matrix

traceback()
4: naive.simplicial.set.embedding(graph, embedding, config,
fix.observations = V)
3: implementations[[method]](object, data)
2: predict.umap(iris.umap, iris.wnoise)
1: predict(iris.umap, iris.wnoise)

And it also looks like a drop=F is missing in naive.simlicial.set.embedding:

naive.simplicial.set.embedding
function (g, embedding, config, fix.observations = NULL)
{
if (config$n_epochs == 0) {
return(embedding)
}
result = t(embedding)
gmax = max(g$coo[, "value"])
g$coo[g$coo[, "value"] < gmax/config$n_epochs, "value"] = 0
g = reduce.coo(g)
eps = cbind(g$coo, eps = make.epochs.per.sample(g$coo[, "value"],
config$n_epochs))
if (is.null(fix.observations)) {
result = naive.optimize.embedding(result, config, eps)
}
else {
eps = eps[eps[, "from"] > fix.observations, ]
indeces = seq(fix.observations + 1, ncol(result))
seeds = column.seeds(result[, indeces, drop = FALSE],
key = config$transform_state)
temp.index = fix.observations + 1
temp.embedding = result[, seq_len(fix.observations +
1), drop = FALSE] ## <----- added drop=FALSE
temp.eps = split.data.frame(eps, eps[, "from"])
for (i in seq_along(indeces)) {
temp.embedding[, temp.index] = result[, indeces[i]]
set.seed(seeds[i])
i.eps = temp.eps[[as.character(indeces[i])]]
if (!is.null(i.eps)) {
i.eps[, "from"] = temp.index
temp.result = naive.optimize.embedding(temp.embedding,
config, i.eps)
}
result[, indeces[i]] = temp.result[, temp.index]
}
}
colnames(result) = g$names
t(result)
}

With these two changes predict() now runs without error and returns values. I am not sure if there are deeper issues with predicting with n_components=1, or if these two changes are sufficient.

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS: /mnt/drive2/r-project/R-3.6.1/lib/libRblas.so
LAPACK: /mnt/drive2/r-project/R-3.6.1/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics utils datasets grDevices methods base

other attached packages:
[1] umap_0.2.3 colorspace_1.4-1

loaded via a namespace (and not attached):
[1] compiler_3.6.1 Matrix_1.2-17 tools_3.6.1 reticulate_1.13
[5] Rcpp_1.0.2 RSpectra_0.15-0 grid_3.6.1 jsonlite_1.6
[9] openssl_1.4.1 lattice_0.20-38 askpass_1.1

Type error in optimize_embedding

When running umap, I get the following error:

Error in eval(ei, envir) : 
  Not compatible with requested type: [type=character; target=double].

I've traced this to umap:::optimize_embedding, but because I don't speak C well I can't follow the .Call inside any further. I've checked all the arguments that are getting passed on to the C code, though, and none of them are character. The only thing I can see that might be out of place is that abg = c(NA, NA, 1, 0), and I don't know if those NAs are expected.

Running umap with its default settings does not produce the error; the settings I have are as follows:

umap configuration parameters
           n_neighbors: 15
          n_components: 2
                metric: euclidean
              n_epochs: 200
                 input: data
                  init: spectral
              min_dist: 0.1
      set_op_mix_ratio: 1
    local_connectivity: 1
             bandwidth: 1
                 alpha: 1
                 gamma: 1
  negative_sample_rate: 5
                     a: NA
                     b: NA
                spread: 1
          random_state: NA
       transform_state: NA
                   knn: NA
           knn_repeats: 1
               verbose: FALSE
       umap_learn_args: NA

Is this happening because a and b are NA? The function does warn me that

1: umap: parameters 'spread', 'a', 'b' are set to non-default values;
 parameter 'spread' will be ignored.
 (Embedding will be controlled via 'a' and 'b')

Which now that I write this seems suspicious. But umap.defaults also has a and b set to NA.

Sparse Matrix support

First of all, thanks for your excellent work.

I was wondering if this version was supporting sparseMatrix as input when using method "umap-learn", and, if it does, please explain how.

Thanks

CRAN

are you planing to put this on CRAN?

method = "python" does not work

I installed reticulate in R and umap with pip, now I get the following error:

> umap:::python.umap
Module(umap)
> umap(iris[1:4], method = "python")
Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: module 'umap' has no attribute 'UMAP'

Intel MKL FATAL ERROR

Hi,

The package was working just fine and suddenly now whenever I try to load library(umap) it throws this errors;

INTEL MKL ERROR: dlopen(/Users/nguyenb/Library/r-miniconda/envs/r-reticulate/lib/libmkl_intel_thread.dylib, 9): Library not loaded: @rpath/libiomp5.dylib Referenced from: /Users/nguyenb/Library/r-miniconda/envs/r-reticulate/lib/libmkl_intel_thread.dylib Reason: image not found. Intel MKL FATAL ERROR: Cannot load libmkl_intel_thread.dylib.

I already try to reinstall reticulate and umap...
Many thanks for your help!

Bastien

Citing the package

I used the umap package in one of my analyses, but I was unsure how to cite it in my manuscript? Should I just cite the UMAP paper on aRxiv?

Thanks

Problem with using custom metric

Hello, I am trying to run UMAP with pre-computed "custom metric" as input distance matrix. My custom metric is Pearson distance. I know that there is an in built custom metric - "Pearson" available. But, I wanted to check whether the results match if I use pre-computed Pearson distance as the input distance matrix to the umap() function. Even after setting the random_state the same in both the cases, I got different results.

Case 1: (Using the in-built Pearson metric)
inp_n_neighbors <- 200
inp_min_dist <- 0.001
inp_spread <- 0.2
n_comp <- 2
custom.config <- umap.defaults
custom.config$random_state <- 123
custom.config$n_neighbors <- inp_n_neighbors
custom.config$min_dist <- inp_min_dist
custom.config$spread <- inp_spread
custom.config$metric <- "pearson"
custom.config$n_components <- n_comp
res.umap <- umap(data, config=custom.config, preserve.seed=TRUE)

Case 2: (Using the custom Pearson metric as input distance matrix)
inp_n_neighbors <- 200
inp_min_dist <- 0.001
inp_spread <- 0.2
n_comp <- 2
custom.config <- umap.defaults
custom.config$random_state <- 123
custom.config$input <- "dist"
custom.config$n_neighbors <- inp_n_neighbors
custom.config$min_dist <- inp_min_dist
custom.config$spread <- inp_spread
custom.config$n_components <- n_comp
data_corr <- cor(t(data), method="pearson")
data_dist <- (1 - data_corr)/2
res.umap2<- umap(data_dist, config=custom.config, preserve.seed=TRUE)

The results of res.umap and res.umap2 are different

I was curious to know what is happening and played around with things and realized that even with pre-computed custom distance metric as input, the value assigned to "custom.config$metric" parameter changes the results. For example, look at the case 3.

Case 3: (Using the custom Pearson metric as input distance matrix)
inp_n_neighbors <- 200
inp_min_dist <- 0.001
inp_spread <- 0.2
n_comp <- 2
custom.config <- umap.defaults
custom.config$random_state <- 123
custom.config$input <- "dist"
custom.config$n_neighbors <- inp_n_neighbors
custom.config$min_dist <- inp_min_dist
custom.config$spread <- inp_spread
custom.config$n_components <- n_comp
custom.config$metric <- "pearson" #### THE DEFAULT IS EUCLIDEAN DISTANCE BUT I CHANGED IT TO PEARSON"
data_corr <- cor(t(data), method="pearson")
data_dist <- (1 - data_corr)/2
res.umap3<- umap(data_dist, config=custom.config, preserve.seed=TRUE)

The results of res.umap2 and res.umap3 are different

WHEN I USE A PRE-COMPUTED CUSTOM METRIC AS INPUT DISTANCE, WHY THE VAULE ASSIGNED TO "custom.config$metric" CHANGES THE RESULTS? WHERE IS THE PROBLEM WITH MY UNDERSTANDING?

Thanks

Failed creating initial embedding; using random embedding instead

Hello,

I would like to highlight the following warning :

Warning message: failed creating initial embedding; using random embedding instead

That is appearing in your vignette :

umap/vignettes/umap.Rmd

Line 116 in 54ce0b5

iris.umap = umap(iris.data)

By default, the initiation is spectral (init = spectral in umap.default) but with your example (and with my data as well) it's seems to not working as expected and switch to random.

How can I overcome this and perform a UMAP analysis using init = spectral and not init = random ? I cannot figure it out why this is happening...

Can you please help me to understand this ?

Best regards,
Lea

Error with n_components=1

Hello,
I am trying to use umap in R and wanted to try editing the configuration n_components to 1.
When I do it I get an error :

custom.config = umap.defaults
custom.config$n_components=1
aa.umap = umap(aa, config = custom.config)
Error in result[, 1:d, drop = FALSE] : incorrect number of dimensions

Is there a way I can extract a reduction to only 1 component ?

Thanks for the info
Mattia

predict() generates different predictions if called with multiple points at once versus called with each point individually

If I call predict() with a data.frame of points, and then call predict() with each point separately I get different results, some times substantially different. For my application I need to get the same result for the same input data. Unfortunately, the obvious solution of running each point through separately is about 100x slower. I'm not sure if this is a bug or just a characteristic of the algorithm.

Here is an example taken from the vignette:
iris.data = iris[, grep("Sepal|Petal", colnames(iris))]
iris.labels = iris[, "Species"]
iris.umap = umap(iris.data)

set.seed(19)
iris.wnoise = iris.data + matrix(rnorm(150*40, 0, 0.1), ncol=4)
colnames(iris.wnoise) = colnames(iris.data)
iris.wnoise.umap = predict(iris.umap, iris.wnoise)

iris.wnoise.umap.1 = t(apply(iris.wnoise, 1, function(x) {
x <- t(matrix(x))
predict(iris.umap, x)
}))

head(cbind(iris.wnoise.umap, iris.wnoise.umap.1))
[,1] [,2] [,3] [,4]
1 16.28199 -0.70549906 15.92311 -0.47269093
2 14.69319 -2.22523708 14.16301 -2.27630185
3 14.78369 -3.54772123 15.00407 -3.04123797
4 14.36916 -2.48618578 14.13746 -2.43722985
5 16.04017 -0.44320957 16.37358 -0.64862115
6 15.90486 0.05976971 16.48200 0.09985856

Differences with Python version?

Hi,
Thanks for developing this package first and foremost!
My data look intuitive and nice using your version in R, but when I try to reproduce these results in Python using McInnes' version, weirdly enough no structure whatsoever is apparent in the data.
I compared the parameter settings and all the similarly named parameters are identical. Do you know how to best approximate the default settings of your version in Python?
Thanks! Maarten

is there any spark version implementations?

hey guys, is there any spark implementation of umap?

Number of threads

Hi, is it possible to specify the number of threads to use? I have been running umap on our multi-core server, and it seems to simply use all available cores. It would be useful to provide a user argument to specify the number of threads. Thanks!

min_dist not updating with Python backend

Hello, I found out this pretty strange bug.

In short, when I use the umap-learn backend, any setting of min_dist is not respected.
The Python backend complains if it's higher than spread, the arguments change, min_dist is in
config$umap_learn_args, but the UMAP coordinates do not change.

Other settings (e.g. n_neighbors) are good.

When the same code is called directly from a Python session, min_dist is correctly applied.

Repro and more comments in this repository:
https://github.com/lgaborini/umap-bug/blob/master/umap_compare_iris.md

Source (R markdown + reticulate)

Thank you!

when random_state is set automatically in config, it is not sufficient for reproducibility

A user reported a bug via email.

Embeddings are reproducible when a seed is set manually.

result_1 <- umap(dataset, random_state=123)
result_1$config$random_state == 123 # TRUE, the seed is recorded in config

result_2 <- umap(dataset, random_state=result_1$config$random_state)
identical(result_1, result_2) # TRUE, which is correct

When an embedding is created without a seed, the package creates and sets its own seed. The intention is to be able to recreate the same result if needed, even if the first run did not set a specific seed.

result_3 <- umap(dataset)
result_3$config$random_state > 0 # TRUE, signals intention for reproducibility

result_4 <- umap(dataset, random_state=result_3$config$random_state)
identical(result_3, result_4) # FALSE, but should actually be TRUE

The bug affects version 0.2.9 and probably older versions as well.

tkonopka / umap Goto Github PK

umap's Introduction

umap

Examples

Implementations

Acknowledgments

References

License

umap's People

Contributors

Stargazers

Watchers

Forkers

umap's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs