dillondaudert / umap.jl Goto Github PK

View Code? Open in Web Editor NEW

126.0 8.0 18.0 2.79 MB

Uniform Manifold Approximation and Projection (UMAP) implementation in Julia

License: MIT License

Julia 93.81% Jupyter Notebook 6.19%

umap julia dimensionality-reduction visualization machine-learning topological-data-analysis

umap.jl's Introduction

UMAP.jl

A pure Julia implementation of the Uniform Manifold Approximation and Projection dimension reduction algorithm

McInnes, L, Healy, J, Melville, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiV 1802.03426, 2018

Usage

embedding = umap(X, n_components; n_neighbors, metric, min_dist, ...)

The umap function takes two arguments, X (a column-major matrix of shape (n_features, n_samples)), n_components (the number of dimensions in the output embedding), and various keyword arguments. Several important ones are:

n_neighbors::Int=15: This controls how many neighbors around each point are considered to be part of its local neighborhood. Larger values will result in embeddings that capture more global structure, while smaller values will preserve more local structures.
metric::SemiMetric=Euclidean(): The (semi)metric to use when calculating distances between points. This can be any subtype of the SemiMetric type from the Distances.jl package, including user-defined types.
min_dist::Float=0.1: This controls the minimum spacing of points in the embedding. Larger values will cause points to be more evenly distributed, while smaller values will preserve more local structure.

The returned embedding will be a matrix of shape (n_components, n_samples).

Using precomputed distances

UMAP can use a precomputed distance matrix instead of finding the nearest neighbors itself. In this case, the distance matrix is passed as X and the metric keyword argument should be :precomputed. Example:

embedding = umap(distances, n_components; metric=:precomputed)

Fitting a UMAP model to a dataset and transforming new data

Constructing a model

To construct a model to use for embedding new data, use the constructor:

model = UMAP_(X, n_components; <kwargs>)

where the constructor takes the same keyword arguments (kwargs) as umap. The returned object has the following fields:

model.graph     # The graph of fuzzy simplicial set membership strengths of each point in the dataset
model.embedding # The embedding of the dataset
model.data      # A reference to the original dataset
model.knns      # A matrix of indices of nearest neighbors of points in the dataset,
                # as determined on the original manifold (may be approximate)
model.dists     # The distances of the neighbors indicated by model.knns

Embedding new data

To transform new data into the existing embedding of a UMAP model, use the transform function:

Q_embedding = transform(model, Q; <kwargs>)

where Q is a matrix of new query data to embed into the existing embedding, and model is the object obtained from the UMAP_ call above. Q must come from a space of the same dimensionality as model.data (ie X in the UMAP_ call above).

The remaining keyword arguments (kwargs) are the same as for above functions.

Implementation Details

There are two main steps involved in UMAP: building a weighted graph with edges connecting points to their nearest neighbors, and optimizing the low-dimensional embedding of that graph. The first step is accomplished either by an exact kNN search (for datasets with < 4096 points) or by the approximate kNN search algorithm, NNDescent. This step is also usually the most costly.

The low-dimensional embedding is initialized (by default) with the eigenvectors of the normalized Laplacian of the kNN graph. These are found using ARPACK (via Arpack.jl).

Current Limitations

Input data types: Only data points that are represented by vectors of numbers (passed in as a matrix) are valid inputs. This is mostly due to a lack of support for other formats in NNDescent. Support for e.g. string datasets is possible in the future
Sequential: This implementation does not take advantage of any parallelism

External Resources

Understanding UMAP
For a great description of how UMAP works, see this page from the Python UMAP documentation
If you're familiar with t-SNE, then this page describes UMAP with similar vocabulary to that dimension reduction algorithm

Examples

The full MNIST and FMNIST datasets are plotted below using both this implementation and the Python implementation for comparison. These were generated by this notebook.

Note that the memory allocation for the Python UMAP is unreliable, as Julia's benchmarking doesn't count memory allocated within Python itself.

MNIST

FMNIST

Disclaimer

This implementation is a work-in-progress. If you encounter any issues, please create an issue or make a pull request.

umap.jl's People

Contributors

Stargazers

Watchers

Forkers

simondanisch michiel-vl maximerivest bwang12 sanjmohan ericphanson rasmushenningsson kragol tomhaber davibarreira playfloor sadit pujaltes olayinkaadeleye jarredclloyd imai-yasuhiko

umap.jl's Issues

Project status?

Hello,

It seems like the project hasn't been updated in a while, and I see quite a few untouched issues and PRs.

Is this project still alive, or is it in some sort of limbo?

UMAP performance

It's great to have a Julia implementation of the UMAP. I have been using the Python one quite a bit and am very impressed with its performance thus far. (https://github.com/lmcinnes/umap)

Since I am somewhat new to Julia, I am wondering how much faster can the Julia version be?

Currently, the Python UMAP takes about 3.2 seconds to run on a randomized 2000 by 2000 matrix.

import numpy as np
import umap
test = np.random.rand(2000, 2000)
UMAP = umap.UMAP(n_components=2)
%timeit UMAP.fit_transform(test)

While Julia UMAP would take about 4.3 seconds to run on a same size randomized matrix.

using BenchmarkTools
using UMAP
test = rand(2000, 2000)
@Btime umap(test)

I'd love to get your take on this @dillondaudert .

Example Notebook not working - Julia 1.6.1

Hey guys, thanks for the UMAP implementation. I was trying to run the example notebook, and I'm getting an error right in the beginning in the following line:

mnist_x = MNIST.convert2features(MNIST.traintensor(Float64))

The error message is DimensionMismatch("new dimensions () must be consistent with array size 47040000").

As pointed out, I'm working with Julia 1.6. Don't know if it's related, or if the issue is with some new version for MLDatasets.

Inverse transformation

Is there a way to do the inverse transformation (like https://umap-learn.readthedocs.io/en/latest/inverse_transform.html) from coordinates in the UMAP projection back to the original space?

Final initial implementation pass

This is a list of functionalities, optimizations, and tests that need to be completed for the initial UMAP implementation to be "complete". Roughly, these things represent the blockers before this package gets added to the registry.

Functionality

In fuzzy_simplicial_set, implement the set_operation_ratio argument which interpolates between fuzzy set union and fuzzy set intersection when creating the UMAP graph

The following are improvements originally included in this list but later removed. They won't be completed before this package is registered, and are only kept in this issue so they can be referenced later.

Performance

Improve memory allocation performance of pairwise_knn

Testing

Add tests for type stability of all / most functions
Add tests to spectral_layout to check for correct eigenvectors in simple cases
Add tests checking fit_phi finds appropriate parameters a, b

ERROR: MethodError: Cannot `convert` an object of type

ERROR: MethodError: Cannot `convert` an object of type 
  DataStructures.BinaryHeap{NearestNeighborDescent.KNNGraphs.HeapKNNGraphEdge{Int64{},Float64{}},Base.Order.ReverseOrdering{Base.Order.ForwardOrdering}} to an object of type
  DataStructures.BinaryHeap{NearestNeighborDescent.KNNGraphs.HeapKNNGraphEdge{Int64{},Float64{}},Base.Order.ReverseOrdering}

my data looks like this:

julia> X
10×5503 Array{Float64,2}:
 237.0  237.0  180.0  192.0  192.0  192.0  192.0  822.0  822.0  822.0  561.0  844.0  144.0  564.0  144.0  730.0  …  688.0  834.0  185.0  519.0  376.0  710.0  710.0  171.0  511.0  502.0  511.0  171.0  685.0  166.0  166.0
 141.0  141.0  377.0  375.0  375.0  375.0  375.0  151.0  151.0  151.0  443.0  386.0  394.0  378.0  394.0  391.0     228.0  374.0  304.0  245.0  406.0  240.0  240.0  272.0  232.0  238.0  232.0  300.0  253.0  310.0  310.0      
 516.0  516.0  834.0  834.0  834.0  834.0  834.0  511.0  511.0  193.0  834.0  162.0  834.0  834.0  834.0  834.0     820.0  820.0  820.0  820.0  820.0  820.0  820.0  820.0  820.0  820.0  820.0  820.0  820.0  820.0  820.0      
 384.0  384.0  159.0  159.0  159.0  159.0  159.0  334.0  334.0  287.0  132.0  379.0  371.0  371.0  371.0  371.0     180.0  180.0  180.0  180.0  180.0  180.0  180.0  180.0  180.0  180.0  180.0  180.0  180.0  180.0  180.0      
 258.0  250.0  411.0  411.0  411.0  411.0  411.0  499.0  499.0  368.0  260.0  339.0  416.0  423.0  416.0  339.0     521.0  577.0  497.0  322.0  167.0  510.0  465.0  571.0  182.0  233.0  182.0  538.0  538.0  513.0  513.0
 381.0  396.0  384.0  384.0  384.0  384.0  384.0   44.0   44.0  421.0  277.0  378.0  378.0  380.0  378.0  378.0  …  247.0  217.0  218.0  398.0  377.0  239.0  262.0  224.0  292.0  303.0  292.0  220.0  220.0  256.0  256.0      
 818.0  818.0  728.0  728.0  728.0  728.0  728.0  210.0  210.0  530.0  530.0  526.0  570.0  170.0  570.0  511.0     317.0  317.0  317.0  317.0  317.0  317.0  317.0  317.0  317.0  317.0  317.0  317.0  317.0  317.0  317.0
 366.0  366.0  349.0  349.0  349.0  349.0  349.0  270.0  270.0  213.0  213.0  370.0  372.0  371.0  372.0  370.0     222.0  222.0  222.0  222.0  222.0  222.0  222.0  222.0  222.0  222.0  222.0  222.0  222.0  222.0  222.0
   0.0    0.0   37.0   37.0   37.0   37.0   37.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0       0.0    0.0   -7.0    0.0    0.0  -27.0    0.0    0.0  -13.0    0.0  -13.0  -30.0  -30.0  -20.0  -20.0
 -23.0  -23.0  -20.0  -20.0  -20.0  -20.0  -20.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0       0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0

and my call looks like

julia> embedding = umap(X, 2)

julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 3900 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
Environment:
  JULIA_EDITOR = "C:\Users\bdeon\AppData\Local\Programs\Microsoft VS Code\Code.exe"
  JULIA_NUM_THREADS =

julia> @time e = UMAP.umap(X, 2); ERROR: InexactError: Int64(...

Most of the time umapworks just fine, but occasionally I am getting InexactError (example below)

julia> @time e = UMAP.umap(X, 2)
ERROR: InexactError: Int64(2260.3232069772675)

Is it a bug, or am I doing something wrong?

Seg fault upon @pyimport umap as py_umap;

Hi,
I am getting a seg fault when importing the python version via PyCall (@pyimport or pyimport() calls to other python modules are working as expected though).

multiThreads?

Hi really great work!
I like UMAP.jl a lot. But when I run it on a matrix with a dimension of [400,000 x 30], it took about an hour to complete. Do you have any suggestion to speed up the analysis? I wonder if it is possible to use multi-threads to run umap?
Many thanks in advance!

Random Seed

Hi! I've had great success with this package so far. I was wondering if it is possible to add a random seed for reproducibility. Perhaps there is an alternative, existing way to do this that I haven't thought of. Thanks!

DescentGraph not defined

weird issue when adding:

julia> using UMAP
[ Info: Precompiling UMAP [c4f8c510-2410-5be4-91d7-4fbaeb39457e]
ERROR: LoadError: UndefVarError: DescentGraph not defined
Stacktrace:
[1] include(::Module, ::String) at ./Base.jl:377
[2] top-level scope at none:2
[3] eval at ./boot.jl:331 [inlined]
[4] eval(::Expr) at ./client.jl:449
[5] top-level scope at ./none:3
in expression starting at /home/myhome/.julia/packages/UMAP/Eq5Hc/src/UMAP.jl:4

Implement local_connectivity

In smooth_knn_dists, implement the local_connectivity argument which interpolates between the distances of the nearest neighbors around each point.

(originally in #1)

Error in spectral layout on master

On Julia 1.3 with the master branch, I'm seeing spectral layout errors. Julia 1.2 with the latest release (1.2) still works fine.

Arbitrary data support

It would be nice if UMAP.jl didn't require the input to be formatted as X (a column-major matrix of shape (n_features, n_samples)). For example, I have samples that I store as structs with some metadata and a data field, which is stored as a matrix itself. Then I define my distance (a subtype of Distances.SemiMetric) between these structs by grabbing their data fields and calculating a distance measure. It would be nice to be able to just pass this vector of samples and my distance to UMAP. In fact, NearestNeighborDescent already supports this (as long as you also define Distances.result_type(::MyMetric, a, b) = Float32 or such), so I can get it working with UMAP just by removing some type annotations. It seems like UMAP only needs to know the number of original features to check that the output dimension is smaller, and they don't actually need to be formatted as a matrix.

(Of course, I could vectorize all the data matrices and hcat them into a big matrix, and then define my distance measure to reshape them, etc, but it seems unnecessary and complicated).

Does this seem like a reasonable feature for UMAP.jl?

Feature request: densMAP

Lovely work on this package. I noticed that the python UMAP implementation (https://github.com/lmcinnes/umap) has added the densMAP algorithm (https://www.nature.com/articles/s41587-020-00801-7). In my opinion this is a very nice solution to one of the issues with UMAP.

Any interest in adding this to the Julia implementation?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Directly specify connectivity network

The python version of UMAP allows you to specify the connectivity network directly, which allows UMAP to be used in situations where you might have a network structure derived from some other means than KNN. I imagine this could be useful for UMAP.jl as well, and wouldn't be too much additional work to implement.

Thanks!

UMAP from distance matrix?

Having read the UMAP paper it seems to me that one should be able to use UMAP even for data which is not represented as a vector of float's, e.g. when one only has a full distance matrix. Any plans to support this in UMAP.jl?

upgrade dependencies

currently, lots of dependencies are outdated, and conflicts with new versions of these packages. specifically, Distribution, JLD2, LsqFit and Reexport.

link from python FAQ

the python umap FAQ mentions a pure R version. you might submit a PR there to point to this pure Julia version too.

Different embeddings for identical data

Why does transform return different embeddings for the same data?

model = UMAP_(transpose(data), 2; n_neighbors=15, min_dist=0.1)
UMAP.transform(model, transpose(data[1:20,:]))

results in

2×20 Matrix{Float32}:
-7.9815   8.59674  -7.18658  7.90244  …  -2.41209   2.69043   4.15093
-2.66961  0.89672  -3.72272  6.20439      7.94941  -7.61004  -8.79426

Transforming the same data a second time results in a different embedding:

UMAP.transform(model, transpose(data[1:20,:]))

results in

2×20 Matrix{Float32}:
 -7.99711  8.49432  -7.27412  7.844    …  -2.32314   2.682     4.33971
 -2.75653  1.0201   -3.63024  5.85082      8.06509  -7.60372  -8.87176

Do I have to fix a certain random seed to get identical embeddings?

Is UMAP sensitive to floating point microarchitecture?

I'm trying to run UMAP.jl on two different machines (one is a MacBook Pro and then other is a computational cluster at my institute, both using Julia 1.0) and when I run the same dataset on each of them, I get drastically different results. My MacBook shows clear groupings by communities previously detected in the data (community id was excluded when embeddings were calculated), but when I run the same data on the computational cluster, the UMAP returns a similar overall shape, but there is not longer any clear groupings based on community id (like with my MacBook). I was wondering if UMAP.jl could be sensitive to floating point microarchitecture changes between my MacBook and the computational cluster? And if there is anything I could do to fix this? I have large-ish datasets (20-100k x 12) and they take hours to run on my machine and the cluster, so I want to make sure I have consistency between the computational resources. I've have two smaller datasets (1.5k x 12, which take < 1 min to run) and I've reproduced this discrepancy multiple times across these smaller examples.

Landmark-Based Spectral Embedding

The spectral embedding step of UMAP is really really slow. I experimented with replacing UMAP's typical spectral embedding step with the Variational Nystrom approximation: http://proceedings.mlr.press/v48/vladymyrov16.pdf. My results show that (at least on MNIST) this can speed up the embedding step by 10x while not affecting the final embedding quality very much. I was wondering if it might make sense to upstream this into UMAP.jl? One of the cons is that it adds another hyper-parameter for the number of landmarks.

I'm happy to provide more details and my implementation (the core algorithm is really small), if people are interested in this extension.

Tag a release?

Hi,

there are updates in in the LsqFit compat in the master Project.toml that have yet to be released. Could you pretty please tag a release?

Cheers!

Float32

I just got the following error when using an Array{Float32}:

MethodError: no method matching combine_fuzzy_sets(::SparseArrays.SparseMatrixCSC{Float32,Int64}, ::Float64)
Closest candidates are:
  combine_fuzzy_sets(::AbstractArray{T<:AbstractFloat,2}, !Matched::T<:AbstractFloat) where T<:AbstractFloat at /home/me/.julia/packages/UMAP/7orf6/src/utils.jl:61

Isn't the input data row-major?

The docs say about the input X (a column-major matrix of shape (n_features, n_samples). Isn't (n_features, n_sample) row-major?

memory mapped / out-of-core support

the original python umap is working on supporting datasets that won't fit in memory. is that possible with this julia version?