GithubHelp home page GithubHelp logo

pavlin-policar / opentsne Goto Github PK

View Code? Open in Web Editor NEW
1.4K 22.0 156.0 71.87 MB

Extensible, parallel implementations of t-SNE

Home Page: https://opentsne.rtfd.io

License: BSD 3-Clause "New" or "Revised" License

Python 40.83% TeX 4.31% Jupyter Notebook 32.69% C++ 9.60% C 1.19% Cython 11.38%
tsne visualization machine-learning dimensionality-reduction embedding

opentsne's Introduction

openTSNE

Build Status Documentation Status License Badge

openTSNE is a modular Python implementation of t-Distributed Stochasitc Neighbor Embedding (t-SNE)1, a popular dimensionality-reduction algorithm for visualizing high-dimensional data sets. openTSNE incorporates the latest improvements to the t-SNE algorithm, including the ability to add new data points to existing embeddings2, massive speed improvements345, enabling t-SNE to scale to millions of data points and various tricks to improve global alignment of the resulting visualizations6.

A visualization of 44,808 single cell transcriptomes obtained from the mouse retina embedded using the multiscale kernel trick to better preserve the global aligment of the clusters.

A visualization of 44,808 single cell transcriptomes obtained from the mouse retina7 embedded using the multiscale kernel trick to better preserve the global aligment of the clusters.

Installation

openTSNE requires Python 3.8 or higher in order to run.

Conda

openTSNE can be easily installed from conda-forge with

conda install --channel conda-forge opentsne

Conda package

PyPi

openTSNE is also available through pip and can be installed with

pip install opentsne

PyPi package

Installing from source

If you wish to install openTSNE from source, please run

pip install .

in the root directory to install the appropriate dependencies and compile the necessary binary files.

Please note that openTSNE requires a C/C++ compiler to be available on the system.

In order for openTSNE to utilize multiple threads, the C/C++ compiler must support OpenMP. In practice, almost all compilers implement this with the exception of older version of clang on OSX systems.

To squeeze the most out of openTSNE, you may also consider installing FFTW3 prior to installation. FFTW3 implements the Fast Fourier Transform, which is heavily used in openTSNE. If FFTW3 is not available, openTSNE will use numpy’s implementation of the FFT, which is slightly slower than FFTW. The difference is only noticeable with large data sets containing millions of data points.

A hello world example

Getting started with openTSNE is very simple. First, we'll load up some data using scikit-learn

then, we'll import and run

Citation

If you make use of openTSNE for your work we would appreciate it if you would cite the paper

@article {Poli{\v c}ar731877,
    author = {Poli{\v c}ar, Pavlin G. and Stra{\v z}ar, Martin and Zupan, Bla{\v z}},
    title = {openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding},
    year = {2019},
    doi = {10.1101/731877},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2019/08/13/731877},
    eprint = {https://www.biorxiv.org/content/early/2019/08/13/731877.full.pdf},
    journal = {bioRxiv}
}

openTSNE implements two efficient algorithms for t-SNE. Please consider citing the original authors of the algorithm that you use. If you use FIt-SNE (default), then the citation is8 below, but if you use Barnes-Hut the citations are9 and10.

References


  1. Van Der Maaten, Laurens, and Hinton, Geoffrey. “Visualizing data using t-SNE.” Journal of Machine Learning Research 9.Nov (2008): 2579-2605.

  2. Poličar, Pavlin G., Martin Stražar, and Blaž Zupan. “Embedding to Reference t-SNE Space Addresses Batch Effects in Single-Cell Classification.” Machine Learning (2021): 1-20.

  3. Van Der Maaten, Laurens. “Accelerating t-SNE using tree-based algorithms.” Journal of Machine Learning Research 15.1 (2014): 3221-3245.

  4. Yang, Zhirong, Jaakko Peltonen, and Samuel Kaski. "Scalable optimization of neighbor embedding for visualization." International Conference on Machine Learning. PMLR, 2013.

  5. Linderman, George C., et al. "Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data." Nature Methods 16.3 (2019): 243.

  6. Kobak, Dmitry, and Berens, Philipp. “The art of using t-SNE for single-cell transcriptomics.” Nature Communications 10, 5416 (2019).

  7. Macosko, Evan Z., et al. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161.5 (2015): 1202-1214.

  8. Linderman, George C., et al. "Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data." Nature Methods 16.3 (2019): 243.

  9. Van Der Maaten, Laurens. “Accelerating t-SNE using tree-based algorithms.” Journal of Machine Learning Research 15.1 (2014): 3221-3245.

  10. Yang, Zhirong, Jaakko Peltonen, and Samuel Kaski. "Scalable optimization of neighbor embedding for visualization." International Conference on Machine Learning. PMLR, 2013.

opentsne's People

Contributors

ales-erjavec avatar blazzupan avatar dkobak avatar inejc avatar jgraving avatar mstrazar avatar pavlin-policar avatar primozgodec avatar timrepke avatar toddrme2178 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opentsne's Issues

pynndescent new metrics

With opentsne-0.3.12 and pynndescent 0.3.3 (both installed from conda-forge right now):

/home/localadmin/anaconda3/lib/python3.7/site-packages/openTSNE/nearest_neighbors.py:196: UserWarning: `pynndescent` has recently changed which distance metrics are supported, and `openTSNE.nearest_neighbors` has not been updated. Please notify the developers of this change.
  "`pynndescent` has recently changed which distance metrics are supported, "

Sparse matrix support

Does fastTSNE currently support sparse matrices as input (i.e. scipy.sparse objects)? I remember reading that UMAP does support them, so I assume pynndescent should, but I did not look deeper into that.

I also wonder how efficient pynndescent for huge sparse matrices is, compared e.g. to https://github.com/facebookresearch/pysparnn that I found by googling approximate nearest neighbour sparse. I also came across this https://github.com/RUSH-LAB/Flash, described here https://arxiv.org/pdf/1709.01190.pdf. Don't know what other libraries are there.

This isn't very useful for single cell transcriptomics, but is relevant for text-based datasets with millions of samples and millions (!) of features.

Add `verbose` input parameter

What about having verbose input parameter such that verbose=1 would print the progress using the ErrorLogger() callback but also showing what happens before (kNN index construction, affinity calculation, etc.) and giving the times taken by the each stage?

Why is KL decreasing faster with default parameters compared to the FIt-SNE implementation?

I was under the impression that all important optimization parameters used in openTSNE are the same as in FIt-SNE:

class openTSNE.TSNE(n_components=2, perplexity=30, learning_rate=200, 
early_exaggeration_iter=250, early_exaggeration=12, n_iter=750, exaggeration=None, theta=0.5, 
n_interpolation_points=3, min_num_intervals=50, ints_in_interval=1, initialization='pca', 
metric='euclidean', metric_params=None, initial_momentum=0.5, final_momentum=0.8, 
min_grad_norm=1e-08, max_grad_norm=None, n_jobs=1, neighbors='approx', 
negative_gradient_method='fft', callbacks=None, callbacks_every_iters=50, random_state=None)

and

def fast_tsne(X, theta=.5, perplexity=30, map_dims=2, max_iter=1000, 
              stop_early_exag_iter=250, K=-1, sigma=-1, nbody_algo='FFT', knn_algo='annoy',
              mom_switch_iter=250, momentum=.5, final_momentum=.8, learning_rate=200,
              early_exag_coeff=12, no_momentum_during_exag=False, n_trees=50, 
              search_k=None, start_late_exag_iter=-1, late_exag_coeff=-1,
              nterms=3, intervals_per_integer=1, min_num_intervals=50,            
              seed=-1, initialization=None, load_affinities=None,
              perplexity_list=None, df=1, return_loss=False, nthreads=None)

However, I have just noticed when using one particular dataset that KL decreases faster when I use openTSNE. When I run this:

Z = fast_tsne(X, perplexity=30, seed=42)

I get

Iteration 50 (50 iterations in 0.91 seconds), cost 6.444297
Iteration 100 (50 iterations in 0.87 seconds), cost 6.270150
Iteration 150 (50 iterations in 0.90 seconds), cost 4.691937
Iteration 200 (50 iterations in 0.88 seconds), cost 4.266439
Iteration 250 (50 iterations in 0.86 seconds), cost 4.019699
Unexaggerating Ps by 12.000000
Iteration 300 (50 iterations in 0.89 seconds), cost 3.404685
Iteration 350 (50 iterations in 0.91 seconds), cost 3.043569
Iteration 400 (50 iterations in 0.87 seconds), cost 2.794367
Iteration 450 (50 iterations in 0.92 seconds), cost 2.613639
Iteration 500 (50 iterations in 0.93 seconds), cost 2.471313
Iteration 550 (50 iterations in 1.14 seconds), cost 2.356686
Iteration 600 (50 iterations in 1.28 seconds), cost 2.271862
Iteration 650 (50 iterations in 1.47 seconds), cost 2.190770
Iteration 700 (50 iterations in 1.68 seconds), cost 2.127095
Iteration 750 (50 iterations in 1.98 seconds), cost 2.074458
Iteration 800 (50 iterations in 2.50 seconds), cost 2.027436
Iteration 850 (50 iterations in 2.42 seconds), cost 1.981508
Iteration 900 (50 iterations in 2.63 seconds), cost 1.940266
Iteration 950 (50 iterations in 3.39 seconds), cost 1.915046
Iteration 1000 (50 iterations in 3.72 seconds), cost 1.875520
Wrote the 23822 x 2 data matrix successfully.

whereas when I run

Zo = TSNE(initialization='random', random_state=42, callbacks=ErrorLogger(), n_jobs=-1).fit(X)

I get

Iteration   50, KL divergence  6.0437, 50 iterations in 1.6216 sec
Iteration  100, KL divergence  4.5617, 50 iterations in 1.6222 sec
Iteration  150, KL divergence  4.2162, 50 iterations in 1.6578 sec
Iteration  200, KL divergence  4.0487, 50 iterations in 1.7035 sec
Iteration  250, KL divergence  3.9388, 50 iterations in 1.6815 sec
Iteration   50, KL divergence  3.1818, 50 iterations in 1.6840 sec
Iteration  100, KL divergence  2.7344, 50 iterations in 1.8720 sec
Iteration  150, KL divergence  2.4542, 50 iterations in 1.9890 sec
Iteration  200, KL divergence  2.2686, 50 iterations in 2.1730 sec
Iteration  250, KL divergence  2.1360, 50 iterations in 2.3945 sec
Iteration  300, KL divergence  2.0377, 50 iterations in 2.6713 sec
Iteration  350, KL divergence  1.9628, 50 iterations in 2.9563 sec
Iteration  400, KL divergence  1.9047, 50 iterations in 3.4022 sec
Iteration  450, KL divergence  1.8599, 50 iterations in 3.6570 sec
Iteration  500, KL divergence  1.8257, 50 iterations in 3.9738 sec
Iteration  550, KL divergence  1.7997, 50 iterations in 4.6501 sec
Iteration  600, KL divergence  1.7799, 50 iterations in 5.0196 sec
Iteration  650, KL divergence  1.7647, 50 iterations in 5.3889 sec
Iteration  700, KL divergence  1.7527, 50 iterations in 5.7517 sec
Iteration  750, KL divergence  1.7420, 50 iterations in 5.6339 sec

and indeed with FIt-SNE I get the embedding that spans roughly from -50 to 50, whereas with openTSNE it spans roughly from -80 to 80. I would expect this to happen if openTNSE used higher learning rate, but it's set to 200 in both implementations.

embedding.optimize() does not have default negative_gradient_method

I tried to reproduce

Z = TSNE(perplexity=30, n_jobs=-1, random_state=42, learning_rate=X.shape[0]/12).fit(X)

with the step-by-step commands and managed to get the same result with

aff = affinity.PerplexityBasedNN(X, perplexity=30, n_jobs=-1, random_state=42)
init = initialization.pca(X, random_state=42)
embedding = TSNEEmbedding(init, aff, n_jobs=-1, learning_rate=X.shape[0]/12)
embedding = embedding.optimize(n_iter=250, negative_gradient_method='fft', exaggeration=12, momentum=0.5)
embedding = embedding.optimize(n_iter=750, negative_gradient_method='fft', momentum=0.8)

however the latter code would not work without negative_gradient_method='fft' even though the former code ran fine without explicitly specifying the FFT method. Shouldn't optimize have the same default as TSNE?

pynndescent metric warnings

Great work! While I meet some problem when I am trying to use them in my pytorch project. I do not why there is always a warining "UserWarning: pynndescent has recently changed which distance metrics are supported, and openTSNE.nearest_neighbors has not been updated. Please notify the developers of this change. " pynndescent has recently changed which distance metrics are supported, "".

Add support for callable metrics

Expected behaviour

pynndescent supports passing callable metrics compiled with numba, so I would expect openTSNE to be able to support this

Actual behaviour

When I pass a numba-compiled callable metric, openTSNE throws a ValueError

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-b188b4c74e96> in <module>()
----> 1 embedding_train = tsne.fit(X_train)

/home/jake/.local/lib/python3.6/site-packages/openTSNE/tsne.py in fit(self, X)
   1040 
   1041         """
-> 1042         embedding = self.prepare_initial(X)
   1043 
   1044         try:

/home/jake/.local/lib/python3.6/site-packages/openTSNE/tsne.py in prepare_initial(self, X)
   1118             metric_params=self.metric_params,
   1119             n_jobs=self.n_jobs,
-> 1120             random_state=self.random_state,
   1121         )
   1122 

/home/jake/.local/lib/python3.6/site-packages/openTSNE/affinity.py in __init__(self, data, perplexity, method, metric, metric_params, symmetrize, n_jobs, random_state)
    124         k_neighbors = min(self.n_samples - 1, int(3 * self.perplexity))
    125         self.knn_index, self.__neighbors, self.__distances = build_knn_index(
--> 126             data, method, k_neighbors, metric, metric_params, n_jobs, random_state
    127         )
    128 

/home/jake/.local/lib/python3.6/site-packages/openTSNE/affinity.py in build_knn_index(data, method, k, metric, metric_params, n_jobs, random_state)
    274             metric_params=metric_params,
    275             n_jobs=n_jobs,
--> 276             random_state=random_state,
    277         )
    278 

/home/jake/.local/lib/python3.6/site-packages/openTSNE/nearest_neighbors.py in __init__(self, metric, metric_params, n_jobs, random_state)
     11     def __init__(self, metric, metric_params=None, n_jobs=1, random_state=None):
     12         self.index = None
---> 13         self.metric = self.check_metric(metric)
     14         self.metric_params = metric_params
     15         self.n_jobs = n_jobs

/home/jake/.local/lib/python3.6/site-packages/openTSNE/nearest_neighbors.py in check_metric(self, *args, **kwargs)
    184             )
    185 
--> 186         return super().check_metric(*args, **kwargs)
    187 
    188     def build(self, data, k):

/home/jake/.local/lib/python3.6/site-packages/openTSNE/nearest_neighbors.py in check_metric(self, metric)
     58         if metric not in self.VALID_METRICS:
     59             raise ValueError(
---> 60                 f"`{self.__class__.__name__}` does not support the `{metric}` "
     61                 f"metric. Please choose one of the supported metrics: "
     62                 f"{', '.join(self.VALID_METRICS)}."

ValueError: `NNDescent` does not support the `CPUDispatcher(<function kld at 0x7f7fe2311158>)` metric. Please choose one of the supported metrics: euclidean, l2, manhattan, taxicab, l1, chebyshev, linfinity, linfty, linf, minkowski, seuclidean, standardised_euclidean, wminkowski, weighted_minkowski, mahalanobis, canberra, cosine, correlation, haversine, braycurtis, hamming, jaccard, dice, matching, kulsinski, rogerstanimoto, russellrao, sokalsneath, sokalmichener, yule.
Steps to reproduce the behavior

Here is a minimal working example:

import numpy as np
from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger
from numba import njit

@njit(fastmath=True)
def kld(p, q):
    result = 0.0
    for i in range(p.shape[0]):
        logp = np.log(p[i]) if p[i] > 0 else 0
        logq = np.log(q[i]) if q[i] > 0 else 0
        result += p[i] * (logq - logp)
    return -result / np.log(2.)

tsne = TSNE(
    perplexity=30,
    metric=kld,
    callbacks=ErrorLogger(),
    n_jobs=1
)

x_train = np.random.beta(0.5, 0.5, size=(500,10))
embedding_train = tsne.fit(x_train)

Allow building against FFTW if available

FFTW is still probably faster than numpy's FFT so we can detect whether FFTW is available and build against that. We can swap out which extension should be built whether or not the library is available.

This can be done similarly to what pyFFTW does (see has_library function in setup.py).

Untangling a line in openTSNE vs FIt-SNE

I've been experimenting with extreme early exaggeration that should approximate Laplacian eigenmaps as shown in https://epubs.siam.org/doi/abs/10.1137/18M1216134. My simple toy example (straight line in 3D) works fine in FIt-SNE, but fails in openTSNE, and I struggle to understand why.

Here is a reproducible example:

n = 10000
X = np.zeros((n,3))
X[:,0] = np.arange(n)

Z1 = fast_tsne(X, seed=40, learning_rate=1, early_exag_coeff=n/10, stop_early_exag_iter=2000, max_iter=2000)
Z2 = fast_tsne(X, seed=41, learning_rate=1, early_exag_coeff=n/10, stop_early_exag_iter=2000, max_iter=2000)
Z3 = fast_tsne(X, seed=42, learning_rate=1, early_exag_coeff=n/10, stop_early_exag_iter=2000, max_iter=2000)

Z4 = TSNE(n_jobs=-1, initialization='random', random_state=40, learning_rate=1, early_exaggeration=n/10, 
           early_exaggeration_iter=2000, n_iter=0).fit(X)
Z5 = TSNE(n_jobs=-1, initialization='random', random_state=41, learning_rate=1, early_exaggeration=n/10, 
           early_exaggeration_iter=2000, n_iter=0).fit(X)
Z6 = TSNE(n_jobs=-1, initialization='random', random_state=42, learning_rate=1, early_exaggeration=n/10, 
           early_exaggeration_iter=2000, n_iter=0).fit(X)

plt.figure(figsize=(9,3))
for i,Z in enumerate([Z1,Z2,Z3], 1):
    plt.subplot(1,3,i)
    plt.scatter(Z[:,0], Z[:,1], s=1, c=np.arange(n))
sns.despine()
plt.tight_layout()
plt.savefig('line_fitsne.png')

plt.figure(figsize=(9,3))
for i,Z in enumerate([Z4,Z5,Z6], 1):
    plt.subplot(1,3,i)
    plt.scatter(Z[:,0], Z[:,1], s=1, c=np.arange(n))
sns.despine()
plt.tight_layout()
plt.savefig('line_opentsne.png')

I set learning rate to 1 and exaggeration coefficient to n/10, as in the paper above. I do early exaggeration for 2000 iterations and nothing else. Other parameters are default. FIt-SNE successfully unwraps the line for every random seed, as it should.

line_fitsne

But openTSNE fails to unwrap the line for every random seed:

line_opentsne

It does not look like it's due to convergence: I tried running it for another 1000 iterations (so 3000 in total) and it looks basically the same.
It does not look like it's due to initialization either: I tried manually initializing with random initializations with std=0.0001, and it still does not succeed.

Weird. I wanted to use openTSNE to make animations of how the line gets untangled with extreme exaggeration, but it only seems to work in FIt-SNE :-(

Relation to FIt-SNE?

How is this related or how does it compare to FIt-SNE on which you have also done work?

`pynndescent` has recently changed

Expected behaviour

Return the embedding

Actual behaviour

Return the embedding with one warning :
.../miniconda3/lib/python3.7/site-packages/openTSNE/nearest_neighbors.py:181: UserWarning: pynndescent has recently changed which distance metrics are supported, and openTSNE.nearest_neighbors has not been updated. Please notify the developers of this change.
"pynndescent has recently changed which distance metrics are supported, "

Steps to reproduce the behavior

Hello World steps

Sdist does not include *.pxd

The source distribution openTSNE-0.3.9.tar.gz on PyPI is missing .pxd files, such that installing from source fails:

Compiling openTSNE/_matrix_mul/matrix_mul_numpy.pyx because it changed.
[1/4] Cythonizing openTSNE/_matrix_mul/matrix_mul_numpy.pyx

Error compiling Cython file:
------------------------------------------------------------
...
# cython: wraparound=False
# cython: cdivision=True
# cython: initializedcheck=False
# cython: warn.undeclared=True
# cython: language_level=3
cimport openTSNE._matrix_mul.matrix_mul
       ^
------------------------------------------------------------

openTSNE\_matrix_mul\matrix_mul_numpy.pyx:7:8: 'openTSNE\_matrix_mul\matrix_mul.pxd' not found

Import Aborted ( Double free or corruption)

Expected behaviour
Actual behaviour

Shows below error.

double free or corruption (top)
Aborted

Steps to reproduce the behavior

On shell :pip3 install openTSNE
python>> import openTSNE

Optimizer momentum terms not pickled

Expected behaviour

Everything to be pickled properly.

Actual behaviour

The optimizer is not pickled. See #78. This means that if an embedding is reloaded, the momentum terms will be reset, resulting in different visualizations than if it were all run in the same session. This will only happen if the reloaded embedding is further optimized, and has no effect on transform.

running problem

Hi Pavlin! Sorry to bother you again! I use the 01_simple_usage.ipynb,but i meet a mistake
Iteration 450, KL divergence 0.6179, 50 iterations in 0.3072 sec
Iteration 500, KL divergence 0.6173, 50 iterations in 0.3070 sec
Iteration 550, KL divergence 0.6176, 50 iterations in 0.3243 sec
Iteration 600, KL divergence 0.6177, 50 iterations in 0.3255 sec
Iteration 650, KL divergence 0.6189, 50 iterations in 0.3075 sec
Iteration 700, KL divergence 0.6181, 50 iterations in 0.3076 sec
Iteration 750, KL divergence 0.6168, 50 iterations in 0.4510 sec
Traceback (most recent call last):
File "/home/shao/PycharmProjects/tf/main1.py", line 175, in
utils.plot(embedding_train, pred_y, colors=utils.MACOSKO_COLORS)
File "/home/shao/PycharmProjects/tf/utils.py", line 332, in plot
for yi in classes
File "/home/shao/PycharmProjects/tf/utils.py", line 332, in
for yi in classes
KeyError: 0

Process finished with exit code 1

I respect your answer,thank you!

pynndescent throws error when n_jobs=-1

Nice work on this! Unfortunately, I ran into an error. Not sure if it's yours or pynndescent. See below...

Expected behaviour

According to your API https://opentsne.readthedocs.io/en/latest/api/index.html you should be able to set n_jobs=-1, and I would expect openTSNE trains successfully with n_jobs=-1

Actual behaviour

pynndescent throws an errors:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-30-6084a3963dd8> in <module>()
----> 1 embedding_train = tsne.fit(np.random.normal(size=(500,10)))

/home/jake/.local/lib/python3.6/site-packages/openTSNE/tsne.py in fit(self, X)
   1040 
   1041         """
-> 1042         embedding = self.prepare_initial(X)
   1043 
   1044         try:

/home/jake/.local/lib/python3.6/site-packages/openTSNE/tsne.py in prepare_initial(self, X)
   1118             metric_params=self.metric_params,
   1119             n_jobs=self.n_jobs,
-> 1120             random_state=self.random_state,
   1121         )
   1122 

/home/jake/.local/lib/python3.6/site-packages/openTSNE/affinity.py in __init__(self, data, perplexity, method, metric, metric_params, symmetrize, n_jobs, random_state)
    124         k_neighbors = min(self.n_samples - 1, int(3 * self.perplexity))
    125         self.knn_index, self.__neighbors, self.__distances = build_knn_index(
--> 126             data, method, k_neighbors, metric, metric_params, n_jobs, random_state
    127         )
    128 

/home/jake/.local/lib/python3.6/site-packages/openTSNE/affinity.py in build_knn_index(data, method, k, metric, metric_params, n_jobs, random_state)
    277         )
    278 
--> 279     neighbors, distances = knn_index.build(data, k=k)
    280 
    281     return knn_index, neighbors, distances

/home/jake/.local/lib/python3.6/site-packages/openTSNE/nearest_neighbors.py in build(self, data, k)
    207             algorithm="standard",
    208             max_candidates=60,
--> 209             n_jobs=self.n_jobs,
    210         )
    211 

/home/jake/.local/lib/python3.6/site-packages/pynndescent/pynndescent_.py in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_level, tree_init, random_state, algorithm, max_candidates, n_iters, delta, rho, n_jobs, seed_per_row, verbose)
    552                 verbose=verbose,
    553                 n_jobs=n_jobs,
--> 554                 seed_per_row=seed_per_row,
    555             )
    556         elif algorithm == "standard" or leaf_array.shape[0] == 1:

/home/jake/.local/lib/python3.6/site-packages/pynndescent/threaded.py in nn_descent(data, n_neighbors, rng_state, max_candidates, dist, dist_args, n_iters, delta, rho, rp_tree_init, leaf_array, verbose, n_jobs, seed_per_row)
    583             rng_state,
    584             parallel,
--> 585             seed_per_row=seed_per_row,
    586         )
    587 

/home/jake/.local/lib/python3.6/site-packages/pynndescent/threaded.py in init_current_graph(data, dist, dist_args, n_neighbors, chunk_size, rng_state, parallel, seed_per_row)
    149     # store the updates in an array
    150     max_heap_update_count = chunk_size * n_neighbors * 2
--> 151     heap_updates = np.zeros((n_tasks, max_heap_update_count, 4))
    152     heap_update_counts = np.zeros((n_tasks,), dtype=np.int64)
    153     rng_state_threads = per_thread_rng_state(n_tasks, rng_state)

ValueError: negative dimensions are not allowed
Steps to reproduce the behavior

Here is a minimal working example (works fine with n_jobs set to a positive value):

from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger
import numpy as np

tsne = TSNE(
    perplexity=30,
    metric='euclidean',
    callbacks=ErrorLogger(),
    n_jobs=-1
)

embedding_train = tsne.fit(np.random.normal(size=(500,10)))

Please do not install examples: they conflict with other packages

Installing py36-fastTSNE-0.2.13...
pkg-static: py36-fastTSNE-0.2.13 conflicts with py36-tweepy-3.5.0 (installs files into the same place).  Problematic file: /usr/local/lib/python3.6/site-packages/examples/__init__.py
*** Error code 70

There is no need to install examples.

Is there any way to make 3D tSNEs with FIt-SNE/openTSNE?

Expected behaviour

I would expect the algorithm to be able to embed into 3 dimensions.

Actual behaviour

Documentation reports that only 2d embedding is available.

####What I want to do
Either:
-Embed 3 tSNE dimensions;
-To use the 2D tSNE coordinates outputted by FIt-SNE(openTSNE) with a third, Barnes-Hut calculated tSNE dimension.

I'm working with some highly diverse brain data from scRNAseq snapshots and a third dimension REALLY helps to visualize how the data is structured.

Non-consistent nan value for KL divergence

Expected behaviour

When calling TSNE.fit() on the same numpy array I expect:

  1. The KL divergence to not be a nan
  2. The appearence of nanin the KL divergence to not depend on the specific array slice considered the chosen value of perplexity.
Actual behaviour

Parameters Tested (what is not mentioned is kept at the default value):

perplexity = [100, 200, 300, 500]
distance = ["euclidean", "cosine"]
n_iter=2000
learning_rate=[200, X.shape[0] // 12]
n_jobs=4

Data Source:

X = np.random.randint(-1, 1, shape=(600000, 100))
X_test_1 = X[:300000, :]
X_test_2 = X[:6000, :]

X is an embedding extracted using tensorflow.keras.LSTM.
X does not contain any "weird" value (e.g. nan or inf)

X_test_1 leads to nan KL Divergence since the first 50 iterations
X_test_2 shows normal behaviour
Random and bigger slices of X also show normal behaviour

Values of perplexity above 100 leads to nan KL Divergence since the first 50 iterations

Steps to reproduce the behavior

It is hard to outline the steps for reproducing this behaviour since it appears to be data dependent.
I have different behaviours on different slices of the input array while a colleague of mine reported a silent exit on one data source (of shape ~(10000, 250)) while normal behaviour on another.

UPDATE
The problem apparently lies in the parameter perplexity: values above 100 will lead to nan KL Divergence, is this behaviour expected?

Pip install doesn't work on OSX

Expected behaviour

fastTSNE can be easily complied on OSX.

Actual behaviour

fastTSNE fails.

Steps to reproduce the behavior

Run pip install fasttsne on OSX.

Installing collected packages: fastTSNE, Orange3
  Running setup.py install for fastTSNE ... error
    Complete output from command /Users/ajda/miniconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/xn/5y86_zbd2f35vb02plps2f_r0000gn/T/pip-install-e38stk2_/fastTSNE/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/xn/5y86_zbd2f35vb02plps2f_r0000gn/T/pip-record-vmlih87z/install-record.txt --single-version-externally-managed --compile:
    ./tmptrxiymhf/fftw3..c:1:10: fatal error: 'fftw3.h' file not found
    #include <fftw3.h>
             ^~~~~~~~~
    1 error generated.
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.7-x86_64-3.6
    creating build/lib.macosx-10.7-x86_64-3.6/tests
    copying tests/test_correctness.py -> build/lib.macosx-10.7-x86_64-3.6/tests
    copying tests/test_affinities.py -> build/lib.macosx-10.7-x86_64-3.6/tests
    copying tests/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/tests
    copying tests/test_tsne.py -> build/lib.macosx-10.7-x86_64-3.6/tests
    creating build/lib.macosx-10.7-x86_64-3.6/fastTSNE
    copying fastTSNE/metrics.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE
    copying fastTSNE/affinity.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE
    copying fastTSNE/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE
    copying fastTSNE/nearest_neighbors.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE
    copying fastTSNE/initialization.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE
    copying fastTSNE/callbacks.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE
    copying fastTSNE/tsne.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE
    creating build/lib.macosx-10.7-x86_64-3.6/examples
    copying examples/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/examples
    copying examples/utils.py -> build/lib.macosx-10.7-x86_64-3.6/examples
    creating build/lib.macosx-10.7-x86_64-3.6/benchmarks
    copying benchmarks/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/benchmarks
    copying benchmarks/benchmark_tsne.py -> build/lib.macosx-10.7-x86_64-3.6/benchmarks
    creating build/lib.macosx-10.7-x86_64-3.6/fastTSNE/pynndescent
    copying fastTSNE/pynndescent/distances.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE/pynndescent
    copying fastTSNE/pynndescent/rp_trees.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE/pynndescent
    copying fastTSNE/pynndescent/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE/pynndescent
    copying fastTSNE/pynndescent/utils.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE/pynndescent
    copying fastTSNE/pynndescent/pynndescent_.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE/pynndescent
    creating build/lib.macosx-10.7-x86_64-3.6/fastTSNE/_matrix_mul
    copying fastTSNE/_matrix_mul/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastTSNE/_matrix_mul
    running build_ext
    building 'fastTSNE.quad_tree' extension
    creating build/temp.macosx-10.7-x86_64-3.6
    creating build/temp.macosx-10.7-x86_64-3.6/fastTSNE
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/ajda/miniconda3/include -arch x86_64 -I/Users/ajda/miniconda3/include -arch x86_64 -I/Users/ajda/miniconda3/lib/python3.6/site-packages/numpy/core/include -I/Users/ajda/miniconda3/include/python3.6m -c fastTSNE/quad_tree.c -o build/temp.macosx-10.7-x86_64-3.6/fastTSNE/quad_tree.o -fopenmp -O3
    clang: error: unsupported option '-fopenmp'
    error: command 'gcc' failed with exit status 1
    
    ----------------------------------------
Command "/Users/ajda/miniconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/xn/5y86_zbd2f35vb02plps2f_r0000gn/T/pip-install-e38stk2_/fastTSNE/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/xn/5y86_zbd2f35vb02plps2f_r0000gn/T/pip-record-vmlih87z/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/xn/5y86_zbd2f35vb02plps2f_r0000gn/T/pip-install-e38stk2_/fastTSNE/

import problem

Traceback (most recent call last):
File "", line 1, in
File "D:\annaconde\envs\tensorflow\lib\site-packages\openTSNE_init_.py", line 3, in
from .tsne import TSNE, TSNEEmbedding, PartialTSNEEmbedding, OptimizationInterrupt
File "D:\annaconde\envs\tensorflow\lib\site-packages\openTSNE\tsne.py", line 744
raise ValueError(f"Unrecognized initialization scheme {initialization}.")
^
SyntaxError: invalid syntax

Doesn't find fftw3.h which path is specified in CFLAGS/CXXFLAGS

On FreeBSD I have /usr/local/include/fftw3.h. /usr/local/include is in CFLAGS/CXXFLAGS but the build doesn't find it:

===>  Configuring for py36-fastTSNE-0.2.13
./tmp2fc_czn0/fftw3..c:1:10: fatal error: 'fftw3.h' file not found
#include <fftw3.h>
         ^~~~~~~~~
1 error generated.
running config
===>  Building for py36-fastTSNE-0.2.13
./tmpsa8l98y6/fftw3..c:1:10: fatal error: 'fftw3.h' file not found
#include <fftw3.h>
         ^~~~~~~~~
1 error generated.

Why does transform() have exaggeration=2 by default?

The parameters of the transform function are

def transform(self, X, perplexity=5, initialization="median", k=25,
learning_rate=100, n_iter=100, exaggeration=2, momentum=0, max_grad_norm=0.05):

so it has exaggeration=2 by default. Why? This looks unintuitive to me: exaggeration is a slightly "weird" trick that can arguably be very useful for huge data sets, but I would expect the out-of-sample embedding to work just fine with it. Am I missing something?

I am also curious why momentum is set to 0 (unlike in normal tSNE optimization), but here I don't have any intuition for what it should be.

Another question is: will this function work with n_iter=0 if one just wants to get an embedding using medians of k nearest neighbours? That would be handy. Or is there another way to get this? Perhaps from prepare_partial?

And lastly, when transform() is applied to points from a very different data set (imagine positioning Smart-seq2 cells onto a 10x Chromium reference), I prefer to use correlation distances because I suspect Euclidean distances might be completely off (even when the original tSNE was done using Euclidean distances). I think openTSNE currently does not support this, right? Did you have any problems with that? One could perhaps allow transform() to take a metric argument (is correlation among the supported metrics, btw?). The downside is that if this metric is different from the metric used to prepare the embedding, then the nearest neighbours object will have to be recomputed, so it will suddenly become much slower. Let me know if I should post it as a separate issue.

Don't see any effect of n_jobs

Parameter n_jobs does not seem to influence the speed at all: I get exactly the same speed with n_jobs=1 and n_jobs=-1 (as well as other values) on my n=23k data set. It's weird -- what can I do to debug this?

Replace FFTW with numpy's FFT

Using FFTW3 requires FFTW3 to be installed on the machine. This makes distribution terribly difficult. While numpy's implementation is slower than FFTW, it can be faster when used with Intel MKL.

image

Even without MKL, being able to to ship to windows without building wheels is probably preferable.

Scanpy integration

Hi Pavlin, have you thought about integrating openTSNE into scanpy? Scanpy has a smart internal setup where the same kNN graph is used for various downstream analysis tasks such as dimensionality reduction or clustering. AFAIK it uses pynndescent for kNN construction. For dimensionality reduction it supports UMAP and t-SNE and even supports FIt-SNE if it's installed (I think), but then of course FIt-SNE needs to rebuild its own kNN graph. It seems it would be much better to actually use openTSNE that can directly re-use the pre-built kNN graph object.

To be honest, I haven't used scanpy that much myself, but I see it's getting more and more popular in the community. So in the interest of the ecosystem I think it'd be great to transition it to openTSNE.

I imagine that integrating the core t-SNE should be relatively straightforward. One could then think of implementing some "recipes" from our Nat Comms paper on top of that as additional functions, if the scanpy team is interested in that.

What do you think? I have zero time for it right now due to some looming deadlines but could imagine investing some time into it later in spring.

Performance issues

Hey guys,
I receive the following error when trying to install via pip and anaconda:

Microsoft Windows [Version 10.0.17134.285]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\Users\Omar>conda activate tensorflow

(tensorflow) C:\Users\Omar>pip install fasttsne
Collecting fasttsne
  Downloading https://files.pythonhosted.org/packages/32/98/ad7b5278e9fbc0c6f52a336a12440cc405ba620526e30f62a6ea5516829d/fastTSNE-0.2.6.tar.gz (439kB)
    100% |################################| 440kB 6.6MB/s
Requirement already satisfied: numpy>1.14 in c:\anaconda2\envs\tensorflow\lib\site-packages (from fasttsne) (1.15.0)
Collecting numba>=0.38.1 (from fasttsne)
  Downloading https://files.pythonhosted.org/packages/aa/ac/124fe0c1ad1ff7cbe0311352b545a142927583ffb44f57585d6a40b3f5a6/numba-0.39.0-cp35-cp35m-win_amd64.whl (1.6MB)
    100% |################################| 1.6MB 3.1MB/s
Requirement already satisfied: scikit-learn<0.19.99,>=0.19 in c:\anaconda2\envs\tensorflow\lib\site-packages (from fasttsne) (0.19.2)
Requirement already satisfied: scipy in c:\anaconda2\envs\tensorflow\lib\site-packages (from fasttsne) (1.1.0)
Collecting llvmlite>=0.24.0dev0 (from numba>=0.38.1->fasttsne)
  Downloading https://files.pythonhosted.org/packages/0b/7b/0ed27a8429d674ec82631f71d3896f23dd0caa87d5313bcb4535c58d7ab1/llvmlite-0.24.0-cp35-cp35m-win_amd64.whl (10.6MB)
    100% |################################| 10.6MB 3.3MB/s
Building wheels for collected packages: fasttsne
  Running setup.py bdist_wheel for fasttsne ... error
  Complete output from command C:\Anaconda2\envs\tensorflow\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\Omar\\AppData\\Local\\Temp\\pip-install-jq2cu5np\\fasttsne\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d C:\Users\Omar\AppData\Local\Temp\pip-wheel-0p5n70do --python-tag cp35:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.5
  creating build\lib.win-amd64-3.5\benchmarks
  copying benchmarks\benchmark_tsne.py -> build\lib.win-amd64-3.5\benchmarks
  copying benchmarks\__init__.py -> build\lib.win-amd64-3.5\benchmarks
  creating build\lib.win-amd64-3.5\fastTSNE
  copying fastTSNE\affinity.py -> build\lib.win-amd64-3.5\fastTSNE
  copying fastTSNE\callbacks.py -> build\lib.win-amd64-3.5\fastTSNE
  copying fastTSNE\metrics.py -> build\lib.win-amd64-3.5\fastTSNE
  copying fastTSNE\nearest_neighbors.py -> build\lib.win-amd64-3.5\fastTSNE
  copying fastTSNE\tsne.py -> build\lib.win-amd64-3.5\fastTSNE
  copying fastTSNE\__init__.py -> build\lib.win-amd64-3.5\fastTSNE
  creating build\lib.win-amd64-3.5\tests
  copying tests\test_correctness.py -> build\lib.win-amd64-3.5\tests
  copying tests\test_tsne.py -> build\lib.win-amd64-3.5\tests
  copying tests\__init__.py -> build\lib.win-amd64-3.5\tests
  creating build\lib.win-amd64-3.5\fastTSNE\pynndescent
  copying fastTSNE\pynndescent\distances.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
  copying fastTSNE\pynndescent\pynndescent_.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
  copying fastTSNE\pynndescent\rp_trees.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
  copying fastTSNE\pynndescent\utils.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
  copying fastTSNE\pynndescent\__init__.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
  running build_ext
  building 'fastTSNE.quad_tree' extension
  error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/

  ----------------------------------------
  Failed building wheel for fasttsne
  Running setup.py clean for fasttsne
Failed to build fasttsne
Installing collected packages: llvmlite, numba, fasttsne
  Running setup.py install for fasttsne ... error
    Complete output from command C:\Anaconda2\envs\tensorflow\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\Omar\\AppData\\Local\\Temp\\pip-install-jq2cu5np\\fasttsne\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\Omar\AppData\Local\Temp\pip-record-wft2hl9z\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.5
    creating build\lib.win-amd64-3.5\benchmarks
    copying benchmarks\benchmark_tsne.py -> build\lib.win-amd64-3.5\benchmarks
    copying benchmarks\__init__.py -> build\lib.win-amd64-3.5\benchmarks
    creating build\lib.win-amd64-3.5\fastTSNE
    copying fastTSNE\affinity.py -> build\lib.win-amd64-3.5\fastTSNE
    copying fastTSNE\callbacks.py -> build\lib.win-amd64-3.5\fastTSNE
    copying fastTSNE\metrics.py -> build\lib.win-amd64-3.5\fastTSNE
    copying fastTSNE\nearest_neighbors.py -> build\lib.win-amd64-3.5\fastTSNE
    copying fastTSNE\tsne.py -> build\lib.win-amd64-3.5\fastTSNE
    copying fastTSNE\__init__.py -> build\lib.win-amd64-3.5\fastTSNE
    creating build\lib.win-amd64-3.5\tests
    copying tests\test_correctness.py -> build\lib.win-amd64-3.5\tests
    copying tests\test_tsne.py -> build\lib.win-amd64-3.5\tests
    copying tests\__init__.py -> build\lib.win-amd64-3.5\tests
    creating build\lib.win-amd64-3.5\fastTSNE\pynndescent
    copying fastTSNE\pynndescent\distances.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
    copying fastTSNE\pynndescent\pynndescent_.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
    copying fastTSNE\pynndescent\rp_trees.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
    copying fastTSNE\pynndescent\utils.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
    copying fastTSNE\pynndescent\__init__.py -> build\lib.win-amd64-3.5\fastTSNE\pynndescent
    running build_ext
    building 'fastTSNE.quad_tree' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/

    ----------------------------------------
Command "C:\Anaconda2\envs\tensorflow\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\Omar\\AppData\\Local\\Temp\\pip-install-jq2cu5np\\fasttsne\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\Omar\AppData\Local\Temp\pip-record-wft2hl9z\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Omar\AppData\Local\Temp\pip-install-jq2cu5np\fasttsne\
Cache entry deserialization failed, entry ignored

I have already installed visual c++ before.

Zeisel18 colors

Just in case you want to use the same colours as Zeisel et al. use in their Figure 1, then here they are (I grabbed them from their figure):

{'Astroependymal cells': '#d7abd4',
 'Cerebellum neurons': '#2d74bf',
 'Cholinergic, monoaminergic and peptidergic neurons': '#9e3d1b',
 'Di- and mesencephalon neurons': '#3b1b59',
 'Enteric neurons': '#1b5d2f',
 'Hindbrain neurons': '#51bc4c',
 'Immature neural': '#ffcb9a',
 'Immune cells': '#768281',
 'Neural crest-like glia': '#a0daaa',
 'Oligodendrocytes': '#8c7d2b',
 'Peripheral sensory neurons': '#98cc41',
 'Spinal cord neurons': '#c52d94',
 'Sympathetic neurons': '#11337d',
 'Telencephalon interneurons': '#ff9f2b',
 'Telencephalon projecting neurons': '#fea7c1',
 'Vascular cells': '#3d672d'}

If you prefer your own colours then no problem, just close this issue :-) But then maybe consider extending the palette such that you get unique colors for each type. Or use a higher level of taxonomy so that you have enough colours for it in your palette.

A bunch of comments and questions

Hi Pavlin! Great work. I did not know about Orange but I am working with scRNA-seq data myself (cf. your Zeisel2018 example) and I am using Python, so it's interesting to see developments in that direction.

I have a couple of scattered comments/questions that I will just dump here. This isn't a real "issue".

  1. You say that BH is much faster than FFT for smaller datasets. That's interesting; I did not notice this. What kind of numbers are you talking about here? I was under impression that with n<10k both methods are so fast (I guess all 1000 iterations under 1 min?) that the exact time does not really matter...

  2. Any specific reason to use "Python/Numba implementation of nearest neighbor descent" for approximate nearest neighbours? There are some popular libraries, e.g. annoy. Is your implementation much faster than that? Because otherwise it could be easier to use a well-known established library... I think Leland McInnes is using something similar (Numba implementation of nearest neighbor descent) in his UMAP; did you follow him here?

  3. I did not look at the actual code, but from the description on the main page it sounds that you don't have a vanilla t-SNE implementation in here. Is it true? I think it would be nice to have vanilla t-SNE in here too. For datasets with n=1k-2k it's pretty fast and I guess many people would prefer to use vanilla t-SNE if possible.

  4. I noticed you writing this in one of the closed issues:

    we allow new data to be added into the existing embedding by direct optimization. To my knowledge, no other library does this. It's sometimes difficult to get nice embeddings like this, but it may have potential.

    That's interesting. How exactly are you doing this? You fix the existing embedding, compute all the affinities for the extended dataset (original data + new data) and then optimize the cost by allowing only the positions of the new points to change? Something like that?

  5. George sped up his code quite a bit by adding multithreading to the F_attr computations. He is now implementing multithreading for the repulsive forces too. See KlugerLab/FIt-SNE#32, and the discussion there. This might be interesting for you too. Or are you already using multithreading during gradient descent?

  6. I am guessing that your Zeisel2018 plot is colored using the same 16 "megaclusters" that Zeisel et al. use in Figure 1B (https://www.cell.com/cms/attachment/f1754f20-890c-42f5-aa27-bbb243127883/gr1_lrg.jpg). If so, it would be great if you used the same colors as in their figure; this would ease the comparison. Of course you are not trying to make comparisons here, but this is something that would be interesting to me personally :)

No way to set random state

There is no way to use a set random state, much like in scikit-learn. Currently, the only way to get replicable results is (probably) by passing in a fixed initial embedding. random_state is used both with PCA and random initialization, so this would be helpful.

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-ICO19U/opentsne/

Expected behaviour

Installation should have happened smoothly

Actual behaviour

Collecting opentsne
Downloading https://files.pythonhosted.org/packages/d7/e3/85625dc946ef152717f8b0234740864f6b3e2d2e29cfbd429e9251e2cb9e/openTSNE-0.3.1.tar.gz (822kB)
100% |████████████████████████████████| 829kB 652kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-ICO19U/opentsne/setup.py", line 129, in
if has_c_library("fftw3"):
File "/tmp/pip-install-ICO19U/opentsne/setup.py", line 39, in has_c_library
with tempfile.TemporaryDirectory(dir=".") as directory:
AttributeError: 'module' object has no attribute 'TemporaryDirectory'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-ICO19U/opentsne/

Steps to reproduce the behavior

pip3 install opentsne
or pip3 install fasttsne

Runtime and RAM usage compared to FIt-SNE

I understand that openTSNE is expected to be slower than FIt-SNE, but I'd like to understand how much slower it is in typical situations. As I reported earlier, when I run it on 70000x50 PCA-reduced MNIST data with default parameters and n_jobs=-1, I get ~60 seconds with FIt-SNE and ~120 seconds with openTSNE. Every 50 iterations take around 2s vs around 4s.

I did not check for this specific case, but I suspect that FFT takes only a small fraction of this time, and the computational bottleneck is formed by the attractive forces. Can one profile openTSNE and see how much time is taken by different steps, such as repulsive/attractive computations?

Apart from that, and possibly even more worryingly, I replicated the data 6x and added some noise, to get a 420000x50 data matrix. It takes FIt-SNE around 1Gb of RAM to allocate the space for the kNN matrix, so it works just fine on my laptop. However, openTSNE rapidly took >7Gb of RAM and crashed the kernel (I have 16 Gb but around half was taken by other processes). This happened in the first seconds, so I assume it happens during the kNN search. Does pynndescent eat up so much memory in this case?

Allow to vary the t-distribution degree of freedom

Hi Pavlin. Have you already seen https://arxiv.org/abs/1902.05804 by any chance? I thought you might be interested to add a df parameter to openTSNE. It was really easy to implement it in FIt-SNE so I guess it should be pretty easy here as well. I would even consider doing it myself, but alas, I am so caught in revisions right now (including revising this paper) that I am not sure when I would have time for it...

Anyway, if you don't want to implement it (at least not until our paper is out in some reputable place haha) then it's fine. Do let me know if you have any comments on the preprint though.

Why does n_iter correspond to the number of iterations without exaggeration?

Just noticed that n_iter is (everywhere) defined as

The number of iterations to run in the normal optimization regime.

Isn't it at odds with all other t-SNE implementations? At least FIt-SNE wrappers, scikit-learn, the original Barnes-Hut C++ wrappers, -- all define n_iter (or however this parameter is called) as the total number of iterations, including early exaggeration phase.

Did you on purpose deviate from this historical "convention"?

Fix PC signs when using PCA initialization

I find it very convenient to have some convention for fixing PC signs, when using PCA initialization. I usually fix them such that the sum of eigenvector values is positive (of course some other conventions could work equally well). My suggestion is to insert this snippet

# fix PC signs
flipSigns = np.sum(pca_.components_, axis=1) < 0
embedding[:, flipSigns] *= -1

in here https://github.com/pavlin-policar/openTSNE/blob/master/openTSNE/initialization.py#L57.

By the way, why not add svd_solver='auto' parameter to initialization.pca() and pass it into the sklearn PCA call? Could be convenient IMHO.

installing fftw3

Hi.

Could you provide a bit more detail about installing FFTW3, as you recommend here? How does one verify that it's installed correctly and usable by openTSNE? I followed the instructions for Ubuntu here, namely apt-get install libfftw3-dev libfftw3-doc, and I see /usr/include/fftw3.h et al. Does this mean I have fftw3 installed correctly for openTSNE?

Thanks.

Spectral initialization

I think it'd be really cool to add initialization='spectral' support. It should be very easy, because it just means taking the P matrix after it's computed and running scipy.sparse.linalg.svds on it.

Numba warnings after numba update

I updated my conda packages (including numba) and now I am getting a lot of Numba warnings when running openTSNE. They seem harmless (everything still works correctly), but annoying. Here is a full output.

The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../anaconda3/lib/python3.6/site-packages/pynndescent/rp_trees.py", line 133:
@numba.njit(fastmath=True, nogil=True, parallel=True)
def euclidean_random_projection_split(data, indices, rng_state):
^

  self.func_ir.loc))
/home/localadmin/anaconda3/lib/python3.6/site-packages/pynndescent/pynndescent_.py:177: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../anaconda3/lib/python3.6/site-packages/pynndescent/utils.py", line 79:
@numba.njit(parallel=True)
def rejection_sample(n_samples, pool_size, rng_state):
^

  indices = rejection_sample(n_neighbors, data.shape[0], rng_state)
/home/localadmin/anaconda3/lib/python3.6/site-packages/pynndescent/pynndescent_.py:199: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../anaconda3/lib/python3.6/site-packages/pynndescent/utils.py", line 459:
@numba.njit(parallel=True)
def new_build_candidates(
^

  seed_per_row,
/home/localadmin/anaconda3/lib/python3.6/site-packages/numba/compiler.py:602: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../anaconda3/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 38:
    @numba.njit(parallel=True, fastmath=True)
    def init_from_random(n_neighbors, data, query_points, heap, rng_state):
    ^

  self.func_ir.loc))
/home/localadmin/anaconda3/lib/python3.6/site-packages/numba/compiler.py:602: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../anaconda3/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 49:
    @numba.njit(parallel=True, fastmath=True)
    def init_from_tree(tree, data, query_points, heap, rng_state):
    ^

  self.func_ir.loc))

Documentation hides all commands -- is this intentional?

No `__version__` in openTSNE

I tried printing version with

import openTSNE
print(openTSNE.__version__)

but unfortunately got

AttributeError: module 'openTSNE' has no attribute '__version__'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.