eamid / trimap Goto Github PK

View Code? Open in Web Editor NEW

289.0 289.0 18.0 24.38 MB

TriMap: Large-scale Dimensionality Reduction Using Triplets

License: Apache License 2.0

Python 100.00%

trimap's People

Contributors

Stargazers

Watchers

Forkers

yaxu75 dorianxiao stanleyn chor-nyan wubizhi kejiejiang jhird imathsoft codeaudit haiwentom bkmgit vishalbelsare kajsanorin wiscevan stjordanis jlmelville evancresswell moisestohias

trimap's Issues

Support for random seed for reproducible runs

Glad to see this cool work from a fellow Slug.

Is there a reason that there is no support for a random seed argument?

That would be a very useful (and standard) thing to include.

Is there any way to save the weights and parameters in the trimap for applied dataset to cluster sample data that are related to previously clustered dataset?

Shared code really helped in research and I'd like to apply trimap to the further data analysis.
It would be great if sample data could be clustered in the same way as the deep learning network do.
What I meant is, for example, 2-dimensional cluster graph for MNIST dataset is plotted after clustering and a new image - handwritten 0-9 number maybe by my own - is set to this graph, that are contained in the right class area (0-9).
I've searched for saving and loading of K-mean cluster but no further information exists.
Actually I'm not sure weather it is possible or not.
If there's any advice for this issue, please let me know.

Thanks in advance.

IndexError in running trimap with precomputed values

To reproduce:

import trimap
import numpy as np
import pandas as pd

cossims = pd.read_feather("wiki_rule_cosinesimilarities.feather")

distmat = 1 - np.matrix(cossims.iloc[:,0:cossims.shape[0]],'double')

tmap = trimap.TRIMAP(use_dist_matrix=True)
tmap = tmap.fit_transform(distmat)

I'm attaching the data in a zipfile.
wiki_rule_cosinesimilarities.feather.zip

initial coordinates are exported as `NaN` or throws an error

First, thanks for your work on this package and technique and making it available for others to study and experiment with.

The following example (using return_seq=True) raises a ValueError for me:

import trimap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = trimap.TRIMAP(return_seq=True).fit_transform(digits.data, init="pca")

Omitting the init argument and letting it be the default None, allows the computation to finish, but the initial coordinates are exported as nan:

embedding = trimap.TRIMAP(return_seq=True).fit_transform(digits.data)

embedding[:, :, 0]
array([[nan, nan],
       [nan, nan],
       [nan, nan],
       ...,
       [nan, nan],
       [nan, nan],
       [nan, nan]])

I think the following:

trimap/trimap/trimap_.py

Line 591 in a7250f3

Y_all[:, :, 0] = Yinit

should be:

    Y_all[:, :, 0] = Y

as Y_init may hold a string like "random" (which causes the ValueError) or None (hence the nans).

Happy to provide a PR for this if needed.

Error occurred when running 'get_params()'

trimap.TRIMAP().get_params()

AttributeError: 'TRIMAP' object has no attribute 'weight_adj'

Example programs

Would be helpful to add example programs to make checking reproducibility easier.

+

Other Distance Metrics

Is it possible to use something like angle (cosine similarity) as the measure of closeness?

It has no transform option

For reproducibility, it should also have a transform option, so it can transform datapoint which is hasn't been trained on.
On top of that, to be able to use the sklaern Pipeline, it needs this functionality as well

Distance matrix as input

Hi,
is it possible to use a precomputed distance matrix as input? / will it be added in the future?
Thank you for the reply.

Best regards,
Vykintas

TypeError: No matching definition for argument type(s)

I got an error from numba when trying to use trimap on a fairly simple dataset. Any help greatly appreciated!

Here's a colab notebook with the reproduction:
https://colab.research.google.com/drive/1nhFmCGNDerz-0V3pJoL9UFGntD4OonYL

TRIMAP(n_inliers=10, n_outliers=5, n_random=5, lr=1000.0, n_iters=400, weight_adj=500.0, fast_trimap = True, opt_method = dbd, verbose=True, return_seq=False)
running TriMap on 10000 points with dimension 508
pre-processing
found nearest neighbors
sampled triplets
running TriMap with dbd
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-9fcd06511c30> in <module>()
----> 1 embedding = trimap.TRIMAP().fit_transform(vectors)

3 frames
/usr/local/lib/python3.6/dist-packages/numba/dispatcher.py in _explain_matching_error(self, *args, **kws)
    461         msg = ("No matching definition for argument type(s) %s"
    462                % ', '.join(map(str, args)))
--> 463         raise TypeError(msg)
    464 
    465     def _search_new_conversions(self, *args, **kws):

TypeError: No matching definition for argument type(s) array(float32, 1d, C), int64, int64, array(int32, 2d, C), array(float32, 1d, C)

Trimap hangs forever

The following code makes Trimap hang forever.

import numpy as np
import trimap
x = np.array([[ 3.18987876e-01,  5.87170608e-02, -5.35221584e-02,
        -2.12370202e-01,  1.44289479e-01,  1.15213081e-01,
        -3.49550992e-01, -8.56188014e-02,  7.67039582e-02,
        -7.87917897e-02, -2.89615601e-01, -2.38374388e-03,
        -6.07468300e-02, -1.53473644e-02,  9.19963419e-02,
        -1.14370733e-01,  1.21543720e-01,  1.16481416e-01,
        -2.94296652e-01, -1.43486544e-01, -3.29958886e-01,
         1.34309351e-01, -4.32708934e-02,  3.27159733e-01,
         1.35406721e-04,  2.15839192e-01, -2.31008962e-01,
        -1.53630883e-01,  1.70035616e-01, -1.03398576e-01,
        -7.83967040e-03, -1.48111418e-01,  7.08103701e-02,
         1.51507165e-02, -4.70302580e-03],
       [ 2.70511746e-01,  2.50944565e-03, -1.01266943e-01,
        -6.04593521e-03,  1.90846086e-01, -5.88433584e-03,
        -3.05718035e-01, -1.63746793e-02,  8.91139284e-02,
        -3.90956774e-02, -2.89017886e-01,  5.44876307e-02,
        -3.34294289e-02,  5.05351350e-02,  1.19450457e-01,
        -2.66644936e-02,  1.38987005e-01,  2.54748076e-01,
        -2.78318554e-01,  5.58482762e-03, -4.44619954e-01,
        -3.14005986e-02, -2.54096221e-02,  3.29968154e-01,
         4.54740152e-02,  1.45967603e-01, -1.36808544e-01,
        -1.10377215e-01,  1.64085761e-01, -2.38455474e-01,
        -1.35548353e-01, -1.64852977e-01,  1.17668778e-01,
        -4.60316762e-02,  4.73128930e-02],
       [ 3.17780316e-01, -7.81738758e-03, -6.44788519e-02,
         5.62540069e-02,  1.69442132e-01,  5.34028653e-03,
        -3.56567532e-01,  9.72701795e-03,  8.40950683e-02,
        -7.36852437e-02, -3.20505381e-01,  2.87447236e-02,
        -8.96242410e-02,  1.10711388e-01,  3.08006257e-02,
        -1.42246597e-02,  7.26564825e-02,  3.26128125e-01,
        -1.96420610e-01, -8.66924319e-03, -3.05779576e-01,
        -2.30795946e-02,  9.55938771e-02,  3.96909148e-01,
         7.82142058e-02,  1.47577658e-01, -9.03981999e-02,
        -4.88963164e-02,  1.18389614e-01, -2.15027452e-01,
        -6.54470399e-02, -1.75441504e-01,  1.87194660e-01,
        -5.08111436e-04,  1.35444716e-01],
       [ 2.58012921e-01, -8.77735093e-02, -1.28023893e-01,
         1.47463515e-01,  2.61107385e-01, -5.92785887e-02,
        -2.14058936e-01,  3.41764428e-02,  4.58676219e-02,
        -4.56911102e-02, -2.89655060e-01, -1.57761140e-04,
        -4.51611951e-02,  7.53968805e-02,  7.84260333e-02,
         5.99992424e-02,  1.10423878e-01,  3.26432049e-01,
        -2.62022614e-01,  2.30244398e-02, -3.76471043e-01,
        -1.13793373e-01,  1.96540896e-02,  2.30564684e-01,
         6.99499100e-02,  1.44859001e-01,  5.51677980e-02,
         2.79185660e-02,  7.44636357e-02, -2.78124183e-01,
        -1.65953085e-01, -1.10599346e-01,  2.63543546e-01,
        -8.91586766e-02,  1.93403229e-01],
       [ 3.32011819e-01, -1.40174493e-01, -5.28167412e-02,
         1.13800459e-01,  2.06157431e-01, -8.29892382e-02,
        -2.11161330e-01,  7.94143155e-02,  4.90802489e-02,
        -1.19306277e-02, -2.87060529e-01,  4.33459552e-03,
         8.65805820e-02,  3.03589255e-02,  1.73449665e-01,
         1.71231180e-02,  4.74411622e-02,  2.65454501e-01,
        -2.75403082e-01,  2.34591905e-02, -3.79175991e-01,
        -1.03660703e-01,  4.20364253e-02,  1.28694892e-01,
        -8.52392241e-03, -4.99439947e-02,  1.10806182e-01,
        -2.32070358e-03,  2.65163928e-02, -3.77998233e-01,
        -2.85796434e-01, -7.88480118e-02,  1.74133658e-01,
        -1.40881404e-01,  1.08900480e-01],
       [ 1.82337701e-01, -2.11179242e-01, -1.01714216e-01,
         1.31016269e-01,  4.99383882e-02, -1.59250170e-01,
        -1.29212305e-01, -3.32643799e-02,  1.20454393e-01,
         1.02800533e-01, -2.92455345e-01, -1.76530272e-01,
         2.09684089e-01,  1.33223221e-01,  1.39211901e-02,
         4.81586717e-03, -9.83966216e-02,  3.23559731e-01,
        -2.28622139e-01,  3.68424207e-02, -2.63355613e-01,
        -1.88473210e-01,  4.12943624e-02,  1.66466340e-01,
        -1.77660301e-01, -1.06210433e-01,  2.31963158e-01,
        -5.21184653e-02,  8.36717412e-02, -2.57204562e-01,
        -2.26933807e-01, -1.83641464e-01,  2.42122248e-01,
        -1.56716019e-01,  4.54310402e-02],
       [ 2.46496126e-02, -1.26516521e-01, -2.60583401e-01,
         2.04805687e-01,  1.16600819e-01, -2.23044977e-01,
        -1.97046809e-02, -8.16227198e-02,  7.48965740e-02,
         1.76039010e-01, -2.80806333e-01, -9.68108177e-02,
         1.12287454e-01,  1.50147453e-01, -7.96348378e-02,
         4.77459133e-02,  3.08816843e-02,  2.76006699e-01,
        -2.06872299e-01,  1.46334590e-02, -2.49763101e-01,
        -1.79324538e-01, -2.08251923e-02,  1.89528510e-01,
        -9.29871425e-02,  1.07009195e-01,  2.11045280e-01,
        -3.39877009e-02,  7.40684122e-02, -1.97052538e-01,
        -8.61336514e-02, -2.74793237e-01,  3.87020469e-01,
        -7.57661313e-02,  1.86928004e-01]], dtype=np.float32)
y = trimap.TRIMAP(verbose=True).fit_transform(x)

More than 2D

This is more like a request than a problem perhaps.

Wow - I LOVE TRIMAP!! <3

Wondered if its possible to generate more than two embedded parameters already and I'm just not seeing that option? If not, any chance that could be added going forward?

Thank You and Best Wishes,
Ian

Invalid use of type(CPUDispatcher(<function euclid_dist at 0x7f5913a577b8>)) with parameters (array(float64, 1d, C), array(float64, 1d, C))

hello
@eamid
Thank you very much for your work
when i try use it for my data ,but it get error

my data like:
array([1.3200e+02, 3.0000e+00, 3.4000e+01, 4.1000e+01, 4.3000e+01,
9.0000e+02, 8.9700e+02, 1.2700e+02, 3.0000e+00, 1.7000e+01,
3.5900e+02, 5.9800e+02, 1.0000e+00, 1.0000e+00, 9.3000e+01,
3.0000e+00, 5.0000e+00, 1.8000e+01, 2.4500e+02, 4.2500e+02,
4.0600e+02, 1.2100e+02, 3.0000e+00, 5.0000e+00, 1.8000e+01,
7.8400e+02, 1.4690e+03, 1.1610e+03, 1.1000e+02, 3.0000e+00,
1.5000e+01, 2.0000e+02, 2.1200e+02, 6.7700e+02, 6.5400e+02,
1.1000e+02, 3.0000e+00, 3.4000e+01, 4.1000e+01, 4.3000e+01,
1.0940e+03, 1.0880e+03, 1.1000e+02, 4.0000e+00, 2.9000e+01,
...
])

get

TRIMAP(n_inliers=20, n_outliers=10, n_random=10, distance=euclidean, lr=1000.0, n_iters=400, weight_adj=1000.0, apply_pca=True, opt_method=dbd, verbose=True, return_seq=False)
running TriMap on 1900000 points with dimension 500
pre-processing
applied PCA
found nearest neighbors
Traceback (most recent call last):
File "", line 4, in
File "/root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 827, in fit_transform
File "/root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 812, in fit
File "/root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 583, in trimap
File "/root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 318, in generate_triplets
File "/root/anaconda3/lib/python3.6/site-packages/numba/core/dispatcher.py", line 415, in _compile_for_args
error_rewrite(e, 'typing')
File "/root/anaconda3/lib/python3.6/site-packages/numba/core/dispatcher.py", line 358, in error_rewrite
reraise(type(e), e, None)
File "/root/anaconda3/lib/python3.6/site-packages/numba/core/utils.py", line 80, in reraise
raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of type(CPUDispatcher(<function euclid_dist at 0x7f5913a577b8>)) with parameters (array(float64, 1d, C), array(float64, 1d, C))
Known signatures:

(array(float32, 1d, A), array(float32, 1d, A)) -> float32
During: resolving callee type: type(CPUDispatcher(<function euclid_dist at 0x7f5913a577b8>))
During: typing of call at /root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py (94)

File "../../anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 94:

Question: Hosted on conda-forge for conda install

Just curious, have you considered hosting trimap on conda-forge so this could be installed via conda install -c conda-forge trimap?

trimap vs ivis?

Can you do some comparisons between trimap and bearing researches ivis? I think that both of you use similar techniques (triplet networks).

I'm curious to see what differes between each implementation.

error about the implement of the function rejection sample

in the trimap_.py, there is a function

def rejection_sample(n_samples, max_int, rejects):
    """
    Samples "n_samples" integers from a given interval [0,max_int] while
    rejecting the values that are in the "rejects".

    """
    result = np.empty(n_samples, dtype=np.int32)
    for i in range(n_samples):
        reject_sample = True
        while reject_sample:
            j = np.random.randint(max_int)
            for k in range(i):
                if j == result[k]:
                    break
            for k in range(rejects.shape[0]):
                if j == rejects[k]:
                    break
            else:
                reject_sample = False
        result[i] = j
    return result

and another function

def sample_knn_triplets(P, nbrs, n_inliers, n_outliers):
    """
    Sample nearest neighbors triplets based on the similarity values given in P

    Input
    ------

    nbrs: Nearest neighbors indices for each point. The similarity values 
        are given in matrix P. Row i corresponds to the i-th point.

    P: Matrix of pairwise similarities between each point and its neighbors 
        given in matrix nbrs

    n_inliers: Number of inlier points

    n_outliers: Number of outlier points

    Output
    ------

    triplets: Sampled triplets
    """
    n, n_neighbors = nbrs.shape
    triplets = np.empty((n * n_inliers * n_outliers, 3), dtype=np.int32)
    for i in numba.prange(n):
        sort_indices = np.argsort(-P[i])
        for j in numba.prange(n_inliers):
            sim = nbrs[i][sort_indices[j + 1]]
            samples = rejection_sample(n_outliers, n, sort_indices[: j + 2])
            for k in numba.prange(n_outliers):
                index = i * n_inliers * n_outliers + j * n_outliers + k
                out = samples[k]
                triplets[index][0] = i
                triplets[index][1] = sim
                triplets[index][2] = out
                # if sim==out :
                #     print("sim==out")
    return triplets

the sort_indices is always range(0,150) [ set the n_inliners=100], in the raw implemention code
you have guarantee that out is not in range(0,150), but in fact range(0,150） is not the true indice for sim, so I have found the indice of sim and out will be equal sometimes. in my opinion, the implemention of sample_knn_triplets should be below：

def sample_knn_triplets(P, nbrs, n_inliers, n_outliers):
    """
    Sample nearest neighbors triplets based on the similarity values given in P

    Input
    ------

    nbrs: Nearest neighbors indices for each point. The similarity values 
        are given in matrix P. Row i corresponds to the i-th point.

    P: Matrix of pairwise similarities between each point and its neighbors 
        given in matrix nbrs

    n_inliers: Number of inlier points

    n_outliers: Number of outlier points

    Output
    ------

    triplets: Sampled triplets
    """
    n, n_neighbors = nbrs.shape
    triplets = np.empty((n * n_inliers * n_outliers, 3), dtype=np.int32)
    for i in numba.prange(n):
        sort_indices = np.argsort(-P[i])
        for j in numba.prange(n_inliers):
            sim = nbrs[i][sort_indices[j + 1]]
           # I have changed the next line compared with the raw code
            samples = rejection_sample(n_outliers, n, nbrs[i][sort_indices[: j+2]])
            for k in numba.prange(n_outliers):
                index = i * n_inliers * n_outliers + j * n_outliers + k
                out = samples[k]
                triplets[index][0] = i
                triplets[index][1] = sim
                triplets[index][2] = out
                # if sim==out :
                #     print("sim==out")
    return triplets

Embedding dimension

May be helpful to allow embedding in more than two dimensions as well.

Illegal instruction error

Hi, I just installed TriMap and its dependencies (from conda). I ran the NIST demo script:

import trimap
from sklearn.datasets import load_digits
digits = load_digits()
embedding = trimap.TRIMAP().fit_transform(digits.data)

but on running the trimap command, I get the following error:

TRIMAP(n_inliers=10, n_outliers=5, n_random=5, distance=euclidean,lr=1000.0, n_iters=400, weight_adj=500.0, apply_pca=True, opt_method=dbd, verbose=True, return_seq=False)
running TriMap on 1797 points with dimension 64
pre-processing
Illegal instruction (core dumped)

Can you give me some advice on how to debug this? I attached my conda environment (conda_env.txt), if that helps. I'm excited to give this a go!

Many thanks and kind regards,
Tim

Improve descriptiveness of verbose output

The verbose output could be improved. Here are some suggestions in no particular order.

don't complain about lack of PCA on high-dimensional data when there is not high dimensional data and thus it's not relevant.
be more specific about exactly what's happening. On large datasets I just see "pre-processing" early on and it can stay that way for a long time. What's it doing? The output should be specific about exactly which step is happening. For long running steps, provide incremental output. Not sure incremental is possible with nearest neighbors but that would be particularly useful.
note that when the TriMap settings are printed to stdout they do not include all the relevant settings. n_dims, for example, though I guess you are more conservative with this argument for now since it's also not documented and you mention it's untested.

eamid / trimap Goto Github PK

trimap's People

Contributors

Stargazers

Watchers

Forkers

trimap's Issues

AttributeError: 'TRIMAP' object has no attribute 'weight_adj'

Recommend Projects

Recommend Topics

Recommend Org

Jobs