eamid / trimap Goto Github PK
View Code? Open in Web Editor NEWTriMap: Large-scale Dimensionality Reduction Using Triplets
License: Apache License 2.0
TriMap: Large-scale Dimensionality Reduction Using Triplets
License: Apache License 2.0
Glad to see this cool work from a fellow Slug.
Is there a reason that there is no support for a random seed argument?
That would be a very useful (and standard) thing to include.
Shared code really helped in research and I'd like to apply trimap to the further data analysis.
It would be great if sample data could be clustered in the same way as the deep learning network do.
What I meant is, for example, 2-dimensional cluster graph for MNIST dataset is plotted after clustering and a new image - handwritten 0-9 number maybe by my own - is set to this graph, that are contained in the right class area (0-9).
I've searched for saving and loading of K-mean cluster but no further information exists.
Actually I'm not sure weather it is possible or not.
If there's any advice for this issue, please let me know.
Thanks in advance.
To reproduce:
import trimap
import numpy as np
import pandas as pd
cossims = pd.read_feather("wiki_rule_cosinesimilarities.feather")
distmat = 1 - np.matrix(cossims.iloc[:,0:cossims.shape[0]],'double')
tmap = trimap.TRIMAP(use_dist_matrix=True)
tmap = tmap.fit_transform(distmat)
I'm attaching the data in a zipfile.
wiki_rule_cosinesimilarities.feather.zip
First, thanks for your work on this package and technique and making it available for others to study and experiment with.
The following example (using return_seq=True
) raises a ValueError
for me:
import trimap
from sklearn.datasets import load_digits
digits = load_digits()
embedding = trimap.TRIMAP(return_seq=True).fit_transform(digits.data, init="pca")
Omitting the init
argument and letting it be the default None
, allows the computation to finish, but the initial coordinates are exported as nan
:
embedding = trimap.TRIMAP(return_seq=True).fit_transform(digits.data)
embedding[:, :, 0]
array([[nan, nan],
[nan, nan],
[nan, nan],
...,
[nan, nan],
[nan, nan],
[nan, nan]])
I think the following:
Line 591 in a7250f3
should be:
Y_all[:, :, 0] = Y
as Y_init
may hold a string like "random"
(which causes the ValueError
) or None
(hence the nan
s).
Happy to provide a PR for this if needed.
trimap.TRIMAP().get_params()
Would be helpful to add example programs to make checking reproducibility easier.
Is it possible to use something like angle (cosine similarity) as the measure of closeness?
For reproducibility, it should also have a transform option, so it can transform datapoint which is hasn't been trained on.
On top of that, to be able to use the sklaern Pipeline, it needs this functionality as well
Hi,
is it possible to use a precomputed distance matrix as input? / will it be added in the future?
Thank you for the reply.
Best regards,
Vykintas
I got an error from numba when trying to use trimap on a fairly simple dataset. Any help greatly appreciated!
Here's a colab notebook with the reproduction:
https://colab.research.google.com/drive/1nhFmCGNDerz-0V3pJoL9UFGntD4OonYL
TRIMAP(n_inliers=10, n_outliers=5, n_random=5, lr=1000.0, n_iters=400, weight_adj=500.0, fast_trimap = True, opt_method = dbd, verbose=True, return_seq=False)
running TriMap on 10000 points with dimension 508
pre-processing
found nearest neighbors
sampled triplets
running TriMap with dbd
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-9fcd06511c30> in <module>()
----> 1 embedding = trimap.TRIMAP().fit_transform(vectors)
3 frames
/usr/local/lib/python3.6/dist-packages/numba/dispatcher.py in _explain_matching_error(self, *args, **kws)
461 msg = ("No matching definition for argument type(s) %s"
462 % ', '.join(map(str, args)))
--> 463 raise TypeError(msg)
464
465 def _search_new_conversions(self, *args, **kws):
TypeError: No matching definition for argument type(s) array(float32, 1d, C), int64, int64, array(int32, 2d, C), array(float32, 1d, C)
The following code makes Trimap hang forever.
import numpy as np
import trimap
x = np.array([[ 3.18987876e-01, 5.87170608e-02, -5.35221584e-02,
-2.12370202e-01, 1.44289479e-01, 1.15213081e-01,
-3.49550992e-01, -8.56188014e-02, 7.67039582e-02,
-7.87917897e-02, -2.89615601e-01, -2.38374388e-03,
-6.07468300e-02, -1.53473644e-02, 9.19963419e-02,
-1.14370733e-01, 1.21543720e-01, 1.16481416e-01,
-2.94296652e-01, -1.43486544e-01, -3.29958886e-01,
1.34309351e-01, -4.32708934e-02, 3.27159733e-01,
1.35406721e-04, 2.15839192e-01, -2.31008962e-01,
-1.53630883e-01, 1.70035616e-01, -1.03398576e-01,
-7.83967040e-03, -1.48111418e-01, 7.08103701e-02,
1.51507165e-02, -4.70302580e-03],
[ 2.70511746e-01, 2.50944565e-03, -1.01266943e-01,
-6.04593521e-03, 1.90846086e-01, -5.88433584e-03,
-3.05718035e-01, -1.63746793e-02, 8.91139284e-02,
-3.90956774e-02, -2.89017886e-01, 5.44876307e-02,
-3.34294289e-02, 5.05351350e-02, 1.19450457e-01,
-2.66644936e-02, 1.38987005e-01, 2.54748076e-01,
-2.78318554e-01, 5.58482762e-03, -4.44619954e-01,
-3.14005986e-02, -2.54096221e-02, 3.29968154e-01,
4.54740152e-02, 1.45967603e-01, -1.36808544e-01,
-1.10377215e-01, 1.64085761e-01, -2.38455474e-01,
-1.35548353e-01, -1.64852977e-01, 1.17668778e-01,
-4.60316762e-02, 4.73128930e-02],
[ 3.17780316e-01, -7.81738758e-03, -6.44788519e-02,
5.62540069e-02, 1.69442132e-01, 5.34028653e-03,
-3.56567532e-01, 9.72701795e-03, 8.40950683e-02,
-7.36852437e-02, -3.20505381e-01, 2.87447236e-02,
-8.96242410e-02, 1.10711388e-01, 3.08006257e-02,
-1.42246597e-02, 7.26564825e-02, 3.26128125e-01,
-1.96420610e-01, -8.66924319e-03, -3.05779576e-01,
-2.30795946e-02, 9.55938771e-02, 3.96909148e-01,
7.82142058e-02, 1.47577658e-01, -9.03981999e-02,
-4.88963164e-02, 1.18389614e-01, -2.15027452e-01,
-6.54470399e-02, -1.75441504e-01, 1.87194660e-01,
-5.08111436e-04, 1.35444716e-01],
[ 2.58012921e-01, -8.77735093e-02, -1.28023893e-01,
1.47463515e-01, 2.61107385e-01, -5.92785887e-02,
-2.14058936e-01, 3.41764428e-02, 4.58676219e-02,
-4.56911102e-02, -2.89655060e-01, -1.57761140e-04,
-4.51611951e-02, 7.53968805e-02, 7.84260333e-02,
5.99992424e-02, 1.10423878e-01, 3.26432049e-01,
-2.62022614e-01, 2.30244398e-02, -3.76471043e-01,
-1.13793373e-01, 1.96540896e-02, 2.30564684e-01,
6.99499100e-02, 1.44859001e-01, 5.51677980e-02,
2.79185660e-02, 7.44636357e-02, -2.78124183e-01,
-1.65953085e-01, -1.10599346e-01, 2.63543546e-01,
-8.91586766e-02, 1.93403229e-01],
[ 3.32011819e-01, -1.40174493e-01, -5.28167412e-02,
1.13800459e-01, 2.06157431e-01, -8.29892382e-02,
-2.11161330e-01, 7.94143155e-02, 4.90802489e-02,
-1.19306277e-02, -2.87060529e-01, 4.33459552e-03,
8.65805820e-02, 3.03589255e-02, 1.73449665e-01,
1.71231180e-02, 4.74411622e-02, 2.65454501e-01,
-2.75403082e-01, 2.34591905e-02, -3.79175991e-01,
-1.03660703e-01, 4.20364253e-02, 1.28694892e-01,
-8.52392241e-03, -4.99439947e-02, 1.10806182e-01,
-2.32070358e-03, 2.65163928e-02, -3.77998233e-01,
-2.85796434e-01, -7.88480118e-02, 1.74133658e-01,
-1.40881404e-01, 1.08900480e-01],
[ 1.82337701e-01, -2.11179242e-01, -1.01714216e-01,
1.31016269e-01, 4.99383882e-02, -1.59250170e-01,
-1.29212305e-01, -3.32643799e-02, 1.20454393e-01,
1.02800533e-01, -2.92455345e-01, -1.76530272e-01,
2.09684089e-01, 1.33223221e-01, 1.39211901e-02,
4.81586717e-03, -9.83966216e-02, 3.23559731e-01,
-2.28622139e-01, 3.68424207e-02, -2.63355613e-01,
-1.88473210e-01, 4.12943624e-02, 1.66466340e-01,
-1.77660301e-01, -1.06210433e-01, 2.31963158e-01,
-5.21184653e-02, 8.36717412e-02, -2.57204562e-01,
-2.26933807e-01, -1.83641464e-01, 2.42122248e-01,
-1.56716019e-01, 4.54310402e-02],
[ 2.46496126e-02, -1.26516521e-01, -2.60583401e-01,
2.04805687e-01, 1.16600819e-01, -2.23044977e-01,
-1.97046809e-02, -8.16227198e-02, 7.48965740e-02,
1.76039010e-01, -2.80806333e-01, -9.68108177e-02,
1.12287454e-01, 1.50147453e-01, -7.96348378e-02,
4.77459133e-02, 3.08816843e-02, 2.76006699e-01,
-2.06872299e-01, 1.46334590e-02, -2.49763101e-01,
-1.79324538e-01, -2.08251923e-02, 1.89528510e-01,
-9.29871425e-02, 1.07009195e-01, 2.11045280e-01,
-3.39877009e-02, 7.40684122e-02, -1.97052538e-01,
-8.61336514e-02, -2.74793237e-01, 3.87020469e-01,
-7.57661313e-02, 1.86928004e-01]], dtype=np.float32)
y = trimap.TRIMAP(verbose=True).fit_transform(x)
This is more like a request than a problem perhaps.
Wow - I LOVE TRIMAP!! <3
Wondered if its possible to generate more than two embedded parameters already and I'm just not seeing that option? If not, any chance that could be added going forward?
Thank You and Best Wishes,
Ian
hello
@eamid
Thank you very much for your work
when i try use it for my data ,but it get error
my data like:
array([1.3200e+02, 3.0000e+00, 3.4000e+01, 4.1000e+01, 4.3000e+01,
9.0000e+02, 8.9700e+02, 1.2700e+02, 3.0000e+00, 1.7000e+01,
3.5900e+02, 5.9800e+02, 1.0000e+00, 1.0000e+00, 9.3000e+01,
3.0000e+00, 5.0000e+00, 1.8000e+01, 2.4500e+02, 4.2500e+02,
4.0600e+02, 1.2100e+02, 3.0000e+00, 5.0000e+00, 1.8000e+01,
7.8400e+02, 1.4690e+03, 1.1610e+03, 1.1000e+02, 3.0000e+00,
1.5000e+01, 2.0000e+02, 2.1200e+02, 6.7700e+02, 6.5400e+02,
1.1000e+02, 3.0000e+00, 3.4000e+01, 4.1000e+01, 4.3000e+01,
1.0940e+03, 1.0880e+03, 1.1000e+02, 4.0000e+00, 2.9000e+01,
...
])
get
TRIMAP(n_inliers=20, n_outliers=10, n_random=10, distance=euclidean, lr=1000.0, n_iters=400, weight_adj=1000.0, apply_pca=True, opt_method=dbd, verbose=True, return_seq=False)
running TriMap on 1900000 points with dimension 500
pre-processing
applied PCA
found nearest neighbors
Traceback (most recent call last):
File "", line 4, in
File "/root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 827, in fit_transform
File "/root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 812, in fit
File "/root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 583, in trimap
File "/root/anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 318, in generate_triplets
File "/root/anaconda3/lib/python3.6/site-packages/numba/core/dispatcher.py", line 415, in _compile_for_args
error_rewrite(e, 'typing')
File "/root/anaconda3/lib/python3.6/site-packages/numba/core/dispatcher.py", line 358, in error_rewrite
reraise(type(e), e, None)
File "/root/anaconda3/lib/python3.6/site-packages/numba/core/utils.py", line 80, in reraise
raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of type(CPUDispatcher(<function euclid_dist at 0x7f5913a577b8>)) with parameters (array(float64, 1d, C), array(float64, 1d, C))
Known signatures:
File "../../anaconda3/lib/python3.6/site-packages/trimap-1.0.14-py3.6.egg/trimap/trimap_.py", line 94:
Just curious, have you considered hosting trimap on conda-forge so this could be installed via conda install -c conda-forge trimap
?
Can you do some comparisons between trimap and bearing researches ivis? I think that both of you use similar techniques (triplet networks).
I'm curious to see what differes between each implementation.
in the trimap_.py, there is a function
def rejection_sample(n_samples, max_int, rejects):
"""
Samples "n_samples" integers from a given interval [0,max_int] while
rejecting the values that are in the "rejects".
"""
result = np.empty(n_samples, dtype=np.int32)
for i in range(n_samples):
reject_sample = True
while reject_sample:
j = np.random.randint(max_int)
for k in range(i):
if j == result[k]:
break
for k in range(rejects.shape[0]):
if j == rejects[k]:
break
else:
reject_sample = False
result[i] = j
return result
and another function
def sample_knn_triplets(P, nbrs, n_inliers, n_outliers):
"""
Sample nearest neighbors triplets based on the similarity values given in P
Input
------
nbrs: Nearest neighbors indices for each point. The similarity values
are given in matrix P. Row i corresponds to the i-th point.
P: Matrix of pairwise similarities between each point and its neighbors
given in matrix nbrs
n_inliers: Number of inlier points
n_outliers: Number of outlier points
Output
------
triplets: Sampled triplets
"""
n, n_neighbors = nbrs.shape
triplets = np.empty((n * n_inliers * n_outliers, 3), dtype=np.int32)
for i in numba.prange(n):
sort_indices = np.argsort(-P[i])
for j in numba.prange(n_inliers):
sim = nbrs[i][sort_indices[j + 1]]
samples = rejection_sample(n_outliers, n, sort_indices[: j + 2])
for k in numba.prange(n_outliers):
index = i * n_inliers * n_outliers + j * n_outliers + k
out = samples[k]
triplets[index][0] = i
triplets[index][1] = sim
triplets[index][2] = out
# if sim==out :
# print("sim==out")
return triplets
the sort_indices
is always range(0,150) [ set the n_inliners=100
], in the raw implemention code
you have guarantee that out is not in range(0,150), but in fact range(0,150) is not the true indice for sim
, so I have found the indice of sim
and out
will be equal sometimes. in my opinion, the implemention of sample_knn_triplets
should be below:
def sample_knn_triplets(P, nbrs, n_inliers, n_outliers):
"""
Sample nearest neighbors triplets based on the similarity values given in P
Input
------
nbrs: Nearest neighbors indices for each point. The similarity values
are given in matrix P. Row i corresponds to the i-th point.
P: Matrix of pairwise similarities between each point and its neighbors
given in matrix nbrs
n_inliers: Number of inlier points
n_outliers: Number of outlier points
Output
------
triplets: Sampled triplets
"""
n, n_neighbors = nbrs.shape
triplets = np.empty((n * n_inliers * n_outliers, 3), dtype=np.int32)
for i in numba.prange(n):
sort_indices = np.argsort(-P[i])
for j in numba.prange(n_inliers):
sim = nbrs[i][sort_indices[j + 1]]
# I have changed the next line compared with the raw code
samples = rejection_sample(n_outliers, n, nbrs[i][sort_indices[: j+2]])
for k in numba.prange(n_outliers):
index = i * n_inliers * n_outliers + j * n_outliers + k
out = samples[k]
triplets[index][0] = i
triplets[index][1] = sim
triplets[index][2] = out
# if sim==out :
# print("sim==out")
return triplets
May be helpful to allow embedding in more than two dimensions as well.
Hi, I just installed TriMap and its dependencies (from conda). I ran the NIST demo script:
import trimap
from sklearn.datasets import load_digits
digits = load_digits()
embedding = trimap.TRIMAP().fit_transform(digits.data)
TRIMAP(n_inliers=10, n_outliers=5, n_random=5, distance=euclidean,lr=1000.0, n_iters=400, weight_adj=500.0, apply_pca=True, opt_method=dbd, verbose=True, return_seq=False)
running TriMap on 1797 points with dimension 64
pre-processing
Illegal instruction (core dumped)
Can you give me some advice on how to debug this? I attached my conda environment (conda_env.txt), if that helps. I'm excited to give this a go!
Many thanks and kind regards,
Tim
The verbose output could be improved. Here are some suggestions in no particular order.
don't complain about lack of PCA on high-dimensional data when there is not high dimensional data and thus it's not relevant.
be more specific about exactly what's happening. On large datasets I just see "pre-processing" early on and it can stay that way for a long time. What's it doing? The output should be specific about exactly which step is happening. For long running steps, provide incremental output. Not sure incremental is possible with nearest neighbors but that would be particularly useful.
note that when the TriMap settings are printed to stdout they do not include all the relevant settings. n_dims
, for example, though I guess you are more conservative with this argument for now since it's also not documented and you mention it's untested.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.