scikit-learn-contrib / scikit-dimension Goto Github PK

A Python package for intrinsic dimension estimation

Home Page: https://scikit-dimension.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scikit-dimension's Introduction

scikit-dimension

scikit-dimension is a Python module for intrinsic dimension estimation built according to the scikit-learn API and distributed under the 3-Clause BSD license.

Please refer to the documentation and the paper for detailed API, examples and references

Installation

Using pip:

pip install scikit-dimension

From source:

git clone https://github.com/j-bac/scikit-dimension
cd scikit-dimension
pip install .

Quick start

Local and global estimators can be used in this way:

import skdim
import numpy as np

#generate data : np.array (n_points x n_dim). Here a uniformly sampled 5-ball embedded in 10 dimensions
data = np.zeros((1000,10))
data[:,:5] = skdim.datasets.hyperBall(n = 1000, d = 5, radius = 1, random_state = 0)

#estimate global intrinsic dimension
danco = skdim.id.DANCo().fit(data)
#estimate local intrinsic dimension (dimension in k-nearest-neighborhoods around each point):
lpca = skdim.id.lPCA().fit_pw(data,
                              n_neighbors = 100,
                              n_jobs = 1)
                            
#get estimated intrinsic dimension
print(danco.dimension_, np.mean(lpca.dimension_pw_))

scikit-dimension's People

Contributors

Stargazers

Watchers

Forkers

sysbio-curie lamhda konstantinklepikov tajourisarra lkampoli kylehkhsu dhockaday hiroishida fzergz chengemily1 jusker-greek dahoas kmyim jessecresswell

scikit-dimension's Issues

DANCo Fast

Hi,

I noticed in the DANCo.py file that some functions have been commented out which seem to implement a version of FastDANCo. I was wondering if it's safe to uncomment and use these in my code or whether there was a reason as to why they're commented out?

Thanks.

P.S. Great package.

Why is participation ratio rounded to an integer?

I just noticed that participation ratio is always an integer. Why is that?

scikit-dimension/skdim/id/_PCA.py

Line 189 in ea1a495

de = int(PR)

Quickstart error -- no FisherS example

Minor issue. The last line of the quickstart page throws an error:

print(danco.dimension_, fishers.dimension_, np.mean(lpca.dimension_pw_))

As the fisher instance was never implemented. Adding a line like the following fixes it:
fishers = skdim.id.FisherS().fit(data)

FishersS encounters error

Hi,

I've tried the FishersS() estimation using several values of condition number and several datasets and all come back with the same error.

See below for 2 examples using the tutorial dataset. Am I missing something? Do I need to specify alphas? Running with python 3.7.3 on a MacBook.

Thanks much,
Dan Schnell CCHMC

data = np.zeros((1000,10))
data[:,:5] = skdim.datasets.hyperBall(n = 1000, d = 5, radius = 1, random_state = 0)

pca=skdim.id.lPCA()
gid1=pca.fit(data).dimension_
gid1
5
FS2 = skdim.id.FisherS(conditional_number=10, project_on_sphere=1, alphas=None, produce_plots=False, verbose=0, limit_maxdim=False).fit(data)
Traceback (most recent call last):
File "", line 1, in
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 123, in fit
) = self._SeparabilityAnalysis(X)
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 617, in _SeparabilityAnalysis
separable_fraction, p_alpha = self._checkSeparabilityMultipleAlpha(Xp)
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 259, in _checkSeparabilityMultipleAlpha
counts[k:e, :] = counts[k:e, :] + self._histc(xy.T, alphas)
TypeError: expected dtype object, got 'numpy.dtype[float64]'
FS1 = skdim.id.FisherS().fit(data)
Traceback (most recent call last):
File "", line 1, in
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 123, in fit
) = self._SeparabilityAnalysis(X)
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 617, in _SeparabilityAnalysis
separable_fraction, p_alpha = self._checkSeparabilityMultipleAlpha(Xp)
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 259, in _checkSeparabilityMultipleAlpha
counts[k:e, :] = counts[k:e, :] + self._histc(xy.T, alphas)
TypeError: expected dtype object, got 'numpy.dtype[float64]'

Error when using FisherS estimator, function point_inseparability_to_pointID()

The function call point_inseparability_to_pointID() with default parameters can create an exception when there exists at least one point which is separable no matter what alpha is (in the specified range). This can happen when there exists a single outlier data point.

[minor] Incorrect docstring for dataset `lineDiskBall`

scikit-dimension/skdim/datasets.py

Line 139 in bde433a

data: np.array, (npoints x ndim)

Docstring for lineDiskBall says return value is an np.array, but actually a tuple with two elements is returned.

Too many threads spawed in DANCo().fit()

Hello, I am an HPC technichal consultant and one of our users is using your package. The function skdim.id.DANCo().fit(data) seems to be spawning a significant number of threads with no apparent way to control the number. In an HPC environment, this is an issue. Is this intended?

Question: Why is d = k + 1 for Kaiser and Broken Stick?

I noticed that the ID estimates provided by the Kaiser and broken_stick methods in id.lPCA are k + 1, where k is the number of components to be kept according to the most commonly used implementations of these rules (i.e. keep only components with an eigenvalue > 1 [Kaiser], or keep only components with greater than expected explained variance [broken stick]).

I'm wondering what the thinking was behind this choice, and if there are any papers I can cite justifying this modification.

Thanks!

FisherS return `nan` value when ID is pretty large

def test_ID_estimator(D, name='TwoNN', *args, **kwargs):
  ids = []
  Ns = [64, 128, 256, 512, 1024, 2048]
  for N in Ns:
    data = np.zeros((N,3*32*32))
    data[:,:D] = skdim.datasets.hyperBall(n = N, d = D, radius = 2, random_state = 666)
    _id = eval(f"skdim.id.{name}")(*args,**kwargs).fit_transform(X=data)
    ids.append(_id)
    print(f'{name}', N, _id)

Results

test_ID_estimate(10, name='FisherS')
>>>
FisherS 64 11.258882790713601
FisherS 128 12.293955630520735
FisherS 256 10.300845319222297
FisherS 512 10.0622264687654
FisherS 1024 10.117248646587507
FisherS 2048 10.083245836860842

test_ID_estimate(20, name='FisherS')
>>>
FisherS 64 nan
FisherS 128 25.11715241495022
FisherS 256 19.984501791074795
FisherS 512 20.91391435506641
FisherS 1024 19.95265685270067
FisherS 2048 19.77165730041636

test_ID_estimate(50, name='FisherS')
>>>
FisherS 64 nan
FisherS 128 nan
FisherS 256 nan
FisherS 512 nan
FisherS 1024 nan
FisherS 2048 49.44956027057592

test_ID_estimate(100, name='FisherS')
>>>
FisherS 64 nan
FisherS 128 nan
FisherS 256 nan
FisherS 512 nan
FisherS 1024 nan
FisherS 2048 nan

Incorrect implementation of Participation Ratio

I just noticed what I think might be a problem with your implementation of Participation Ratio!

Suppose one is given a matrix X with shape (num samples, num features). The PR is given by square(sum(eigenvalues)) / sum(square(eigenvalues)) of X^T X, but the implementation uses PCA.

scikit-dimension/skdim/id/_PCA.py

Line 149 in 1ddee0f

pca = PCA().fit(X)

This is incorrect because PCA first demeans the data. I don't think that's correct.

[minor] Incorrect docstring for hyperSphere

The docstring for the hyperSphere dataset mentions center as a parameter when it should not.
https://github.com/scikit-learn-contrib/scikit-dimension/blob/master/skdim/datasets.py#L82

`asPointwise` with `n_jobs>1` does not progress

(This was tested after fixing another issue locally #17)

scikit-dimension/skdim/_commonfuncs.py

Line 105 in bde433a

 def asPointwise(data, class_instance, precomputed_knn=None, n_neighbors=100, n_jobs=1): 

asPointwise works as intended when n_jobs=1. However, in my experience when n_jobs>1 the async call does not complete.

I can recommend using the joblib library instead of multiprocessing, for example:

from joblib import delayed, Parallel
if n_jobs > 1:
    with Parallel(n_jobs=n_jobs) as parallel:
        def fit_estimator(class_instance, data, idx):
            return class_instance.fit(data[idx, :]).dimension_
        results = parallel(
            delayed(fit_estimator)(
                class_instance, data, idx
            ) for idx in knn
        )
    return np.array(results)

dimension drops to 0 when number of samples is high

Hello and thanks for the package !

I am currently running few tests and when I run estimator on large datasets, the returned dimension is 0. When I lower the number of sample, it returns a higher dimension. Any ideas of why this behavior happens ?

Best,

Etienne

Computation of estimator values pointwise on kNN sets is not parallelized

I will use ESS as an example, since it is a pretty slow estimator. Since it is a LocalEstimator, when fit is called kNNs are first computed (if not already provided)

scikit-dimension/skdim/_commonfuncs.py

Line 419 in b9e8845

dists, knnidx = get_nn(X, k=self.n_neighbors, n_jobs=n_jobs)

and this in turn calls to the sklearn library which properly parallelizes the computation based on the parameter n_jobs that we can provide.

Second, a call to self.fit in the ESS class performs a simple, single-threaded for loop over the datapoints.

scikit-dimension/skdim/id/_ESS.py

Lines 75 to 78 in b9e8845

 for i in range(len(X)): 

 self.dimension_pw_[i], self.essval_[i] = self._essLocalDimEst( 

 X[knnidx[i, :]] 

 )

The computations are "embarrassingly parallel". I have locally implemented parallelization and seen at least a 6x speedup in ESS computation on datasets of size 5,000 - 50,000.

I can open a PR with these changes as an example if you are willing to add joblib as a dependency.

Inconsistent results when using multiprocessing in fit_pw method

Dear authors,

thank you for your work on this project and for making the code available to the community!
I have encountered an issue with the fit_pw method in the lPCA class when using multiprocessing. The issue is that the output dimensions are not the same when using different numbers of jobs (n_jobs), while they should be consistent. Below is a code sample to reproduce the problem:

import numpy as np
import skdim

# set the random seed
np.random.seed(42)

lpca = skdim.id.lPCA(ver="FO", alphaFO=0.05, verbose=True)

# create a random dataset
X_np = np.random.randn(100, 10)

print("X_np.shape: ", X_np.shape)
print(X_np)

lpca.fit_pw(X=X_np, n_neighbors=20, n_jobs=1, smooth=False)
lpca_dimensions_pw_n_jobs_1 = lpca.dimension_pw_
print(f"lpca.dimension_pw_: {lpca.dimension_pw_}")

lpca.fit_pw(X=X_np, n_neighbors=20, n_jobs=2, smooth=False)
lpca_dimensions_pw_n_jobs_2 = lpca.dimension_pw_
print(f"lpca.dimension_pw_: {lpca.dimension_pw_}")

# Check that the results are the same
assert np.allclose(lpca_dimensions_pw_n_jobs_1, lpca_dimensions_pw_n_jobs_2)

I have identified that the problem is related to the management of class instances and their state when using multiprocessing. In the current implementation, the worker processes do not share memory with the main process, so modifications to the instances within the worker processes are not reflected in the main process. The dimension is being calculated and stored in the _dimension attribute of the instances within the worker processes, but this information is not being propagated back to the main process.

A possible solution is to change the fit_pw function to return the computed dimension values instead of storing them in the _dimension attribute of the instances. Then, use the apply_async function to asynchronously apply the fit function to each data point and collect the results in the main process. Here's an example of a modified fit_pw function that addresses this issue:

def fit_pw(self, X, precomputed_knn=None, smooth=False, n_neighbors=100, n_jobs=1):
    # ...
    if n_jobs > 1:
        with mp.Pool(n_jobs) as pool:
            # Asynchronously apply the `fit` function to each data point and collect the results
            results = [pool.apply_async(self.fit, (X[i, :],)) for i in knnidx]
            # Retrieve the computed dimensions
            self.dimension_pw_ = np.array([r.get().dimension_ for r in results])
    # ...

With this modification, the computed dimensions are correctly returned and stored in the main process, and the dimensions are consistent when running the code with different numbers of jobs.
Would it be possible to consider this change or a similar approach to address the issue with multiprocessing in the fit_pw method? This problem likely also affects the other multiprocessing dimension estimates, and not just the lPCA class.

Incorrect dimension of `hyperSphere`

The parameter d in skdim.datasets.hyperSphere is documented to be the "Dimension of the hypersphere". However, it is actually one larger than the intrinsic dimension of the hypersphere--the parameter d is the dimension of the ambient space in which the hypersphere is embedded (extrinsic dimension).

In [1]: from skdim.datasets import hyperSphere

In [2]: X = hyperSphere(n=100,d=2)

In [3]: X.shape
Out[3]: (100, 2)

The 2-sphere (the usual sphere) can't be embedded in R^2. The dataset X above is a sampling of the circle (1-sphere) not a 2-sphere.

NameError in `asPointWise` function

Name X in this line is not defined within the function
https://github.com/scikit-learn-contrib/scikit-dimension/blob/master/skdim/_commonfuncs.py#L115

setup.py should contain numba in INSTALL_REQUIRES

Hi,
When numba is not installed, FisherS fails with the error

_FisherS.py, line 33, in <module>
    import numba as nb
ModuleNotFoundError: No module named 'numba'

Line 33, indeed imports numba which is missing in the INSTALL_REQUIRES.

I would suggest changing the INSTALL_REQUIRES to:

INSTALL_REQUIRES = ["numpy", "scipy", "scikit-learn", "numba"]

to prevent this.

Very good package!

	for i in range(len(X)):
	self.dimension_pw_[i], self.essval_[i] = self._essLocalDimEst(
	X[knnidx[i, :]]
	)