GithubHelp home page GithubHelp logo

scikit-learn-contrib / scikit-dimension Goto Github PK

View Code? Open in Web Editor NEW
70.0 5.0 15.0 24.1 MB

A Python package for intrinsic dimension estimation

Home Page: https://scikit-dimension.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scikit-dimension's Introduction

Build status CircleCI Documentation Status codecov GitHub license Downloads

scikit-dimension

scikit-dimension is a Python module for intrinsic dimension estimation built according to the scikit-learn API and distributed under the 3-Clause BSD license.

Please refer to the documentation and the paper for detailed API, examples and references

Installation

Using pip:

pip install scikit-dimension

From source:

git clone https://github.com/j-bac/scikit-dimension
cd scikit-dimension
pip install .

Quick start

Local and global estimators can be used in this way:

import skdim
import numpy as np

#generate data : np.array (n_points x n_dim). Here a uniformly sampled 5-ball embedded in 10 dimensions
data = np.zeros((1000,10))
data[:,:5] = skdim.datasets.hyperBall(n = 1000, d = 5, radius = 1, random_state = 0)

#estimate global intrinsic dimension
danco = skdim.id.DANCo().fit(data)
#estimate local intrinsic dimension (dimension in k-nearest-neighborhoods around each point):
lpca = skdim.id.lPCA().fit_pw(data,
                              n_neighbors = 100,
                              n_jobs = 1)
                            
#get estimated intrinsic dimension
print(danco.dimension_, np.mean(lpca.dimension_pw_)) 

scikit-dimension's People

Contributors

auranic avatar j-bac avatar jessecresswell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

scikit-dimension's Issues

DANCo Fast

Hi,

I noticed in the DANCo.py file that some functions have been commented out which seem to implement a version of FastDANCo. I was wondering if it's safe to uncomment and use these in my code or whether there was a reason as to why they're commented out?

Thanks.

P.S. Great package.

Quickstart error -- no FisherS example

Minor issue. The last line of the quickstart page throws an error:

print(danco.dimension_, fishers.dimension_, np.mean(lpca.dimension_pw_))

As the fisher instance was never implemented. Adding a line like the following fixes it:
fishers = skdim.id.FisherS().fit(data)

FishersS encounters error

Hi,

I've tried the FishersS() estimation using several values of condition number and several datasets and all come back with the same error.

See below for 2 examples using the tutorial dataset. Am I missing something? Do I need to specify alphas? Running with python 3.7.3 on a MacBook.

Thanks much,
Dan Schnell CCHMC

data = np.zeros((1000,10))
data[:,:5] = skdim.datasets.hyperBall(n = 1000, d = 5, radius = 1, random_state = 0)

pca=skdim.id.lPCA()
gid1=pca.fit(data).dimension_
gid1
5
FS2 = skdim.id.FisherS(conditional_number=10, project_on_sphere=1, alphas=None, produce_plots=False, verbose=0, limit_maxdim=False).fit(data)
Traceback (most recent call last):
File "", line 1, in
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 123, in fit
) = self._SeparabilityAnalysis(X)
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 617, in _SeparabilityAnalysis
separable_fraction, p_alpha = self._checkSeparabilityMultipleAlpha(Xp)
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 259, in _checkSeparabilityMultipleAlpha
counts[k:e, :] = counts[k:e, :] + self._histc(xy.T, alphas)
TypeError: expected dtype object, got 'numpy.dtype[float64]'
FS1 = skdim.id.FisherS().fit(data)
Traceback (most recent call last):
File "", line 1, in
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 123, in fit
) = self._SeparabilityAnalysis(X)
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 617, in _SeparabilityAnalysis
separable_fraction, p_alpha = self._checkSeparabilityMultipleAlpha(Xp)
File "//anaconda3/lib/python3.7/site-packages/skdim/id/_FisherS.py", line 259, in _checkSeparabilityMultipleAlpha
counts[k:e, :] = counts[k:e, :] + self._histc(xy.T, alphas)
TypeError: expected dtype object, got 'numpy.dtype[float64]'

Too many threads spawed in DANCo().fit()

Hello, I am an HPC technichal consultant and one of our users is using your package. The function skdim.id.DANCo().fit(data) seems to be spawning a significant number of threads with no apparent way to control the number. In an HPC environment, this is an issue. Is this intended?

Question: Why is d = k + 1 for Kaiser and Broken Stick?

I noticed that the ID estimates provided by the Kaiser and broken_stick methods in id.lPCA are k + 1, where k is the number of components to be kept according to the most commonly used implementations of these rules (i.e. keep only components with an eigenvalue > 1 [Kaiser], or keep only components with greater than expected explained variance [broken stick]).

I'm wondering what the thinking was behind this choice, and if there are any papers I can cite justifying this modification.

Thanks!

FisherS return `nan` value when ID is pretty large

def test_ID_estimator(D, name='TwoNN', *args, **kwargs):
  ids = []
  Ns = [64, 128, 256, 512, 1024, 2048]
  for N in Ns:
    data = np.zeros((N,3*32*32))
    data[:,:D] = skdim.datasets.hyperBall(n = N, d = D, radius = 2, random_state = 666)
    _id = eval(f"skdim.id.{name}")(*args,**kwargs).fit_transform(X=data)
    ids.append(_id)
    print(f'{name}', N, _id)

Results

test_ID_estimate(10, name='FisherS')
>>>
FisherS 64 11.258882790713601
FisherS 128 12.293955630520735
FisherS 256 10.300845319222297
FisherS 512 10.0622264687654
FisherS 1024 10.117248646587507
FisherS 2048 10.083245836860842
test_ID_estimate(20, name='FisherS')
>>>
FisherS 64 nan
FisherS 128 25.11715241495022
FisherS 256 19.984501791074795
FisherS 512 20.91391435506641
FisherS 1024 19.95265685270067
FisherS 2048 19.77165730041636
test_ID_estimate(50, name='FisherS')
>>>
FisherS 64 nan
FisherS 128 nan
FisherS 256 nan
FisherS 512 nan
FisherS 1024 nan
FisherS 2048 49.44956027057592
test_ID_estimate(100, name='FisherS')
>>>
FisherS 64 nan
FisherS 128 nan
FisherS 256 nan
FisherS 512 nan
FisherS 1024 nan
FisherS 2048 nan

Incorrect implementation of Participation Ratio

I just noticed what I think might be a problem with your implementation of Participation Ratio!

Suppose one is given a matrix X with shape (num samples, num features). The PR is given by square(sum(eigenvalues)) / sum(square(eigenvalues)) of X^T X, but the implementation uses PCA.

pca = PCA().fit(X)

This is incorrect because PCA first demeans the data. I don't think that's correct.

`asPointwise` with `n_jobs>1` does not progress

(This was tested after fixing another issue locally #17)

def asPointwise(data, class_instance, precomputed_knn=None, n_neighbors=100, n_jobs=1):

asPointwise works as intended when n_jobs=1. However, in my experience when n_jobs>1 the async call does not complete.

I can recommend using the joblib library instead of multiprocessing, for example:

from joblib import delayed, Parallel
if n_jobs > 1:
    with Parallel(n_jobs=n_jobs) as parallel:
        def fit_estimator(class_instance, data, idx):
            return class_instance.fit(data[idx, :]).dimension_
        results = parallel(
            delayed(fit_estimator)(
                class_instance, data, idx
            ) for idx in knn
        )
    return np.array(results)

dimension drops to 0 when number of samples is high

Hello and thanks for the package !

I am currently running few tests and when I run estimator on large datasets, the returned dimension is 0. When I lower the number of sample, it returns a higher dimension. Any ideas of why this behavior happens ?

Best,

Etienne

Computation of estimator values pointwise on kNN sets is not parallelized

I will use ESS as an example, since it is a pretty slow estimator. Since it is a LocalEstimator, when fit is called kNNs are first computed (if not already provided)

dists, knnidx = get_nn(X, k=self.n_neighbors, n_jobs=n_jobs)

and this in turn calls to the sklearn library which properly parallelizes the computation based on the parameter n_jobs that we can provide.

Second, a call to self.fit in the ESS class performs a simple, single-threaded for loop over the datapoints.

for i in range(len(X)):
self.dimension_pw_[i], self.essval_[i] = self._essLocalDimEst(
X[knnidx[i, :]]
)

The computations are "embarrassingly parallel". I have locally implemented parallelization and seen at least a 6x speedup in ESS computation on datasets of size 5,000 - 50,000.

I can open a PR with these changes as an example if you are willing to add joblib as a dependency.

Inconsistent results when using multiprocessing in fit_pw method

Dear authors,

thank you for your work on this project and for making the code available to the community!
I have encountered an issue with the fit_pw method in the lPCA class when using multiprocessing. The issue is that the output dimensions are not the same when using different numbers of jobs (n_jobs), while they should be consistent. Below is a code sample to reproduce the problem:

import numpy as np
import skdim

# set the random seed
np.random.seed(42)

lpca = skdim.id.lPCA(ver="FO", alphaFO=0.05, verbose=True)

# create a random dataset
X_np = np.random.randn(100, 10)

print("X_np.shape: ", X_np.shape)
print(X_np)

lpca.fit_pw(X=X_np, n_neighbors=20, n_jobs=1, smooth=False)
lpca_dimensions_pw_n_jobs_1 = lpca.dimension_pw_
print(f"lpca.dimension_pw_: {lpca.dimension_pw_}")

lpca.fit_pw(X=X_np, n_neighbors=20, n_jobs=2, smooth=False)
lpca_dimensions_pw_n_jobs_2 = lpca.dimension_pw_
print(f"lpca.dimension_pw_: {lpca.dimension_pw_}")

# Check that the results are the same
assert np.allclose(lpca_dimensions_pw_n_jobs_1, lpca_dimensions_pw_n_jobs_2)

I have identified that the problem is related to the management of class instances and their state when using multiprocessing. In the current implementation, the worker processes do not share memory with the main process, so modifications to the instances within the worker processes are not reflected in the main process. The dimension is being calculated and stored in the _dimension attribute of the instances within the worker processes, but this information is not being propagated back to the main process.

A possible solution is to change the fit_pw function to return the computed dimension values instead of storing them in the _dimension attribute of the instances. Then, use the apply_async function to asynchronously apply the fit function to each data point and collect the results in the main process. Here's an example of a modified fit_pw function that addresses this issue:

def fit_pw(self, X, precomputed_knn=None, smooth=False, n_neighbors=100, n_jobs=1):
    # ...
    if n_jobs > 1:
        with mp.Pool(n_jobs) as pool:
            # Asynchronously apply the `fit` function to each data point and collect the results
            results = [pool.apply_async(self.fit, (X[i, :],)) for i in knnidx]
            # Retrieve the computed dimensions
            self.dimension_pw_ = np.array([r.get().dimension_ for r in results])
    # ...

With this modification, the computed dimensions are correctly returned and stored in the main process, and the dimensions are consistent when running the code with different numbers of jobs.
Would it be possible to consider this change or a similar approach to address the issue with multiprocessing in the fit_pw method? This problem likely also affects the other multiprocessing dimension estimates, and not just the lPCA class.

Incorrect dimension of `hyperSphere`

The parameter d in skdim.datasets.hyperSphere is documented to be the "Dimension of the hypersphere". However, it is actually one larger than the intrinsic dimension of the hypersphere--the parameter d is the dimension of the ambient space in which the hypersphere is embedded (extrinsic dimension).

In [1]: from skdim.datasets import hyperSphere

In [2]: X = hyperSphere(n=100,d=2)

In [3]: X.shape
Out[3]: (100, 2)

The 2-sphere (the usual sphere) can't be embedded in R^2. The dataset X above is a sampling of the circle (1-sphere) not a 2-sphere.

setup.py should contain numba in INSTALL_REQUIRES

Hi,
When numba is not installed, FisherS fails with the error

_FisherS.py, line 33, in <module>
    import numba as nb
ModuleNotFoundError: No module named 'numba'

Line 33, indeed imports numba which is missing in the INSTALL_REQUIRES.

I would suggest changing the INSTALL_REQUIRES to:

INSTALL_REQUIRES = ["numpy", "scipy", "scikit-learn", "numba"]

to prevent this.

Very good package!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.