GithubHelp home page GithubHelp logo

Comments (12)

JesseCresswell avatar JesseCresswell commented on June 1, 2024 1

I see, that makes much more sense and is simpler. I understand why Local and Global are used now. What threw me off was first seeing lPCA as a "global" estimator, since I think of using it in the local way, but it happens to be the exception to your naming scheme.

Here's a suggestion then - in the README you use DANCo as an example of a global estimator (fine), and lPCA as an example of a local estimator. Then when I look up their implementation I see they both inherit from GlobalEstimator. That's unexpected! Why not replace lPCA with ESS or another LocalEstimator in the first example that most users will see?

from scikit-dimension.

JesseCresswell avatar JesseCresswell commented on June 1, 2024 1

Can you give me access rights to make branches and push? I will make a PR switching this function joblib, and then another for adding parallelism internally for estimators like ESS.

from scikit-dimension.

j-bac avatar j-bac commented on June 1, 2024

Thanks, really appreciate you bringing up these issues and possible fix.

Would you have an example to reproduce this ? For example this completes with no problem for me:

import numpy as np
import skdim
X = np.random.random((1000, 10))
skdim.id.lPCA().fit_pw(X,n_jobs=4).dimension_pw_

edit: I managed to reproduce the error using asPointwise directly rather than fit_pw. I don't know why the latter works while the former gets stuck like this. If I don't find a solution with multiprocessing I'll try to implement your suggestion with joblib

from scikit-dimension.

JesseCresswell avatar JesseCresswell commented on June 1, 2024

Oh I was about to explain more but I see your edit. I was specifically using the ESS estimator which does not have a built-in fit_pw function, and asPointwise appeared to be the intended method for pointwise estimation of local intrinsic dimension.

import numpy as np
import skdim
import multiprocessing as mp

# Re-defining this function to fix bug in Issue 17
def asPointwise(data, class_instance, precomputed_knn=None, n_neighbors=100, n_jobs=1):
    """Use a global estimator as a pointwise one by creating kNN neighborhoods"""
    if precomputed_knn is not None:
        knn = precomputed_knn
    else:
        _, knn = skdim.get_nn(data, k=n_neighbors, n_jobs=n_jobs)

    if n_jobs > 1:
        with mp.Pool(n_jobs) as pool:
            # Asynchronously apply the `fit` function to each data point and collect the results
            results = [pool.apply_async(class_instance.fit, (data[i, :],)) for i in knn]
            # Retrieve the computed dimensions
        return np.array([r.get().dimension_ for r in results])
    else:
        return np.array([class_instance.fit(data[i, :]).dimension_ for i in knn])


X = np.random.random((100, 10))

# estimator = skdim.id.lPCA() # GlobalEstimator
estimator = skdim.id.ESS() # LocalEstimator

# n_jobs = 1 # single process
n_jobs = 2 # multi process

if isinstance(estimator, skdim._commonfuncs.LocalEstimator):
    lid = asPointwise(X, estimator, n_neighbors=10, n_jobs=n_jobs)
elif isinstance(estimator, skdim._commonfuncs.GlobalEstimator):
    lid = estimator.fit_pw(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_

print(lid)

This code completes quickly for LPCA with n_jobs=1 or 2, and for [ESS, MADA, MLE, MOM, TLE] with n_jobs=1.
It completely hangs for [ESS, MADA, MLE, MOM, TLE] with n_jobs=2. These are the GlobalEstimator classes.

from scikit-dimension.

JesseCresswell avatar JesseCresswell commented on June 1, 2024

Thanks for looking into this, and for the great library!

I had a question about terminology. Estimators like LPCA inherit from GlobalEstimator, and have the fit_pw function for pointwise (local) dimension estimation. Estimators like ESS inherit from LocalEstimator, and do not have the fit_pw function implemented, only fit which is for the whole dataset.

Based on these descriptions, it seems like the class names of GlobalEstimator and LocalEstimator are reversed. Local estimators should be the ones that operate pointwise, while global estimators only act on the whole dataset by default. Did you have a different interpretation?

from scikit-dimension.

j-bac avatar j-bac commented on June 1, 2024

You can use ESS pointwise like so:

import numpy as np
import skdim
X = np.random.random((1000, 10))
skdim.id.ESS().fit(X,n_jobs=4).dimension_pw_

So you can do this in your code:

if isinstance(estimator, skdim._commonfuncs.LocalEstimator):
    lid =  estimator.fit(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_
elif isinstance(estimator, skdim._commonfuncs.GlobalEstimator):
    lid = estimator.fit_pw(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_

print(lid)

LocalEstimator class is used for estimators that already require computation of ID estimates for each point neighborhood to provide a global ID estimate. So in principle there is no need for a .fit_pw method and .dimension_pw_ is already returned

For lPCA I made an exception - most people are used to running PCA on the entire dataset to estimate ID, so this is the default when using .fit() to prevent confusion. Accordingly it inherits from GlobalEstimator whose .fit method runs an estimator on the entire dataset.

However paper references associated with this class rather use PCA locally and aggregate estimates to obtain global ID. Indeed ID estimation using PCA on an entire dataset will fail for non-linear data. So arguably, PCA is more of a local ID estimator that tries to obtain ID of the manifold tangent space. Hence the name lPCA (local PCA) for the class to point this out while keeping default use global.

Overall I am trying to keep users from inadequately using estimators. .fit() always expects to receive an entire (possibly non-linear) dataset as input whether the estimator inherits from LocalEstimator or GlobalEstimator. This is not the case in original R ESS implementation where ESS expects a single neighborhood as input. In Python this would be equivalent to calling ESS().fit_once

I agree the terminology is confusing so suggestions are very welcome.

from scikit-dimension.

j-bac avatar j-bac commented on June 1, 2024

I'll put this one my to do list: the lPCA exception should probably be removed and the docs provide a clear example with a lPCA.fit_once method to apply PCA "as usual" a single time on the whole dataset

from scikit-dimension.

j-bac avatar j-bac commented on June 1, 2024

Maybe I need to request for you to be added to the scikit-learn-contrib org. I don't see an option to give you access rights myself

from scikit-dimension.

j-bac avatar j-bac commented on June 1, 2024

Actually you should be able to PR even without being member

from scikit-dimension.

JesseCresswell avatar JesseCresswell commented on June 1, 2024

No, I have no option to make a PR or even a new branch on this repo. I tried pushing a branch directly but I get a permissions error.

from scikit-dimension.

j-bac avatar j-bac commented on June 1, 2024

Could you try again following the steps at https://github.com/firstcontributions/first-contributions ?

from scikit-dimension.

JesseCresswell avatar JesseCresswell commented on June 1, 2024

Closed by PR #23

from scikit-dimension.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.