Comments (12)
I see, that makes much more sense and is simpler. I understand why Local
and Global
are used now. What threw me off was first seeing lPCA as a "global" estimator, since I think of using it in the local way, but it happens to be the exception to your naming scheme.
Here's a suggestion then - in the README you use DANCo as an example of a global estimator (fine), and lPCA as an example of a local estimator. Then when I look up their implementation I see they both inherit from GlobalEstimator
. That's unexpected! Why not replace lPCA with ESS or another LocalEstimator
in the first example that most users will see?
from scikit-dimension.
Can you give me access rights to make branches and push? I will make a PR switching this function joblib, and then another for adding parallelism internally for estimators like ESS.
from scikit-dimension.
Thanks, really appreciate you bringing up these issues and possible fix.
Would you have an example to reproduce this ? For example this completes with no problem for me:
import numpy as np
import skdim
X = np.random.random((1000, 10))
skdim.id.lPCA().fit_pw(X,n_jobs=4).dimension_pw_
edit: I managed to reproduce the error using asPointwise directly rather than fit_pw. I don't know why the latter works while the former gets stuck like this. If I don't find a solution with multiprocessing I'll try to implement your suggestion with joblib
from scikit-dimension.
Oh I was about to explain more but I see your edit. I was specifically using the ESS estimator which does not have a built-in fit_pw
function, and asPointwise
appeared to be the intended method for pointwise estimation of local intrinsic dimension.
import numpy as np
import skdim
import multiprocessing as mp
# Re-defining this function to fix bug in Issue 17
def asPointwise(data, class_instance, precomputed_knn=None, n_neighbors=100, n_jobs=1):
"""Use a global estimator as a pointwise one by creating kNN neighborhoods"""
if precomputed_knn is not None:
knn = precomputed_knn
else:
_, knn = skdim.get_nn(data, k=n_neighbors, n_jobs=n_jobs)
if n_jobs > 1:
with mp.Pool(n_jobs) as pool:
# Asynchronously apply the `fit` function to each data point and collect the results
results = [pool.apply_async(class_instance.fit, (data[i, :],)) for i in knn]
# Retrieve the computed dimensions
return np.array([r.get().dimension_ for r in results])
else:
return np.array([class_instance.fit(data[i, :]).dimension_ for i in knn])
X = np.random.random((100, 10))
# estimator = skdim.id.lPCA() # GlobalEstimator
estimator = skdim.id.ESS() # LocalEstimator
# n_jobs = 1 # single process
n_jobs = 2 # multi process
if isinstance(estimator, skdim._commonfuncs.LocalEstimator):
lid = asPointwise(X, estimator, n_neighbors=10, n_jobs=n_jobs)
elif isinstance(estimator, skdim._commonfuncs.GlobalEstimator):
lid = estimator.fit_pw(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_
print(lid)
This code completes quickly for LPCA with n_jobs=1 or 2, and for [ESS, MADA, MLE, MOM, TLE] with n_jobs=1.
It completely hangs for [ESS, MADA, MLE, MOM, TLE] with n_jobs=2. These are the GlobalEstimator
classes.
from scikit-dimension.
Thanks for looking into this, and for the great library!
I had a question about terminology. Estimators like LPCA inherit from GlobalEstimator
, and have the fit_pw
function for pointwise (local) dimension estimation. Estimators like ESS inherit from LocalEstimator
, and do not have the fit_pw
function implemented, only fit
which is for the whole dataset.
Based on these descriptions, it seems like the class names of GlobalEstimator
and LocalEstimator
are reversed. Local estimators should be the ones that operate pointwise, while global estimators only act on the whole dataset by default. Did you have a different interpretation?
from scikit-dimension.
You can use ESS pointwise like so:
import numpy as np
import skdim
X = np.random.random((1000, 10))
skdim.id.ESS().fit(X,n_jobs=4).dimension_pw_
So you can do this in your code:
if isinstance(estimator, skdim._commonfuncs.LocalEstimator):
lid = estimator.fit(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_
elif isinstance(estimator, skdim._commonfuncs.GlobalEstimator):
lid = estimator.fit_pw(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_
print(lid)
LocalEstimator
class is used for estimators that already require computation of ID estimates for each point neighborhood to provide a global ID estimate. So in principle there is no need for a .fit_pw
method and .dimension_pw_
is already returned
For lPCA I made an exception - most people are used to running PCA on the entire dataset to estimate ID, so this is the default when using .fit()
to prevent confusion. Accordingly it inherits from GlobalEstimator
whose .fit
method runs an estimator on the entire dataset.
However paper references associated with this class rather use PCA locally and aggregate estimates to obtain global ID. Indeed ID estimation using PCA on an entire dataset will fail for non-linear data. So arguably, PCA is more of a local ID estimator that tries to obtain ID of the manifold tangent space. Hence the name lPCA (local PCA) for the class to point this out while keeping default use global.
Overall I am trying to keep users from inadequately using estimators. .fit()
always expects to receive an entire (possibly non-linear) dataset as input whether the estimator inherits from LocalEstimator
or GlobalEstimator
. This is not the case in original R ESS implementation where ESS expects a single neighborhood as input. In Python this would be equivalent to calling ESS().fit_once
I agree the terminology is confusing so suggestions are very welcome.
from scikit-dimension.
I'll put this one my to do list: the lPCA exception should probably be removed and the docs provide a clear example with a lPCA.fit_once method to apply PCA "as usual" a single time on the whole dataset
from scikit-dimension.
Maybe I need to request for you to be added to the scikit-learn-contrib org. I don't see an option to give you access rights myself
from scikit-dimension.
Actually you should be able to PR even without being member
from scikit-dimension.
No, I have no option to make a PR or even a new branch on this repo. I tried pushing a branch directly but I get a permissions error.
from scikit-dimension.
Could you try again following the steps at https://github.com/firstcontributions/first-contributions ?
from scikit-dimension.
Closed by PR #23
from scikit-dimension.
Related Issues (18)
- FisherS return `nan` value when ID is pretty large HOT 2
- Question: Why is d = k + 1 for Kaiser and Broken Stick? HOT 1
- Why is participation ratio rounded to an integer? HOT 8
- FishersS encounters error HOT 2
- Incorrect implementation of Participation Ratio HOT 2
- Inconsistent results when using multiprocessing in fit_pw method HOT 1
- Incorrect dimension of `hyperSphere` HOT 1
- NameError in `asPointWise` function HOT 1
- [minor] Incorrect docstring for hyperSphere HOT 1
- [minor] Incorrect docstring for dataset `lineDiskBall` HOT 1
- Computation of estimator values pointwise on kNN sets is not parallelized HOT 2
- Error when using FisherS estimator, function point_inseparability_to_pointID() HOT 1
- Quickstart error -- no FisherS example HOT 1
- dimension drops to 0 when number of samples is high HOT 1
- Too many threads spawed in DANCo().fit() HOT 3
- DANCo Fast HOT 2
- setup.py should contain numba in INSTALL_REQUIRES HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scikit-dimension.