GithubHelp home page GithubHelp logo

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include about scikit-learn HOT 4 CLOSED

AnPananas avatar AnPananas commented on May 18, 2024
the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include

from scikit-learn.

Comments (4)

lesteve avatar lesteve commented on May 18, 2024 1

As the doc says:

If metric is "precomputed", X is assumed to be a distance matrix and must be square.

Distance matrix, means the matrix has 0 on the diagonal. You matrix is a similarity matrix I am guessing this is why you find points with no neighbors.

I am going to close the issue, since at this point I feel this is more likely to be a scikit-learn usage question rather than a bug in scikit-learn.

from scikit-learn.

lesteve avatar lesteve commented on May 18, 2024

Thanks for opening an issue! I think the documentation is right but I have to admit, I am certainly not a DBSCAN expert.

I took min_samples=2 and the points with one neighbor did not become the core

Have you tried playing with eps? If you can provide a snippet of code showing this issue, this would be great so that a maintainer can have a closer look.

from scikit-learn.

AnPananas avatar AnPananas commented on May 18, 2024

@lesteve
I use matrix of distances in input data

from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
DBSCAN_model = DBSCAN(eps=100, min_samples=2, metric='precomputed', algorithm='brute')

similarity_matrix = [
    [1e10, 2500.966630, 572.004568, 2571.116203, 2637.008209, 2378.405924, 244.336929, 288.477526, 339.468194],
    [2500.966630, 1e10, 2437.781596, 70.149573, 1024.025578, 765.423293, 2256.629701, 2262.386342, 2205.245221],
    [572.004568, 2437.781596, 1e10, 2507.931168, 2573.823175, 2315.220890, 327.667640, 333.424280, 232.536374],
    [2571.116203, 70.149573, 2507.931168, 1e10, 1094.175151, 835.572866, 2326.779274, 2332.535914, 2275.394794],
    [2637.008209, 1024.025578, 2573.823175, 1094.175151, 1e10, 258.602285, 2392.671280, 2398.427921, 2341.286801],
    [2378.405924, 765.423293, 2315.220890, 835.572866, 258.602285, 1e10, 2134.068995, 2139.825636, 2082.684515],
    [244.336929, 2256.629701, 327.667640, 2326.779274, 2392.671280, 2134.068995, 1e10, 44.140597, 95.131265],
    [288.477526, 2262.386342, 333.424280, 2332.535914, 2398.427921, 2139.825636, 44.140597, 1e10, 100.887906],
    [339.468194, 2205.245221, 232.536374, 2275.394794, 2341.286801, 2082.684515, 95.131265, 100.887906, 1e10]
]

# Create DataFrame
df_similarity = pd.DataFrame(similarity_matrix, columns=range(1, 10), index=range(1, 10))

df_similarity_numpy = df_similarity.to_numpy()

neighbors_model = NearestNeighbors(
            radius=DBSCAN_model.eps,
            algorithm=DBSCAN_model.algorithm,
            leaf_size=DBSCAN_model.leaf_size,
            metric=DBSCAN_model.metric,
            metric_params=DBSCAN_model.metric_params,
            p=DBSCAN_model.p,
            n_jobs=DBSCAN_model.n_jobs,
        )

neighbors_model.fit(df_similarity_numpy)
# This has worst case O(n^2) memory complexity
neighborhoods = neighbors_model.radius_neighbors(df_similarity_numpy, return_distance=False)

This code return neighborhoods as array in output we see:

array([array([], dtype=int64), array([3]), array([], dtype=int64),
       array([1]), array([], dtype=int64), array([], dtype=int64),
       array([7, 8]), array([6]), array([6])], dtype=object)

only on point has more then 1 neighbor ( array([7, 8])). I remind you that we have specified 2 neighbors "including the point in question" in order for it to become the core.

then we run this:

n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
core_samples = np.asarray(n_neighbors >= 2, dtype=np.uint8)
core_samples

in output:
array([0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=uint8)

only one point which have 2 neighbors except itself become a core point

from scikit-learn.

AnPananas avatar AnPananas commented on May 18, 2024

@lesteve Thank you very much for the prompt response, now I understand)

from scikit-learn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.