Describe the issue linked to the documentation The documentation p

As the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSC

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include about scikit-learn HOT 4 CLOSED

AnPananas commented on May 18, 2024

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include

from scikit-learn.

Comments (4)

lesteve commented on May 18, 2024 1

As the doc says:

If metric is "precomputed", X is assumed to be a distance matrix and must be square.

Distance matrix, means the matrix has 0 on the diagonal. You matrix is a similarity matrix I am guessing this is why you find points with no neighbors.

I am going to close the issue, since at this point I feel this is more likely to be a scikit-learn usage question rather than a bug in scikit-learn.

from scikit-learn.

lesteve commented on May 18, 2024

Thanks for opening an issue! I think the documentation is right but I have to admit, I am certainly not a DBSCAN expert.

I took min_samples=2 and the points with one neighbor did not become the core

Have you tried playing with eps? If you can provide a snippet of code showing this issue, this would be great so that a maintainer can have a closer look.

from scikit-learn.

AnPananas commented on May 18, 2024

@lesteve
I use matrix of distances in input data

from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
DBSCAN_model = DBSCAN(eps=100, min_samples=2, metric='precomputed', algorithm='brute')

similarity_matrix = [
    [1e10, 2500.966630, 572.004568, 2571.116203, 2637.008209, 2378.405924, 244.336929, 288.477526, 339.468194],
    [2500.966630, 1e10, 2437.781596, 70.149573, 1024.025578, 765.423293, 2256.629701, 2262.386342, 2205.245221],
    [572.004568, 2437.781596, 1e10, 2507.931168, 2573.823175, 2315.220890, 327.667640, 333.424280, 232.536374],
    [2571.116203, 70.149573, 2507.931168, 1e10, 1094.175151, 835.572866, 2326.779274, 2332.535914, 2275.394794],
    [2637.008209, 1024.025578, 2573.823175, 1094.175151, 1e10, 258.602285, 2392.671280, 2398.427921, 2341.286801],
    [2378.405924, 765.423293, 2315.220890, 835.572866, 258.602285, 1e10, 2134.068995, 2139.825636, 2082.684515],
    [244.336929, 2256.629701, 327.667640, 2326.779274, 2392.671280, 2134.068995, 1e10, 44.140597, 95.131265],
    [288.477526, 2262.386342, 333.424280, 2332.535914, 2398.427921, 2139.825636, 44.140597, 1e10, 100.887906],
    [339.468194, 2205.245221, 232.536374, 2275.394794, 2341.286801, 2082.684515, 95.131265, 100.887906, 1e10]
]

# Create DataFrame
df_similarity = pd.DataFrame(similarity_matrix, columns=range(1, 10), index=range(1, 10))

df_similarity_numpy = df_similarity.to_numpy()

neighbors_model = NearestNeighbors(
            radius=DBSCAN_model.eps,
            algorithm=DBSCAN_model.algorithm,
            leaf_size=DBSCAN_model.leaf_size,
            metric=DBSCAN_model.metric,
            metric_params=DBSCAN_model.metric_params,
            p=DBSCAN_model.p,
            n_jobs=DBSCAN_model.n_jobs,
        )

neighbors_model.fit(df_similarity_numpy)
# This has worst case O(n^2) memory complexity
neighborhoods = neighbors_model.radius_neighbors(df_similarity_numpy, return_distance=False)

This code return neighborhoods as array in output we see:

array([array([], dtype=int64), array([3]), array([], dtype=int64),
       array([1]), array([], dtype=int64), array([], dtype=int64),
       array([7, 8]), array([6]), array([6])], dtype=object)

only on point has more then 1 neighbor ( array([7, 8])). I remind you that we have specified 2 neighbors "including the point in question" in order for it to become the core.

then we run this:

n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
core_samples = np.asarray(n_neighbors >= 2, dtype=np.uint8)
core_samples

in output:
array([0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=uint8)

only one point which have 2 neighbors except itself become a core point

from scikit-learn.

AnPananas commented on May 18, 2024

@lesteve Thank you very much for the prompt response, now I understand)

from scikit-learn.

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include about scikit-learn HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs