GithubHelp home page GithubHelp logo

hdbscansharp's People

Contributors

doxakis avatar pedropaulovc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hdbscansharp's Issues

Provide PDB-Files in nuget package

Would you mind providing the pdb-files in the nuget package?

Everytime I need to debug something, I clone your project and build it on command line to use this dll / pdb.
Since this library is open source it would be nice to be able to debug easily.

I'm not a .NET expert and maybe there are easier ways to debug.
If there is a reason to not provide pdb-files in nuget packages, I would like to learn them.

Thank you for your work :)

OutlierScore calculation problem

When getting the outlier scores for a clustering, I'm currently getting values between -4 and slightly above one (e.g. 1.00000053). But it is my understanding (and the way the python library behaves) that the outlier score should be between 0.0 and 1.0 .

The values above 1.0 might be due to rounding errors, but the negative values seem wrong.

Looking into the implementation, in the file HdbscanSharp/Hdbscanstar/HdbscanAlgorithm.cs on line 625, you have score = 1 - (epsilon_max / epsilon);, whereas the python implementation in the file /hdbscan/_hdbscan_tree.pyx on line 577 has result[point] = (lambda_max - lambda_array[n]) / lambda_max, which is the equivalent of result[point] = 1 - (lambda_array[n]/ lambda_max) (with epsilon_max and lambda_max both being the lowest death level). So to me it looks like the c# implementation switched the denominator and numerator, leading to wrong results.

Or are the scores in this implementation meant to be different from the python one?

System.OverflowException: Value was either too large or too small for an Int32

I'm getting the following exception:

System.OverflowException occurred
HResult=-2146233066
Message=Der Wert für einen Int32 war zu groß oder zu klein.
Source=mscorlib
StackTrace:
bei System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
InnerException:

The exception occurred in this line:

int label = int.Parse(lineContents[i]);

After debugging a bit I found out that there are original doubles appended to hierarchyWriter
in this line:

I think the problem is a Locale problem because in germany for example the decimal delimiter is a comma.
I.E. the following code will result in output to be "48,91321404939,"

var currentEdgeWeight = 48.91321404939
string output = currentEdgeWeight + "" + delimiter;

Possible fix could be:

NumberFormatInfo nfi = new NumberFormatInfo();
nfi.NumberDecimalSeparator = ".";
int outputLength = 0;
string output = currentEdgeWeight.ToString(nfi) + delimiter;
hierarchyWriter.Append(output);

.NET Standard?

This is an awesome library!

But are there any plans to make it support .NET standard? Or is it a dependency on Accord?

Result differs from python variant

Hi,
I've another question regarding your library.
Doing the following in python:

import hdbscan

clusterer = hdbscan.HDBSCAN(min_cluster_size=100, core_dist_n_jobs=1)
labels = clusterer.fit_predict(data)

should be equal to the following call to your library in c#. Is this assumption correct?

var parameters = new HdbscanParameters()
{
  DataSet = dataToCluster.RawData,
  MinClusterSize = 100,
  MinPoints = 100,
  DistanceFunction = new EuclideanDistance(),
};
var hdbscanResult = HdbscanRunner.Run(parameters);

When visualizing the cluster results using T-SNE the result of the python library you mentioned in your README looks like this:

figure_4-1

When doing the same with the result the call to your library in the above snippet it looks like this:

figure_2-2

I wonder about the three (in my opinion) obvious clusters.
Is my call to your library wrong?
I didn't really understand what the parameter "MinPoints" is used for.
When does it make sense to initialize this parameter different to MinClusterSize?

Weights for data points

Is it possible to use weights for data points?
I currently add duplicates to create "weights" for my data points. Seems very inefficient. Is there a better way to do it?

How to read data from a database?

Thank you for the implementation.
I have a database (SQL Server or PostgreSQL etc) and one table contains text field.
I'd like to read the texts from this table but not from the file set (as in the sample) and then analyze and detect clusters.
Can you advise something? Thanx!

Assembly is not strong-named

Repro steps:

  1. Download the latest version from NuGet
  2. Unpack the .nupkg file
  3. Run sn -vf HdbscanSharp.1.0.15\lib\net45\HdbscanSharp.dll

Expected:
The sn.exe tool would output

Microsoft (R) .NET Framework Strong Name Utility  Version 4.0.30319.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Assembly 'HdbscanSharp.1.0.15\lib\net45\HdbscanSharp.dll' is valid

Actual:
The sn.exe tool outputs

Microsoft (R) .NET Framework Strong Name Utility  Version 4.0.30319.0
Copyright (c) Microsoft Corporation.  All rights reserved.

HdbscanSharp.1.0.15\lib\net45\HdbscanSharp.dll does not represent a strongly named assembly

Observation:
This blocks strong-named .NET assemblies from depending on HdbscanSharp. See Why strong-name your assemblies?

Add more distance functions

It would be great to have more distance functions ready to use.

The python lib implements lot of metrics.
http://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html#distance-matrices

The following file expose distance functions:
https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/dist_metrics.pyx

Some distance functions:

  • Chebyshev Distance
  • Minkowski Distance
  • W-Minkowski Distance
  • Mahalanobis Distance
  • Hamming Distance
  • Canberra Distance
  • Bray-Curtis Distance
  • Jaccard Distance
  • Matching Distance
  • Dice Distance
  • Kulsinski Distance
  • Rogers-Tanimoto Distance
  • Russell-Rao Distance
  • Sokal-Michener Distance
  • Sokal-Sneath Distance
  • Haversine Distance (2 dimensional)
  • Yule Distance

In the following folder: HdbscanSharp/Distance/
There is some examples. (For example, see ManhattanDistance)
It must implements IDistanceCalculator.

If there is some limitation, it can throw a meanful exception with explications.
For example, if the distance works only for 2 dimension, throw an exception.

I suggest to add one class for each new distance function and add minimal documentation (Description, formula and suggested use case if any. Check wikipedia).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.