doxakis / hdbscansharp Goto Github PK
View Code? Open in Web Editor NEWHDBSCAN in C#
License: MIT License
HDBSCAN in C#
License: MIT License
Would you mind providing the pdb-files in the nuget package?
Everytime I need to debug something, I clone your project and build it on command line to use this dll / pdb.
Since this library is open source it would be nice to be able to debug easily.
I'm not a .NET expert and maybe there are easier ways to debug.
If there is a reason to not provide pdb-files in nuget packages, I would like to learn them.
Thank you for your work :)
When getting the outlier scores for a clustering, I'm currently getting values between -4 and slightly above one (e.g. 1.00000053). But it is my understanding (and the way the python library behaves) that the outlier score should be between 0.0 and 1.0 .
The values above 1.0 might be due to rounding errors, but the negative values seem wrong.
Looking into the implementation, in the file HdbscanSharp/Hdbscanstar/HdbscanAlgorithm.cs on line 625, you have score = 1 - (epsilon_max / epsilon);
, whereas the python implementation in the file /hdbscan/_hdbscan_tree.pyx on line 577 has result[point] = (lambda_max - lambda_array[n]) / lambda_max
, which is the equivalent of result[point] = 1 - (lambda_array[n]/ lambda_max)
(with epsilon_max
and lambda_max
both being the lowest death level). So to me it looks like the c# implementation switched the denominator and numerator, leading to wrong results.
Or are the scores in this implementation meant to be different from the python one?
I'm getting the following exception:
System.OverflowException occurred
HResult=-2146233066
Message=Der Wert für einen Int32 war zu groß oder zu klein.
Source=mscorlib
StackTrace:
bei System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
InnerException:
The exception occurred in this line:
int label = int.Parse(lineContents[i]);
After debugging a bit I found out that there are original doubles appended to hierarchyWriter
in this line:
I think the problem is a Locale problem because in germany for example the decimal delimiter is a comma.
I.E. the following code will result in output to be "48,91321404939,"
var currentEdgeWeight = 48.91321404939
string output = currentEdgeWeight + "" + delimiter;
Possible fix could be:
NumberFormatInfo nfi = new NumberFormatInfo();
nfi.NumberDecimalSeparator = ".";
int outputLength = 0;
string output = currentEdgeWeight.ToString(nfi) + delimiter;
hierarchyWriter.Append(output);
This is an awesome library!
But are there any plans to make it support .NET standard? Or is it a dependency on Accord?
Hi,
I've another question regarding your library.
Doing the following in python:
import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=100, core_dist_n_jobs=1)
labels = clusterer.fit_predict(data)
should be equal to the following call to your library in c#. Is this assumption correct?
var parameters = new HdbscanParameters()
{
DataSet = dataToCluster.RawData,
MinClusterSize = 100,
MinPoints = 100,
DistanceFunction = new EuclideanDistance(),
};
var hdbscanResult = HdbscanRunner.Run(parameters);
When visualizing the cluster results using T-SNE the result of the python library you mentioned in your README looks like this:
When doing the same with the result the call to your library in the above snippet it looks like this:
I wonder about the three (in my opinion) obvious clusters.
Is my call to your library wrong?
I didn't really understand what the parameter "MinPoints" is used for.
When does it make sense to initialize this parameter different to MinClusterSize?
Is it possible to use weights for data points?
I currently add duplicates to create "weights" for my data points. Seems very inefficient. Is there a better way to do it?
Thank you for the implementation.
I have a database (SQL Server or PostgreSQL etc) and one table contains text field.
I'd like to read the texts from this table but not from the file set (as in the sample) and then analyze and detect clusters.
Can you advise something? Thanx!
Repro steps:
.nupkg
filesn -vf HdbscanSharp.1.0.15\lib\net45\HdbscanSharp.dll
Expected:
The sn.exe
tool would output
Microsoft (R) .NET Framework Strong Name Utility Version 4.0.30319.0
Copyright (c) Microsoft Corporation. All rights reserved.
Assembly 'HdbscanSharp.1.0.15\lib\net45\HdbscanSharp.dll' is valid
Actual:
The sn.exe
tool outputs
Microsoft (R) .NET Framework Strong Name Utility Version 4.0.30319.0
Copyright (c) Microsoft Corporation. All rights reserved.
HdbscanSharp.1.0.15\lib\net45\HdbscanSharp.dll does not represent a strongly named assembly
Observation:
This blocks strong-named .NET assemblies from depending on HdbscanSharp
. See Why strong-name your assemblies?
It would be great to have more distance functions ready to use.
The python lib implements lot of metrics.
http://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html#distance-matrices
The following file expose distance functions:
https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/dist_metrics.pyx
Some distance functions:
In the following folder: HdbscanSharp/Distance/
There is some examples. (For example, see ManhattanDistance)
It must implements IDistanceCalculator.
If there is some limitation, it can throw a meanful exception with explications.
For example, if the distance works only for 2 dimension, throw an exception.
I suggest to add one class for each new distance function and add minimal documentation (Description, formula and suggested use case if any. Check wikipedia).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.