GithubHelp home page GithubHelp logo

klopfer-bachelor's Introduction

Label/Property/Concept Hierarchy/Taxonomy Inference

This repository contains code implementing the following functionality:

  • A survey on hierarchical clustering
  • a synthetic dataset generator
  • an implementation of the COBWEB algortihm for conceptual clustering as an APOC procedure
  • several importing and preprocessing tools found in the cypher and python sub-directories

The key ideas are presented in the following presentations and thesis:

Survey

The survey uses all applicable algorithms from scikit-learn, some from pyclustering that are not in sklearn. It further uses hdbscan from scikit-contrib (slightly adapted, so it needs to be installed from lib, see below). Further it uses TRESTLE algorithm, and a combination of the scikit-learn & pyclustering algorithms with tandard clustering techniques followed by hierarchical clustering.

Setup:
On Linux do the following (macOS should be similar)

cd lib/hdbscan
python setup.py install
pip install -r requirements.txt

Run it using python clustering_survey.py

Data Generator

The data generator generates a hierarchy consisting of labels to be clustered. Width and depth of the hierarchy are configurable. Width adds nodes per level incrementing a counter per level. Depth adds more levels and appends a counter per level. Labels are "stacked" i.e. a node at the lowest level has |depth| labels while the root node has one. At the lowest level a couple of identical instances are created so that they MUST be in a cluster without a doubt. Everything above depends on the algorithm but the goal is essentially to reconstruct the ground truth it can be called from the command line using

java -jar lib/synthetic_data_generator.jar

If you're building via maven, that jar is in the target folder target/project-dataset-generator-jar-with-dependencies.jar

COBWEB implementation

During the survey it turned out, that conceptual clustering has several desirable traits that the other algorithms dont have, such that it was chosen to be implemented with Neo4j as graph database for the purpose of extracting graph-based features and use them for clustering. This was done in order to evaluate if concept hierarchies capture correlations among properties and features well enough to potentially use them for cardinality estimation, that is for estimating the selectivities of certain combinations of properties and graph-based features. For example if a node is queried with a relation of a certain type to another node of a certain type having a property with a certain value, one can look up in the concept tree which concept fits best to this query and how many values are approximated to be in this category. When constructing the tree with sample then one gets an estimation of the share of data instances belonging to the concept under consideration. If no concept matches well enough one can still use the uniformity assumption. With some modifications to the conceptual clustering framework it should also be possible to retrieve strongly correlated selection criteria in order to build up a look-up table for non-uniform selection criteria, such that if it is not found the probability that the selectivity is uniform is high.

Setup Before maven can be run to test or package the code one needs to initialize an additional dependency by hand (used for testing):

mvn initialize
mvn package

The longest running tests are disabled by default as a dataset with 100M edges is used and it can take up to 30 minutes.

For usage with Neo4j put the jar into /plugins:
From The Neo4J manual

Procedures and functions are written in Java and compiled into jar files. They are deployed to the database by dropping that jar file into the plugins directory on each standalone or clustered server. The database must be re-started on each server to pick up new procedures and functions. Running CALL dbms.procedures() will display the full list of procedures available in a particular database instance, including user-defined procedures.

I.e. to check if the plugin was registered correctly run CALL dbms.procedures() and see if kn.uni.dbis.neo4j.conceptual.PropertyGraphCobwebStream is shown.

Example call of the procedure:

MATCH (n)
WITH COLLECT(n) AS nodes
CALL kn.uni.dbis.neo4j.conceptual.PropertyGraphCobwebStream(nodes)
YIELD node, partitionID
RETURN node, partitionID

Currently the return value is not properly accessible trough Cypher. Actually using the algorithm - or rather a modification or reimplementation that fits the purpose better - within the database kernel, that is in the cardinality estimator would make more sense. Due to a lack of time I haven't done this, neither is there concrete code to hook this up to the estimator (also because N4j is sparse on internal docs and information), nor are there results on the improvements this gives. Another point for future work is to improve the algorithm such that it returns the qunatities of interest, provide a more robust optimization criterion than the category utility especially for mixed data, implement advances in the field e.g. LABYRINTH. Especially interesting would be the bayesian extension AutoClass or learning Bayesian networks to capture the conditional selectivities. A survery of different extensions is specified in Unsupervised Learning of Probabilistic Concept Hierarchies.

klopfer-bachelor's People

Contributors

someusername1 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.