GithubHelp home page GithubHelp logo

domain-discovery-d4's People

Contributors

dependabot[bot] avatar heikomuller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

domain-discovery-d4's Issues

Modify Strong Domain Discovery

The strong domain discovery step can be modified in the following way:

  • Cluster local domains that support each other. Each cluster forms a strong domain
  • For each strong domain rank terms based on the number of columns (in the strong domain) that they occur in
  • Use steepest drop to group terms based on their weights

Frequencies for Equivalence Classes

Add option to compute frequency of an equivalence class for each column C as either

  • min. column frequency of all EQ terms in C
  • max. column frequency of all EQ terms in C
  • sum of frequencies for all EQ terms in C (default)
  • average (rounded) of frequencies for all EQ terms in C

Keep track of term frequencies

We should keep track of term frequencies in the column files (and the terms index and compressed term index). This would allow us to use similarity measures for terms/equivalence classes that are based on some notion of tf-idf.

Improve Algorithm for Merging Similar Equivalence Classes

The current implementation for SimilarTermIndexGenerator is rather naive. It merges all equivalence classes in a connected component based on similarity between pairs of equivalence classes. This approach has the strong disadvantage of potentially merging dis-similar equivalence classes because similarity is not transitive.

One improvement could be to pick equivalence classes as strong seeds and then merge them with all other equivalence classes that are similar to the seed. While this could still merge dis-similar equivalence classes there is the guarantee that they all at least satisfy the similarity threshold with the seed equivalence class.

add ref to paper

Heiko,

Can you please add a reference to our VLDB paper in the readme? This may be useful for people that try to use the tool.

Thanks,
Juliana

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.