GithubHelp home page GithubHelp logo

scheminer's Introduction

Scheminer

An attempt to write a tool to infer the schema of an arbitrary database.

Tested on the Brazilian E-Commerce Public Dataset by Olist. The rules and heuristics should generalize to other datasets as well, but no guarantees.

Actual schema (incomplete) Infered Schema
Example example schema Mined example schema

Usage

The project is built using PDM as its package manager, but can be installed using any PEP517 compliant package manager. To use the UI, however, it's required to install the dev dependencies, which are defined in a PDM-specific section of the pyproject.toml

To install using PDM, it's sufficient to just type:

pdm install -d # For dev

With Pip, first type:

pip install -e .

Then separately install the following dependencies to use the UI:

streamlit>=1.32.2
pyvis>=0.3.2
plotly>=5.20.0
matplotlib>=3.8.4
scipy>=1.13.0
streamlit-agraph>=0.0.45

Now the UI can be ran using the following command:

streamlit run debug-ui.py

Currently Scheminer only supports databases in the form of a list of CSV tables, which can be uploaded to the Streamlit UI.

Debug UI Screenshot

How it works

Scheminer relies one two basic assumptions for resolving the schema of a database:

  • If column A is a subset of column B, it is likely a foreign key to column B.
  • If column A is a subset of column B and C, but column B is also a subset of column C, column C is likely the actual parent column.

In the real world, we do encounter some difficult to resolve cases:

  • In one-to-one relations, which column is actually the parent?
  • In many-to-many relations (which sadly do occur in badly normalized databases), which column is actually the parent?
  • Some columns like counts can easily be a subset of a numerical incremental index (1, 2, ..., 200 is easily a subset of 1, 2, ... 10.000.000).
  • If A should be a subset of B, but B has deleted records that didn't cascade down to A, this relation is only detected if we allow for some tolerance or "fuzzy" subsets.

In the future, these cases may be (partially) resolved using NLP techniques on column and table names. For now, however, these require manual intervention.

The general schema mining process can be split into three phases:

  1. Relationship detection
    • Analyze column contents to detect relationships
  2. Manual intervention
    • Clean detection errors
  3. Graph-based schema resolution
    • Analyze the graph to keep the minimum valid schema

Relationship detection

  1. Detect unidirectional relations (one-to-X, many-to-X)
    • Calculate relationship strength, the fraction of values of $A$ found in $B$ $\frac{|A \cup B |}{|A|}$
    • Calculate one-sided cardinality. If the unique values of $A$ in $B$ are equal to the total values of $A$ found $B$, cardinality is one-to-X, else many-to-X.
  2. Merge unidirectional relations into fully qualified relations (one-to-X + many-to-X -> many-to-one)
    • The table with the greatest strength pointing to the other is likely the child
  3. Filter out relations with strength less than 1 (minus tolerance)
    • We only want to keep "full" subsets
  4. Flip one-to-many relations to many-to-one
    • We want relations to always point from the child to the parent

Manual intervention

  1. Resolve parent-child confusions
    • Resolve cases where two columns both contain 100% of each other's values and we cannot automatically detect the correct parent-child direction.
  2. Filter out low-correlation columns.
    • Try to filter out some falsely detected relations. Some columns may be spurious subsets of other columns. A catagorical [1, 2, 3] column, for example, can be a perfect subset of a numerical index. Such columns will have very little overlap the other way around, however, which we can filter for.
  3. Filter out incorrect relations
    • Filter out columns that are spurious subsets of others (often numerical columns)
    • Supported by automatically detecting columns where the parent column has only a weak relation the other way around. This can remove some relations between large indexes and smaller numerical columns.

Graph resolution

  1. For all columns with multiple outgoing edges (parents)
  2. Traverse trail of column edges to find top-level ancestors
  3. Remove links to parents if they share an ancestor with another parent.
    • In hindsight, it's also sufficient to just check if a parent has a parent itself. Since $A \subseteq B \cup B \subseteq C \implies A \subseteq C$, we know ancestor of our parent is also detected as an own direct parent.

See the diagram below for an illustration of this step.

flowchart LR;

subgraph Filtered
tabaf[Table A] --Column x -> Column z-->tabcf[Table C]
tabbf[Table B] --Column y -> Column z--> tabcf
end

subgraph Unfiltered
taba[Table A] --Column x -> Column y--> tabb[Table B] --Column y -> Column z--> tabc[Table C]

taba --Column x -> Column z-->tabc
end

Unfiltered ==> Filtered
Loading

scheminer's People

Watchers

Vincent Brouwers avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.