GithubHelp home page GithubHelp logo

naru-project / naru Goto Github PK

View Code? Open in Web Editor NEW
96.0 7.0 33.0 56 KB

Neural Relation Understanding: neural cardinality estimators for tabular data

License: Apache License 2.0

Python 99.91% Shell 0.09%
cardinality-estimation query-optimization self-supervised-learning generative-model unsupervised-learning deep-generative-model autoregressive-neural-networks learned-database-components learned-database density-estimation

naru's Introduction

Neural Relation Understanding

Naru is a suite of neural cardinality estimators for tabular data.

GitHub

This repo contains the code for the VLDB'20 paper, Deep Unsupervised Cardinality Estimation.

Main modules:

  • common.py: a lightweight pandas-based library to load/analyze/represent tables
  • several deep autoregressive model architectures
  • ProgressiveSampling: approximate inference algorithms for deep autoregressive models
  • a generator for high-dimensional SQL queries
  • training/evaluation scripts

Quick start

To set up a conda environment, run:

conda env create -f environment.yml

Run the following to test on a tiny 100-row dataset:

source activate naru

# Trains a ResMADE on dataset 'dmv-tiny'.
# This will create a checkpoint with path 'models/dmv-tiny-<model spec>.pt'.
python train_model.py --epochs=100 --residual 

# Use the trained model as a cardinality estimator.
# --glob supports evaluating a set of checkpoints at once; here, there will only be one match.
python eval_model.py --glob='dmv-tiny*.pt' --residual

Model architectures

Naru currently implements three state-of-the-art autoregressive architectures:

  1. MADE: a highly efficient masked MLP, introduced in Masked Autoencoder for Distribution Estimation (ICML'15).
  2. ResMADE: MADE with residual connections, introduced in Autoregressive Energy Machines (ICML'19).
  3. Transformer: an autoregressive Transformer, the architecture powering several recent breakthroughs in natural language processing (e.g., BERT, GPT-2, XLNet).

In principle, Naru's inference algorithms can interface with any autoregressive model and turn them into cardinality estimators.

Datasets

DMV. The DMV dataset is publically available at catalog.data.gov. The data is continuously updated. Our frozen snapshot (~11.6M tuples) can be downloaded by running

bash ./download_dmv.sh

Specify --dataset=dmv when launching the training/evaluation scripts.

Registering custom datasets

A user can point a Naru model to her own datasets in a few steps.

First, put a CSV file under datasets/. Second, define in datasets.py a LoadMyDataset() function:

def LoadMyDataset(filepath):   
    # Make sure that this loads data correctly.  
    df = pd.read_csv(filepath, **kwargs)  
    return CsvTable('Name of Dataset', df, cols=df.columns)

Last, call this function in the appropriate places inside the train/evaluation scripts. Search for current usage of args.dataset in those files and extend accordingly.

Running experiments

Run python train_model.py --help to see a list of tunable knobs. We recommend at least setting --residual --direct-io --column-masking. (In terms of learning efficiency, ResMADE learns faster than MADE, and --direct-io also helps. Architecture: Transformer can achieve lower negative log-likelihoods so it fits complex datasets better albeit being more expensive.)

When running evaluation (eval_model.py), include the same set of architecture flags to make sure checkpoint loading is correct.

Examples:

# Use a small 256x5 ResMADE model, with column masking.
python train_model.py --num-gpus=1 --dataset=dmv --epochs=20 --warmups=8000 --bs=2048 \
    --residual --layers=5 --fc-hiddens=256 --direct-io --column-masking

# Evaluate.  To enable estimators other than Naru, see section below.
python eval_model.py --dataset=dmv --glob='<ckpt from above>' --num-queries=2000 \
    --residual --layers=5 --fc-hiddens=256 --direct-io --column-masking
    
# Alternative: larger MADE model reported in paper.
python train_model.py --num-gpus=1 --dataset=dmv --epochs=100 --warmups=12000 --bs=2048 \
    --layers=0 --direct-io --column-masking --input-encoding=binary --output-encoding=one_hot

# Alternative: use a Transformer.
python train_model.py --num-gpus=1 --dataset=dmv --epochs=20 --warmup=20000 --bs=1024 \
    --blocks=4 --dmodel=64 --dff=256 --heads=4 --column-masking

Baseline cardinality estimators

We also include a set of baseline cardinality estimators known in the database literature:

  • Naru (--glob to find trained checkpoints)
  • Sampling (--run-sampling)
  • Bayes nets (--run-bn)
  • MaxDiff n-dimensional histogram (--run-maxdiff)
  • Postgres (see estimators.Postgres)

Example: to run experiments using trained Naru model(s) and a Sampler:

python eval_model.py --dataset=dmv --num-queries=2000 --glob='dmv*.pt' --run-sampling

Parameters controling these estimators can be adjusted inside eval_model.py.

Contributors

This repo was written by: Amog Kamsetty, Chenggang Wu, Eric Liang, Zongheng Yang.

Reference

If you find this repository useful in your work, please cite our VLDB'20 paper:

@inproceedings{naru,
  title={Deep Unsupervised Cardinality Estimation},
  author={Yang, Zongheng and Liang, Eric and Kamsetty, Amog and Wu, Chenggang and Duan, Yan and Chen, Xi and Abbeel, Pieter and Hellerstein, Joseph M and Krishnan, Sanjay and Stoica, Ion},
  journal={Proceedings of the VLDB Endowment},
  volume={13},
  number={3},
  pages={279--292},
  year={2019},
  publisher={VLDB Endowment}
}

naru's People

Contributors

concretevitamin avatar cw75 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

naru's Issues

DMV dataset snapshot has been deleted

Hi,

As for the title, I found that the DMV dataset snapshot has been deleted. I wonder if anyone still keeps this snapshot and I would like to run benchmarking with this snapshot. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.