(SODDY, TEDDY) & DREAM

This repository implements training data generation algorithms (SODDY & TEDDY) and deep cardinality estimators (DREAM) proposed in our paper "Cardinality Estimation of Approximate Substring Queries using Deep Learning". It is created by Suyong Kwon, Woohwan Jung and Kyuseok Shim.

Repository Overview

It consists of four folders each of which contains its own README file and script.

Folder	Description
gen_train_data	training data generation algorithms
dream	deep cardinality estimators for approximate substring queries
astrid	the modified version of Astrid starting from the astrid model downloaded from [github]
plot	example notebook files

Installation and Requirements

It is recommended to run our code with the CUDA environment. However, the non-CUDA version of our code is also working when the pytorch library does not supper GPU. (You may set CUDA_VISIBLE_DEVICES as -1 to enforce CPU mode.)

Method 1: Use the Docker Image

To run the image needs the NVIDIA Container Toolkit. If you do not have the toolkit, refer to the installation guide

git clone https://github.com/sykwon/teddy-dream.git

# run docker image
docker run -it --gpus all --name dream -v ${PWD}:/workspace -u 1000:1000 sykwon/dream /bin/bash

# after starting docker
redis-server --daemonize yes
cd gen_train_data/
make clean && make && make info
cd ..

Method 2: Create a Virtual Python Environment

This code needs Python-3.7 or higher.

sudo apt-get install -y redis-server git
sudo apt-get install -y binutils
sudo apt-get install -y texlive texlive-latex-extra texlive-fonts-recommended dvipng cm-super

conda create -n py37 python=3.7
source activate py37
conda install -y pytorch=1.7.1 torchvision=0.8.2 cudatoolkit=11.0 -c pytorch -c nvidia

pip install -r requirements.txt

Datasets

DBLP
GENE
WIKI
IMDB

Examples

These commands produces experimental results.

cd gen_train_data
./run.sh DBLP     # to generate training data from the DBLP dataset
# ./run.sh GENE   # to generate training data from the GENE dataset
# ./run.sh WIKI   # to generate training data from the WIKI dataset
# ./run.sh IMDB   # to generate training data from the IMDB dataset
# ./run.sh all    # to generate training data from all datasets
cd ..

cd dream
./run.sh DBLP    # to train all models except Astrid with the DBLP dataset
# ./run.sh GENE  # to train all models except Astrid with the GENE dataset
# ./run.sh WIKI  # to train all models except Astrid with the WIKI dataset
# ./run.sh IMDB  # to train all models except Astrid with the IMDB dataset
# ./run.sh all   # to train all models except Astrid with all datasets
cd ..

cd astrid
./run.sh DBLP    # to train the Astrid model with the DBLP dataset
# ./run.sh GENE  # to train the Astrid model with the GENE dataset
# ./run.sh WIKI  # to train the Astrid model with the WIKI dataset
# ./run.sh IMDB  # to train the Astrid model with the IMDB dataset
# ./run.sh all   # to train the Astrid model with all datasets
cd ..

Please refer to [notebook] to see the experimental results.

Citation

Please consider to cite our paper if you find this code useful:

@article{kwon2022cardinality,
    title={Cardinality estimation of approximate substring queries using deep learning},
    author={Kwon, Suyong and Jung, Woohwan and Shim, Kyuseok},
    journal={Proceddings of the VLDB Endowment},
    volume={15},
    number={11},
    year={2022}
}

sykwon / teddy-dream Goto Github PK

teddy-dream's Introduction

(SODDY, TEDDY) & DREAM

Repository Overview

Installation and Requirements

Method 1: Use the Docker Image

Method 2: Create a Virtual Python Environment

Datasets

Examples

Citation

teddy-dream's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs