GithubHelp home page GithubHelp logo

queryproc / optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins Goto Github PK

View Code? Open in Web Editor NEW
27.0 2.0 17.0 226 KB

Code for the paper titled "Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins". VLDB'19

Home Page: http://amine.io/papers/wco-optimizer-vldb19.pdf

Shell 0.01% Python 3.04% ANTLR 0.90% Java 96.04%

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's Introduction

Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins

Table of Contents

Overview

For an overview of our one-time subgraph matching optimizer, check our paper.
We study the problem of optimizing subgraph queries using the new worst-case optimal join plans. Worst-case optimal plans evaluate queries by matching one query vertex at a time using multi-way intersections. The core problem in optimizing worst-case optimal plans is to pick an ordering of the query vertices to match. We design a cost-based optimizer that (i) picks efficient query vertex orderings for worst-case optimal plans; and (ii) generates hybrid plans that mix traditional binary joins with worst-case optimal style multiway intersections. Our cost metric combines the cost of binary joins with a new cost metric called intersection-cost. The plan space of our optimizer contains plans that are not in the plan spaces based on tree decompositions from prior work.

DO NOT DISTRIBUTE. USE ONLY FOR ACADEMIC RESEARCH PURPOSES.

Build Steps

  • To do a full clean build: ./gradlew clean build installDist
  • All subsequent builds: ./gradlew build installDist

Executing Queries

Getting Started

After building, run the following command in the project root directory:

. ./env.sh

You can now move into the scripts folder to load a dataset and execute queries:

cd scripts

Dataset Preperation

A dataset may consist of two files: (i) a vertex file, where IDs are from 0 to N and each line is of the format (ID,LABEL); and (ii) an edge file where each line is of the format (FROM,TO,LABEL). If the vertex file is omitted, all vertices are assigned the same label. We mainly used datasets from SNAP. The serialize_dataset.py script lets you load datasets from csv files and serialize them to the appropriate format for quick subsequent loading.

To load and serialize a dataset from a single edges files, run the following command in the scripts folder:

python3 serialize_dataset.py /absolute/path/edges.csv /absolute/path/data

The system will assume that all vertices have the same label in this case. The serialized graph will be stored in the data directory. If the dataset consists of an edges file and a vertices file, the following command can be used instead:

python3 serialize_dataset.py /absolute/path/edges.csv /absolute/path/data -v /absolute/path/vertices.csv

After running one of the commands above, a catalog can be generated for the optimizer using the serialize_catalog.py script.

python3 serialize_catalog.py /absolute/path/data  

Executing Queries

Once a dataset has been prepared, executing a query is as follows:

python3 execute_query.py "(a)->(b),(b)->(c),(c)->(d)" /absolute/path/data

An output example on the dataset of Amazon0601 from SNAP with 1 edge label and 1 verte label is shown below. The dataset loading time, the opimizer run time, the quey execution run time and the query plan with the number of output and intermediate tuples are logged.

Dataset loading run time: 626.713398 (ms)
Optimizer run time: 9.745375 (ms)
Plan initialization before exec run time: 9.745375 (ms)
Query execution run time: 2334.2977 (ms)
Number output tuples: 118175329
Number intermediate tuples: 34971362
Plan: SCAN (a)->(c), Single-Edge-Extend TO (b) From (a[Fwd]), Multi-Edge-Extend TO (d) From (b[Fwd]-c[Fwd])

In order to invoke a multi-threaded execution, one can execute the query above with the following command to use 2 threads.

python3 execute_query.py "(a)->(b),(b)->(c),(c)->(d)" /absolute/path/data -t 2

The query above assigns an arbitrary edge and vertex labels to (a), (b), (c), (a)->(b), and (b)->(c). Use it with unlabeled datasets only. When the dataset has labels, assign labels to each vertex and edge as follows:

python3 execute_query.py "(a:person)-[friendof]->(b:person), (b:person)-[likes]->(c:movie)" /absolute/path/data

Requiring More Memory

Note that the JVM heap by default is allocated a max of 2GB of memory. Changing the JVM heap maximum size can be done by prepending JAVA_OPTS='-Xmx500G' when calling the python scripts:

JAVA_OPTS='-Xmx500G' python3 serialize_catalog.py /absolute/path/data  

Contact

Amine Mhedhbi, [email protected]

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's People

Contributors

queryproc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's Issues

No such file or directory

After I build the graphflow and change snap to edges.csv, I run the following command:
python serialize_dataset.py /absolute/dataset/edges.csv /absolute/data
but I got the following error:
No such file or directory: '/GRAPHFLOW_HOME/build/install/graphflow/bin/dataset-serializer': '/GRAPHFLOW_HOME/build/install/graphflow/bin/dataset-serialize
It seems that the build step didn't generate the whole graphflow.
I also git clone the graphflow repo and it can generate '/GRAPHFLOW_HOME/build/install/graphflow/bin/, but not the scripts to load data.
I guess that might be something wrong?

Looking forward to your reply!

About the size of the adjacency list

image

In the class SortedAdjList, there is a function named "size". However, I think is should be "-" in the return statement. (Because in the offset, offset[i] is the number of all type(label) not bigger than i)

Query results do not match

The number of tuples output is often 1 or 2 less than there actually are.

Reproduce:

  • edges.csv:
9,1,3
9,4,3
12,1,0
9,1,2
9,6,4
2,1,3
5,1,2
9,6,3
12,1,2
11,2,4

-vertices.csv:

0,1
1,1
2,0
3,1
4,2
5,1
6,1
7,1
8,1
9,1
10,2
11,0
12,2

commands:

root@dc124d4957e7:~# rm -r /root/data/graphflow/
root@dc124d4957e7:~# mkdir /root/data/graphflow/
root@dc124d4957e7:~# python3 eva_graphflow_stream/scripts/serialize_dataset.py /root/data/edges.csv /root/data/graphflow/ -v /root/data/vertices.csv
[INFO ][2023-07-04 16:11:17.845] KeyStore: Serializing the types and labels key store.
[INFO ][2023-07-04 16:11:17.853] Graph: Serializing the data graph.
root@dc124d4957e7:~# JAVA_OPTS='-Xmx500G' python3 eva_graphflow_stream/scripts/serialize_catalog.py /root/data/graphflow/ -v 2
[INFO ][2023-07-04 16:12:07.315] Catalog: serializing the data graph's catalog.
root@dc124d4957e7:~# python3 eva_graphflow_stream/scripts/execute_query.py "(a:1)-[3]->(b:1)" /root/data/graphflow/
(a:1)-[3]->(b:1)
[INFO ][2023-07-04 16:12:18.357] OptimizerExecutor: Dataset loading run time: 115.204859 (ms)
[INFO ][2023-07-04 16:12:18.370] OptimizerExecutor: Optimizer run time: 10.196823 (ms)
[INFO ][2023-07-04 16:12:18.372] OptimizerExecutor: Plan initialization before exec run time: 10.196823 (ms)
[INFO ][2023-07-04 16:12:18.374] OptimizerExecutor: Query execution run time: 0.0371 (ms)
[INFO ][2023-07-04 16:12:18.374] OptimizerExecutor: Number output tuples: 2
[INFO ][2023-07-04 16:12:18.375] OptimizerExecutor: Number intermediate tuples: 0
[INFO ][2023-07-04 16:12:18.375] OptimizerExecutor: Plan: SCAN (a)->(b)

number of output tuples:
expected 3
actual 2.

OutOfMemoryError:

Hello, Thank you for the open-source code,but I try to set the JAVA_OPTS='-Xmx50G', but it seems doesn't work,Could you please help me to fix it?
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at ca.waterloo.dsg.graphflow.query.QueryGraph.addQEdgeToQGraph(QueryGraph.java:107)
at ca.waterloo.dsg.graphflow.query.QueryGraph.addEdge(QueryGraph.java:95)
at ca.waterloo.dsg.graphflow.query.QueryGraph$$Lambda$63/0x00000008001b7c40.accept(Unknown Source)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
at ca.waterloo.dsg.graphflow.query.QueryGraph.addEdges(QueryGraph.java:53)
at ca.waterloo.dsg.graphflow.query.QueryGraph.copy(QueryGraph.java:234)
at ca.waterloo.dsg.graphflow.planner.catalog.CatalogPlans.setNoops(CatalogPlans.java:277)
at ca.waterloo.dsg.graphflow.planner.catalog.CatalogPlans.setNextOperators(CatalogPlans.java:163)
at ca.waterloo.dsg.graphflow.planner.catalog.CatalogPlans.setNextOperators(CatalogPlans.java:168)
at ca.waterloo.dsg.graphflow.planner.catalog.Catalog.populate(Catalog.java:233)
at ca.waterloo.dsg.graphflow.runner.dataset.CatalogSerializer.main(CatalogSerializer.java:62)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.