queryproc / optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins Goto Github PK

View Code? Open in Web Editor NEW

27.0 2.0 17.0 226 KB

Code for the paper titled "Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins". VLDB'19

Home Page: http://amine.io/papers/wco-optimizer-vldb19.pdf

Shell 0.01% Python 3.04% ANTLR 0.90% Java 96.04%

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's Introduction

Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins

Overview
Build Steps
Executing Queries
Contact

Overview

For an overview of our one-time subgraph matching optimizer, check our paper.
We study the problem of optimizing subgraph queries using the new worst-case optimal join plans. Worst-case optimal plans evaluate queries by matching one query vertex at a time using multi-way intersections. The core problem in optimizing worst-case optimal plans is to pick an ordering of the query vertices to match. We design a cost-based optimizer that (i) picks efficient query vertex orderings for worst-case optimal plans; and (ii) generates hybrid plans that mix traditional binary joins with worst-case optimal style multiway intersections. Our cost metric combines the cost of binary joins with a new cost metric called intersection-cost. The plan space of our optimizer contains plans that are not in the plan spaces based on tree decompositions from prior work.

DO NOT DISTRIBUTE. USE ONLY FOR ACADEMIC RESEARCH PURPOSES.

Build Steps

To do a full clean build: ./gradlew clean build installDist
All subsequent builds: ./gradlew build installDist

Executing Queries

Getting Started

After building, run the following command in the project root directory:

. ./env.sh

You can now move into the scripts folder to load a dataset and execute queries:

cd scripts

Dataset Preperation

A dataset may consist of two files: (i) a vertex file, where IDs are from 0 to N and each line is of the format (ID,LABEL); and (ii) an edge file where each line is of the format (FROM,TO,LABEL). If the vertex file is omitted, all vertices are assigned the same label. We mainly used datasets from SNAP. The serialize_dataset.py script lets you load datasets from csv files and serialize them to the appropriate format for quick subsequent loading.

To load and serialize a dataset from a single edges files, run the following command in the scripts folder:

python3 serialize_dataset.py /absolute/path/edges.csv /absolute/path/data

The system will assume that all vertices have the same label in this case. The serialized graph will be stored in the data directory. If the dataset consists of an edges file and a vertices file, the following command can be used instead:

python3 serialize_dataset.py /absolute/path/edges.csv /absolute/path/data -v /absolute/path/vertices.csv

After running one of the commands above, a catalog can be generated for the optimizer using the serialize_catalog.py script.

python3 serialize_catalog.py /absolute/path/data

Executing Queries

Once a dataset has been prepared, executing a query is as follows:

python3 execute_query.py "(a)->(b),(b)->(c),(c)->(d)" /absolute/path/data

An output example on the dataset of Amazon0601 from SNAP with 1 edge label and 1 verte label is shown below. The dataset loading time, the opimizer run time, the quey execution run time and the query plan with the number of output and intermediate tuples are logged.

Dataset loading run time: 626.713398 (ms)
Optimizer run time: 9.745375 (ms)
Plan initialization before exec run time: 9.745375 (ms)
Query execution run time: 2334.2977 (ms)
Number output tuples: 118175329
Number intermediate tuples: 34971362
Plan: SCAN (a)->(c), Single-Edge-Extend TO (b) From (a[Fwd]), Multi-Edge-Extend TO (d) From (b[Fwd]-c[Fwd])

In order to invoke a multi-threaded execution, one can execute the query above with the following command to use 2 threads.

python3 execute_query.py "(a)->(b),(b)->(c),(c)->(d)" /absolute/path/data -t 2

The query above assigns an arbitrary edge and vertex labels to (a), (b), (c), (a)->(b), and (b)->(c). Use it with unlabeled datasets only. When the dataset has labels, assign labels to each vertex and edge as follows:

python3 execute_query.py "(a:person)-[friendof]->(b:person), (b:person)-[likes]->(c:movie)" /absolute/path/data

Requiring More Memory

Note that the JVM heap by default is allocated a max of 2GB of memory. Changing the JVM heap maximum size can be done by prepending JAVA_OPTS='-Xmx500G' when calling the python scripts:

JAVA_OPTS='-Xmx500G' python3 serialize_catalog.py /absolute/path/data

Contact

Amine Mhedhbi, [email protected]

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's People

Contributors

Stargazers

Watchers

Forkers

curiosityyy lmatz qsguo zhengyi-yang yuchen-ecnu kangfei zhengtongyan fabianmurariu g31pranjal danhlephuoc avudzor edison0521 tonyyxliu hongtaicao lxhq anhlt18vn smalluncle

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's Issues

No such file or directory

After I build the graphflow and change snap to edges.csv, I run the following command:
python serialize_dataset.py /absolute/dataset/edges.csv /absolute/data
but I got the following error:
No such file or directory: '/GRAPHFLOW_HOME/build/install/graphflow/bin/dataset-serializer': '/GRAPHFLOW_HOME/build/install/graphflow/bin/dataset-serialize
It seems that the build step didn't generate the whole graphflow.
I also git clone the graphflow repo and it can generate '/GRAPHFLOW_HOME/build/install/graphflow/bin/, but not the scripts to load data.
I guess that might be something wrong?

Looking forward to your reply!

About the size of the adjacency list

In the class SortedAdjList, there is a function named "size". However, I think is should be "-" in the return statement. (Because in the offset, offset[i] is the number of all type(label) not bigger than i)

Query results do not match

The number of tuples output is often 1 or 2 less than there actually are.

Reproduce:

edges.csv:

9,1,3
9,4,3
12,1,0
9,1,2
9,6,4
2,1,3
5,1,2
9,6,3
12,1,2
11,2,4

-vertices.csv:

0,1
1,1
2,0
3,1
4,2
5,1
6,1
7,1
8,1
9,1
10,2
11,0
12,2

commands:

root@dc124d4957e7:~# rm -r /root/data/graphflow/
root@dc124d4957e7:~# mkdir /root/data/graphflow/
root@dc124d4957e7:~# python3 eva_graphflow_stream/scripts/serialize_dataset.py /root/data/edges.csv /root/data/graphflow/ -v /root/data/vertices.csv
[INFO ][2023-07-04 16:11:17.845] KeyStore: Serializing the types and labels key store.
[INFO ][2023-07-04 16:11:17.853] Graph: Serializing the data graph.
root@dc124d4957e7:~# JAVA_OPTS='-Xmx500G' python3 eva_graphflow_stream/scripts/serialize_catalog.py /root/data/graphflow/ -v 2
[INFO ][2023-07-04 16:12:07.315] Catalog: serializing the data graph's catalog.
root@dc124d4957e7:~# python3 eva_graphflow_stream/scripts/execute_query.py "(a:1)-[3]->(b:1)" /root/data/graphflow/
(a:1)-[3]->(b:1)
[INFO ][2023-07-04 16:12:18.357] OptimizerExecutor: Dataset loading run time: 115.204859 (ms)
[INFO ][2023-07-04 16:12:18.370] OptimizerExecutor: Optimizer run time: 10.196823 (ms)
[INFO ][2023-07-04 16:12:18.372] OptimizerExecutor: Plan initialization before exec run time: 10.196823 (ms)
[INFO ][2023-07-04 16:12:18.374] OptimizerExecutor: Query execution run time: 0.0371 (ms)
[INFO ][2023-07-04 16:12:18.374] OptimizerExecutor: Number output tuples: 2
[INFO ][2023-07-04 16:12:18.375] OptimizerExecutor: Number intermediate tuples: 0
[INFO ][2023-07-04 16:12:18.375] OptimizerExecutor: Plan: SCAN (a)->(b)

number of output tuples:
expected 3
actual 2.

OutOfMemoryError:

Hello, Thank you for the open-source code,but I try to set the JAVA_OPTS='-Xmx50G', but it seems doesn't work,Could you please help me to fix it?
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at ca.waterloo.dsg.graphflow.query.QueryGraph.addQEdgeToQGraph(QueryGraph.java:107)
at ca.waterloo.dsg.graphflow.query.QueryGraph.addEdge(QueryGraph.java:95)
at ca.waterloo.dsg.graphflow.query.QueryGraph$$Lambda$63/0x00000008001b7c40.accept(Unknown Source)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
at ca.waterloo.dsg.graphflow.query.QueryGraph.addEdges(QueryGraph.java:53)
at ca.waterloo.dsg.graphflow.query.QueryGraph.copy(QueryGraph.java:234)
at ca.waterloo.dsg.graphflow.planner.catalog.CatalogPlans.setNoops(CatalogPlans.java:277)
at ca.waterloo.dsg.graphflow.planner.catalog.CatalogPlans.setNextOperators(CatalogPlans.java:163)
at ca.waterloo.dsg.graphflow.planner.catalog.CatalogPlans.setNextOperators(CatalogPlans.java:168)
at ca.waterloo.dsg.graphflow.planner.catalog.Catalog.populate(Catalog.java:233)
at ca.waterloo.dsg.graphflow.runner.dataset.CatalogSerializer.main(CatalogSerializer.java:62)

queryproc / optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins Goto Github PK

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's Introduction

Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins

Table of Contents

Overview

Build Steps

Executing Queries

Getting Started

Dataset Preperation

Executing Queries

Requiring More Memory

Contact

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's People

Contributors

Stargazers

Watchers

Forkers

optimizing-subgraph-queries-combining-binary-and-worst-case-optimal-joins's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs