alibaba / pilotscope Goto Github PK

PilotScope is a middleware to bridge the gaps of deploying AI4DB (Artificial Intelligence for Databases) algorithms into actual database systems.

Shell 0.24% Python 74.86% CSS 0.01% HTML 0.12% JavaScript 0.75% AMPL 0.11% Jupyter Notebook 23.48% Dockerfile 0.45%

postgresql ai4db database middleware spark cardinality-estimation index-recommendation knob-tuning query-optimizer

pilotscope's Introduction

PilotScope

PilotScope is a middleware to bridge the gaps of deploying AI4DB (Artificial Intelligence for Databases) algorithms into actual database systems. It aims at hindering the underlying details of different databases so that an AI4DB driver could steer any database in a unified manner. By applying PilotScope, we obtain the following benefits:

The DB users could experience any AI4DB algorithm as a plug-in unit on their databases with little cost. The cloud computing service providers could operate and maintain AI4DB algorithms on their database products as a service to users. (More Convenient for Usage! 👏👏👏)
The ML researchers could easily benchmark and iterate their AI4DB algorithms in practical scenarios. (Much Faster to Iterate! ⬆️⬆️⬆️)
The ML and DB developers are liberated from learning the details in other side. They could play their own strengths to write the codes in their own sides. (More Freedom to Develop! 🏄‍♀️🏄‍♀️🏄‍♀️)
All contributors could extend PilotScope to support more AI4DB algorithms, more databases and more functions. (We highly encourage this! 😊😊😊)

News

🎉 [2023-12-15] Our paper on PilotScope has been accepted by VLDB 2024!

Code Structure

PilotScope/
├── algorithm_examples                         # Algorithm examples
├── fig                                        # Saved some Figures
├── paper                                 
│   ├── PilotScope.pdf                         # Paper of PilotScope
├── pilotscope
│   ├── Anchor                                 # Base push and pull anchors for implementing push and pull opearators       
│   │   ├── AnchorHandler.py
│   │   ├── AnchorEnum.py
│   │   ├── AnchorTransData.py
│   │   ├── ...
│   ├── Common                                 # Useful tools for PilotScope
│   │   ├── Index.py
│   │   ├── CardMetricCalc.py                   
│   │   ├── ...
│   ├── DBController                           # The implemenation of DB controllers for different databased
│   │   ├── BaseDBController.py
│   │   ├── PostgreSQLController.py
│   │   ├── ...
│   ├── DBInteractor                           # The funtionalities for interaction with database
│   │   ├── HttpInteractorReceiver.py
│   │   ├── PilotDataInteractor.py
│   │   ├── ...
│   ├── DataManager                            # The management of data
│   │   ├── DataManager.py
│   │   └── TableVisitedTracker.py
│   ├── Dataset                                # An easy-to-use API for loading benchmarks
│   │   ├── BaseDataset.py
│   │   ├── Imdb
│   │   ├── ...
│   ├── Exception                              # Some exception which may occur in the lifecycle of pilotscope
│   │   └── Exception.py
│   ├── Factory                                # Factory patterns
│   │   ├── AnchorHandlerFactory.py
│   │   ├── DBControllerFectory.py
│   │   ├── ...
│   ├── PilotConfig.py                         # Configurations of PilotScope
│   ├── PilotEnum.py                           # Some related enumeration types
│   ├── PilotEvent.py                          # Some predefined events
│   ├── PilotModel.py                          # Base models of pilotscope 
│   ├── PilotScheduler.py                      # Sheduling data traing、inference、collection push-and-pull and so on
│   ├── PilotSysConfig.py                      # System configuration of PilotScope 
│   └── PilotTransData.py                      # A unified data object for data collection
├── requirements.txt                           # Requirements for PilotScope
├── setup.py                                   # Setup for PilotScope
├── test_example_algorithms                    # Examples of some tasks, such as index recommendation, knob tuning, etc.
└── test_pilotscope                            # Unittests of PilotScope

Installation

Required Software Versions:

Python: 3.8
PostgreSQL: 13.1
Apache Spark: 3.3.2

You can install PilotScope Core and modified databases (e.g., PostgreSQL and Spark) following the documentation.

Feature Overview

The components of PilotScope Core in ML side can be divided into two categories: Database Components and Deployment Components. The Database Components are used to facilitate data exchange and control over the database, while the Deployment Components are used to facilitate the automatic application of custom AI algorithms to each incoming SQL query.

A high-level overview of the PilotScope Core components is shown in the following figure.

The Database Components are highlighted in Yellow, while the Deployment Components are highlighted in green. We will discuss each of these components in detail in the documentation.

An Example for Data Interaction with Database

The PilotConfig class is utilized to configure the PilotScope application, such as the database credentials for establishing a connection. We first create an instance of the PilotConfig where we can specify the database credentials and connected database name, i.e., stats_tiny.

# Example of PilotConfig
config: PilotConfig = PostgreSQLConfig(host="localhost", port="5432", user="postgres", pwd="postgres")
# You can also instantiate a PilotConfig for other DBMSes. e.g. 
# config:PilotConfig = SparkConfig()
config.db = "stats_tiny"
# Configure PilotScope here, e.g. changing the name of database you want to connect to.

The PilotDataInteractor class provides a flexible workflow for data exchange. It includes three main functions: push, pull, and execute. These functions assist the user in collecting data (pull operators) after setting additional data (push operators) in a single query execution process.

For instance, if the user wants to collect the execution time, estimated cost, and cardinality of all sub-queries within a query. Here is an example code:

sql = "select count(*) from votes as v, badges as b, users as u where u.id = v.userid and v.userid = b.userid and u.downvotes>=0 and u.downvotes<=0"
data_interactor = PilotDataInteractor(config)
data_interactor.pull_estimated_cost()
data_interactor.pull_subquery_card()
data_interactor.pull_execution_time()
data = data_interactor.execute(sql)
print(data)

The execute function returns a PilotTransData object named data, which serves as a placeholder for the collected data. Each member of this object represents a specific data point, and the values corresponding to the previously registered pull operators will be filled in, while the other values will remain as None.

execution_time: 0.00173
estimated_cost: 98.27
subquery_2_card: {'select count(*) from votes v': 3280.0, 'select count(*) from badges b': 798.0, 'select count(*) from users u where u.downvotes >= 0 and u.downvotes <= 0': 399.000006, 'select count(*) from votes v, badges b where v.userid = b.userid;': 368.609177, 'select count(*) from votes v, users u where v.userid = u.id and u.downvotes >= 0 and u.downvotes <= 0;': 333.655156, 'select count(*) from badges b, users u where b.userid = u.id and u.downvotes >= 0 and u.downvotes <= 0;': 425.102804, 'select count(*) from votes v, badges b, users u where v.userid = u.id and v.userid = b.userid and u.downvotes >= 0 and u.downvotes <= 0;': 37.536205}
buffercache: None
...

In certain scenarios, when the user wants to collect the execution time of a SQL query after applying a new cardinality (e.g., scaling the original cardinality by 100) for all sub-queries within the SQL, the PilotDataInteractor provides push function to achieve this. Here is an example code:

# Example of PilotDataInteractor (registering operators again and execution)
data_interactor.push_card({k: v * 100 for k, v in data.subquery_2_card.items()})
data_interactor.pull_estimated_cost()
data_interactor.pull_execution_time()
new_data = data_interactor.execute(sql)
print(new_data)

By default, each call to the execute function will reset any previously registered operators. Therefore, we need to push these new cardinalities and re-register the pull operators to collect the estimated cost and execution time. In this scenario, the new cardinalities will replace the ones estimated by the database's cardinality estimator. As a result, the partial result of the new_data object will be significantly different from the result of the data object, mainly due to the variation in cardinality values.

execution_time: 0.00208
estimated_cost: 37709.05
...

More functionalities please refer to the documentation.

Documentation

The classes and methods of PilotScope have been well documented. You can find the documentation in documentation.

License

PilotScope is released under Apache License 2.0.

Reference

If you find our work useful for your research or development, please kindly cite the following

@article{zhu2023pilotscope,
	title={PilotScope: Steering Databases with Machine Learning Drivers},
	author={Rong Zhu and Lianggui Weng and Wenqing Wei and Di Wu and Jiazhen Peng and Yifan Wang and Bolin Ding and Defu Lian Bolong Zheng and Jingren Zhou},
	journal = {Proceedings of the VLDB Endowment},
	year={2024}}

Contributing

As an open-sourced project, we greatly appreciate any contribution to PilotScope!

pilotscope's People

Contributors

Stargazers

Watchers

Forkers

zhangjianhrm programmerwcn liupengjugx wholeworld-timothy kaxipig ericsun2 tangyiheng ccwingcode postgresqlstudy pgone45 sysu17363068 mingyihao ljqcodelove yangharvey jack1981

pilotscope's Issues

安装插件

给修改后的PostgreSQL数据库安装插件（例如pgvector）可行吗

Questions of running test_lero_example.py

After changing self.config in function setup(), I run /test_example_algorithms/test_lero_example.py directly and here are some errors.
It seems an error of invalid device id is raised:

......
RelType :  {'Seq Scan', 'Bitmap Index Scan', 'Hash', 'Sort', 'Nested Loop', 'Materialize', 'Hash Join', 'Bitmap Heap Scan', 'Index Only Scan', 'Aggregate', 'Index Scan', 'Merge Join'}
Training data set size = 7055
input_feature_dim: 25
Invalid device id
Exception in thread pretraining_async_start:
......
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id

And after that here's another error:

start to test sql
current is the 0-th sql, and it is select count(*) from badges as b, users as u where b.userid= u.id and u.upvotes>=0
Terminate pilotscope either at the end of the program or in case of an exception.
E
======================================================================
ERROR: test_lero (__main__.LeroTest)

......

AttributeError: 'NoneType' object has no attribute 'transform'

I think if there is a standard output and if there are more configs I need to change since I installed my Postgresql in a different directory and use another user different from pilotscope.

Questions of Appling AI models Across DBMSs

Hi, this project is quite interesting to me. Since this project provides a good abstraction layer between AI models and DBMSs, I am wondering whether it can train the same AI models across DBMSs, instead of training and predicting on a specific DBMS. For example, the MSCN algorithm is trained based on PostgreSQL in the example code. Can we use the dataset from multiple DBMSs to train the same MSCN model and directly apply it to different DBMSs?

Looking forward to your reply!

Questions of deploying Bao in pilotscope

Since PostgreSQL 13.1 changes the function standart_planner() and Bao: Making Learned Query Optimization Practical is deployed in PostgreSQL 12, there are errors when trying to set Bao on PostgreSQL 13.1.

When using command:

make USE_PGXS=1 install

Error:

In file included from main.c:10:
bao_planner.h:226:12: error: too few arguments to function ‘standard_planner’
  226 |     plan = standard_planner(query_copy, cursorOptions, boundParams);
      |            ^~~~~~~~~~~~~~~~

It looks like this issue is caused by the version of PostgreSQL.
And I wonder how to intergrate Bao into pilotscope as Lero.

报错'PostgresDatabaseConnector' object has no attribute 'data_interactor'

您好大佬，我在按照algorithm_examples/Index/index_selection_evaluation中的readme安装，在执行python3 -m selection benchmark_results/tpch_wo_2_17_20/config.json这步的时候为什么会报这个错误'PostgresDatabaseConnector' object has no attribute 'data_interactor'，但是我看PostgresDatabaseConnector中确实没有data_interactor属性，请问可能是什么原因呀，另外index_selection_evaluatio目录下好像缺少hypopg和tpch-kit这两个文件夹，是我自己从别处下载的，是否和这个有关

xyc@xyc-virtual-machine:~/桌面/pilotscope/algorithm_examples/Index/index_select
ion_evaluation$ python3 -m selection benchmark_results/tpch_wo_2_17_20/config.json
2024-05-17 22:50:39,029 - root - DEBUG - Init IndexSelection
2024-05-17 22:50:39,029 - root - INFO - Starting Index Selection Evaluation
2024-05-17 22:50:39,029 - root - INFO - Using config file benchmark_results/tpch_wo_2_17_20/config.json
2024-05-17 22:50:39,029 - root - DEBUG - Database connector created: None
2024-05-17 22:50:39,032 - root - INFO - Postgres: Set random seed `SELECT setseed(0.17)`
2024-05-17 22:50:39,033 - root - DEBUG - Postgres connector created: None
2024-05-17 22:50:39,033 - root - DEBUG - Database with given scale factor already existing
2024-05-17 22:50:39,034 - root - DEBUG - Database connector created: indexselection_tpch___1
2024-05-17 22:50:39,036 - root - INFO - Postgres: Set random seed `SELECT setseed(0.17)`
2024-05-17 22:50:39,037 - root - DEBUG - Postgres connector created: indexselection_tpch___1
2024-05-17 22:50:39,037 - root - INFO - Generating TPC-H Queries
2024-05-17 22:50:39,037 - root - DEBUG - No need to run make
2024-05-17 22:50:39,043 - root - INFO - Queries generated
2024-05-17 22:50:39,044 - root - INFO - Dropping indexes
2024-05-17 22:50:39,047 - root - INFO - Postgres: Run `analyze`
2024-05-17 22:50:39,151 - root - INFO - Dropping indexes
2024-05-17 22:50:39,152 - root - INFO - Create new database connector (closing old)
2024-05-17 22:50:39,152 - root - DEBUG - Database connector closed: indexselection_tpch___1
2024-05-17 22:50:39,153 - root - DEBUG - Database connector created: indexselection_tpch___1
2024-05-17 22:50:39,155 - root - INFO - Postgres: Set random seed `SELECT setseed(0.17)`
2024-05-17 22:50:39,156 - root - DEBUG - Postgres connector created: indexselection_tpch___1
2024-05-17 22:50:39,156 - root - DEBUG - Init selection algorithm
2024-05-17 22:50:39,156 - root - INFO - Dropping indexes
2024-05-17 22:50:39,159 - root - DEBUG - Init cost evaluation
2024-05-17 22:50:39,159 - root - INFO - Cost estimation with actual_runtimes
2024-05-17 22:50:39,159 - root - DEBUG - Init WhatIfIndexCreation
2024-05-17 22:50:39,159 - root - INFO - Running algorithm {'name': 'extend', 'parameters': {'budget_MB': 250, 'max_index_width': 2, 'benchmark_name': 'tpch', 'min_cost_improvement': 1.003}, 'timeout': 300}
2024-05-17 22:50:39,159 - root - INFO - Calculating best indexes Extend
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xyc/桌面/pilotscope/algorithm_examples/Index/index_selection_evaluation/selection/__main__.py", line 4, in <module>
    index_selection.run()  # pragma: no cover
  File "/home/xyc/桌面/pilotscope/algorithm_examples/Index/index_selection_evaluation/selection/index_selection_evaluation.py", line 59, in run
    self._run_algorithms(config_file)
  File "/home/xyc/桌面/pilotscope/algorithm_examples/Index/index_selection_evaluation/selection/index_selection_evaluation.py", line 155, in _run_algorithms
    indexes, what_if, cost_requests, cache_hits = self._run_algorithm(
  File "/home/xyc/桌面/pilotscope/algorithm_examples/Index/index_selection_evaluation/selection/index_selection_evaluation.py", line 206, in _run_algorithm
    indexes = algorithm.calculate_best_indexes(self.workload)
  File "/home/xyc/桌面/pilotscope/algorithm_examples/Index/index_selection_evaluation/selection/selection_algorithm.py", line 36, in calculate_best_indexes
    indexes = self._calculate_best_indexes(workload)
  File "/home/xyc/桌面/pilotscope/algorithm_examples/Index/index_selection_evaluation/selection/algorithms/extend_algorithm.py", line 49, in _calculate_best_indexes
    current_cost = self.cost_evaluation.calculate_cost(
  File "/home/xyc/桌面/pilotscope/algorithm_examples/Index/index_selection_evaluation/selection/cost_evaluation.py", line 77, in calculate_cost
    pilot_total_cost = self.pilot_calculate_cost(workload, indexes)
  File "/home/xyc/桌面/pilotscope/algorithm_examples/Index/index_selection_evaluation/selection/cost_evaluation.py", line 104, in pilot_calculate_cost
    data_interactor = self.db_connector.data_interactor
AttributeError: 'PostgresDatabaseConnector' object has no attribute 'data_interactor'

使用教程

你好，请问有pilotscope使用的教程吗？

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble