zinchse / hero Goto Github PK

Python 100.00%

hero's Introduction

Data Science Lifecycle Base Repo

Use this repo as a template repository for data science projects using the Data Science Life Cycle Process. This repo is meant to serve as a launch off point. Our goal is to introduce only minimum viable opinions into the structure of this repo in order to make this repository/framework useful across a variety of data science projects and workflows. Therefore, we will tend to err to the side of omitting something if we're not confident that it's widely useful or think it's overly opinionated. That shouldn't stop you from forking this repo and adapting it to fit the needs of your project/team/organization.

With that in mind, if there is something that you think we're missing or should change, open an issue and we'll talk!

Get started.

The only manual step required is that you have to manually create the labels. The label names, descriptions, and color codes can be found in the .github/labels.yaml file. For more information on creating labels, review the GitHub docs here.

Contributing

Issues and suggestions for this template repo should be opened in the main dslp repo.

Default Directory Structure

├── .cloud              # for storing cloud configuration files and templates (e.g. ARM, Terraform, etc)
├── .github
│   ├── ISSUE_TEMPLATE
│   │   ├── Ask.md
│   │   ├── Data.Aquisition.md
│   │   ├── Data.Create.md
│   │   ├── Experiment.md
│   │   ├── Explore.md
│   │   └── Model.md
│   ├── labels.yaml
│   └── workflows
├── .gitignore
├── README.md
├── code
│   ├── datasets        # code for creating or getting datasets
│   ├── deployment      # code for deploying models
│   ├── features        # code for creating features
│   └── models          # code for building and training models
├── data                # directory is for consistent data placement. contents are gitignored by default.
│   ├── README.md
│   ├── interim         # storing intermediate results (mostly for debugging)
│   ├── processed       # storing transformed data used for reporting, modeling, etc
│   └── raw             # storing raw data to use as inputs to rest of pipeline
├── docs
│   ├── code            # documenting everything in the code directory (could be sphinx project for example)
│   ├── data            # documenting datasets, data profiles, behaviors, column definitions, etc
│   ├── media           # storing images, videos, etc, needed for docs.
│   ├── references      # for collecting and documenting external resources relevant to the project
│   └── solution_architecture.md    # describe and diagram solution design and architecture
├── environments
├── notebooks
├── pipelines           # for pipeline orchestrators i.e. AzureML Pipelines, Airflow, Luigi, etc.
├── setup.py            # if using python, for finding all the packages inside of code.
└── tests               # for testing your code, data, and outputs
    ├── data_validation
    └── unit

hero's People

Contributors

Watchers

hero's Issues

add costs

add cost values in the plan representation

Create sequential-all Dataset

`sequential-all` dataset

Dataset Description

Collect result of sequential calls of EXPLAIN (format json) and EXPLAIN (analyze, format json) commands for all queries from 3 common benchmarks (JOB, TPCH, sample_queries) under different environment settings (all combination of 7 hints and 3 parallel modes). Execution of duplicated plans can be eliminated in order to reduce the collection time.

add experiments artifacts

add archives with experiment artifacts:

model weights
loss curves, and
processed stratified metrics

explore the prediction modes

tldr;
compare different explore mode of hint prediction modes: a) by template, b) by logical plan based and c) by full-plan (with estimations).

Goals

Find the answers for the questions:

"Is it possible to make robust template-based hint prediction?"
"Is the logical plan enough to make robust hint prediction?"
"What is the worst case for these types of predictions?"

Emulate online scenario

Main Questions

For various load configurations, answer the following questions:
- When can the application of a NN in an online scenario be beneficial?
- How much resources will be needed for this?
- How much more effective is the hero approach?
Consider a) planning time, b) training time (hero), and c) regression from predictions in an online scenario.
Investigate the dependency of the achieved performance gain and required resources on the search space (only hintset / only dop / hintset and dop).

Scenarios of Interest

A scenario in emulation is determined by two components - the available data for model training and the workload.

Data = all default plans, workload = all queries.
Goal: to test the ability to generalise knowledge based on the history of standard plans without changing the workload.
Data = results of the execution of plans previously selected by the NN, workload = all queries; the process of training models, executing the workload, and collecting data is repeated until convergence to the optimum.
Goal: to measure the resources needed to achieve a beneficial outcome using the classical approach.
Data = plans of all fast queries, workload = long queries (and vice versa).
Goal: to test the ability to generalise to a workload with changes in the distribution of query execution times.
Data = plans of part of the queries with the structure of the standard tree X, workload = remaining queries with the same structure X.
Goal: to test the ability to generalise knowledge from a partial history to a workload with changes only in the statistics of standard plans.
Data = plans of part of the queries with the standard tree X, workload = remaining queries with the same standard tree X.
Goal: to test the ability to generalise knowledge from a partial history to a workload without changes in standard plans.

hintset exploration strategy

Task

Investigate possible learning algorithms (i.e. strategies of exploration for good transition).

Context

We have realized that the robustness problem is quite acute even on commonly used benchmarks.
The natural way to deal with it would be to a) switch to offline learning and b) use checks of similarity of the custom plan and its estimated cardinalities with the experience from history.
In order to guarantee the safety of any prediction of the M model, the transitions obtained with it must have already been explored. In this case, we don't need any prediction model, because we can just take the times from history itself!
It means that offline learning becomes just applying a smart strategy for filling history with the most useful transitions, i.e., we must explore queries and hintsets in such a way that we can find transitions with the highest speedup as quickly as possible.
So we get a situation where hintsets are just a way to get the desired transition, and inference becomes just a search against the default plan for hintsets that could potentially lead to good and already confirmed transition. We will see later why this is an extremely important feature of the model.

Exploring the possibilities of hint-based optimization

Description

Investigate extreme cases of query behavior when using hints and query_dop parameter (both regression and acceleration)

Cost model vs NN

Compare learned NN with cost model on plan ranking problem (in generalisation mode!): "What is the probability that random 2 plans will be ordered correctly via cost (NN prediction) comparison?"

check TCNN abilitiies

to do:

reimplement TCNN
check its ability to avoid regressions
try a neighbor prioritization approach during local search using TCNN prediction sorting

improve project structure

check configuration files in all project branches (.pylintrc, requirements.txt, etc.).
remove all unnecessary packages from the requirements.txt file
...