addb-swstarlab / ccdt Goto Github PK

CCDT(Configuration Clustering Database Tuning)

Jupyter Notebook 100.00%

ccdt's Introduction

2023_KCC Model

This is the code of the Korea Computer Congress 2023 (KCC 2023) paper

'A Study about Search Space of Knob Range Reduction for Database Tuning'.

This study proposes a method to reduce the search space as an optimization method that can improve the performance of database parameters (knobs).

- MySQL ver. 5.7

- Num of Parameters = 139

- Num of Config = 200

- Workload : TPCC , Twitter

Firstly, we randomly generate 200 samples via Latin Hypercube Sampling (LHS).

Secondly, we select 10 knobs that have a significant impact on database performance by a knob ranking algorithm.

Thirdly, 10 configurations within the generated samples are selected based on their measured database performance, where we calculated score (throughput/latency) to compare multiple configurations.

Then, we find the used value range of each selected knob from the selected configurations.

With these newly defined knob ranges, the optimization algorithm can search knob values within a narrower range than its default range.

Paper

Below is link of 'A Study about Search Space of Knob Range Reduction for Database Tuning' paper
Paper link

ccdt's People

Contributors

Stargazers

Watchers

Forkers

kwon-sein hyojoys

ccdt's Issues

About Model Selection

Why did you choose the ensemble model even though there are various model other than ensemble architecture among supervised learning techniques?

Questions about data genration method

Hello,

Why do you use LHS for data generation, how about using other methods like Monte Carlo Sampling or Quasi-random Sequences etc.?

Thanks.

About XGBRegressor in SMAC_with_our_data_twitter.ipynb

Hello,

In the current code, you are using fixed hyperparameter values in XGBRegressor.
Hyperparameters are crucial factors that determine the performance of a model, and finding values optimized for a specific problem is crucial.
Additionally, I'm wondering if you've applied hyperparameter tuning techniques such as Grid Search, Random Search, or Bayesian Optimization.

Thanks for reading.

About prediction model for feature selction

When I looked at your code, I think you used a random forest model to select the features that are important for predicting database performance because you can use the feature_importance_ function built into the random forest model.

However, the performance of the randomforest prediction model seems to be rather poor, so I think the reliability of the important features extracted by the prediction model is low. So, have you tried using a model that can utilize attention scores other than the randomforest or adaboost model, or have you tried using another kind of model?

Top-K knob

The code says that you get Top-K knob through the SHAP algorithm, but can I use LIME, PDP, Permutation Feature Importance instead of SHAP?

About dataset

Hello, Thanks for your work,

In Jupyter_model.ipynb, why did't you remove outlier?

I think this checked dot is outlier

About tps, latency

Hello,

I have two question.
what does tps, latency mean in testing_tpcc.ipynb?

and why dose the score imply tps/latency?

Thanks!

About Code

Hello, I have some questions about the code.

In the "testing_tpcc_ipynb" file

Cell 6 and 7 is the same code?

In cell 6, you append best config in variable "best_config".
But you declare "best_config" as empty list in cell 7 and didn't use the "best_config" in cell 7.
So I wonder why you declare the "bset_config".

And cell 5 and 8, the result of metrics.sort_value('score', ascending=False) is different.
Can you tell me why?

Thank you.

About selection of clustering algorithm

When I looked at that code, the clustering technique used was K-MEANS. However, as far as I know, there are many other algorithms other than k-means that can be used as a clustering technique, especially for configuration data in the form of tabular data. So is there any reason why you used k-means instead of using other techniques?

About Scaler in new_idea.ipynb

Hello,

The current code uses StandardScaler and MinMaxScaler to scale the data, which adjusts the distribution of the data to help the learning algorithm perform better.

However, depending on the characteristics of the data and the requirements of the model, different scaling methods may be more appropriate.

For example, log transformation can be used to adjust the distribution of the data, or RobustScaler can be used to apply scaling that is less sensitive to outliers.

I wonder if you've ever experimented with these different scaling methods. If so, I wonder what the results were.

Thanks for reading.

About Classifier Hyper parameter.

Hello,

What criteria did you set the hyperparameters for the XGBClassifier in CCDT/new_model/main_classification.py?

I wonder how the results obtained when adjusting the hyperparameter values are different.

Thanks for reading.

About Kmeans label

You set 4 label that each color is 'navy', 'tomato', 'green', 'orange', but you practically use 3 label.

Why did you do that?

Thank you for reading.

About reason of score.

Hi,
First of all, thanks for the great research.

My question is how to calculate the score for each config to pick the best performing one. I noticed that the range (scale) of values for TPS and latency are very different. If I calculate the ratio of those attributes without any normalization or scaling for these two values, it seems that the relatively large scale of TPS will have a dominant effect on the score, is there any particular reason for calculating the score without scaling?

pick Top-10 configuration

Ask about the content of the paper (A Study about Search Space of Job Range Reduction for Database Tuning).

Why pick Top-10 of configuration in Figure 1? Is it related to Knob range reduction?

About Hidden dimension of Pred_Model in new_idea.ipynb.

Hello,

I wonder if there is a reason why you set the hidden_dim of Pred_Model to 64.

Also, I'm wondering if you've tried it in a deeper hidden dimension as well.

Thanks for reading.

addb-swstarlab / ccdt Goto Github PK

ccdt's Introduction

2023_KCC Model

- MySQL ver. 5.7

- Num of Parameters = 139

- Num of Config = 200

- Workload : TPCC , Twitter

Paper

ccdt's People

Contributors

Stargazers

Watchers

Forkers

ccdt's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs