GithubHelp home page GithubHelp logo

lnsp / trace-explorer Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 1.0 143 KB

Toolset to explain and visualize database workload traces and benchmark data points.

License: MIT License

Makefile 0.07% Python 61.40% HTML 38.43% Shell 0.10%
database parquet python traces duckdb

trace-explorer's Introduction

Trace Explorer

This repository contains the source code of Trace Explorer, a toolset to explain and visualize database workload traces and benchmark datapoints.

Installation

# Download the latest trace-explorer.zip from nightly.link
wget -O trace-explorer.zip https://nightly.link/lnsp/trace-explorer/workflows/lint/main/trace-explorer.zip
# Unzip the archive
unzip trace-explorer.zip
# Install onto your machine
pip install trace_explorer-1.1-py3-none-any.whl
# Remove zip folder and wheel
rm trace-explorer.zip trace_explorer-1.1-py3-none-any.whl

Congrats, you can now use the trace_explorer command.

You can also go to the latest Build & Lint workflow and download the latest trace-explorer.zip under the Artifacts section manually.

Web interface

Trace Explorer comes with an integrated web interface. To start it on your local machine, enter

trace_explorer web

and a web server will run on port 5000.

Note: The web UI may not expose all features of the command-line interface.

Command-line interface

The first step in exploring your measurements is data preparation. Trace Explorer assist you in a multitude of ways, by automatically exploring different strategies to maximize dataset variance.

flowchart TD
A[Original dataset] --> |User-specified transformer handling JSON, XML, ...| B[Common parquet format]
B --> |Specify new columns via SQL| B{Parquet file}
B --> |Join with other dataset| B
B --> |Filter outliers or other conditions| B
B --> |Visualize samples| C[Plot as PDF]
B --> |Compare with other dataset| C

Turning the raw data into a parquet table

In case you want to go along and test out the commands but do not have the required data at hand, you can generate sample data using the generator.py script in the sample-data folder. Go down to the section on Generating sample data, follow the examples there or return to this step later on.

First, you have to convert your data into a common format. We use parquet for storing datasets because of its widespread compatibility and integrated compression. Each measurement must be converted into one row in the common dataset format. We provide the Transformer interface to allow users to provide their own format converter. You can find examples for custom transformers in the transformers/ directory.

trace_explorer convert --using myconverter.py --source 'mydataset/*.merged' --output mydatasetcommon.parquet

The provided transformer has to export a single class called Transformer implementing the trace_explorer.Transformer abstract class.

import json
import trace_explorer.transformer

class Transformer(trace_explorer.transformer.Transformer):
    def columns(self):
        return ['scan', 'join', 'filter']

    def transform(self, content: str):
        obj = json.loads(content)
        return [obj['scan'], obj['join'], obj['filter']]

Find a good preprocessing pipeline

To maximize the possibility of being able to derive conclusions from the data, a good preprocessing pipeline is very necessary. We provide a set of common preprocessing primitives, and allow for automatic tuning by optimizing for global variance.

To speed up processing, we use DuckDB and Parquet for storing intermediate data.

# Clean up the dataset by dropping entries with ANY column abs zscore > 5
trace_explorer clean --zscore 5 --source mydataset.parquet --output mydataset_cleaned.parquet

# Add a new generated column from existing data
trace_explorer generate --source mydataset.parquet --query 'select log(1 + execTime) as execTimeLog from dataset'

# Only keep read-only queries
trace_explorer generate --source mydataset.parquet --no_copy --query 'select * from dataset where writtenBytes = 0'

# Print out useful dataset stats
trace_explorer stats --source mydataset.parquet

Visualize your dataset

To make the most sense of your trace, you probably want to visualize your dataset. Trace Explorer supports clustering, auto-labeling and visualizing dataset clusters by

  • performing clustering on a transformed subset of your original dataset
  • auto-labeling the discovered clusters
  • plotting them in a 2D scatter plot using a TSNE embedding
  • (optional) training a tree classifier to cluster entire dataset
# Generate a plot with auto-labeled clusters
trace_explorer visualize --source mydataset.parquet --threshold 5

Compare different traces

Finding a good way to compare cluster traces is difficult. A good approach when operating on a common or subset/superset feature space is to

  • either limit the feature superset to the feature subset OR use a good imputation strategy to generate the missing columns
  • concatenate both datasets
  • take a subset of data to cluster via agglomerative clustering
  • use a random forest classifier to 'learn' the classification
  • apply classification to a larger set of data, compute visualization for that as well
  • visualize large set of data with trained classification
# Compare both datasets in a single visualization
trace_explorer compare --superset dataset1.parquet --subset dataset2.parquet --exclude badcolumns

Generating sample data

The repository contains a simple generator.py script in the sample-data directory. It generates multi-dimensional clustered data sampled from Laplace distributions.

# Generate a 2-column dataset with a join, scan and filter field with 5 clusters and 100 samples per cluster
sample-data/generator.py -c join -c scan -c filter -n 300 -k 5 -d sample-data/raw/

# Convert the dataset into parquet
trace_explorer convert --using sample-data/transformer.py --destination sample-data/raw.parquet --source 'sample-data/raw/*.json'

# Visualize the dataset, spills out a file named plot.pdf
trace_explorer visualize --source sample-data/raw.parquet --threshold 20

# Compare two sampled datasets
trace_explorer compare --superset sample-data/raw.parquet --subset sample-data/raw2.parquet

Run in debug mode

If you want to extend Trace Explorer or fix a bug, you might want to run the web frontend in debug mode (enabling hot reloading of all relevant components).

# NOBROWSER will disable opening a browser up for the application.
NOBROWSER=yes flask --app trace_explorer.web --debug run

trace-explorer's People

Contributors

durner avatar lnsp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

durner

trace-explorer's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.