kong / mesh-perf Goto Github PK

View Code? Open in Web Editor NEW

0.0 8.0 1.0 246 KB

Performance tests of Kong Mesh

License: Apache License 2.0

Makefile 14.17% Go 48.33% HCL 14.15% Shell 3.41% Jsonnet 19.94%

team-mesh

mesh-perf's Introduction

Mesh Performance Tests

Performance tests of Kong Mesh.

Run

Install dependencies

make dev/tools

Create local cluster

ENV=local make start-cluster

Run tests from mesh-perf directory

make run

Destroy local cluster

ENV=local make destroy-cluster

Setup EKS cluster from your machine

It is recommended to use saml2aws for AWS authorization. After authorizing you just need to run command

AWS_PROFILE=saml ENV=eks make start-cluster

Observability

Observability tool is a way to inspect the end result of perf tests. Perf test ends with snapshot of Prometheus TSDB save on the host which run the perf test (defaults to /tmp/prom-snapshots). This directory will look like this

❯❯❯ ll -la /tmp/prom-snapshots/
total 0
drwxr-xr-x   6 jakub  wheel   192B Jun 29 15:40 ./
drwxrwxrwt  15 root   wheel   480B Jun 29 14:30 ../
drwxr-xr-x   6 jakub  wheel   192B Jun 29 15:28 20230629T125736Z-5c8c90f181c0b57f/
drwxr-xr-x   3 jakub  wheel    96B Jun 29 15:30 20230629T133034Z-77fee4f8e5a90c89/
drwxr-xr-x   3 jakub  wheel    96B Jun 29 15:33 20230629T133316Z-5e37819462543e4f/
drwxr-xr-x   3 jakub  wheel    96B Jun 29 15:40 20230629T134058Z-035f3439076d9f04/

You can run Docker Compose of Prometheus + Grafana with the data from test.

PROM_SNAPSHOT_PATH=/tmp/prom-snapshots/20230629T134058Z-035f3439076d9f04 make start-grafana

Grafana will be forwarded to localhost:3000. Kuma CP dashboard should be ready.

To update kuma-cp.json dashboard:

place mesh-perf project next to kuma
run make upgrade/dashboards from the top level directory of mesh-perf.

mesh-perf's People

Contributors

Watchers

Forkers

lahabana

mesh-perf's Issues

Save snapshot of test prometheus metrics in github artifacts

Add resources to observability components

Today when deploy Grafana and Prometheus using kumactl install observability the resources: {} section is empty for all containers.

When resources are not specified the container receives what's left in terms of resources. Even if it's fine for Grafana, Prometheus's throttling can affect test execution (tests rely on Prometheus metrics) and the resulting snapshot.

Run perf test suite regularly from latest master version

setup local k3d cluster using terraform

Calculate most cost optimal EKS cluster for running bigger tests

Each EC2 instance type have limited number of pods that can be deployed (list here). We also need enough cpu to accommodate test services.

Find ec2 instances with the biggest possible pod number and lowest cost.
Find formula that will help us calculate cluster size in relation to expected number of pods in perf test

Increase Kuma sidecar memory limit

The default memory limit is 512Mi and sometimes could be exceeded. This results in test failure like https://github.com/Kong/mesh-perf/actions/runs/6078636182/job/16490081440.

Change report format we send to Datadog

Today we send log that looks like:

hostname="github-actions", service="mesh-perf-test", specReports=[{report1},{report2},{report3}...]

Apparently it's not possible to generate metrics in DD based on items inside specReports array. We can access items only by index.

We have to change the format to:

hostname=github-actions, service="mesh-perf-test", specReport={report1}
hostname=github-actions, service="mesh-perf-test", specReport={report2}
hostname=github-actions, service="mesh-perf-test", specReport={report3}

In that case extracting attributes in DD is pretty straightforward.

`preview` doesn't work to trigger the github action

Would be nice to have it work if we want to do daily runs.

GH action that will create cluster and run test

Setup project with repo in kong org

report should be comparable between test runs

So the reports should be an artifact in a good format that contains parameters (number of services, number of pods, version...) and a set of aggregated metrics (ok to start with just duration).

We should then be able to retrieve all the runs for a period of time and then plot them to compare.

Reduce the size of Prometheus snapshot

When running 2k pods the snapshot can be around 400Mb. It contains a lot of kube metrics we're not using for our dashboard, so it makes sense to somehow exclude them from the snapshot.

Test suite fails in "AfterAll" due to namespace termination take too much time

• [FAILED] [241.758 seconds]
Simple [AfterAll] should distribute certs when mTLS is enabled
  [AfterAll] /home/runner/go/pkg/mod/github.com/kumahq/[email protected]/test/framework/ginkgo.go:33
  [It] /home/runner/work/mesh-perf/mesh-perf/test/k8s/simple_test.go:240

  [FAILED] 'Wait for kuma-test Namespace to terminate.' unsuccessful after 60 retries
  
  In [AfterAll] at: /home/runner/go/pkg/mod/github.com/kumahq/[email protected]/test/framework/k8s_cluster.go:1004

Should we even wait for namespace termination if we destroy the cluster right after?

first test suite

(cleanup application and cp between tests), we can reuse service from kuma-tools - generate-mesh.go

terraform for EKS test cluster

Perf Test stages

Run perf test locally
Run periodically on cloud env
Tests are running with Prometheus, and we are able to extract metrics after tests are completed
Test results with metrics are persisted

Implement metrics needed for perf test

List of needed metrics is in the MADR

LeaderElection `renewDeadline` can be too small

By default, renewDeadline is 10s, but when Kube API is busy it can reply much longer (up to 60s). We should probably configure renewDeadline to be 80s.

Keep in mind that leaseDuration apparently can't be shorted than renewDeadline, so we should set it to 100s or something like that.

These parameters should be set here https://github.com/kumahq/kuma/blob/master/pkg/plugins/bootstrap/k8s/plugin.go#L58. So this feature requires first to make them configurable in Kuma.

number of services
number of pods per service

These should have good defaults

Reduce `scrape_interval` for Perf Tests

Current default value is 10s and it's hardcoded in kumactl install observability.

We should either add a flag --scrape-interval to kumactl or override this value somehow in prometheus-server ConfigMap during the test setup.