GithubHelp home page GithubHelp logo

keptn-contrib / dynatrace-sli-service Goto Github PK

View Code? Open in Web Editor NEW
13.0 6.0 6.0 1.51 MB

(Deprecated) Keptn service to receive metrics from Dynatrace metrics API

License: Apache License 2.0

Dockerfile 1.17% Go 97.04% Smarty 1.12% Shell 0.67%

dynatrace-sli-service's Introduction

Deprecated: Dynatrace SLI Service

Important information

This repository has been archived as the functionality has been moved to the dynatrace-service.


The dynatrace-sli-service is a Keptn service that is responsible for retrieving the values of SLIs from your Dynatrace Tenant via the Dynatrace Metrics v2 API endpoint. For that it handles the Keptn Event sh.keptn.internal.event.get-sli which gets executed as part of a quality gate evaluation!

The dynatrace-sli-service provides the capabilty to connect to different Dynatrace Tenants for your Keptn projects, stages or services. It also allows you to either define SLIs through sli.yaml files or through a Dynatrace dashboard and all of this is configurable through dynatrace.conf.yaml:

By default, even if you do not specify a custom sli.yaml or a Dynatrace dashboard, the following SLIs are automatically supported in case you reference them in your slo.yaml:

 - throughput: builtin:service.requestCount.total
 - error_rate: builtin:service.errors.total.rate
 - response_time_p50: builtin:service.response.time:percentile(50)
 - response_time_p90: builtin:service.response.time:percentile(90)
 - response_time_p95: builtin:service.response.time:percentile(95)

By default these metrics (SLIs) are queried from a Dynatrace-monitored service entity with the tags keptn_project, keptn_service, keptn_stage & keptn_deployment.

As highlighted above, the dynatrace-sli-service also provides the following capabilities:

  • Connecting to different Dynatrace Tenants (SaaS or Managed) depending on Keptn Project, Stage or Service

  • Defining a custom list of SLIs based on the Dynatrace Metrics API v2. This allows SLIs to reference any metric in Dynatrace: Application, Service, Process Groups, Host, Custom Devices, Calculated Service Metrics, External Metrics ...

  • Visually defining SLIs & SLOs through a Dynatrace Dashboard instead of sli.yaml and slo.yaml

Compatibility Matrix

Keptn Version Dynatrace-SLI-Service Service Image
0.6.0 keptncontrib/dynatrace-sli-service:0.3.0
0.6.1 keptncontrib/dynatrace-sli-service:0.3.1
0.6.1, 0.6.2 keptncontrib/dynatrace-sli-service:0.4.1
0.6.1, 0.6.2 keptncontrib/dynatrace-sli-service:0.4.2
0.7.0 keptncontrib/dynatrace-sli-service:0.5.0
0.7.1 keptncontrib/dynatrace-sli-service:0.6.0
0.7.2 keptncontrib/dynatrace-sli-service:0.7.0
0.7.2+ keptncontrib/dynatrace-sli-service:0.7.1
0.7.3 keptncontrib/dynatrace-sli-service:0.7.2
0.7.3 keptncontrib/dynatrace-sli-service:0.7.3
0.8.0-alpha keptncontrib/dynatrace-sli-service:0.8.0-alpha
0.8.0, 0.8.1 keptncontrib/dynatrace-sli-service:0.9.0
0.8.0 - 0.8.3 keptncontrib/dynatrace-sli-service:0.10.0
0.8.0 - 0.8.3 keptncontrib/dynatrace-sli-service:0.10.1
0.8.0 - 0.8.3 keptncontrib/dynatrace-sli-service:0.10.2
0.8.0 - 0.8.3 keptncontrib/dynatrace-sli-service:0.10.3
0.8.0 - 0.8.3 keptncontrib/dynatrace-sli-service:0.11.0
0.8.4 - 0.8.6 keptncontrib/dynatrace-sli-service:0.12.0
0.8.4 - 0.8.6 keptncontrib/dynatrace-sli-service:0.12.1

Installation

As any Keptn Service, the dynatrace-sli-service needs to be installed on the k8s cluster where you have installed Keptn!

Deploy in your Kubernetes cluster

  • The dynatrace-sli-service by default validates the SSL certificate of the Dynatrace API. If your Dynatrace API only has a self-signed certificate, you can disable the SSL certificate check by setting the environment variable dynatraceSliService.config.httpSSLVerify (default true) specified in the chart/values.yml file to false.

  • The dynatrace-sli-service can be configured to use a proxy server via the HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables as described in httpproxy.FromEnvironment(). As the dynatrace-sli-service connects to a distributor, a NO_PROXY entry including 127.0.0.1 should be used to prevent these from being proxied. The HTTP_PROXY and HTTPS_PROXY environment variables can be configured using the dynatraceSliService.config.httpProxy (default "") and dynatraceSliService.config.httpsProxy (default "") variables in values.yml, NO_PROXY is set to 127.0.0.1 by default.

  • To deploy the current version of the dynatrace-sli-service in your Kubernetes cluster, use the helm chart located in the chart directory. Please use the same namespace for the dynatrace-sli-service as you are using for Keptn, e.g: keptn.

    helm upgrade --install  dynatrace-sli-service -n keptn https://github.com/keptn-contrib/dynatrace-sli-service/releases/download/$VERSION/dynatrace-sli-service-$VERSION.tgz

    Note: Replace $VERSION with the desired version number (e.g. 0.12.1) you want to install.

  • This installs the dynatrace-sli-service into the keptn namespace, which you can verify using:

    kubectl -n keptn get deployment dynatrace-sli-service -o wide
    kubectl -n keptn get pods -l run=dynatrace-sli-service

Uninstall

To delete a deployed dynatrace-sli-service, use the helm CLI to uninstall the installed release of the service:

helm delete -n keptn dynatrace-sli-service

Debugging

Remote debugging is supported using Skaffold via skaffold debug, which starts a Delve instance prior to running the service.

Pre-Requisites: Dynatrace Tenant URL & API Token

In order for the dynatrace-sli-service to connect to Dynatrace you need to provide a Dynatrace Tenant URL and a Dynatrace API Token. In our examples below we use the best practice to export these values in the environment variables DT_TENANT and DT_API_TOKEN as explained in the Keptn documentation for Dynatrace

Configuration of project- & Keptn-wide Dynatrace credentials

The dynatrace-sli-service uses the same implementation as the dynatrace-service when it comes to connecting to your Dynatrace Tenant (SaaS or Managed). Both services pull the Dynatrace Tenant URL and Dynatrace API Token from the k8s secret stored in the same namespace as where your dynatrace-xx-service is installed.

Both services give you the option to configure project-wide-default or keptn-wide-default credentials. For project-wide, the secret needs to be named dynatrace-credentials-YOURPROJECT. For keptn-wide the secret can either be called dynatrace-credentials or just dynatrace.

The following is an example to define a secret for a Keptn project called sockshop:

kubectl create secret generic dynatrace-credentials-sockshop -n "keptn" --from-literal="DT_TENANT=$DT_TENANT" --from-literal="DT_API_TOKEN=$DT_API_TOKEN"

And here is an example to specify a keptn wide default secret that is used in case there is no project wide secret defined for a particular Keptn project

kubectl create secret generic dynatrace -n "keptn" --from-literal="DT_TENANT=$DT_TENANT" --from-literal="DT_API_TOKEN=$DT_API_TOKEN"

Configurations of Credentials through dynatrace.conf.yaml

While project and keptn wide credentials give a certain flexibility - it has its drawbacks that have asked for more fine grained control over Dynatrace Credential Management as well as configuraing the behavior of other features of the dynatrace-sli-service on a project, service and stage level. This is why its important to understand and use dynatrace.conf.yaml

When the dynatrace-sli-service is processing a sh.keptn.internal.event.get-sli it looks for the file called dynatrace/dynatrace.conf.yaml in the Keptn Configuration Repository. It first looks for it on the service, then the stage and then finally the project level. This conf file is also used by the dynatrace-service. For the dynatrace-sli-service it allows you to configure the following behavior:

  • Which k8s secret to use to pull Dynatrace Tenant Credentials (DT_TENANT & DT_API_TOKEN)
  • Whether to pull SLI/SLO information from a Dynatrace dashboard or use the stored sli.yaml and slo.yaml in the Keptn Configuration Repository

Here is an example dynatrace.conf.yaml

---
spec_version: '0.1.0'
dtCreds: dynatrace-preprod
dashboard: query

To upload this to your Keptn project you can for instance use the Keptn CLI:

keptn add-resource --project=yourproject --stage=yourstage --resource=./dynatrace.conf.yaml --resourceUri=dynatrace/dynatrace.conf.yaml

dtCreds dtCreds allows you to specify the name of the k8s secret in your Keptn namespace that holds the required credentials to connect to the Dynatrace Tenant. This extends the default behavior as explained in the beginning by having the dynatrace-sli-service first look at the secret defined in dtCreds. If dtCreds is not specified or if there is no dynatrace.conf.yaml at all then it just does the default behavior.

In the example above where dtCreds was specified with the value dynatrace-preprod the dynatrace-sli-service would be looking for the first matching secret in the following order: dynatrace-preprod, dynatrace-credentials-YOURKEPTNPROJECT, dynatrace-credentials, dynatrace If none of these secrets is configured in your k8s Keptn namespace the dynatrace-sli-service will respond with an error indicating that no Dynatrace credentials could be found!

For completeness of the example - here is the way on how to create that secret so it matches whats in dynatrace.conf.yaml:

kubectl create secret generic dynatrace-preprod -n "keptn" --from-literal="DT_TENANT=$DT_TENANT" --from-literal="DT_API_TOKEN=$DT_API_TOKEN"

dtCreds was requested by many users as it gives you the option to specify credentials for your different Dynatrace Tenants, e.g: my-dynatrace-preprod, my-dynatrace-prod, my-dynatrace-dev. And then you can configure on project, stage or even service level which Dynatrace Tenant to be used. This gives you all flexiblity to manage multiple environments within a single project but separate it out by e.g: stages

Configurations of Dashboard SLI/SLO queries through dynatrace.conf.yaml

The dynatrace.conf.yaml provides an additional option to configure whether the dynatrace-sli-service should use the metric queries defined in sli.yaml, whether it should pull data from a specific dashboard or whether it query the data from a Dynatrace Dashboard who's name matches the Keptn project, stage and service.

Here is an example dynatrace.conf.yaml including the dashboard parameter

---
spec_version: '0.1.0'
dtCreds: dynatrace-prod
dashboard: query

Remember to have this file uploaded using e.g: Keptn CLI or the Keptn API. It has to be in the subfolder dynatrace which is why resourceUri=dynatrace/dynatrace.conf.yaml:

keptn add-resource --project=yourproject --stage=yourstage --resource=./dynatrace.conf.yaml --resourceUri=dynatrace/dynatrace.conf.yaml

dashboard The dashboard parameter provides 3 options

  • blank (default): If dashboard is not specified at all or if you do not even have a dynatrace.conf.yaml then the dynatrace-sli-service will simply execute the metric query as defined in slo.yaml
  • query: This value means that the dynatrace-sli-service will look for a dashboard on your Dynatrace Tenant (dynatrace-prod in the example above) which has the following dashboard naming format: KQG;project=<YOURKEPTNPROJECT>;service=<YOURKEPTNSERVICE>;stage=<YOURKEPTNSTAGE>. If such a dashboard exists it will use the definition of that dashboard for SLIs as well as SLOs. If no dashboard is found that matches that name it goes back to default mode.
  • DASHBOARD-UUID: If you specify the UUID of a Dynatrace dashboard the dynatrace-sli-service will query this dashboard on the specified Dynatrace Tenant. If it exists it will use the definition of this dashboard for SLIs as well as SLOs. If the dashboard was not found the dynatrace-sli-service will raise an error and not continue!

Here is an example of a dynatrace.conf.yaml specifing the UUID of a Dynatrace Dashboard

---
spec_version: '0.1.0'
dtCreds: dynatrace-prod
dashboard: 311f4aa7-5257-41d7-abd1-70420500e1c8

Dashboard parsing behavior If a dashboard is queried, the dynatrace-sli-service will first validate if the dashboard has changed since the last evaluation. It does that by comparing the dashboard's JSON with the dashboard JSON that was used during the last evaluation run. If the dashboard.json has not changed it will fall back to the sli.yaml and slo.yaml as these were also created out of the dashboard in the previous run. If you want to overwrite this behavior you can simply put a KQG.QueryBehavior=Overwrite on your dashboard. Details on that explained further down in this readme. This behavior also implies that the dynatrace-sli-service stores the content of the dashboard and the generated sli.yaml and slo.yaml in your configuration repo. You can find these files on service level under: dynatrace/dashboard.json, dynatrace/sli.yaml, slo.yaml

Tip: You can easily find the dashboard id for an existing dashboard by navigating to it in your Dynatrace Web interface. The ID is then part of the URL.

SLI Configuration

While most users will use the dashboard approach it is important to understand how the general processing of SLIs works without dashboards. Dashboards give an additional convenience as the sli.yaml file doesn't need to be created or maintained by anybody as this information is extracted from a Dynatrace Dashboard. However - in very mature organizations the approach of using SLI & SLO yamls instead of Dynatrace Dashboards is very likely.

Thats why - lets give you some basic understanding of how SLIs work with the dynatrace-sli-service

The default SLI queries that come with the dynatrace-sli-service are defined as follows. Those will be used in case you have not specified a custom sli.yaml neither a Dynatrace dashboard:

spec_version: "1.0"
indicators:
 throughput: "metricSelector=builtin:service.requestCount.total:merge(0):sum&entitySelector=tag(keptn_project:$PROJECT),tag(keptn_stage:$STAGE),tag(keptn_service:$SERVICE),tag(keptn_deployment:$DEPLOYMENT),type(SERVICE)"
 error_rate: "metricSelector=builtin:service.errors.total.rate:merge(0):avg&entitySelector=tag(keptn_project:$PROJECT),tag(keptn_stage:$STAGE),tag(keptn_service:$SERVICE),tag(keptn_deployment:$DEPLOYMENT),type(SERVICE)"
 response_time_p50: "metricSelector=builtin:service.response.time:merge(0):percentile(50)&entitySelector=tag(keptn_project:$PROJECT),tag(keptn_stage:$STAGE),tag(keptn_service:$SERVICE),tag(keptn_deployment:$DEPLOYMENT),type(SERVICE)"
 response_time_p90: "metricSelector=builtin:service.response.time:merge(0):percentile(90)&entitySelector=tag(keptn_project:$PROJECT),tag(keptn_stage:$STAGE),tag(keptn_service:$SERVICE),tag(keptn_deployment:$DEPLOYMENT),type(SERVICE)"
 response_time_p95: "metricSelector=builtin:service.response.time:merge(0):percentile(95)&entitySelector=tag(keptn_project:$PROJECT),tag(keptn_stage:$STAGE),tag(keptn_service:$SERVICE),tag(keptn_deployment:$DEPLOYMENT),type(SERVICE)"

Note: The default SLI queries require the following tags on the services and within the query:

  • keptn_project
  • keptn_stage
  • keptn_service
  • keptn_deployment

When Keptn queries these SLIs for e.g., the service carts in the stage dev within project sockshop, it would translate to the following tags in the query:

  • keptn_project:sockshop
  • keptn_stage:dev
  • keptn_service:carts
  • keptn_deployment:primary (or keptn_deployment:canary during tests)

If you use Keptn for the deployment of your artifacts using Keptn's Helm Service you will have these four tags automatically set and detected by Dynatrace. If you want to use other tags, you need to overwrite the SLI configuration (see below).

Overwrite SLI Configuration / Custom SLI queries

Users can override the predefined queries, as well as add custom queries by creating a SLI configuration:

  • A custom SLI configuration is a YAML file as shown below:

    ---
    spec_version: "1.0"
    indicators:
      your_metric: "metricSelector=your_metric:count&entitySelector=tag(keptn_project:$PROJECT),tag(keptn_stage:$STAGE),tag(keptn_service:$SERVICE),tag(keptn_deployment:$DEPLOYMENT),type(SERVICE)"
  • To store this configuration, you need to add this file to Keptn's configuration store either on project, stage, or service level. The remote resourceUri needs to be dynatrace/sli.yaml. This is done by using the Keptn CLI with the keptn add-resource command. Here is an example

    keptn add-resource --project=yourproject --stage=yourstage --service=yourservice --resource=./sli.yaml --resourceUri=dynatrace/sli.yaml

More examples on custom SLIs

You can define your sli.yaml that defines ANY type of metric available in Dynatrace - on ANY entity type (APPLICATION, SERVICE, PROCESS GROUP, HOST, CUSTOM DEVICE, etc.). You can either "hard-code" the queries in your sli.yaml or you can use placeholders such as $SERVICE, $STAGE, $PROJECT, $DEPLOYMENT as well as $LABEL.yourlabel1, $LABEL.yourlabel2. This is very powerful as you can define generic sli.yaml files and leverage the dynamic data of a Keptn event. Here is an example where we are retrieving the tag name from a label that is passed to Keptn

indicators:
    throughput:  "metricSelector=builtin:service.requestCount.total:merge(0):sum&entitySelector=tag($LABEL.dttag),type(SERVICE)"

So, if you are sending an event to Keptn and passing in a label with the name dttag and a value e.g: evaluateforsli then it will match a Dynatrace service that has this tag on it:

You can also have SLIs that span multiple layers of your stack, e.g: services, process groups and host metrics. Here is an example that queries one metric from a service, one from a process group and one from a host. The tag names come from labels that are sent to Keptn:

indicators:
    throughput:  "metricSelector=builtin:service.requestCount.total:merge(0):sum&entitySelector=tag($LABEL.dtservicetag),type(SERVICE)"
    gcheapuse:   "metricSelector=builtin:tech.nodejs.v8heap.gcHeapUsed:merge(0):sum&entitySelector=tag($LABEL.dtpgtag),type(PROCESS_GROUP_INSTANCE)"
    hostmemory:  "metricSelector=builtin:host.mem.usage:merge(0):avg&entitySelector=tag($LABEL.dthosttag),type(HOST)"

Hope these examples help you see what is possible. If you want to explore more about Dynatrace Metrics, and the queries you need to create to extract them I suggest you explore the Dynatrace API Explorer (Swagger UI) as well as the Metric API v2 documentation.

Advanced SLI Queries for Dynatrace

Here are a couple of additional query options that have been added to the Dynatrace SLI Service over time to extend the capabilities of querying more relevant data:

Dynatrace SLO Definition With Dynatrace Version 207 Dynatrace introduced native support for SLO monitoring. The dynatrace-sli-service is able to query these SLO definitions by referencing them by SLO-ID. Here is such an SLO as seen in a dashboard:

And here is the corresponding SLI query which is specified as SLO;<SLOID>:

indicators:
    rt_faster_500ms: SLO;524ca177-849b-3e8c-8175-42b93fbc33c5

The dynatrace-sli-service basically queries the SLO using the /api/v2/slo/ endpoint and will return evaluatedPercentage field!

Open Problems One interesting metric is the number of open problems you may have in a particular environment or those that match a particular problem type. Dynatrace provides the Problem APIv2 which allows you to query problems by entitySelector as well as problemSelector. You can pass both fields as part of an SLI query prefixing it with PV2. Here is an example on how such an SLI definition would look like:

indicators:
    problems: PV2;problemSelector=status(open)&entitySelector=managementZoneIds(7030365576649815430)

The dynatrace-sli-service will return the totalCount field of the /api/v2/problems endpoint passing your query string!

Define Metric Unit for Metrics Query Most SLIs you define are queried using the Metrics API v2. The following is an example from above:

indicators:
 teststep_rt_Basic_Check: "metricSelector=calc:service.teststepresponsetime:merge(0):avg:names:filter(eq(Test Step,Basic Check));entitySelector=type(SERVICE)"

When the dynatrace-sli-provider executes this query it simply returns the value of that metric. What is not always known is the metric unit. Depending on the metric definition this could be nanoseconds, microseconds, milliseconds, seconds or even bytes, kilobytes, megabytes, ...

For some of the metrics the dynatrace-sli-provider makes metric unit assumptions and for instance converts MicroSecond into MilliSeconds and Bytes into KiloBytes. However - these assumptions only work for builtin metrics and are therefore not a valid approach unless we would start querying the Metric Definition everytime we query these metrics. While this would work it is a lot of extra API calls we want to avoid. To let the dynatrace-sli-service know about the expected Metric Unit you can prefix your query with MV2;<MetricUnit>;<Regular Query>. So - the above example can be changed to this to tell the service that this metric is returned in MicroSeconds:

indicators:
 teststep_rt_Basic_Check: "MV2;MicroSecond;metricSelector=calc:service.teststepresponsetime:merge(0):avg:names:filter(eq(Test Step,Basic Check));entitySelector=type(SERVICE)"

The possible metric units are those that Dynatrace specifies in the API. Please have a look at the Metric API documentation for a complete overview. Currently the dynatrace-sli-service does the following conversions before returning the value to Keptn. While this doesnt yet solve every request we have seen from our users I hope this solves many use cases of users asking for better handling of MicroSeconds and Bytes:

Source Data Tye Converted To
MicroSeconds MilliSeconds
Bytes KiloBytes

If you want to have a more flexible way to convert metric units please let us know by creating an issue and explain your use case

SLIs & SLOs for Problem Remediation

If Dynatrace sends problems to Keptn which triggers an Auto-Remediation workflow Keptn also evaluates your SLOs after the remediation action was executed. The default behavior that users expect is that the auto-remediation workflow can stop if the problem has been closed in Dynatrace and that it should continue otherwise!

When a Dynatrace Problem initiates a Keptn auto-remediation workflow the dynatrace-service adds the Dynatrace Problem URL as a label with the name "Problem URL". As labels are passed all the way through every event along a Keptn process it also ends up being passed as part of the sh.keptn.internal.event.get-sli which is handled by dynatrace-sli-service Here is an excerpt of that event showing the label:

 "labels": {
      "Problem URL": "https://abc12345.live.dynatrace.com/#problems/problemdetails;pid=3734886735257827488_1606270560000V2",
      "firstaction": "action.triggered.firstaction.sh"
    },
    "project": "demo-remediation"

So, if the dynatrace-sli-service detects that it gets called in context of a remediation workflow and finds a Dynatrace Problem ID (PID) as part of the Problem URL it will query the status of that problem (OPEN or CLOSED) using Dynatrace's Problem API v2. It will then return an SLI called problem_open and the value either be 0 (=problem no longer open) or 1 (=problem still open). The dynatrace-sli-service will also define a key SLO for problem_open with a default pass criteria of <=0 meaning the evaluation will only succeed if the problem is closed. The following is an excerpt of that SLO definition:

objectives:
- sli: problem_open
  pass:
  - criteria:
    - <=0
  key_sli: true

As the SLO gets added if it's not defined and as the sli named problem_open will always be returned this capability allows you to either define your own custom SLO including problem_open as an SLO or you just go with the default that dynatrace-sli-service creates.

SLIs & SLOs via Dynatrace Dashboard

Based on user feedback we learned that defining custom SLIs via the sli.yaml and then defining SLOs via slo.yaml can be challenging as one has to be familiar with the Dynatrace Metrics v2 API to craft the necessary SLI queries. As dashboards are a prominent feature in Dynatrace to visualize metrics, it was a logical step to leverage dashboards as the basis for Keptn's SLI/SLO configuration.

rocket If dynatrace-sli-service parses your dashboard, it will generate an sli.yaml and slo.yaml and uploads it to your Keptn configuration repository. It will also upload the dashboard.json.

How dynatrace-sli-service locates a Dashboard

As explained earlier - the dynatrace-sli-service gives you two options through the dashboard property in your dynatrace.conf.yaml

  1. query. This will query for a dashboard with the name pattern like this: KQG;project=;service=;stage=

  2. UUID: Use e.g: dashboard: e6c947f2-4c29-483c-a065-269b3707bea4 which will then query exactly that dashboard

For more details refer to the section above where we explained dynatrace.conf.yaml

SLI/SLO Dashboard Layout and how it generates SLI & SLO definitions

Here is a sample dashboard for our simplenode sample application:

And here is how the individual pieces matter:

1. Name of the dashboard

If the dashboard is not referenced in dynatrace.conf.yaml via the Dashboard ID, the dynatrace-sli-service queries all dashboards and uses the one that starts with KQG; followed by the name-value pairs:

project=<project>,service=<service>,stage=<stage>

The order of these name-value pairs is not relevant but the values have to match your Keptn project, service and stage. In the example dashboard you see that this dashboard matches the project simpleproject, service simplenode, and stage staging.

2. Management Zone Filter

If you are building a dashboard specific to an application or part of your environment, it is a good practice to set a default management zone filter for your dashboard. The dynatrace-sli-service will use that filter. This can either be a custom created management zone or - like in the example above - the one that Keptn creates in case you use Keptn for the deployment.

3. Markdown with SLO Definitions

The dashboard is not only used to define which metrics should be evaluated (list of SLIs), it is also used to define the individual SLOs and global settings for the SLO, e.g., Total Score goals or Comparison Rules. These are settings you normally have in your slo.yaml. To specify those settings simply create a markdown that contains name-value pairs like in the example dashboard.

Here is the text from that markup you see in the screenshot:

KQG.Total.Pass=90%;KQG.Total.Warning=75%;KQG.Compare.WithScore=pass;KQG.Compare.Results=1;KQG.Compare.Function=avg

It is not mandatory to define them as there are defaults for all of them. Here is a table that gives you the details on each setting:

Setting Default Comment
KQG.Total.Pass 90% Specifies total pass goal of your SLO
KQG.Total.Warning 75% Specifies total warning goal of your SLO
KQG.Compare.Result 1 Against how many previous builds to compare your result to?
KQG.Compare.WithScore pass Which prevoius builds to include in the comparison: pass, pass_or_warn or all
KQG.Compare.Function avg When comparing against multiple builds which aggregation should be used: avg, p50, p90, p95
KQG.QueryBehavior A dashboard is always parsed for SLIs & SLOs even if it hasnt changed. To only parse it when changes occured use 'ParseOnChange'

4. Tiles with SLI definition

The dynatrace-sli-service analyzes every tile but only includes those in the SLI/SLO anlaysis where the tile name includes the name-value pair: sli=sliprefix

If you look at the example dashboard screenshot, you see some tiles that have the sli=sliprefix and some that don't. This allows you to build dashboards that you can extend with metrics that should not be included in your SLI/SLO validation.

Similar to the markdown, each tile can define several configuration elements. The only mandatory is sli=sliprefix. Here a couple of examples of possible values. It actually starts with a human readable value that is not included in the analysis but makes the dashboard easier readable:

Test Step Response Time;sli=teststep_rt;pass=<500;warning=<1000;weight=2
Process Memory;sli=process_memory
Response time (P50);sli=svc_rt_p95;pass=<+10%,<500
Setting Sample Value Comment
sli test_rt This will become the SLI Name, e.g: test_Rt If the chart includes metrics split by dimensions - then the value is a prefix and each dimension will be appended, e.g: test_rt_teststep1, test_rt_teststep2
pass <500,<+10% This can be a comma-separated list which allows you to specify multiple critiera as you can also do in the slo.yaml. You are also allowed to specify multiple pass name/value pairs which will result into multiple criteria just as allowed in the slo.yaml spec
warning <1000 Same as with pass
weight 1 Allows you to define a weight of the SLI. Default is 1
key true If true, this SLI becomes a key SLI. Default is false

5. Tile examples

Here a couple of examples from tiles and how they translate into sli.yaml and slo.yaml definitions

1: Service Response Time (p95)

  • Results in an sli.yaml like this:

    svc_rt_p95: metricSelector=builtin:service.response.time:percentile(50):names;entitySelector=type(SERVICE),mzId(-8783122447839702114)
    
  • And an slo.yaml definition like this:

    - sli: svc_rt_p95
      pass:
        - criteria
            - "<+10%"
              "<600"
      weight 1
      key_sli: false
    

2: Test Step Response Time

  • Result in an SLI definition like this

    teststep_rt_Basic_Check: "metricSelector=calc:service.teststepresponsetime:merge(0):avg:names:filter(eq(Test Step,Basic Check));entitySelector=type(SERVICE),mzId(-8783122447839702114)",
    teststep_rt_echo: "metricSelector=calc:service.teststepresponsetime:merge(0):avg:names:filter(eq(Test Step,echo));entitySelector=type(SERVICE),mzId(-8783122447839702114)",
    teststep_rt_homepage: "metricSelector=calc:service.teststepresponsetime:merge(0):avg:names:filter(eq(Test Step,homepage));entitySelector=type(SERVICE),mzId(-8783122447839702114)",
    teststep_rt_invoke: "metricSelector=calc:service.teststepresponsetime:merge(0):avg:names:filter(eq(Test Step,invoke));entitySelector=type(SERVICE),mzId(-8783122447839702114)",
    teststep_rt_version: "metricSelector=calc:service.teststepresponsetime:merge(0):avg:names:filter(eq(Test Step,version));entitySelector=type(SERVICE),mzId(-8783122447839702114)",
    
  • And an SLO like this:

        - sli: teststep_rt_invoke
          pass:
            - criteria
                - "<500"
          warning:
            - criteria
                - "<1000"
          weight 2
          key_sli: false
        - sli: teststep_rt_version
          pass:
            - criteria
                - "<500"
          warning:
            - criteria
                - "<1000"
          weight 2
          key_sli: false      
          ...
    

Support for SLO Tiles

SLOs in Dynatrace are a new feature to monitor SLOs in production and report on status and error budget. As explained in the readme above the dynatrace-sli-service already provides support for querying the SLO and returning the evaluatedPercentage field. All you need to do is add the SLO tile on your dashboard and it will be included. The dynatrace-sli-service will not only return the value but also use the warning and pass criteria defined in the SLO definition for the slo.yaml for Keptn:

Support for Problem Tiles

A great use case is to validate whether there are any open problems in a given enviornment as part of your Keptn Quality Gate Evaluation. As desribed above the dynatrace-sli-service supports querying the number of problems that have a certain status using Dynatrace's Problem API v2. To include the open problem count that matches your dashboards management zone you can simply add the "Problems" tile to your dashboard. If this tile is on the dashboard you will get an SLI with the name problems, the value will be the total count of problems open. The default SLO will be that problems is a key sli with a pass criteria of <=0. This results in the following slo.yaml entry being generated:

objectives:
- sli: problem_open
  pass:
  - criteria:
    - <=0
  key_sli: true

Support for USQL Tiles

The dynatrace-sli-service also supports Dynatrace USQL tiles. The query will be executed as defined in the dashboard for the given timeframe of the SLI evaluation.

There are just some things to know for the different USQL result types:

Tile Type Comment
Single Just a single value
Pie Chart Takes dimension name and value
Column Chart First columns is considered dimension and second is the value
Table First column is considered dimension and last column the value
Funnel Currently not supported

Here is an example with two USQL Tiles showing a single value of a query:

This will translate into two SLIs called camp_adoption and camp_conv. The SLO definition is the same as explained above with regular time series.

Steps to set up a Keptn project for SLI/SLO Dashboards

This should work with any existing Keptn project you have. Just make sure you have the dynatrace-sli-service enabled for your project. Then create a dashboard as explained above that the dynatrace-sli-service can match to your project/service/stage.

Until Keptn 0.7.2 If you start from scratch and you have never run an evaluation in your project make sure you upload an empty slo.yaml to your service. Why? Because otherwise the Lighthouse service will skip evaluation and never triggers the dynatrace-sli-service. This is just a one time initialization effort. Here is an empty slo.yaml you can use:

---
spec_version: '0.1.0'
comparison:
objectives:

Also check out the samples folder of this repo with some additional helper files and the exported dashboard from the example above.

Development

  • Get dependencies: go mod download
  • Build locally: go build -v -o dynatrace-sli-service ./cmd/
  • Run tests: go test -race -v ./...
  • Run local: ENV=local ./dynatrace-sli-service

Known Limitations

  • The Dynatrace Metrics API provides data with the "eventually consistency" approach. Therefore, the metrics data retrieved can be incomplete or even contain inconsistencies in case of time frames that are within two hours of the current datetime. Usually, it takes a minute to catch up, but in extreme situations this might not be enough. We try to mitigate that by delaying calls to the metrics API by 60 seconds.

  • This service uses the Dynatrace Metrics v2 API by default but can also parse v1 metrics query. If you use the v1 query language you will see warning log outputs in the dynatrace-sli-service which encourages you to update your queries to v2. More information about Metics v2 API can be found in the Dynatrace documentation

dynatrace-sli-service's People

Contributors

agrimmer avatar arthurpitman avatar bacherfl avatar christian-kreuzberger-dtx avatar dependabot[bot] avatar gabrielprioli avatar grabnerandi avatar johannes-b avatar tannergabriel avatar warber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dynatrace-sli-service's Issues

"Response time degredation in /" when spamming get-sli events

Using Keptn 0.8.2 and dynatrace-sli-service 0.8.0.

After a quick and dirty load-test (while true; do keptn trigger evaluation --project=sockshop --stage=staging --service=carts --timeframe=5m; sleep 5; done) with Dynatrace monitoring active, I noticed that dynatrace-sli-service performs poorly under sustained load:

image

This leads to get-sli events not being processed (or processed very very late).

The root-cause of this is most likely external API Calls (e.g., to Dynatrace API, but also to Keptn's configuration-service when calling GetKeptnResource which makes multiple API Calls to configuration-service).

However, those API Calls could be decoupled by putting them inside of a go-routine.

Definition of Done

  • dynatrace-sli-service uses a go-routine to perform long-waiting tasks such as contacting configuration-service
  • Dynatrace does not report a Response time degradation for dynatrace-sli-service any more (or: response time is reasonably low, in the area of 100 milliseconds)

Dynatrace-SLI-Service expects DT_PAAS_TOKEN in secret which is not mandatory for quality-gate-only

With the recent enhancements of dynatrace-service & dynatrace-sli-service these two services will only work if DT_TENANT, DT_API_TOKEN and DT_PAAS_TOKEN are stored in the dynatrace secret. for quality gate only instllations DT_PAAS_TOKEN is typically not defined.

the problematic code is in common.go:GetDTCredentials where it should only validate for DT_TENANT & DT_API_TOKEN but dont make DT_PAAS_TOKEN mandatory!

We should fix this and provide a new version of both dynatrace-sli-service and dynatrace-service and make a not in the release notes about known issues.

Workaround
The workaround is to specify a DT_PAAS_TOKEN in the dynatrace secret. The value can be a bogus value as the correctlness of the api token is not validated - just its existance!

Parsing the Dynatrace SLO Dashboard shouldnt be case sensitive for sli/slo metadata

Right now when the dynatrace service parses a dynatrace dashboard for SLI/SLO information it parses the header of a chart for meta data information such as SLI name, pass and warning criteria. Seems that the current implementation is expecting all lower case key/value pairs, e.g: sli=sliname;warn=expression. If by accident its written like SLI=sliname then this configuration will be ignored.

I suggest to make the name of these configuration elements case insensitive to avoid any accidential typing mistakes. I just ran into it and it took me a while to figure out why my dashboard wasnt correctly parsed

Allow defining SLIs in a Dynatrace dashboard vs sli.yaml

While sli.yaml is a good solution to define SLIs it is a bit hard to get all the Dynatrace queries.
A better approach would be to have the DYnatrace SLI Service query for the existance of a Dynatrace Dashboard with the tags "keptn_project:myproject", "keptn_stage:staging", "keptn_service:myservice".
The Dynatrace SLI service can then parse the dashboard and extract both SLI.yaml as well as SLO.yaml definitions from that dashboard. Here is a screenshot containing a metric on a chart. The name and objective criteria is defined in the chart title:

image

If the SLI provider finds a dashboard I propose that the SLI Provider generates an SLI.yaml as well as an SLO.yaml and stores it back to the Configuration SErvice so that every change is also automatically version controlled

add https_proxy/http_proxy env var support

Container image is missing packages needed to allow alpine linux to connect outside the cluster using an http proxy.

Adding the following packages to the Dockerfile and rebuilding image fixed the problem.

apk add curl netcat-openbsd

Upgrade keptn/distributor to 0.8.4 for properly showing up at the integrations page in Keptn Bridge

In Keptn 0.8.4, the distributor will be extended with the functionality of registering itself as a Keptn uniform integration at the Keptn's Uniform API.

Related Video with short Tutorial (part of community meeting on June 17th): https://youtu.be/oZlf1v5qUvc?t=436

Goal: The integration/service should be visible in Keptn's Bridge Uniform screen:
image

If you have any questions, please reply to keptn/keptn#4418


To enable this feature, the following changes need to be made:

First, the image of the distributor container of the deployment needs to be set to keptn/distributor:0.8.4:

        - name: distributor
              image: keptn/distributor:0.8.4

Second, locate the env section of the distributor container:

        - name: distributor
          image: keptn/distributor:0.8.4
          resources: ...
          env:
            ...

and add the following environment variables:

            - name: VERSION
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: 'metadata.labels[''app.kubernetes.io/version'']'
            - name: K8S_DEPLOYMENT_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: 'metadata.labels[''app.kubernetes.io/name'']'
            - name: K8S_POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: K8S_NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName

Last but not least, ensure that the labels app.kubernetes.io/version and app.kubernetes.name are available under spec.template.metadata.labels in the K8s deployment:

        app.kubernetes.io/name: dynatrace-service
        app.kubernetes.io/version: 0.14.1

You can find a complete example of deployment.yaml (or service.yaml) here: https://github.com/keptn-contrib/unleash-service/blob/release-0.3.2/deploy/service.yaml

Allow keptn placeholders in dtCreds

This was brought up in a conversation with @leonvzGit - he had the requirement to also use placeholders, e.g: $PROJECT, $LABEL.ENV ... in dtCreds.

The use case is that they have different Dynatrace Tenants per Environment, e.g: dynatrace-prod, dynatrace-preprod.
The name of the Environment is always passed to Keptn via a Label, e.g: environment=prod, environment=preprod
They therefore define secrets that hold the DYnatrace Credentials called dynatrace-prod & dynatrace-preprod

In order to use the label and with that define which dynatrace tenant to be used he suggested to allow a dynatrace.conf.yaml like this

dtCreds: dynatrace-$LABEL.Environment

Provide a feature flag for disabling SSL verification

As a user, I would like to be able to use the Dynatrace API which only has a self-signed certificate.

Tasks:

  • Introduce an environment variable HTTP_SSL_VERIFY, which allows disabling the SSL verification.
  • Provide a docu how to disable SSL verification.

Update Service to use CloudEvents 1.0

The current implementation of the Service uses CloudEvents 0.2. Starting with Keptn 0.8, we will switch to CloudEvents 1.0.

Tasks:

Definition of Done:

  • Events are sent and received using the CloudEvents 1.0 specification

Enable/disable the feature of SLO/SLI generation from a DT-Dashboard

Status quo:
When receiving a sh.keptn.internal.event.get-sli.done event, the dynatrace-sli-service always queries Dynatrace for all dashboards to determine whether there is one that for the service/project/stage to evaluate. This can become a time consuming and API heavy step for very large environments.

Improvement:
Introduce a flag as an environment variable that controls this behavior:
FETCH_SLO_SLI_FROM_DASHBOARD: [enabled | disabled | initial]

  • enabled: Behavior as explained above. With each sh.keptn.internal.event.get-sli.done event, find the corresponding dashboard in Dynatrace.
  • disabled: No interaction with the dynatrace/dashboard API
  • initial: When the array of received SLI is empty, the dynatrace-service queries Dynatrace for the corresponding dashboard.

Default SLI metric for error_rate is actually failed request count and not error rate

When not uploading any SLIs the sli-service defaults to 5 default metrics. One of them is called error_rate. The problem is that the current implementation uses the dynatrace built-in metric metricSelector=builtin:service.errors.total.count instead of builtin:service.errors.server.rate - this will lead to problems when going just with defaults as the returned SLI is not a rate metrics between 0 and 100 but its just a count.

I suggest to change the default to the correct rate metric: builtin:service.errors.server.rate

Create a Helm Chart for installing dynatrace-sli-service instead of Kubernetes deployment manifests

In order to align the dynatrace-sli-service with the dynatrace-service, it should provide a Helm Chart for installation (see keptn-contrib/dynatrace-service#261)

This Helm Chart should have all Environment variables provided as values.

Definition of Done

  • Make sure that skaffold works with the provided Helm Chart (see solution: keptn/keptn#3421)
  • Helm Chart for dynatrace-sli-service with respective values provided
  • README updated with usage examples
  • When releasing a new version, the Helm Chart needs to be an asset of the release (automation?)
  • Tutorials on tutorials.keptn.sh updated with new install instructions

Use k8s ServiceAccount with a restricted set of permissions instead of default

Currently, the service uses the default Service Account with a set of high privileges inside the k8s clusters.

Use a dedicated service account with a minimum set of privileges as described in our contributing guidelines here keptn-sandbox/contributing#2

Definition of Done:

  • Service either uses the keptn-default Service Account which has no privileges at all or a new, dedicated Service Account with a minimal set of permissions

Changes made for Release 0.7.3 need to be merged into master

The 0.7.3 release contains features that did not make it into master.

Task:

  • Research on implementation: What has been changed? Issues/Bug-Fixes
  • Merge 0.7.3 features into master branch
  • Implement tests for the changes

Definition of Done:

  • The master branch is feature-complete with 0.7.3

Dynatrace SLI Provider doesnt replace $DEPLOYMENT with actual deployment type

When defining an SLI query and using the $DEPLOYMENT placeholder the Dynatrace SLI service should replace that placeholder with the actual value in the "deployment" field of the GetSLI Event. Currently this doesnt happen because that value is not passed to an internal structure which is used by the replacePlaceholders function.

Missing initialization is here: https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/cmd/main.go#L306
We miss a keptnEvent.Deployment= eventData.Deployment

This then impacts https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/pkg/common/common.go#L119 because Deployment is always empty

The impact of this is that the Metric Queries that the SLI Serivce executes do not filter on the correct service for that deployment type. Especially for Blue/Green deployment this can lead to not retrieving the values of e.g: "Canary" - but - because the deployment is an empty string - the SLI value will be the aggregate between Canary and Primary

I looked back in the code and this issue seems to have been in the code for a while but it was not noticed so far!

Implement dashboard parsing support for native Dynatrace SLO Definitions

Dynatrace is introducing a native concept of SLOs where SLOs can be defined as a metric, timeframe + pass / warning criteria.
As a Dynatrace SLI Service use I want to be able to put an SLO Tile on a Dynatrace Dashboard and I want the Dynatrace SLI Service to use that SLO definition for Keptn

Here is such a tile
image

The dashboard definition includes the link to the SLO Definition ID

{
        "name": "Service-level objective",
        "tileType": "SLO",
        "configured": true,
        "bounds": {
            "top": 532,
            "left": 0,
            "width": 304,
            "height": 152
        },
        "tileFilter": {
            "timeframe": "-1w"
        },
        "assignedEntities": ["11111111-e29a-3d4c-a6a8-874557a52c05"]
    }]

There is a new Dynatrace API that allows us to query either all SLOs or a specific SLO including a custom timeframe. This can then directly be used to query the SLI VAlues from Dynatrace for the given timeframe. Here is the API Example:

curl -X GET "https://demo.dev.dynatracelabs.com/api/v2/slo/11111111-e29a-3d4c-a6a8-874557a52c05?from=now-1w&to=now" -H "accept: application/json; charset=utf-8" -H "Authorization: Api-Token xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Here is the response:

{
  "id": "11111111-e29a-3d4c-a6a8-874557a52c05",
  "name": "SLO1",
  "description": "My SLO Definition",
  "evaluatedPercentage": 89.98165722494267,
  "status": "FAILURE",
  "errorBudget": -9.998342775057338,
  "targetSuccess": 99.98,
  "targetWarning": 99.99,
  "evaluationType": "AGGREGATE",
  "timeWindow": "-1w",
  "filter": ""
}

Send back a get-sli.done with status failed when no dynatrace credentials could be retrieved

When sending a start-evaluation cloud-event via Keptn 0.6.2, the service does not respond at all if no dynatrace credentials are configured.

It should be easy enough to return a get-sli.done event with each SLO having a failed status, as no dynatrace credentials were set.

This would improve the user-experience in the case of forgetting to set the credentials.

This is currently handled in the retrieveMetrics function here:

if err != nil {
stdLogger.Debug(err.Error())
stdLogger.Debug("Failed to fetch global Dynatrace credentials, exiting.")
return err
}

keptnEvent.Deployment is empty (instead of direct/canary), leads to a non-working metrics query

Related to keptn/spec#68 and keptn/keptn#3411

Using dynatrace-sli-service 0.8.0 (from release-0.8.0 branch) and Keptn 0.8.0-rc1, I noticed that there is a problem with querying metrics for a full continuous-delivery use-case.

In Keptn 0.7.x, the get-sli CloudEvent contained the attribute Deployment with the content "direct" or "canary" or "primary" (depending on the use-case) - see https://github.com/keptn/spec/blob/0.1.7/cloudevents.md#example-11 .

With Keptn 0.8.x, this has changed, and the get-sli.triggered CloudEvent looks as follows:

{
  "data": {
    "get-sli": {
      "end": "2021-03-01T12:40:40Z",
      "indicators": [
        "response_time_p95"
      ],
      "sliProvider": "dynatrace",
      "start": "2021-03-01T12:40:16Z"
    },
    "project": "tempberry",
    "service": "tempberry-backend",
    "stage": "hardening"
  },
  "id": "831329d5-2bff-4fce-894c-8e5209b3d9d4",
  "source": "lighthouse-service",
  "specversion": "1.0",
  "time": "2021-03-01T12:40:45.436Z",
  "type": "sh.keptn.event.get-sli.triggered",
  "shkeptncontext": "bafd72ba-f570-4be8-ab55-da29ff435544"
}

In combination with the used sli.yaml where we the placeholder $DEPLOYMENT, this leads to a metrics query that looks for keptn_deployment:"" instead of keptn_deployment:"direct" or keptn_deployment:"canary" or keptn_deployment:"primary" and doesn't return anything.

Definition of Done

  • Once keptn/keptn#3411 has been implemented, dynatrace-sli-service needs to read the optional deployment payload and extract DeploymentNames "direct", "canary" or "primary"

Error when using secret format from-file (as described in README) with 0.4.0/0.4.1

When creating secrets according to the README using:

DT_TENANT: your_tenant_id.live.dynatracelabs.com
DT_API_TOKEN: XYZ123456789
kubectl create secret generic dynatrace-credentials-<project> -n "keptn" --from-file=dynatrace-credentials=your-dynatrace-creds.yaml

The service fails to read the secret. Right now we only support the format which is also used by the dynatrace-service:

kubectl -n keptn create secret generic dynatrace-credentials-<project> --from-literal="DT_TENANT=$DT_TENANT" --from-literal="DT_API_TOKEN=$DT_API_TOKEN"

I'm fine with no longer supporting the old format, and switching to the new format, however we need to update README and docs use the new format.

It doesn't correctly escape special characters

When keptn service is registered with the name including special characters, it fails when it is evaluated with the type, “sh.keptn.events.evaluation-done” with the API error message, “Dynatrace Metrics API returned 0 result values, expected 1”.
When creating a service with ‘keptn create service’ (used for quality gates where Keptn only needs to know the service name) the service name can consist of any characters.
But when evaluating, dynatrace-sli-service run into a problem.
Here is the log of dynatrace-sli-service when failed because it doesn't correctly escape special characters.

2020/05/21 05:48:55 Finished fetching metrics; Sending event now ...
trying to fetch metric builtin:service.response.time:merge(0):percentile(95)
{"timestamp":"2020-05-21T05:48:55.833992356Z","logLevel":"ERROR","message":"Dynatrace Metrics API returned 0 result values, expected 1"}
{"timestamp":"2020-05-21T06:04:09.987444972Z","logLevel":"INFO","message":"Retrieving Dynatrace timeseries metrics"}
2020/05/21 06:04:09 secrets "dynatrace-credentials-dynatrace" not found
{"timestamp":"2020-05-21T06:04:09.993086007Z","logLevel":"DEBUG","message":"Could not find secret 'dynatrace-credentials-dynatrace' in namespace keptn."}
{"timestamp":"2020-05-21T06:04:09.993286481Z","logLevel":"DEBUG","message":"Failed to fetch Dynatrace credentials for project, falling back to global credentials."}
{"timestamp":"2020-05-21T06:04:09.996082023Z","logLevel":"INFO","message":"Dynatrace credentials (Tenant, Token) received. Getting global custom queries ..."}
{"timestamp":"2020-05-21T06:04:09.996330723Z","logLevel":"INFO","message":"Checking for custom SLI queries"}
{"timestamp":"2020-05-21T06:04:10.059801837Z","logLevel":"INFO","message":"Fetching indicator: response_time_p95"}
Querying metric response_time_p95
Getting timeseries config for metric response_time_p95
Old=builtin:service.response.time:merge(0):percentile(95), new=builtin%3Aservice.response.time%3Amerge%280%29%3Apercentile%2895%29
TargetURL= https://pir290.dynatrace-managed.com/e/51b39109-3233-43e5-8b26-eb20bc212b9d/api/v2/metrics/series/builtin%3Aservice.response.time%3Amerge%280%29%3Apercentile%2895%29?from=1590040737000&resolution=Inf&scope=tag%28%5BEnvironment%5Dkeptn_project%3Adynatrace%29%2Ctag%28%5BEnvironment%5Dkeptn_stage%3Aperformance-test%29%2Ctag%28%5BEnvironment%5Dkeptn_service%3Atracing-jmeter-java-test%29%2Ctag%28%5BEnvironment%5Dkeptn_deployment%3A%29&to=1590040748000
Request finished, parsing body...
trying to fetch metric builtin:service.response.time:merge(0):percentile(95)

FYI)
You can check from the service in Dynatrace that keptn_service tag value has "+" special character.
Screen Shot 2020-05-21 at 23 19 12

Only parse Dynatrace dashboard if it was changed from previous run

If the dynatrace-sli-service finds a matching dashboard based on the dashboard parameter it always parses the dashboard, makes calls to the Metrics Description API to get all necessary information about data types, dimensions ... and then it generates the SLI and SLO.yaml

If a dashboard doesnt change between two evaluation runs it means that the dynatrace-sli-service is doing this work even though it is not necessary as the current SLI.yaml and SLO.yaml already contain all the correct queries.
The current behavior also results in a lot of unecessary DYnatrace API calls to the Metrics Description API

To optimize this behavior the service should first validate if the dashboard has changed. As the dashboard.json is stored in the Keptn configuration repo it is as simple as comparing these two jsons. If nothing has changed then just use the existing SLI & SLO.yaml. If something has changed the process the dashboard and generate new SLI & SLO yamls

SLI-Service ignores filter of metric defintion

When the SLI-service parses a Dashboard for the Quality-Gate usecase it ignores the filters defined on a metric and takes all available entities.

In our case we wanted to filter one single service, which can be seen in the last section of the provided dashboard-json:

{
    "metadata": {
        "configurationVersions": [3],
        "clusterVersion": "1.200.89.20200826-104932"
    },
    "id": "9786cacb-b172-4343-9dc7-84c8f42ae5d7",
  "dashboardMetadata": {
    "name": "KQG;project=myproj;service=myservice;stage=staging",
    "shared": false,
    "owner": "user",
    "sharingDetails": {
      "linkShared": true,
      "published": false
    },
    "dashboardFilter": {
      "timeframe": ""
    }
  },
    "tiles": [{
        "name": "Markdown",
        "tileType": "MARKDOWN",
        "configured": true,
        "bounds": {
            "top": 0,
            "left": 0,
            "width": 836,
            "height": 76
        },
        "tileFilter": {},
        "markdown": "KQG.Total.Pass=90%;KQG.Total.Warning=75%;KQG.Compare.WithScore=pass;KQG.Compare.Results=1;KQG.Compare.Function=avg\n"
    }, {
        "name": "",
        "tileType": "CUSTOM_CHARTING",
        "configured": true,
        "bounds": {
            "top": 114,
            "left": 0,
            "width": 836,
            "height": 456
        },
        "tileFilter": {},
        "filterConfig": {
            "type": "MIXED",
            "customName": "Server side response time | ;sli=median_response_time;pass=<+10%,<500;key=true",
            "defaultName": "Custom chart",
            "chartConfig": {
                "legendShown": true,
                "type": "TIMESERIES",
                "series": [{
                    "metric": "builtin:service.response.server",
                    "aggregation": "PERCENTILE",
                    "percentile": 50,
                    "type": "LINE",
                    "entityType": "SERVICE",
                    "dimensions": [{
                        "id": "0",
                        "name": "dt.entity.service",
                        "values": [],
                        "entityDimension": true
                    }],
                    "sortAscending": false,
                    "aggregationRate": "TOTAL",
                    "sortColumn": true
                }],
                "resultMetadata": {}
            },
            "filtersPerEntityType": {
                "SERVICE": {
                    "SPECIFIC_ENTITIES": ["SERVICE-2BDFB35714B81801"]
                }
            }
        }
    }]
}

Something like this was the result of the evaluation 😅
column_of_death

Too strict check of endtime to be in the future

When the system that sends a start-evaluation event with an end timestamp of time.Now() is slightly ahead of the k8s cluster where Keptn runs then the Dynatrace SLI SErvice errors out with a "endtime is in the future".
This happened to me when running through some of the quality gate tutorials that are using the keptn cli to send a start-evaluation event. Like this:

keptn send event start-evaluation --project=qgproject --stage=qualitystage --service=evalservice --timeframe=10m

As my laptop is about 1.5s ahead of the EKS cluster where Keptn runs I can never get this use case to work. As the SLI Service has a built-in 120s retry to Dynatrace anyway to avoid this situation I suggest to ease up that time restriction to e.g: 30s. This will solve this problem and will allow users to use the keptn CLI for kicking off qulaity gates even though their local clock might be off by a couple of seconds

Parsing of dynatrace.conf.yaml doesn't work correctly

Currently the struct tags of the DynatraceConfigFile struct are not working properly due to the placement of omitempty:

type DynatraceConfigFile struct {
SpecVersion string `json:"spec_version" yaml:"spec_version"`
DtCreds string `json:"dtCreds",omitempty yaml:"dtCreds",omitempty`
Dashboard string `json:"dashboard",omitempty yaml:"dashboard",omitempty`
}

This should be fixed and make the parsing more robust.

0.8.0-alpha: Uncaught runtime Error when querying SLIs due to invalid url: invalid memory address or nil pointer dereference

I just got this when trying to fetch SLIs for sockshop (I called the project sockshop-dt on my installation).

{"timestamp":"2020-12-28T15:55:05.610500575Z","logLevel":"INFO","message":"Secret 'dynatrace' with credentials found, returning (https://ncl02507.sprint.dynatracelabs.com\n) ..."}
{"timestamp":"2020-12-28T15:55:05.770440682Z","logLevel":"INFO","message":"No custom SLI queries for project=sockshop-dt,stage=staging,service=carts found as no dynatrace/sli.yaml in repo. Going with default!"}
{"timestamp":"2020-12-28T15:55:05.770526087Z","logLevel":"INFO","message":"Fetching indicator: response_time_p95"}
{"timestamp":"2020-12-28T15:55:05.770691427Z","logLevel":"DEBUG","message":"Retrieved SLI config for response_time_p95: metricSelector=builtin:service.response.time:merge(0):percentile(95)\u0026entitySelector=type(SERVICE),tag(keptn_project:$PROJECT),tag(keptn_stage:$STAGE),tag(keptn_service:$SERVICE),tag(keptn_deployment:$DEPLOYMENT)"}
{"level":"error","ts":1609170905.771217,"logger":"fallback","caller":"client/invoker.go:66","msg":"call to Invoker.Invoke(...) has panicked: runtime error: invalid memory address or nil pointer dereference","stacktrace":"github.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2.1\n\t/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/invoker.go:66\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:679\nruntime.panicmem\n\t/usr/local/go/src/runtime/panic.go:199\nruntime.sigpanic\n\t/usr/local/go/src/runtime/signal_unix.go:394\ngithub.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace.(*Handler).BuildDynatraceMetricsQuery\n\t/go/src/github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace/dynatrace.go:611\ngithub.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace.(*Handler).GetSLIValue\n\t/go/src/github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace/dynatrace.go:1315\nmain.retrieveMetrics\n\t/go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:324\nmain.gotEvent\n\t/go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:79\nreflect.Value.call\n\t/usr/local/go/src/reflect/value.go:460\nreflect.Value.Call\n\t/usr/local/go/src/reflect/value.go:321\ngithub.com/cloudevents/sdk-go/v2/client.(*receiverFn).invoke\n\t/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/receiver.go:86\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2\n\t/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/invoker.go:69\ngithub.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke\n\t/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/invoker.go:71\ngithub.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.1\n\t/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/client.go:233"}

Stacktrace in nice:

github.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2.1
	/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/invoker.go:66
runtime.gopanic
	/usr/local/go/src/runtime/panic.go:679
runtime.panicmem
	/usr/local/go/src/runtime/panic.go:199
runtime.sigpanic
	/usr/local/go/src/runtime/signal_unix.go:394
github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace.(*Handler).BuildDynatraceMetricsQuery
	/go/src/github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace/dynatrace.go:611
github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace.(*Handler).GetSLIValue
	/go/src/github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace/dynatrace.go:1315
main.retrieveMetrics
	/go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:324
main.gotEvent
	/go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:79
reflect.Value.call
	/usr/local/go/src/reflect/value.go:460
reflect.Value.Call
	/usr/local/go/src/reflect/value.go:321
github.com/cloudevents/sdk-go/v2/client.(*receiverFn).invoke
	/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/receiver.go:86
github.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke.func2
	/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/invoker.go:69
github.com/cloudevents/sdk-go/v2/client.(*receiveInvoker).Invoke
	/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/invoker.go:71
github.com/cloudevents/sdk-go/v2/client.(*ceClient).StartReceiver.func2.1
	/go/pkg/mod/github.com/cloudevents/sdk-go/[email protected]/client/client.go:233

This is the get-sli cloud-event:

{
  "data": {
    "get-sli": {
      "customFilters": [],
      "end": "2020-12-28T15:55:05Z",
      "indicators": [
        "response_time_p95"
      ],
      "sliProvider": "dynatrace",
      "start": "2020-12-28T15:46:23Z"
    },
    "labels": {
      "DtCreds": "dynatrace"
    },
    "message": "",
    "project": "sockshop-dt",
    "result": "",
    "service": "carts",
    "stage": "staging",
    "status": ""
  },
  "id": "9977c3fe-82d5-4a66-ae95-0816c3539d0b",
  "source": "lighthouse-service",
  "specversion": "1.0",
  "time": "2020-12-28T15:55:05.458Z",
  "type": "sh.keptn.event.get-sli.triggered",
  "shkeptncontext": "eb1504d5-2ef8-4374-a9c1-a6d0bedcc870"
}

slo.yaml

---
spec_version: "1.0"
comparison:
  aggregate_function: "avg"
  compare_with: "single_result"
  include_result_with_score: "pass"
  number_of_comparison_results: 1
filter:
objectives:
  - sli: "response_time_p95"
    key_sli: false
    pass:             # pass if (relative change <= 10% AND absolute value is < 600ms)
      - criteria:
          - "<=+10%"  # relative values require a prefixed sign (plus or minus)
          - "<600"    # absolute values only require a logical operator
    warning:          # if the response time is below 800ms, the result should be a warning
      - criteria:
          - "<=800"
    weight: 1
total_score:
  pass: "90%"
  warning: "75%"

I did not add an sli.yaml file, so it should be using the default that's available.

Root cause

I did some initial analysis, and the root-cause is that my Dynatrace Tenant URL had an invalid character in it (was my mistake when copying it).
Then url.Parse returned an error which we are not catching, see

u, _ := url.Parse(targetURL)
q, _ := url.ParseQuery(u.RawQuery)

Definition of Done

  • Catch invalid URLs/configurations (maybe on a global level) and return a failed get-sli.finished event

Allow LABELS as placeholder in Dynatrace.sli

Right now the Dynatrace-SLI Service only allows 4 placeholders: $PROJECT, $STAGE, $SERVICE, $DEPLOYMENT.
Many discussions we have around using Keptn for Quality Gates only or for "Performance as a Self-Service" have shown that these 4 placeholders are not enough.

To give you a concrete example from Discovery:
They are using Jenkins to deploy their service and then put a handful of custom tags on them. They also use Jenkins and leverage the x-dynatrace header to pass in things like TSN (Test Step Name: homepage, search, checkout) and LTN (Load Test Name: loadtest_job1, loadtest_job2, ...). Especially LTN is dynamic and changes with every Jenkins pipeline execution as the Jenkins Job ID is used in that Test NAme.

In order for Discovery to be able to use SLIs they would need to be able to reference these "dynamic values" - otherwise they would need to auto-generate the SLI for every Jenkins Job, make their own replacements, upload it to the keptn git repo and run the evaluation.

In order to solve this I suggest we take the labels that are passed to the Deployment-Finished, Test-Finished or Start-Evaluation event - make sure these labels are passed to the SLI Provider, e.g: Dynatrace SLI Provider and then use them in the SLIs as placeholder.
Here is an example:

contenttype": "application/json",
  "data": {
    "labels": {
      "testname" : "mytest_1",
      "servicetag1" : "mytag",
      "servicetag1" : "mytag"
    },
    "project": "simpleproject",
    "service": "simplenode",
    "stage": "staging",
    "start": "2019-11-21T11:00:00.000Z",
    "end": "2019-11-21T11:05:00.000Z",
    "teststrategy": "performance"
  },
  "source": "jenkins",
  "type": "sh.keptn.event.start-evaluation",
}

The Dynatrace SLI could then look like this

spec_version: '1.0'
indicators:
  throughput: "builtin:service.requestCount.total:merge(0):sum?scope=tag($LABEL.servicetag1),tag($LABEL.servicetag2)"
  svccalls_test_invoke: "calc:service.teststepservicecalls:filter(eq(Test Name,$LABEL.testname)):merge(0):avg?scope=tag($LABEL.servicetag1),tag($LABEL.servicetag2)"

This would give our users the flexiblity to pass in any custom meta data via labels to Keptn and in the end to the Dynatrace SLI Provider.

Run docker container as non-root users

As a dynatrace-sli-service user, I would like that the dynatrace-sli-service runs with a non-root user.

Tasks:

  • Fix docker image and add user
    RUN adduser -D nonroot -u 65532
    USER nonroot
    
  • Add security context in Helm chart and deployment manifest

Definition of Done:

  • Image runs as non-root user

Link Back to SLI Provider in Keptn's Bridge

Is it possible to provide a link back to your SLI provider in the Evaluation Events on Keptn's bridge?, eg the performance signature plugin provides a link back to Dynatrace's Diagnostic tools--> Top Web requests for the evaluation period.

I have found this to be a great context provider for testers and developers, it encourages them to log into Dynatrace directly from the results and get going with the diagnostics of the performance problem

Refactor error handling and logging

The dynatrace-sli-service should be refactored w.r.t. logging and error handling.

Let's use the following rules:

  • There should be only one line for one log message, i.e. make only one log entry instead of
    dt.Logger.Error("Failed sending Dynatrace API request: " + err.Error())
    dt.Logger.Error("Response Body:" + body)
    
  • Only log an error where it the error is handled. Otherwise, we have log entries which all have the same root cause error.
  • Always adhere to the logging format and not fmt.Println

For example, the dynatrace-sli-service produces the following log output:

2020-09-15T09:38:17.604563211Z Processing custom chart tile Process Memory;sli=process_memory, sli=process_memoryRequest finished, parsing body...
2020-09-15T09:38:17.604835261Z {"metricId":"builtin:tech.generic.mem.workingSetSize","displayName":"Process memory","description":"","unit":"Byte","entityType":["PROCESS_GROUP_INSTANCE"],"aggregationTypes":["auto","avg","max","min"],"transformations":["filter","fold","merge","names","parents"],"defaultAggregation":{"type":"avg"},"dimensionDefinitions":[{"key":"dt.entity.process_group_instance","name":"Process","displayName":"Process","index":0,"type":"ENTITY"}]}
2020-09-15T09:38:17.606561262Z 
2020-09-15T09:38:17.60659643Z merging dimension Process
2020-09-15T09:38:17.606602683Z Finalize query for metricSelector=builtin:tech.generic.mem.workingSetSize:merge(0):avg:names;entitySelector=type(PROCESS_GROUP_INSTANCE)
2020-09-15T09:38:17.606607784Z Final Query= https://tno85405.live.dynatrace.com/api/v2/metrics/query/?entitySelector=type%28PROCESS_GROUP_INSTANCE%29&from=1600162396000&metricSelector=builtin%3Atech.generic.mem.workingSetSize%3Amerge%280%29%3Aavg%3Anames&resolution=Inf&to=1600162696000
2020-09-15T09:38:17.650767639Z Request finished, parsing body...
2020-09-15T09:38:17.650833349Z {"totalCount":134,"nextPageKey":"___a7acX3q0AAAAGAQA6YnVpbHRpbjp0ZWNoLmdlbmVyaWMubWVtLndvcmtpbmdTZXRTaXplOm1lcmdlKDApOmF2ZzpuYW1lcwEAAAAAAAAAZAEAAAF0kSERmAAAAXSRHHdgAAABdJEhC0ABAQAcdHlwZShQUk9DRVNTX0dST1VQX0lOU1RBTkNFKQEAP3Z1OU5SVlJTM3EwQVpHUUFBUUFZVEd4VlpGbHRkVFZUTWsxbmQyWXhOMU5rYVRSM1Jtczl2dTlOUlZSUzNxMP__2u2nF96t","result":[{"metricId":"builtin:tech.generic.mem.workingSetSize:merge(0):avg:names","data":[{"dimensions":[],"timestamps":[1600162740000],"values":[1.2162151518655463E8]}]}]}
2020-09-15T09:38:17.650843597Z 
2020-09-15T09:38:17.650848178Z received query result
2020-09-15T09:38:17.650894859Z Processing result for builtin:tech.generic.mem.workingSetSize:merge(0):avg:names
2020-09-15T09:38:17.650900095Z process_memory: 118771.01
2020-09-15T09:38:17.683266445Z Processing custom chart tile Process CPU;sli=process_cpu;pass=<20;warning=<50;key=true, sli=process_cpuRequest finished, parsing body...
2020-09-15T09:38:17.683356998Z {"metricId":"builtin:tech.generic.cpu.usage","displayName":"Process CPU usage","description":"","unit":"Percent","entityType":["PROCESS_GROUP_INSTANCE"],"aggregationTypes":["auto","avg","max","min"],"transformations":["filter","fold","merge","names","parents"],"defaultAggregation":{"type":"avg"},"dimensionDefinitions":[{"key":"dt.entity.process_group_instance","name":"Process","displayName":"Process","index":0,"type":"ENTITY"}]}
2020-09-15T09:38:17.683374399Z 
2020-09-15T09:38:17.683379214Z merging dimension Process
2020-09-15T09:38:17.683384337Z Finalize query for metricSelector=builtin:tech.generic.cpu.usage:merge(0):avg:names;entitySelector=type(PROCESS_GROUP_INSTANCE)
2020-09-15T09:38:17.683431015Z Final Query= https://tno85405.live.dynatrace.com/api/v2/metrics/query/?entitySelector=type%28PROCESS_GROUP_INSTANCE%29&from=1600162396000&metricSelector=builtin%3Atech.generic.cpu.usage%3Amerge%280%29%3Aavg%3Anames&resolution=Inf&to=1600162696000
2020-09-15T09:38:17.724089207Z Request finished, parsing body...
2020-09-15T09:38:17.724133811Z {"totalCount":134,"nextPageKey":"___a7acX3q0AAAAGAQAxYnVpbHRpbjp0ZWNoLmdlbmVyaWMuY3B1LnVzYWdlOm1lcmdlKDApOmF2ZzpuYW1lcwEAAAAAAAAAZAEAAAF0kSER5gAAAXSRHHdgAAABdJEhC0ABAQAcdHlwZShQUk9DRVNTX0dST1VQX0lOU1RBTkNFKQEAP3Z1OU5SVlJTM3EwQVpHUUFBUUFZVEd4VlpGbHRkVFZUTWsxbmQyWXhOMU5rYVRSM1Jtczl2dTlOUlZSUzNxMP__2u2nF96t","result":[{"metricId":"builtin:tech.generic.cpu.usage:merge(0):avg:names","data":[{"dimensions":[],"timestamps":[1600162740000],"values":[0.04291552607627476]}]}]}
2020-09-15T09:38:17.72416445Z 
2020-09-15T09:38:17.724169231Z received query result
2020-09-15T09:38:17.724174063Z Processing result for builtin:tech.generic.cpu.usage:merge(0):avg:names
2020-09-15T09:38:17.724178872Z process_cpu: 0.04
2020-09-15T09:38:17.724183328Z Chart Tile Test Step Response Time - NOT included as name doesnt include sli=SLINAME
2020-09-15T09:38:17.72418827Z Chart Tile Test Step Failure Rate - NOT included as name doesnt include sli=SLINAME
2020-09-15T09:38:17.724193119Z Chart Tile Test Step Service Calls - NOT included as name doesnt include sli=SLINAME
2020-09-15T09:38:17.724198605Z {"timestamp":"2020-09-15T09:38:17.723599007Z","logLevel":"INFO","message":"Dynatrace Dashboard"}
2020-09-15T09:38:17.725024981Z {"timestamp":"2020-09-15T09:38:17.724540231Z","logLevel":"INFO","message":"Uploading remote file"}
2020-09-15T09:38:17.889671379Z {"timestamp":"2020-09-15T09:38:17.889405905Z","logLevel":"INFO","message":"Generated SLI.yaml from Dynatrace Dashboard"}
2020-09-15T09:38:17.890436715Z {"timestamp":"2020-09-15T09:38:17.890276968Z","logLevel":"INFO","message":"Uploading remote file"}
2020-09-15T09:38:17.987573125Z {"timestamp":"2020-09-15T09:38:17.987415974Z","logLevel":"INFO","message":"Generated SLO.yaml from Dynatrace Dashboard"}
2020-09-15T09:38:17.990901904Z {"timestamp":"2020-09-15T09:38:17.990671089Z","logLevel":"INFO","message":"Uploading remote file"}

This is hard to debug!

React on a triggered event with a started and finally finished event

The dynatrace-sli-service has a subscription to the topic: sh.keptn.internal.event.get-sli

Changing behavior

  • Change the topic to: sh.keptn.event.get-sli.triggered
  • If receiving the event and the event can be processed (since the provider = dynatrace), send a sh.keptn.event.get-sli.started
  • When the event processing is done, send a sh.keptn.event.get-sli.finished (formerly known as sh.keptn.internal.event.get-sli.done)
  • For sending the events, the distributor can be used since the distributor will handle the communication to NATS/the public API.
    That means this service can simply post an event on 127.0.0.1:8081 and the distributor forwards the event to NATS/the public API.

Task

  • Review of content in this repo that refers to the old format of CloudEvents

Better Error Logging for Dynatrace API 4xx errors when queries metrics api v2

Right now when the Dynatrace Metrics APIv2 returns e.g.: "HTTP 403 - Insufficient priviliges for the API Token" there is no log output in the dynatrace sli service that would indicate this issue. The only logoutput we see is a "Dynatrace Metrics API returned an error: No valid response from metrics api for query"

When calling the API - especially the MetricsV2 API - we should better handle and parse NON HTTP 200 responses
Here is an example response for the MetricsV2 API:

{
  "error": {
    "code": 403,
    "message": "Token is missing required scope. Use one of: metrics.read (Read metrics)"
  }
}

The problem is easily fixed in the following locations by fixing the if to only error out if there is a "" response:
https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/pkg/lib/dynatrace/dynatrace.go#L611
https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/pkg/lib/dynatrace/dynatrace.go#L657
https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/pkg/lib/dynatrace/dynatrace.go#L697
https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/pkg/lib/dynatrace/dynatrace.go#L732
https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/pkg/lib/dynatrace/dynatrace.go#L765
https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/pkg/lib/dynatrace/dynatrace.go#L809
https://github.com/keptn-contrib/dynatrace-sli-service/blob/release-0.7.2/pkg/lib/dynatrace/dynatrace.go#L838

Also great opportunity for some code refactoring as all this is duplicated code that can be done by a helper function

Better handling of unexpected Dynatrace API Response, e.g: HTTP 500

By default - the Dynatrace API always returns a JSON object -even in case the API runs into an error, e.g: invalid query format.
However - in case of an HTTP 500 the default response is an HTML error page like this:

<html>
<head><title>500 Internal Server Error</title></head>
<body>
<center><h1>500 Internal Server Error</h1></center>
<hr><center>openresty</center>
</body>
</html>

In this case the Dynatrace SLI Service runs into a parsing issue as it assumes a JSON Response which results in this message sent back to Keptn:

invalid character '<' looking for beginning of value

We need the Dynatrace SLI Service to better handle this response and return a proper error message indicating the the Dynatrace API returned an HTTP 500

Increase test coverage for Dashboard feature and refactoring

The dynatrace-sli-service got new features in regard to extracting SLO/SLI from a Dynatrace Dashboard.
Please see: #62

Increase the robustness of the new features.

Task

  • Review of the new feature
  • Add unit tests when required

Chore tasks:

  • The dashboard feature creates SLO/SLI without spec_version. Set the spec_version to: 0.1.4
  • Even though there is no dynatrace/dynatrace.conf.yaml the service logs:
020-09-17T08:39:00.243879384Z {"timestamp":"2020-09-17T08:39:00.243634332Z","logLevel":"DEBUG","message":"Found dynatrace/dynatrace.conf.yaml on service level"} 
  but no conf.yaml available
  • Refactoring: func (ph *Handler) BuildDynatraceMetricsQuery(metricquery string, startUnix time.Time, endUnix time.Time, customFilters []*keptn.SLIFilter) (string, string) --> the property customFilters is obsolete

Correctly handle SLI inheritance across service, stage and projects as documented

In the doc - https://keptn.sh/docs/0.7.x/quality_gates/sli/#add-sli-configuration-to-a-service-stage-or-project - we explain that SLIs can be defined on any level, e.g: stage, service and project and that the sli-provider should "merge" / "inherit" SLI definitions

Seems the current dynatrace-sli-service implementation is not doing this. It looks at service, then stage and then project - BUT - it will take the first SLI.yaml it finds and then doesnt go "up the hierarchy" to also include the SLI definitions from higher level structures.

This was brought to our attention by one of our users who followed the doc but then highlighted by @christian-kreuzberger-dtx that this is actually not as implemented right now. Here is the problematic code:

// Downloads a resource from the Keptn Configuration Repo
// In RunLocal mode it gets it from the local disk
// In normal mode it first tries to find it on service level, then stage and then project level
//
func GetKeptnResource(keptnEvent *BaseKeptnEvent, resourceURI string, logger *keptn.Logger) (string, error) {

Better error message when evaluation timeframe is 0 seconds

It is possible in Keptn to end up with an evaluation timeframe of 0 seconds. in this case the Dynatrace API will always return no values but the error for that in the bridge is unclear for the end user that it is actually related to a 0 second timeframe

image

The Dynatrace-SLI Service should return a different error that is more descriptive, e.g: start/end timeframe is too short for the Dynatrace API

Default supported SLIs still use old metrics v1 queries

The dynatrace-sli-service supported 5 default SLIs: throughput, error rate, response time (p50, p90, p95)
When no SLI.yaml is defined the service uses a pre-defined query that is still using the old metrics query definition. While this works it leads to a lot of WARNING logs as the dynatrace-sli-service logs a warning in case the old query is used

I suggest we update these default queries to the new metric v2 api query standard

Error retrieving key request metrics on KQG dashboard

Hi guys,
I came across this issue trying to retrieve key request metrics on a KQG dashboard. They seem to work fine with no management zone set but as soon as a MZ is set (against the dashboard or tile) I get an error in SLI retrieval "message": "Dynatrace API returned status code 400: Constraints violated."
Looking at the API calls the Dynatrace SLI service tries to make, is is using: "entitySelector=type(SERVICE_METHOD),mzId(123)` . This appears to be not allowed in the v2/metrics api as calling direct I get this error:

{
  "error": {
    "code": 400,
    "message": "Constraints violated.",
    "constraintViolations": [
      {
        "path": "entitySelector",
        "message": "Predicate name mzId not applicable for type SERVICE_METHOD",
        "parameterLocation": "QUERY",
        "location": null
      }
    ]
  }
}

For reference the metric I tried was builtin:service.keyRequest.response.server

Allow addition of filters to dynatrace dashboard tiles for SLI/SLO generation

Tested with version 0.5.0

Currently filtering of entities for a dashboard only works with management zones but would be very useful to have the ability to add filters on a tile by tile basis. Currently if a tile contains filters the evaluation does not finish and the following logs are added to the dynatrace-sli-service pod:

Query all dashboards
--
  | Request finished, parsing dashboard list response body...
  | Analyzing if Dashboard matches: KQG;project=projectname;service=servicename;stage=environmentname
  | Found Dashboard Match: a5a7665c-2207-4707-9ab0-9eff03f9cac6
  | Query dashboard with ID: a5a7665c-2207-4707-9ab0-9eff03f9cac6
  | Request finished, parsing dashboard response body...
  | 2020/09/30 14:42:22 http: panic serving 10.248.0.1:54762: runtime error: invalid memory address or nil pointer dereference
goroutine 1435 [running]:
--
  | net/http.(*conn).serve.func1(0xc0000ce000)
  | /usr/local/go/src/net/http/server.go:1767 +0x139
  | panic(0x13819e0, 0x2081e30)
  | /usr/local/go/src/runtime/panic.go:679 +0x1b2
  | github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace.(*Handler).QueryDynatraceDashboardForSLIs(0xc0006b4d58, 0xc0004056e0, 0xe, 0xc0004056a9, 0x3, 0xc000405700, 0xe, 0x0, 0x0, 0x3d09000, ...)
  | /go/src/github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace/dynatrace.go:783 +0x395
  | main.getDataFromDynatraceDashboard(0xc000020d58, 0xc0001a8000, 0x3d09000, 0xed70657f9, 0x0, 0x137d9fc0, 0xed706580d, 0x0, 0x0, 0x0, ...)
  | /go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:159 +0x14c
  | main.retrieveMetrics(0x17379c0, 0xc0005000d0, 0x12f56e0, 0xc000121200, 0x1, 0xc0004242a0, 0x1706c00)
  | /go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:304 +0x782
  | main.gotEvent(0x1706c00, 0xc0004242a0, 0x17379c0, 0xc0005000d0, 0x12f56e0, 0xc000121200, 0x1, 0x0, 0x0)
  | /go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:86 +0xf9
  | reflect.Value.call(0x1354620, 0x15aa7c8, 0x13, 0x15088ae, 0x4, 0xc000424810, 0x2, 0x2, 0x40be3f, 0x14b60e0, ...)
  | /usr/local/go/src/reflect/value.go:460 +0x5f6
  | reflect.Value.Call(0x1354620, 0x15aa7c8, 0x13, 0xc000424810, 0x2, 0x2, 0xc0004242a0, 0xc0000311e0, 0x1)
  | /usr/local/go/src/reflect/value.go:321 +0xb4
  | github.com/cloudevents/sdk-go/pkg/cloudevents/client.(*receiverFn).invoke(0xc0004fb200, 0x1706c00, 0xc0004242a0, 0x17379c0, 0xc0005000d0, 0x12f56e0, 0xc000121200, 0x1, 0xc0004246c0, 0x11d7f8b, ...)
  | /go/pkg/mod/github.com/cloudevents/[email protected]/pkg/cloudevents/client/receiver.go:93 +0x220
  | github.com/cloudevents/sdk-go/pkg/cloudevents/client.(*ceClient).obsReceive(0xc0002268c0, 0x1706c00, 0xc0004242a0, 0x17379c0, 0xc0005000d0, 0x12f56e0, 0xc000121200, 0x1, 0xc0004246c0, 0xc000031440, ...)
  | /go/pkg/mod/github.com/cloudevents/[email protected]/pkg/cloudevents/client/client.go:132 +0xfc
  | github.com/cloudevents/sdk-go/pkg/cloudevents/client.(*ceClient).Receive(0xc0002268c0, 0x1706c00, 0xc0004242a0, 0x17379c0, 0xc0005000d0, 0x12f56e0, 0xc000121200, 0x1, 0xc0004246c0, 0x0, ...)
  | /go/pkg/mod/github.com/cloudevents/[email protected]/pkg/cloudevents/client/client.go:120 +0xc4
  | github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http.(*Transport).obsInvokeReceiver(0xc0004335f0, 0x1706c00, 0xc0004242a0, 0x17379c0, 0xc0005000d0, 0x12f56e0, 0xc000121200, 0x1, 0x0, 0xc000031d70, ...)
  | /go/pkg/mod/github.com/cloudevents/[email protected]/pkg/cloudevents/transport/http/transport.go:500 +0x10f
  | github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http.(*Transport).invokeReceiver(0xc0004335f0, 0x1706c00, 0xc0004242a0, 0x17379c0, 0xc0005000d0, 0x12f56e0, 0xc000121200, 0x1, 0x2, 0xffffffffffffffff, ...)
  | /go/pkg/mod/github.com/cloudevents/[email protected]/pkg/cloudevents/transport/http/transport.go:484 +0xc5
  | github.com/cloudevents/sdk-go/pkg/cloudevents/transport/http.(*Transport).ServeHTTP(0xc0004335f0, 0x1702d40, 0xc0001e00e0, 0xc0001d4000)
  | /go/pkg/mod/github.com/cloudevents/[email protected]/pkg/cloudevents/transport/http/transport.go:592 +0xafb
  | net/http.(*ServeMux).ServeHTTP(0xc000226900, 0x1702d40, 0xc0001e00e0, 0xc0001d4000)
  | /usr/local/go/src/net/http/server.go:2387 +0x1bd
  | net/http.serverHandler.ServeHTTP(0xc0001e0000, 0x1702d40, 0xc0001e00e0, 0xc0001d4000)
  | /usr/local/go/src/net/http/server.go:2802 +0xa4
  | net/http.(*conn).serve(0xc0000ce000, 0x1706b40, 0xc0003ac000)
  | /usr/local/go/src/net/http/server.go:1890 +0x875
  | created by net/http.(*Server).Serve
  | /usr/local/go/src/net/http/server.go:2928 +0x384

Move build tasks to GitHub actions

Currently some build tasks (sonarqube checks, building and pushing docker images) are still executed on Travis CI.

This should be moved to GitHub actions

Definition of Done:

  • All tasks that have previously been executed on Travis-CI are executed via GitHub Actions

Crash of dynatrace-sli-service when using a dynatrace dashboard without a management zone

When using the new capabilty to define a dynatrace dashboard for SLIs and that dashboard doesnt have a management zone applied the dynatrace-sli-service will crash with the following crash details

2020/08/27 15:22:28 http: panic serving 10.1.94.1:57704: runtime error: invalid memory address or nil pointer dereference
goroutine 578 [running]:
net/http.(*conn).serve.func1(0xc0000d0000)
	/usr/local/go/src/net/http/server.go:1767 +0x139
panic(0x13819e0, 0x2081e30)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace.(*Handler).QueryDynatraceDashboardForSLIs(0xc000614d58, 0xc0003abb90, 0xb, 0xc0003abba0, 0xc, 0xc0003abbb0, 0xb, 0x0, 0x0, 0x39387000, ...)
	/go/src/github.com/keptn-contrib/dynatrace-sli-service/pkg/lib/dynatrace/dynatrace.go:783 +0x395
main.getDataFromDynatraceDashboard(0xc000112d58, 0xc0001ae000, 0x39387000, 0xed6d9c41f, 0x0, 0x39c1c440, 0xed6d9c677, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:159 +0x14c
main.retrieveMetrics(0x17379c0, 0xc000432a90, 0x12f56e0, 0xc000229a40, 0x1, 0xc000322180, 0x1706c00)
	/go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:304 +0x782
main.gotEvent(0x1706c00, 0xc000322180, 0x17379c0, 0xc000432a90, 0x12f56e0, 0xc000229a40, 0x1, 0x0, 0x0)
	/go/src/github.com/keptn-contrib/dynatrace-sli-service/cmd/main.go:86 +0xf9

the problem is that the current implementation accesses a property within an object that is null

CURL command that creates a DT-Dashboard for SLOs

For supporting a user with setting up a quality gate without manually writing the SLO/SLI file, the dynatrace-sli-service retrieves the SLO/SLI files from a DT-Dashboard that follows a certain format. This is already implemented.

The template for this dashboard is provided by Andi. Please see: https://yoj211.managed-sprint.dynalabs.io/e/ecc29184-1ae2-46dd-96fe-06997671fc57/#dashboard;id=359160e8-cab6-46b1-a843-07d8451ee75d;gtf=c_1594720800000_1594742400000;gf=all

The metrics in this dashboard are:

  • Response time,
  • Failure rate,
  • Throughput for all services.

Supporting material: https://www.youtube.com/watch?v=mGiTvRP1_LM&list=PLqt2rd0eew1YFx9m8dBFSiGYSBcDuWG38&index=8&t=1898s

Task / Definition of Done

  • Create a CURL command that generates the dashboard in Dynatrace

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.