zalando-zmon / service-level-reporting Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 10.0 990 KB

Calculate SLI/SLO metrics from ZMON's timeseries data

License: Other

Python 55.37% HTML 44.06% Mako 0.15% JavaScript 0.29% Dockerfile 0.13%

monitoring reporting service-level-indicator service-level-objective sli slo zmon

service-level-reporting's People

Contributors

Stargazers

Watchers

Forkers

coders-kitchen arjunrn lsavolainen avaczi fmaali christianberg thorbjoerng xuezhizeng lfroment0 andrerpena

service-level-reporting's Issues

Report: chart is cut from the top

Right now the max-y value of chart is calculated according to SLO threshold not the maximum value of the chart. As a result - not the whole chart is visible - it is cut from the top. The points are are much higher than SLO threshold are not visible.

Example of problem: https://pages.github.bus.zalan.do/continuous-delivery/automata-service-level-reports/messaging-bus/nakadi/20170519-20170522/index.html

Is it expected behavior or a bug?
If it's a bug - I can contribute with a fix.

Streamline API

Refactor
CRUD
DB Migrations
Consistency: #25
Errors: #24
...

Return better errors for the failed API calls

Creating a product group that violates the UNIQUE constraint returns a basic 500 error:

HTTP/1.1 500 INTERNAL SERVER ERROR
Connection: keep-alive
Content-Length: 252
Content-Type: application/problem+json
Date: Wed, 15 Feb 2017 16:51:16 GMT

{
    "detail": "The server encountered an internal error and was unable to complete your request.  Either the server is overloaded or there is an error in the application.",
    "status": 500,
    "title": "Internal Server Error",
    "type": "about:blank"
}

We should improve the errors so that clients can figure out how to work around them

Increase timeouts for report generation

Creating a product with invalid product_group does not yield error

Should return error 400 or 404. Now it ends up with NULL product_group_id

Limit maximum data requested from KairosDB

Currently, there are some failing requests to KairosDB because the size of the requested data slice is too large. That's because the app tries to fill in all the missing data since it was last updated. That could be a couple of minutes to months. This should be restricted to a maximum of a day.

@hjacobs would you agree?

Support pagination in CLI

CLI should:

Support pagination variables
Support pagination in some way (not clear yet)

Zappr file not compliant

Your Zappr file does not yield the correct config.

Consider doing this

X-Zalando-Team: "zmon"
X-Zalando-Type: code

approvals:
  minimum: 2

commit:
  message:
    patterns: # commit message has to match any one of
      - "^ *#[0-9]+" # starts with hash and digits

Ability to add, update or delete SLO from the UI

Delete product indicators does not work

When editing the indicator you press delete button and you get a message error "Can't delete indicator"

Not all products SLI are updated in case of exception

In run_sli_update greenlet, if one of the products SLI update failed, other products won't be updated.

Switch legacy to read-only

Switch legacy service/app to read-only API.

Allow SLI update for ranges

Add end field in resource to allow updating specific ranges. This could be useful to manually fix certain failing ranges with large data points count.

Error generating report

We get the following error when trying to generate a report:

$ ./generate-slr.py $API_ENDPOINT pss
Can not determine "period_from" and "period_to" for the report. Terminating!

What is the cause?

Report is invalid in case of duplicate SLI name across products

Products not displayed in page and not found with search (pagination & search issue).

Users reported being unable to find new products. Given that current amount of products is more than 100, and pagination size is of 100, newer products are not displayed. Given that Search function is done on the Frontend, products out of range are not found.

Search should be performed by backend. API supports this already.
Pagination should be handled on UI. API supports this since the beginning.

As a quick fix, pagination limit will be increased to 150 and Search will be modified to be done via the API. Proper pagination handling should be implemented in the UI afterwards.

Generating a report for SLI with no days crashes the entire report

Fix division by zero

There is, at least, one code path that can result in a division by zero exception under certain circumstances.

Traceback (most recent call last):
  File "/generate-slr.py", line 159, in <module>
    cli()
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/generate-slr.py", line 155, in cli
    generate_weekly_report(base_url, product, output_dir)
  File "/generate-slr.py", line 87, in generate_weekly_report
    val = sum(values_by_sli[target['sli_name']]) / len(values_by_sli[target['sli_name']])
ZeroDivisionError: division by zero

Only the reports directory index should be reversed

Currently the same template is used for directory indexes. This gives us the useful reversed sorting index of reports but doesn't really help with the product page index.

Implement UI

Provide UI:

Users should be able to use a user friendly interface to manage their SLO definitions (CRUD)
Users should be able to jump directly to the central SLR repository for a given SLO definition

TBD

Trigger report generation lazily or not (when first requested?)
Serve HTML from web application or push it elsewhere
Support sending reports via email? (maybe later)

Fix time unit

Currently hardcoded to minutes

Use latest Connexion

Can we upgrade to the latest Connexion release (https://github.com/zalando/connexion/releases/tag/1.1.14) or are there any impediments?

Authorization and scopes for the Products

Currently anybody with a valid token can modify the product/SLO/SLI data for any of the teams. As suggested by @hjacobs we should add authorization at possibly the product level.

Integrate Service Level Reports (SLRs) in web app

SLRs should ideally be generated and served (?) by the web application itself.

To be discussed:

Trigger report generation lazily or not (when first requested?)
Serve HTML from web application or push it elsewhere
Support sending reports via email? (maybe later)

Reports deeplink is broken

There is an extra / after slr but when visiting this URL we end up on the root page of the reports

https://slo-host/slr//my-services/a-service/20170831-20170906/index.html

Add support for opentracing

Fix securityDefinitions in the API spec

The current swagger spec contains only the example securityDefinitions

securityDefinitions:
  oauth2:
    type: oauth2
    flow: implicit
    authorizationUrl: https://example.com/oauth2/dialog
    scopes:
      uid: Unique identifier of the user accessing the service.

Which suggests a wrong flow is specified.

Report: fix aggregation for requests

Showing the average number of requests per second per day is not very meaningful. We should compute the total number of requests per day and show it instead.

Report: add abbreviated day of week to table header (e.g. "Sat")

To immediately see what day of week it is (esp. relevant for weekends).

Use Authorization headers in verifying token info

Error generating reports when no breaches found

Report generation seems to fail when no breaches are found.
Getting this error:
File "/generate-slr.py", line 153, in generate_weekly_report
slo['breaches'] = max(breaches_by_sli.values())
ValueError: max() arg is an empty sequence

The error disappears when adjusting the threshold in the SLO definition to a value that ensures there are some breaches

Report: percentiles from plain data

I have another use case:

I have rare events, so that when I run my check every 5 mins, often there are no events at all or just 1 or 2. I cannot calculate 95th percentile every 5 minutes in this case. Instead I can easily record the maximum value for the last 5 minutes, or 0.

Would be nice to have something like "aggregation": "p95" in the reporting service, so that when the report is generated, it aggregates all the data for the week and is able to calculate correct percentiles.

Could you please consider implementing this type of aggregation?

Custom time range for Service Level Reports

Currently the handler which retrieves the data for the report has the time range hardcoded into it which is the past week. If the app has to support reports for larger intervals like a month or a year this handler should be able to accept a start and end data as parameters.

Display a count of metrics on the UI

Ability to display count of metrics on the UI for the information of management users. This can also serve as an indicator of the capacity for the tool.

Total number of products
Total number of reports
Total number of SLO and/or SLI's

Postgresql exception: value out of range: underflow

While inserting new SLI values.

Report: requests aggregation is wrong

Extend API to allow deleting SLO resources

It seems that, currently, it's not possible to delete SLO resources.

We should extend the API to be able to delete SLOs

Tests

Unit tests with decent coverage.

Trying to create a duplicated data-source returns 500 error code

Using the Client, in case a user tries to create a new data-source while the "sli_name" already exists, it returns 500.
Perhaps a better error message and type makes more sense.

Left and right y axes don't scale independently

When two different metrics are plotted in one graph, using the left and right y axis, both axes use the same range (same minimum and maximum).

This compresses one of the graphs and makes it useless in many cases.

The two axes should be scaled independently of each other.

CLI check keys should flatten results

Check result keys should be flattened when compared with SLI source keys.

Improve consistency

Some resources accept POST requests directly on the resource path while others require the .../update URL. The HTTP verb should be enough for the operation decision.

Configuration API / UI

The configuration can only be done via SQL right now --- provide an UI or at least HTTP API.

Report: weighted average over the day

At the moment the endpoint GET /service-level-objectives/{product}/reports/{report_type} will only generate an average value for each SLI for the whole day. This average should be weighted, i.e. minutes with more requests should get more weight.

Example of avg vs weighted avg:

select avg(sli_value) from zsm_data.service_level_indicator where sli_name = 'latency.p95' and date(sli_timestamp) = '2016-09-27'
-> 102.34ms
select sum(sli1.sli_value * sli2.sli_value)/sum(sli2.sli_value) from zsm_data.service_level_indicator sli1 join zsm_data.service_level_indicator sli2 on sli1.sli_timestamp = sli2.sli_timestamp and sli2.sli_name = 'requests' where sli1.sli_name = 'latency.p95' and date(sli1.sli_timestamp) = '2016-09-27';
-> 113.86ms

UI: display error messages from API

Usually the API responds with informative error messages that can be helpful in UI (i.e. avoid generic error messages)

See #79

Faster report generation

Optimize report generation

Token from Authorization header is ignored when session contains token

Problem

An OAuth2 token sent in an Authorization: Bearer <token> HTTP header is ignored, if the HTTP request also contains a slr-session cookie. The application continues to use the token stored in the session, even if this token is expired.

Steps to reproduce

Send an HTTP request to the SLR API containing an Authorization: Bearer header with a valid token. (This succeeds.)
Store the returned slr-session cookie.
Wait for the first token to expire.
Send another request to the API, containing both the cookie and an Authorization: Bearer header with a new (valid) token. This fails with a 401 Unauthorized response.

Proposed solution

API requests should have no session handling at all and always rely on the Authorization header.