GithubHelp home page GithubHelp logo

zalando-zmon / service-level-reporting Goto Github PK

View Code? Open in Web Editor NEW
9.0 9.0 10.0 990 KB

Calculate SLI/SLO metrics from ZMON's timeseries data

License: Other

Python 55.37% HTML 44.06% Mako 0.15% JavaScript 0.29% Dockerfile 0.13%
monitoring reporting service-level-indicator service-level-objective sli slo zmon

service-level-reporting's People

Contributors

arjunrn avatar avaczi avatar christianberg avatar hjacobs avatar jan-m avatar lfroment0 avatar lmineiro avatar marcinzaremba avatar mohabusama avatar vetinari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

service-level-reporting's Issues

Report: chart is cut from the top

Right now the max-y value of chart is calculated according to SLO threshold not the maximum value of the chart. As a result - not the whole chart is visible - it is cut from the top. The points are are much higher than SLO threshold are not visible.

Example of problem: https://pages.github.bus.zalan.do/continuous-delivery/automata-service-level-reports/messaging-bus/nakadi/20170519-20170522/index.html

Is it expected behavior or a bug?
If it's a bug - I can contribute with a fix.

Return better errors for the failed API calls

Creating a product group that violates the UNIQUE constraint returns a basic 500 error:

HTTP/1.1 500 INTERNAL SERVER ERROR
Connection: keep-alive
Content-Length: 252
Content-Type: application/problem+json
Date: Wed, 15 Feb 2017 16:51:16 GMT

{
    "detail": "The server encountered an internal error and was unable to complete your request.  Either the server is overloaded or there is an error in the application.",
    "status": 500,
    "title": "Internal Server Error",
    "type": "about:blank"
}

We should improve the errors so that clients can figure out how to work around them

Limit maximum data requested from KairosDB

Currently, there are some failing requests to KairosDB because the size of the requested data slice is too large. That's because the app tries to fill in all the missing data since it was last updated. That could be a couple of minutes to months. This should be restricted to a maximum of a day.

@hjacobs would you agree?

Zappr file not compliant

Your Zappr file does not yield the correct config.

Consider doing this

X-Zalando-Team: "zmon"
X-Zalando-Type: code

approvals:
  minimum: 2

commit:
  message:
    patterns: # commit message has to match any one of
      - "^ *#[0-9]+" # starts with hash and digits

Allow SLI update for ranges

Add end field in resource to allow updating specific ranges. This could be useful to manually fix certain failing ranges with large data points count.

Error generating report

We get the following error when trying to generate a report:

$ ./generate-slr.py $API_ENDPOINT pss
Can not determine "period_from" and "period_to" for the report. Terminating!

What is the cause?

Products not displayed in page and not found with search (pagination & search issue).

Users reported being unable to find new products. Given that current amount of products is more than 100, and pagination size is of 100, newer products are not displayed. Given that Search function is done on the Frontend, products out of range are not found.

  • Search should be performed by backend. API supports this already.
  • Pagination should be handled on UI. API supports this since the beginning.

As a quick fix, pagination limit will be increased to 150 and Search will be modified to be done via the API. Proper pagination handling should be implemented in the UI afterwards.

Fix division by zero

There is, at least, one code path that can result in a division by zero exception under certain circumstances.

Traceback (most recent call last):
  File "/generate-slr.py", line 159, in <module>
    cli()
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/generate-slr.py", line 155, in cli
    generate_weekly_report(base_url, product, output_dir)
  File "/generate-slr.py", line 87, in generate_weekly_report
    val = sum(values_by_sli[target['sli_name']]) / len(values_by_sli[target['sli_name']])
ZeroDivisionError: division by zero

Implement UI

Provide UI:

  • Users should be able to use a user friendly interface to manage their SLO definitions (CRUD)
  • Users should be able to jump directly to the central SLR repository for a given SLO definition

TBD

  • Trigger report generation lazily or not (when first requested?)
  • Serve HTML from web application or push it elsewhere
  • Support sending reports via email? (maybe later)

Integrate Service Level Reports (SLRs) in web app

SLRs should ideally be generated and served (?) by the web application itself.

To be discussed:

  • Trigger report generation lazily or not (when first requested?)
  • Serve HTML from web application or push it elsewhere
  • Support sending reports via email? (maybe later)

Reports deeplink is broken

There is an extra / after slr but when visiting this URL we end up on the root page of the reports

https://slo-host/slr//my-services/a-service/20170831-20170906/index.html

Fix securityDefinitions in the API spec

The current swagger spec contains only the example securityDefinitions

securityDefinitions:
  oauth2:
    type: oauth2
    flow: implicit
    authorizationUrl: https://example.com/oauth2/dialog
    scopes:
      uid: Unique identifier of the user accessing the service.

Which suggests a wrong flow is specified.

Report: fix aggregation for requests

Showing the average number of requests per second per day is not very meaningful. We should compute the total number of requests per day and show it instead.

Error generating reports when no breaches found

Report generation seems to fail when no breaches are found.
Getting this error:
File "/generate-slr.py", line 153, in generate_weekly_report
slo['breaches'] = max(breaches_by_sli.values())
ValueError: max() arg is an empty sequence

The error disappears when adjusting the threshold in the SLO definition to a value that ensures there are some breaches

Report: percentiles from plain data

I have another use case:

I have rare events, so that when I run my check every 5 mins, often there are no events at all or just 1 or 2. I cannot calculate 95th percentile every 5 minutes in this case. Instead I can easily record the maximum value for the last 5 minutes, or 0.

Would be nice to have something like "aggregation": "p95" in the reporting service, so that when the report is generated, it aggregates all the data for the week and is able to calculate correct percentiles.

Could you please consider implementing this type of aggregation?

Custom time range for Service Level Reports

Currently the handler which retrieves the data for the report has the time range hardcoded into it which is the past week. If the app has to support reports for larger intervals like a month or a year this handler should be able to accept a start and end data as parameters.

Display a count of metrics on the UI

Ability to display count of metrics on the UI for the information of management users. This can also serve as an indicator of the capacity for the tool.

  1. Total number of products
  2. Total number of reports
  3. Total number of SLO and/or SLI's

Tests

Unit tests with decent coverage.

Left and right y axes don't scale independently

When two different metrics are plotted in one graph, using the left and right y axis, both axes use the same range (same minimum and maximum).

This compresses one of the graphs and makes it useless in many cases.

The two axes should be scaled independently of each other.

Improve consistency

Some resources accept POST requests directly on the resource path while others require the .../update URL. The HTTP verb should be enough for the operation decision.

Configuration API / UI

The configuration can only be done via SQL right now --- provide an UI or at least HTTP API.

Report: weighted average over the day

At the moment the endpoint GET /service-level-objectives/{product}/reports/{report_type} will only generate an average value for each SLI for the whole day. This average should be weighted, i.e. minutes with more requests should get more weight.

Example of avg vs weighted avg:

select avg(sli_value) from zsm_data.service_level_indicator where sli_name = 'latency.p95' and date(sli_timestamp) = '2016-09-27'
-> 102.34ms
select sum(sli1.sli_value * sli2.sli_value)/sum(sli2.sli_value) from zsm_data.service_level_indicator sli1 join zsm_data.service_level_indicator sli2 on sli1.sli_timestamp = sli2.sli_timestamp and sli2.sli_name = 'requests' where sli1.sli_name = 'latency.p95' and date(sli1.sli_timestamp) = '2016-09-27';
-> 113.86ms

Token from Authorization header is ignored when session contains token

Problem

An OAuth2 token sent in an Authorization: Bearer <token> HTTP header is ignored, if the HTTP request also contains a slr-session cookie. The application continues to use the token stored in the session, even if this token is expired.

Steps to reproduce

  1. Send an HTTP request to the SLR API containing an Authorization: Bearer header with a valid token. (This succeeds.)
  2. Store the returned slr-session cookie.
  3. Wait for the first token to expire.
  4. Send another request to the API, containing both the cookie and an Authorization: Bearer header with a new (valid) token. This fails with a 401 Unauthorized response.

Proposed solution

API requests should have no session handling at all and always rely on the Authorization header.

Automate SLI data collection

Data collection (as defined in zsm_data.data_source) should be automatic. This could be done with a simple background job/thread (locking?).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.