GithubHelp home page GithubHelp logo

dora-metrics / pelorus Goto Github PK

View Code? Open in Web Editor NEW
231.0 231.0 80.0 5.69 MB

Automate the measurement of organizational behavior

Home Page: https://pelorus.readthedocs.io/

License: Apache License 2.0

Python 84.98% Shell 10.62% Mustache 0.70% Makefile 3.36% Dockerfile 0.33%
devops dora dora-metrics metrics transformation

pelorus's People

Contributors

alyibrahim avatar cnuland avatar dependabot[bot] avatar eformat avatar etsauer avatar fmenesesg avatar joecharles33 avatar jtudelag avatar kenwilli avatar kevinmgranger avatar kkoller avatar kpiwko avatar malacourse avatar mateusoliveira43 avatar mattheh avatar mpryc avatar mvmaestri avatar pcarney8 avatar prakritikoller avatar rafamqrs avatar ramius345 avatar rmarting avatar sabre1041 avatar savitharaghunathan avatar shaheinm avatar themoosman avatar tolarewaju3 avatar tomgeorge avatar weshayutin avatar willowmck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pelorus's Issues

When executing Helm command you must be in the pelorus namespace

When executing the Helm commands to create an exporter, you must be in the pelorus namespace or the objects will be created in the current namespace. This occurs even when the --namespace command is passed as in helm template charts/exporter/ -f exporters/failure/values.yaml --namespace pelorus | oc apply -f-

Lead Time exporter should pull git repo information from BuildConfigs

Currently the lead time exporter requires an administrator to pass a comma-separated list of git repos representing application source code via environment variable to the lead time exporter. This doesn't scale well, as we would have to make an administrative change for each application that comes on board.

What we should do instead is grab the git repository information from the BuildConfigs we find in the cluster.

Create exporter for Deployment Frequency

Acceptance Criteria:

  • An exporter that captures deployments "to production", perhaps by using a label like env: production
  • Update SDM dashboard to use this metric.

Failing to provision jenkins collector

I'm repeatedly getting this error while rolling out the jenkins collector:

TASK [Rollout Build Collector] ************************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "rollout", "-n", "hygieia", "latest", "dc/hygieia-jenkins-build-collector"], "delta": "0:00:00.311447", "end": "2019-01-22 09:38:33.878458", "msg": "non-zero return code", "rc": 1, "start": "2019-01-22 09:38:33.567011", "stderr": "error: #2 is already in progress (Running).", "stderr_lines": ["error: #2 is already in progress (Running)."], "stdout": "", "stdout_lines": []}

PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost                  : ok=4    changed=3    unreachable=0    failed=1   

Unable to view data that's been pushed to LTS (MinIO)

After data has been pushed to long term storage, it isn't visible in the Grafana dashboard.
Steps to reproduce

  1. Install Pelorus with LTS
  2. Install exporter(s) and allow data to be aggregated
  3. Wait a couple of hours for the Thanos archive cycle to run. You can verify by opening the MinIO console and viewing the data
  4. Restart both Prometheus instances
  5. Open dashboard and data isn't present

Update multi-cluster configuration documentation with thanos sidecar

Current Documentation has three places where we point to the multi-cluster configuration, but non of them uses real config examples for such scenario:

This issue is to add such documentation, so it's easier for the user to configure Pelorus instance across multi-clusters.

CRDs in Helm chart not handled cleanly

We appear to have some difficulty managing the CRDs for the grafana operator, and the pelorus namespace doesn't clean up properly, forcing us to take forcible action to delete the "Terminating" namespace. We should look at whether a refactor of the charts would address this.

Keep in mind we are using helm template to process this, so the baked in Helm methodology for handling CRDs doesn't apply.

Write pelorus usage docs

Currently all of our documentation deals with deploying and configuring pelorus infrastructure. We need to write docs that deal with the usage of Pelorus once it is up and running. Some possible topics include:

  • How pelorus collects data (and how to generate some)
  • Features of the Software Delivery Performance dashboard
  • How to read and interpret the data

Long term storage solution

By default prometheus only stores 2 weeks worth of data. In order for MDT tooling to be valuable, we need to store at least 6 months to a year's worth of History. We need to do some research into a long term data store for the stack.

Include graph showing 30, 60 day (or more) history for each?

When I worked at Circonus I designed executive dashboards which showed year over year, quarter over quarter, etc., graphs to show devops teams 'how are we doing today' kinds of answers. It would be (IMHO) really interesting to show that same thing for each of our 4 key metrics (ALT, Deploy, MRT, CFR).

If interested I'd be happy to contribute, or give more specifics.

committime-exporter goes into CrashLoop when a build with a non GitHub repo is found

I have a build that hosted on an internal GitLab server. When the committime-exporter hits this build, it throws an error, and the pod crashes.

Only GitHub repos are currently supported. Skipping build slack-bot-4
Failed processing commit time for build slack-bot-4
'commit'
{'message': 'Not Found', 'documentation_url': 'https://developer.github.com/v3/repos/commits/#get-a-single-commit'}
Traceback (most recent call last):
  File "committime/app.py", line 191, in <module>
    REGISTRY.register(CommitCollector(username, token, namespaces, apps))
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/registry.py", line 24, in register
    names = self._get_names(collector)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/registry.py", line 64, in _get_names
    for metric in desc_func():
  File "committime/app.py", line 31, in collect
    ld_metrics = generate_ld_metrics_list(self._namespaces)
  File "committime/app.py", line 178, in generate_ld_metrics_list
    metric.getCommitTime()
  File "committime/app.py", line 66, in getCommitTime
    self.commit_timestamp = loader.convert_date_time_to_timestamp(self.commit_time)
  File "/opt/app-root/src/committime/lib_pelorus/loader.py", line 18, in convert_date_time_to_timestamp
    timestamp = datetime.strptime(date_time, '%Y-%m-%dT%H:%M:%SZ')
TypeError: strptime() argument 1 must be str, not None

Need a cleaner uninstall process.

This is more of a development problem, where we are testing multiple instances of Pelorus in a single cluster. When we follow the documented uninstall process:

helm template --namespace pelorus pelorus ./charts/deploy/ | oc delete -f- -n pelorus

We end up deleting the operator CRDs, which may still be in use by other instances of pelorus, or other non-pelorus instances of prometheus or grafana. We need a safer way to uninstall the stack, without leaving a bunch of cluster-level rbac resource behind.

Deployment Frequency query is not yielding any data.

We are using the following query to calculate Deployment Frequency:

sum(delta(openshift_apps_deploymentconfigs_complete_rollouts_total{phase="available"}[$interval]))

The core metric openshift_apps_deploymentconfigs_complete_rollouts_total{phase="available"} appears to work just fine, but the delta/sum functions seem to break when we pull in the interval.

Get Token call does not work through web proxy

TASK [Get Tokens] *************************************************************************************************************************************************************************************************
FAILED - RETRYING: Get Tokens (5 retries left).
FAILED - RETRYING: Get Tokens (4 retries left).
FAILED - RETRYING: Get Tokens (3 retries left).
FAILED - RETRYING: Get Tokens (2 retries left).
FAILED - RETRYING: Get Tokens (1 retries left).
fatal: [localhost]: FAILED! => {"attempts": 5, "changed": false, "connection": "close", "content": "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">\n<html><head> <style type=\"text/css\">#jhlbphhggl { position: fixed; margin: 20px; z-index: 8888; border: 2px outset #999; border-radius: 7px; box-shadow: 7px 7px 10px #888888; background-color: rgba(80%, 80%, 80%, 0.9); padding-left: 8px; display: none; bottom:0;left:0;}.dxprzvib { position: absolute; z-index: 10000; width: 12px; height: 12px; left: 1px; top: 1px; background:transparent url(%2FmZ2Z1VCSV2JCYKWxoKtxUMvLQiipf6o1EPx4jV%2FgL16EZQWvHjuHyCCFKH0LkgVPYhSQRpEsYQaaWnSrOmu%2BbE%2FZmfHmSWBdOANb773vTfv8T2CgbO88owU3mzazP03qd88O%2FS9Nj%2FXfPrksexzSN9Znb5uD%2FPg0ehYoZSbGEuxxo8KfldqL44oe%2F5w62NTY7RX2WKH9Y1x4pemrk2guHgXxfv3MHVjEpeMoMSazkvFYZpr6DaGXq%2BXLrbcBzkaw4h9GIGDpPoTna1tJH9qEC3%2FXKuy61xdWvqmf2BWGC6ckQImNWCGAUjjIDUW%2BmAwkIMAi6IFxTVNdZ1gkDajRIUI6EEbgiYqRED2uwpTFVVMc5Rr6QQEyoSUEHGMiAskjTCdTgYxYh5DJAl809Q06ITQy%2Bc%2FuN32%2BOncMJLzeQgrnQ9JGAF7h2h2OFq2%2FUVBkU7g8s6ttfKas3ihOHr25PQVGJlTqrxE3OnC%2FVzG9t%2BwLudvv1Lc2Hj%2FbhOzs3OeXyhU9%2Ffql8VRYEuagedx%2FCpX8Mlp77gzM8s0k%2FmqBBRkQGXdR5FXqzcTz0uVo9nsDhsZeavcXUXmx5Tur0ZvLqsHRbqNwdX4L8AAS0HI54qEg3wAAAAASUVORK5CYII%3D) top left no-repeat;}#jhlbphhggl { bottom: 0; position: fixed; margin: 10px; padding: 0; z-index: 10000; border:  none; border-radius: 0px; box-shadow: 7px 7px 10px #666;}.dxprzvib { position: absolute; z-index: 10000; width: 24px; height: 24px; left: -8px; top: -8px;background: transparent url()top left no-repeat;}</style><script language=\"JavaScript\" type=\"text/javascript\">  var c = 0;  function waitForPageRender() {    c = c + 1;    var t=setTimeout(\"checkVisible()\",500);  }  function checkVisible() {    var floating_div;    var max_width;    var max_height;    var body_tag = document.getElementsByTagName(\"body\")[0];    if(document.all) {      floating_div = document.all.jhlbphhggl;    } else if(document.getElementById) {      floating_div = document.getElementById('jhlbphhggl');    }    if (body_tag && (body_tag.offsetWidth >= 300 && body_tag.offsetHeight > 480)) {      floating_div.style.display = 'inherit';      if (body_tag.offsetWidth >= 728) {        max_width = \"728\";        max_height = \"90\";      } else {        max_width = \"320\";        max_height = \"100\";      }      floating_div.innerHTML = floating_div.innerHTML.replace(/MAXWIDTH/g, max_width.toString()).replace(/MAXHEIGHT/g, max_height.toString());      floating_div.style.height = max_height.concat(\"px\");    }    if (floating_div.style.display == 'none' && c < 20) {      waitForPageRender();    }  }</script> \n<meta type=\"copyright\" content=\"Copyright (C) 1996-2017 The Squid Software Foundation and contributors\">\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type=\"text/css\"><!-- \n /*\n * Copyright (C) 1996-2017 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0;\n\tbackground: #efefef;\n\tfont-size: 12px;\n\tcolor: #1e1e1e;\n}\n\n/* Page displayed title area */\n#titles {\n\tmargin-left: 15px;\n\tpadding: 10px;\n\tpadding-left: 100px;\n\tbackground: url('/squid-internal-static/icons/SN.png') no-repeat left;\n}\n\n/* initial title */\n#titles h1 {\n\tcolor: #000000;\n}\n#titles h2 {\n\tcolor: #000000;\n}\n\n/* special event: FTP success page titles */\n#titles ftpsuccess {\n\tbackground-color:#00ff00;\n\twidth:100%;\n}\n\n/* Page displayed body content area */\n#content {\n\tpadding: 10px;\n\tbackground: #ffffff;\n}\n\n/* General text */\np {\n}\n\n/* error brief description */\n#error p {\n}\n\n/* some data which may have caused the problem */\n#data {\n}\n\n/* the error message received from the system or other software */\n#sysmsg {\n}\n\npre {\n}\n\n/* special event: FTP / Gopher directory listing */\n#dirmsg {\n    font-family: courier, monospace;\n    color: black;\n    font-size: 10pt;\n}\n#dirlisting {\n    margin-left: 2%;\n    margin-right: 2%;\n}\n#dirlisting tr.entry td.icon,td.filename,td.size,td.date {\n    border-bottom: groove;\n}\n#dirlisting td.size {\n    width: 50px;\n    text-align: right;\n    padding-right: 5px;\n}\n\n/* horizontal lines */\nhr {\n\tmargin: 0;\n}\n\n/* page displayed footer area */\n#footer {\n\tfont-size: 9px;\n\tpadding-left: 10px;\n}\n\n\nbody\n:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }\n:lang(he) { direction: rtl; }\n --></style>\n</head><body id=ERR_INVALID_REQ>\n<div id=\"titles\">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id=\"content\">\n<p><b>Invalid Request</b> error was encountered while trying to process the request:</p>\n\n<blockquote id=\"data\">\n<pre>GET /api/admin/apitokens HTTP/1.1\nAccept-Encoding: identity\r\nContent-Length: 4\r\nUser-Agent: ansible-httpget\r\nConnection: close\r\nContent-Type: application/json\r\nAuthorization: ** NOT DISPLAYED **\r\nX-Forwarded-For: 172.18.23.56\r\nHost: hygieia.hygieia.apps.d2.casl.rht-labs.com\r\n</pre>\n</blockquote>\n\n<p>Some possible problems are:</p>\n<ul>\n<li id=\"missing-method\"><p>Missing or unknown request method.</p></li>\n<li id=\"missing-url\"><p>Missing URL.</p></li>\n<li id=\"missing-protocol\"><p>Missing HTTP Identifier (HTTP/1.0).</p></li>\n<li><p>Request is too large.</p></li>\n<li><p>Content-Length missing for POST or PUT requests.</p></li>\n<li><p>Illegal character in hostname; underscores are not allowed.</p></li>\n<li><p>HTTP/1.1 <q>Expect:</q> feature is being asked from an HTTP/1.0 software.</p></li>\n</ul>\n\n<p>Your cache administrator is <a href=\"mailto:webmaster?subject=CacheErrorInfo%20-%20ERR_INVALID_REQ&amp;body=CacheHost%3A%20atlwifi3.atlanta-airport.com%0D%0AErrPage%3A%20ERR_INVALID_REQ%0D%0AErr%3A%20%5Bnone%5D%0D%0ATimeStamp%3A%20Thu,%2024%20Jan%202019%2000%3A00%3A02%20GMT%0D%0A%0D%0AClientIP%3A%20127.0.0.1%0D%0A%0D%0AHTTP%20Request%3A%0D%0AGET%20%2Fapi%2Fadmin%2Fapitokens%20HTTP%2F1.1%0AAccept-Encoding%3A%20identity%0D%0AContent-Length%3A%204%0D%0AUser-Agent%3A%20ansible-httpget%0D%0AConnection%3A%20close%0D%0AContent-Type%3A%20application%2Fjson%0D%0AAuthorization%3A%20Bearer%20eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJhZG1pbiIsImRldGFpbHMiOiJTVEFOREFSRCIsInJvbGVzIjpbIlJPTEVfVVNFUiIsIlJPTEVfQURNSU4iXSwiZXhwIjoxNTQ4Mzc0Mzc1fQ.H2T31Rux_nr_AAl4cYSqpThJ6bDnmwl_sm80ScwB5QvEfeChegkM95KF3lhT57mdv4101t8aLEPC7kK6brOjPg%0D%0AX-Forwarded-For%3A%20172.18.23.56%0D%0AHost%3A%20hygieia.hygieia.apps.d2.casl.rht-labs.com%0D%0A%0D%0A%0D%0A\">webmaster</a>.</p>\n<br>\n</div>\n\n<script language=\"javascript\">\nif ('GET' != '[unknown method]') document.getElementById('missing-method').style.display = 'none';\nif ('http://hygieia.hygieia.apps.d2.casl.rht-labs.com/api/admin/apitokens' != '[no URL]') document.getElementById('missing-url').style.display = 'none';\nif ('http' != '[unknown protocol]') document.getElementById('missing-protocol').style.display = 'none';\n</script>\n\n<hr>\n<div id=\"footer\">\n<p>Generated Thu, 24 Jan 2019 00:00:02 GMT by atlwifi3.atlanta-airport.com (squid/4.0.21)</p>\n<!-- ERR_INVALID_REQ -->\n</div>\n <div id='jhlbphhggl'>     <div class=\"dxprzvib\" onclick=\"document.getElementById('jhlbphhggl').style.display = 'none';\">    </div>    <iframe marginheight=\"0\" marginwidth=\"0\" frameborder=\"0\" scrolling=\"no\" width=\"MAXWIDTH\" height=\"MAXHEIGHT\" style=\"height:MAXHEIGHTpx;padding:0;margin:0\" src=\"http://atlwifi.atlanta-airport.com/portal/atl/display_ad?adsize=MAXWIDTHxMAXHEIGHT\"></iframe></div><script language=\"JavaScript\" type=\"text/javascript\">waitForPageRender();</script> </body></html>\n", "content_language": "en", "content_length": "10562", "content_type": "text/html;charset=utf-8", "date": "Thu, 24 Jan 2019 00:00:02 GMT", "mime_version": "1.0", "msg": "Status code was 411 and not [200]: HTTP Error 411: Length Required", "redirected": false, "server": "squid/4.0.21", "status": 411, "url": "http://hygieia.hygieia.apps.d2.casl.rht-labs.com/api/admin/apitokens", "vary": "Accept-Language", "via": "1.1 atlwifi3.atlanta-airport.com (squid/4.0.21)", "x_cache": "MISS from atlwifi3.atlanta-airport.com", "x_squid_error": "ERR_INVALID_REQ 0"}

Use Helm chart for install instead of an install script

The installation for Pelorus is really just a helm chart, except that we rely on a couple of values in secrets in the monitoring stack in order to wire up pelorus with the openshift monitoring stack.

Since we've done this, we have added a lookup function to helm, which could be used to fetch these values as part of processing the chart. https://helm.sh/docs/chart_template_guide/functions_and_pipelines/#using-the-lookup-function

We should switch over to that so that we can get rid of the install script altogether.

Implement MTTR

Two use cases:

  1. Automated: Pods go into crash loops and other failures observable from kubernetes
  2. Human defined: Some sort of incident ticket getting open or closed. For now we could use GitHub issues to simulate this.

Committime exporter hitting GitHub rate limits

I've noticed that my commit time exporter is hitting rate limits in the GitHub API:

Failed processing commit time for build committime-exporter-1
'commit'
{'message': 'API rate limit exceeded for user ID 4500758.', 'documentation_url': 'https://developer.github.com/v3/#rate-limiting'}
Failed processing commit time for build nodejs-1
'commit'
{'message': 'API rate limit exceeded for user ID 4500758.', 'documentation_url': 'https://developer.github.com/v3/#rate-limiting'}
Namespace:  basic-nginx-build , App:  basic-nginx-04343be1777087992fbbd87f81313db0cb369684 , Build:  basic-nginx-1

The way the exporter is currently written, we hit the github api once for each build it discovers in the cluster. According to the API docs, we are capped at 5000 requests per hour. https://developer.github.com/v3/#rate-limiting

We'll have to work on a few things to help with this:

  1. Updating the servicemonitor to have prometheus trigger the exporter less frequently (currently it runs every 15 seconds)
  2. See if there is a way to batch the api calls to get data for multiple commits at once.

Lead Time currently only supports GitHub Repos

Because we are using the GitHub API to grab the commit timestamps, we can currently only support source code on GitHub. Right now the collector code skips repos that don't have github.com in the URL, but I would like to find a more generic way to handle grabbing the commit timestamp.

This is more difficult than it sounds, as the only generic way seems to be to clone each repo. I'd like to find some less expensive operation than a full repo clone.

Refactoring to a generic method would also remove a manual step in the install process, where the user has to go generate an API token, which is likely to have scalability problems anyway.

Error: failed to start container

Running in a disconnected install, changed all image refrances to pull form local quay.
look like everting deploys corerctly but have this error for
prometheus-prometheus-pelorus-0
prometheus-prometheus-pelorus-1
Error: failed to start container "prometheus-config-reloader": Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "exec: "/bin/prometheus-config-reloader": stat /bin/prometheus-config-reloader: no such file or directory"

quaylocal.local/coreos/prometheus-config-reloader:v0.33.0" already present on machine

3.11.153 cluster

Support Bitbucket integration

We recently installed Pelorus into a cluster being used on an Open Innovation Labs residency in EMEA. The customer is using Bitbucket as their source code repository and Pelorus is currently unable to collect data from Bitbucket.

Raising this issue to track future development and integration to Bitbucket.

Committime exporter failing to return any data because of one build.

I have the committime exporter deployed in a cluster, returning the following error in the logs:

INFO:root:Namespace: kenwilli-basic-spring-boot-build, App: kenwilli-basic-spring-boot-ea67853d8f9b4f43400500f807d72cdfb5b0936d, Build: kenwilli-basic-spring-boot-3
Traceback (most recent call last):
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/wsgiref/handlers.py", line 137, in run
    self.result = application(self.environ, self.start_response)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/exposition.py", line 52, in prometheus_app
    status, header, output = _bake_output(registry, accept_header, params)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/exposition.py", line 40, in _bake_output
    output = encoder(registry)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/openmetrics/exposition.py", line 56, in generate_latest
    floatToGoString(s.value),
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/utils.py", line 9, in floatToGoString
    d = float(d)
TypeError: ("float() argument must be a string or a number, not 'NoneType'", Metric(github_commit_timestamp, Commit timestamp, gauge, , [Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:b662b493c37a2f0810e6c96268a46bda0364bb5e93f0d7c672ea8fd20966da2e'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:5818ec7641d40144628c2537914d2874167c6edf64cfefa8bda89f8b525a36b6'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:2e7efc7c011eef8795f2d73fe37b0ce71198aeb62eca97de49dd2379d8855885'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:2f934e61518d81b9decd423ec8ea88f06e379281df860d120f564721328b7ae3'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:fbcb7bd0acf00c0f012b09e7fbab09c445e3ff71ca96e5bf2625970751029777'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'kenwilli-basic-spring-boot-build', 'app': 'kenwilli-basic-spring-boot-ea67853d8f9b4f43400500f807d72cdfb5b0936d', 'image_sha': 'sha256:46332b417660900361f7185830a4eb6d5ddc7e3002944ba26ed260e83f415197'}, value=None, timestamp=None, exemplar=None)]))

Because it this one error, the exporter won't return ANY data. If you hit the endpoint for the exporter, it simply returns:

$ curl http://committime-exporter-pelorus-etsauer.apps.cluster-eric.blue.osp.opentlc.com/
A server error occurred.  Please contact the administrator.

We need to do the following:

  1. Figure out why this build is triggering NoneType error (it shouldn't be)
  2. Find a way to ensure this type of error can be handled in the future, and just skip the broken build instead of failing to return data.

Committime exporter needs to handle builds in bad states.

We had a build in the cluster that was in a stuck/pending state because of an InvalidOutputReference, and caused the committime exporter to crash with:

AttributeError: 'NoneType' object has no attribute 'git'

We need to ensure the exporter can handle finding builds that are in an unexpected state and move on with collection.

Modify Deploytime exporter to use Kubernetes standard app label

Currently, the Deploytime exporter looks for deployments with a label of application. To match the Kubernetes standard, the exporter should look for the label of app.

List of things to change/update:

  • Change default label in the exporter code
  • Change basic NGINX example to use app labels
  • Update exporter documentation
  • Update Helm chart to allow passing the label as a variable

Create an exporter that captures failure

Acceptance criteria:

  • An exporter that captures failure event timestamps (start and end)
  • Recommend to pull from a Ticketing System, like Jira
  • An update to the dashboard that does the math between the Change Failure / Deployment Frequency

Pelorus Prometheus not scraping OpenShift-monitoring prometheus

Spun up pelorus on a new OpenShift 4.3 cluster.

$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0-201905191700+7bd2e5b-dirty", GitCommit:"7bd2e5b", GitTreeState:"dirty", BuildDate:"2019-05-19T23:52:43Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+520769a", GitCommit:"520769a", GitTreeState:"clean", BuildDate:"2019-10-11T01:55:01Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

$ ./runhelm.sh
.. All creates succeed

All pods come up healthy:

$ oc get pods -n pelorus
NAME                                           READY   STATUS    RESTARTS   AGE
grafana-deployment-6dd5455957-4jzwp            2/2     Running   0          5h36m
grafana-operator-9778b7f46-sj7qq               1/1     Running   0          4h47m
prometheus-operator-pelorus-669cfd4649-4brhc   1/1     Running   0          5h37m
prometheus-prometheus-pelorus-0                4/4     Running   1          5h37m
prometheus-prometheus-pelorus-1                4/4     Running   1          5h37m

I can verify that the scrape configs for openshift prometheus get added to the config file.

$ oc get secrets -n pelorus prometheus-prometheus-pelorus -o jsonpath='{.data.prometheus\.yaml\.gz}' | base64 -d | zcat
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: pelorus/prometheus-pelorus
    prometheus_replica: $(POD_NAME)
rule_files:
- /etc/prometheus/rules/prometheus-prometheus-pelorus-rulefiles-0/*.yaml
scrape_configs:
- job_name: federated-prometheus-local
  scrape_interval: 15s
  honor_labels: true
  metrics_path: /federate
  params:
    match[]:
    - '{job="openshift-state-metrics"}'
  scheme: https
  basic_auth:
    username: internal
    password: o2wTgU6miU160slPv/dZ8pqarxxpUIKg3JZCGYpBTXrJoyJ1S2fizavMfnKimUvTPw+ebWp8k6x7aRn7NAq6y+kGNKyF62F1EjBvWY3RsMVRY0Ykt63559M0aDDSfhETVRorHRsbYXgGOLUklpqUfJGoaBs9jTRgll+utzYNufUUq2YWxxklZnhsEVV6Mn2pCH56pbHEWtOw5vylL9BpRv5+uzoBTlDxrBPplZbFyDDl0cFRsR7bOovLH9z73UNb4YRR8BXAd3/N7adgzsqgJuDl7tP69IMDiDPT5xOTPWaAMNtMoBDmDx7DIjKmj9g79SY0WqGa1Ar7/6yqQfra
  tls_config:
    insecure_skip_verify: true
  static_configs:
  - targets:
    - prometheus-k8s.openshift-monitoring.svc.cluster.local:9091
    labels:
      federated_job: federated-prometheus-local
alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: prometheus_replica
  alertmanagers: []

However... there is no scrape data in the pelorus prometheus.

Missing dependency python2-openshift

The setup command ansible-playbook -i galaxy/openshift-toolkit/custom-dashboards/.applier galaxy/openshift-applier/playbooks/openshift-cluster-seed.yml -e include_tags=infrastructure fails with the following error:


TASK [/tmp/ansible.O_0Fyg/openshift-toolkit/custom-dashboards/mdt-secret-discovery : Fetch grafana_config secret] ******************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "This module requires the OpenShift Python client. Try `pip install openshift`"}```

Fixed by installing python2-openshift package and rerunning.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.