dora-metrics / pelorus Goto Github PK

View Code? Open in Web Editor NEW

242.0 20.0 83.0 5.71 MB

Automate the measurement of organizational behavior

Home Page: https://pelorus.readthedocs.io/

License: Apache License 2.0

Python 84.87% Shell 10.61% Mustache 0.70% Makefile 3.50% Dockerfile 0.33%

devops dora dora-metrics metrics transformation

pelorus's People

Contributors

Stargazers

Watchers

Forkers

cnuland vadimzharov adlerfleurant pcarney8 ramius345 sabre1041 alyibrahim eformat tolarewaju3 mattheh deewhyweb springdo kkoller prakritikoller szucsb98 joecharles33 deweya jnach jonahkh themoosman dobozysaurus kenwilli germanodasilva abryson-redhat etsauer redhat-mal tcynic mint3kool lapd-devops willowmck mvmaestri funkytaco jtudelag andymiller96 sergiubodiu rakhmad mwalker5000 jungfuhsu jland-redhat haithamshahin333 shaheinm luiscachog bbeaudoin faizalak bbalakriz fmenesesg saberkan kevinmgranger rh-open-innovation-labs caiomedeirospinto zanoniluiz nar6mes3 moorrode nhat-tong mpryc kpiwko adnan-drina metaversed itewk rafamqrs flacatus colonelbundy scoonrh savitharaghunathan weshayutin mateusoliveira43 adrianfusco xavrb killashootflow stillalearner jacampano shresthjha29 rajiv-ranjan james-cht filiy archiephan78 rajkrishnamurthy roderick-liu ckyriakidou fredericopranto

pelorus's Issues

Update readme with Pelorus logo

Provisioning tool interferes with OpenShift Monitoring stack

When we provision our stack, it creates issues in the OpenShift monitoring stack. Alertmanager and prometheus pods go into crash loop.

Exporters crash when a APP_LABEL contains special characters.

I wanted to try changing APP_LABEL to one of the kubernetes standard label keys, like app.kubernetes.io/name, however this crashes the commit and deploy time exporters, which use jsonpath expressions to get the values from those labels.

missing dependencies

Missing the following dependencies:

dnf install -y libselinux-python

Deployment Frequency query is not yielding any data.

We are using the following query to calculate Deployment Frequency:

sum(delta(openshift_apps_deploymentconfigs_complete_rollouts_total{phase="available"}[$interval]))

The core metric openshift_apps_deploymentconfigs_complete_rollouts_total{phase="available"} appears to work just fine, but the delta/sum functions seem to break when we pull in the interval.

Create an exporter that captures failure

Acceptance criteria:

An exporter that captures failure event timestamps (start and end)
Recommend to pull from a Ticketing System, like Jira
An update to the dashboard that does the math between the Change Failure / Deployment Frequency

Software Delivery Performance dashboard should be publicly accessible

We would like the resulting dashboard (in read-only mode) to be viewable without having to log in.

Write pelorus usage docs

Currently all of our documentation deals with deploying and configuring pelorus infrastructure. We need to write docs that deal with the usage of Pelorus once it is up and running. Some possible topics include:

How pelorus collects data (and how to generate some)
Features of the Software Delivery Performance dashboard
How to read and interpret the data

runhelm.sh does not work with "--set" arguments

When running the script with "--set", you'll get an invalid option. The getopts in the script needs to be updated

committime-exporter goes into CrashLoop when a build with a non GitHub repo is found

I have a build that hosted on an internal GitLab server. When the committime-exporter hits this build, it throws an error, and the pod crashes.

Only GitHub repos are currently supported. Skipping build slack-bot-4
Failed processing commit time for build slack-bot-4
'commit'
{'message': 'Not Found', 'documentation_url': 'https://developer.github.com/v3/repos/commits/#get-a-single-commit'}
Traceback (most recent call last):
  File "committime/app.py", line 191, in <module>
    REGISTRY.register(CommitCollector(username, token, namespaces, apps))
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/registry.py", line 24, in register
    names = self._get_names(collector)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/registry.py", line 64, in _get_names
    for metric in desc_func():
  File "committime/app.py", line 31, in collect
    ld_metrics = generate_ld_metrics_list(self._namespaces)
  File "committime/app.py", line 178, in generate_ld_metrics_list
    metric.getCommitTime()
  File "committime/app.py", line 66, in getCommitTime
    self.commit_timestamp = loader.convert_date_time_to_timestamp(self.commit_time)
  File "/opt/app-root/src/committime/lib_pelorus/loader.py", line 18, in convert_date_time_to_timestamp
    timestamp = datetime.strptime(date_time, '%Y-%m-%dT%H:%M:%SZ')
TypeError: strptime() argument 1 must be str, not None

Support Private GitHub Instances

The pelorus project doesn't currently support private github instances.

Pelorus Grafana dashboard does not work if long term storage isn't used

Running pelorus without long term storage breaks Grafana. This is because when long term storage isn't run, the service https://thanos-pelorus.pelorus.svc:9092 isn't created. However, the Grafana dashboard still points to that service.

Error: failed to start container

Running in a disconnected install, changed all image refrances to pull form local quay.
look like everting deploys corerctly but have this error for
prometheus-prometheus-pelorus-0
prometheus-prometheus-pelorus-1
Error: failed to start container "prometheus-config-reloader": Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "exec: "/bin/prometheus-config-reloader": stat /bin/prometheus-config-reloader: no such file or directory"

quaylocal.local/coreos/prometheus-config-reloader:v0.33.0" already present on machine

3.11.153 cluster

Lead Time currently only supports GitHub Repos

Because we are using the GitHub API to grab the commit timestamps, we can currently only support source code on GitHub. Right now the collector code skips repos that don't have github.com in the URL, but I would like to find a more generic way to handle grabbing the commit timestamp.

This is more difficult than it sounds, as the only generic way seems to be to clone each repo. I'd like to find some less expensive operation than a full repo clone.

Refactoring to a generic method would also remove a manual step in the install process, where the user has to go generate an API token, which is likely to have scalability problems anyway.

Use Helm chart for install instead of an install script

The installation for Pelorus is really just a helm chart, except that we rely on a couple of values in secrets in the monitoring stack in order to wire up pelorus with the openshift monitoring stack.

Since we've done this, we have added a lookup function to helm, which could be used to fetch these values as part of processing the chart. https://helm.sh/docs/chart_template_guide/functions_and_pipelines/#using-the-lookup-function

We should switch over to that so that we can get rid of the install script altogether.

Pelorus Prometheus not scraping OpenShift-monitoring prometheus

Spun up pelorus on a new OpenShift 4.3 cluster.

$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0-201905191700+7bd2e5b-dirty", GitCommit:"7bd2e5b", GitTreeState:"dirty", BuildDate:"2019-05-19T23:52:43Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+520769a", GitCommit:"520769a", GitTreeState:"clean", BuildDate:"2019-10-11T01:55:01Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

$ ./runhelm.sh
.. All creates succeed

All pods come up healthy:

$ oc get pods -n pelorus
NAME                                           READY   STATUS    RESTARTS   AGE
grafana-deployment-6dd5455957-4jzwp            2/2     Running   0          5h36m
grafana-operator-9778b7f46-sj7qq               1/1     Running   0          4h47m
prometheus-operator-pelorus-669cfd4649-4brhc   1/1     Running   0          5h37m
prometheus-prometheus-pelorus-0                4/4     Running   1          5h37m
prometheus-prometheus-pelorus-1                4/4     Running   1          5h37m

I can verify that the scrape configs for openshift prometheus get added to the config file.

$ oc get secrets -n pelorus prometheus-prometheus-pelorus -o jsonpath='{.data.prometheus\.yaml\.gz}' | base64 -d | zcat
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: pelorus/prometheus-pelorus
    prometheus_replica: $(POD_NAME)
rule_files:
- /etc/prometheus/rules/prometheus-prometheus-pelorus-rulefiles-0/*.yaml
scrape_configs:
- job_name: federated-prometheus-local
  scrape_interval: 15s
  honor_labels: true
  metrics_path: /federate
  params:
    match[]:
    - '{job="openshift-state-metrics"}'
  scheme: https
  basic_auth:
    username: internal
    password: o2wTgU6miU160slPv/dZ8pqarxxpUIKg3JZCGYpBTXrJoyJ1S2fizavMfnKimUvTPw+ebWp8k6x7aRn7NAq6y+kGNKyF62F1EjBvWY3RsMVRY0Ykt63559M0aDDSfhETVRorHRsbYXgGOLUklpqUfJGoaBs9jTRgll+utzYNufUUq2YWxxklZnhsEVV6Mn2pCH56pbHEWtOw5vylL9BpRv5+uzoBTlDxrBPplZbFyDDl0cFRsR7bOovLH9z73UNb4YRR8BXAd3/N7adgzsqgJuDl7tP69IMDiDPT5xOTPWaAMNtMoBDmDx7DIjKmj9g79SY0WqGa1Ar7/6yqQfra
  tls_config:
    insecure_skip_verify: true
  static_configs:
  - targets:
    - prometheus-k8s.openshift-monitoring.svc.cluster.local:9091
    labels:
      federated_job: federated-prometheus-local
alerting:
  alert_relabel_configs:
  - action: labeldrop
    regex: prometheus_replica
  alertmanagers: []

However... there is no scrape data in the pelorus prometheus.

Committime exporter needs to handle builds in bad states.

We had a build in the cluster that was in a stuck/pending state because of an InvalidOutputReference, and caused the committime exporter to crash with:

AttributeError: 'NoneType' object has no attribute 'git'

We need to ensure the exporter can handle finding builds that are in an unexpected state and move on with collection.

Explore deploying mdt stack with helm and operators instead of templates and applier.

It appears the long term gitops/infra as code solution is going to center around Helm as the templating framework of choice, and ArgoCD as the orchestration engine. We should look at what it would take to convert our applier inventory over to helm/argo.

Update Software Delivery Performance dashboard to use commit time and deploy time metrics to calculate lead time

Update multi-cluster configuration documentation with thanos sidecar

Current Documentation has three places where we point to the multi-cluster configuration, but non of them uses real config examples for such scenario:

This issue is to add such documentation, so it's easier for the user to configure Pelorus instance across multi-clusters.

Dashboard should show recent "go-lives" under the Deployment Frequency metric.

Custom CA not recognized by pelorus pods (grafana and prometheus)

Attempts to login to prometheus and grafana fail with Error Page 500 Internal Error
Logs indicate that
oauth fails with "certificate signed by unknown authority"

Missing dependency python2-openshift

The setup command ansible-playbook -i galaxy/openshift-toolkit/custom-dashboards/.applier galaxy/openshift-applier/playbooks/openshift-cluster-seed.yml -e include_tags=infrastructure fails with the following error:


TASK [/tmp/ansible.O_0Fyg/openshift-toolkit/custom-dashboards/mdt-secret-discovery : Fetch grafana_config secret] ******************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "This module requires the OpenShift Python client. Try `pip install openshift`"}```

Fixed by installing python2-openshift package and rerunning.

Automate validation of Pelorus deployment

Update Change Failure Rate to make use of failure exporter

Long term storage solution

By default prometheus only stores 2 weeks worth of data. In order for MDT tooling to be valuable, we need to store at least 6 months to a year's worth of History. We need to do some research into a long term data store for the stack.

Include graph showing 30, 60 day (or more) history for each?

When I worked at Circonus I designed executive dashboards which showed year over year, quarter over quarter, etc., graphs to show devops teams 'how are we doing today' kinds of answers. It would be (IMHO) really interesting to show that same thing for each of our 4 key metrics (ALT, Deploy, MRT, CFR).

If interested I'd be happy to contribute, or give more specifics.

Support Bitbucket integration

We recently installed Pelorus into a cluster being used on an Open Innovation Labs residency in EMEA. The customer is using Bitbucket as their source code repository and Pelorus is currently unable to collect data from Bitbucket.

Raising this issue to track future development and integration to Bitbucket.

Committime exporter failing to return any data because of one build.

I have the committime exporter deployed in a cluster, returning the following error in the logs:

INFO:root:Namespace: kenwilli-basic-spring-boot-build, App: kenwilli-basic-spring-boot-ea67853d8f9b4f43400500f807d72cdfb5b0936d, Build: kenwilli-basic-spring-boot-3
Traceback (most recent call last):
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/wsgiref/handlers.py", line 137, in run
    self.result = application(self.environ, self.start_response)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/exposition.py", line 52, in prometheus_app
    status, header, output = _bake_output(registry, accept_header, params)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/exposition.py", line 40, in _bake_output
    output = encoder(registry)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/openmetrics/exposition.py", line 56, in generate_latest
    floatToGoString(s.value),
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/utils.py", line 9, in floatToGoString
    d = float(d)
TypeError: ("float() argument must be a string or a number, not 'NoneType'", Metric(github_commit_timestamp, Commit timestamp, gauge, , [Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:b662b493c37a2f0810e6c96268a46bda0364bb5e93f0d7c672ea8fd20966da2e'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:5818ec7641d40144628c2537914d2874167c6edf64cfefa8bda89f8b525a36b6'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:2e7efc7c011eef8795f2d73fe37b0ce71198aeb62eca97de49dd2379d8855885'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:2f934e61518d81b9decd423ec8ea88f06e379281df860d120f564721328b7ae3'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:fbcb7bd0acf00c0f012b09e7fbab09c445e3ff71ca96e5bf2625970751029777'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'kenwilli-basic-spring-boot-build', 'app': 'kenwilli-basic-spring-boot-ea67853d8f9b4f43400500f807d72cdfb5b0936d', 'image_sha': 'sha256:46332b417660900361f7185830a4eb6d5ddc7e3002944ba26ed260e83f415197'}, value=None, timestamp=None, exemplar=None)]))

Because it this one error, the exporter won't return ANY data. If you hit the endpoint for the exporter, it simply returns:

$ curl http://committime-exporter-pelorus-etsauer.apps.cluster-eric.blue.osp.opentlc.com/
A server error occurred.  Please contact the administrator.

We need to do the following:

Figure out why this build is triggering NoneType error (it shouldn't be)
Find a way to ensure this type of error can be handled in the future, and just skip the broken build instead of failing to return data.

Unable to view data that's been pushed to LTS (MinIO)

After data has been pushed to long term storage, it isn't visible in the Grafana dashboard.
Steps to reproduce

Install Pelorus with LTS
Install exporter(s) and allow data to be aggregated
Wait a couple of hours for the Thanos archive cycle to run. You can verify by opening the MinIO console and viewing the data
Restart both Prometheus instances
Open dashboard and data isn't present

When thanos is in place, some dashboard panels throw errors

When we put thanos in place, it collects the same metrics from both prometheus pods. Thus each metric gets collected twice. As a result some of our dashboard queries break because they receive duplicate results.

Need to implement a unit testing framework for exporter code

The exporter code is becoming difficult to debug when it doesn't work. It would be useful to have a unit testing framework in place that encourages us to write more testable code.

Some cluster-level resources are overwritten when installing in multiple namespaces in the same cluster

I think this is limited to the clusterrolebindings, but I'm noticing that every time I re-run the ./runhelm.sh script to a new namespace, the clusterrolebindings are getting overwritten for my serviceaccount, which breaks pelorus in all other namespaces. We should reconfigure the helm chart to handle this better.

Lead Time exporter should pull git repo information from BuildConfigs

Currently the lead time exporter requires an administrator to pass a comma-separated list of git repos representing application source code via environment variable to the lead time exporter. This doesn't scale well, as we would have to make an administrative change for each application that comes on board.

What we should do instead is grab the git repository information from the BuildConfigs we find in the cluster.

Modify Deploytime exporter to use Kubernetes standard app label

Currently, the Deploytime exporter looks for deployments with a label of application. To match the Kubernetes standard, the exporter should look for the label of app.

List of things to change/update:

Change default label in the exporter code
Change basic NGINX example to use app labels
Update exporter documentation
Update Helm chart to allow passing the label as a variable

Build a more consumable quickstart/demo that showcases the functionality of the dashboard

Readme with a walkthrough
Sample app & automation

When executing Helm command you must be in the pelorus namespace

When executing the Helm commands to create an exporter, you must be in the pelorus namespace or the objects will be created in the current namespace. This occurs even when the --namespace command is passed as in helm template charts/exporter/ -f exporters/failure/values.yaml --namespace pelorus | oc apply -f-

Get Token call does not work through web proxy

TASK [Get Tokens] *************************************************************************************************************************************************************************************************
FAILED - RETRYING: Get Tokens (5 retries left).
FAILED - RETRYING: Get Tokens (4 retries left).
FAILED - RETRYING: Get Tokens (3 retries left).
FAILED - RETRYING: Get Tokens (2 retries left).
FAILED - RETRYING: Get Tokens (1 retries left).
fatal: [localhost]: FAILED! => {"attempts": 5, "changed": false, "connection": "close", "content": "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">\n<html><head> <style type=\"text/css\">#jhlbphhggl { position: fixed; margin: 20px; z-index: 8888; border: 2px outset #999; border-radius: 7px; box-shadow: 7px 7px 10px #888888; background-color: rgba(80%, 80%, 80%, 0.9); padding-left: 8px; display: none; bottom:0;left:0;}.dxprzvib { position: absolute; z-index: 10000; width: 12px; height: 12px; left: 1px; top: 1px; background:transparent url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAwAAAAMCAYAAABWdVznAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAdhJREFUeNpUUs9rE0EU%2FmZ2Z1VCSV2JCYKWxoKtxUMvLQiipf6o1EPx4jV%2FgL16EZQWvHjuHyCCFKH0LkgVPYhSQRpEsYQaaWnSrOmu%2BbE%2FZmfHmSWBdOANb773vTfv8T2CgbO88owU3mzazP03qd88O%2FS9Nj%2FXfPrksexzSN9Znb5uD%2FPg0ehYoZSbGEuxxo8KfldqL44oe%2F5w62NTY7RX2WKH9Y1x4pemrk2guHgXxfv3MHVjEpeMoMSazkvFYZpr6DaGXq%2BXLrbcBzkaw4h9GIGDpPoTna1tJH9qEC3%2FXKuy61xdWvqmf2BWGC6ckQImNWCGAUjjIDUW%2BmAwkIMAi6IFxTVNdZ1gkDajRIUI6EEbgiYqRED2uwpTFVVMc5Rr6QQEyoSUEHGMiAskjTCdTgYxYh5DJAl809Q06ITQy%2Bc%2FuN32%2BOncMJLzeQgrnQ9JGAF7h2h2OFq2%2FUVBkU7g8s6ttfKas3ihOHr25PQVGJlTqrxE3OnC%2FVzG9t%2BwLudvv1Lc2Hj%2FbhOzs3OeXyhU9%2Ffql8VRYEuagedx%2FCpX8Mlp77gzM8s0k%2FmqBBRkQGXdR5FXqzcTz0uVo9nsDhsZeavcXUXmx5Tur0ZvLqsHRbqNwdX4L8AAS0HI54qEg3wAAAAASUVORK5CYII%3D) top left no-repeat;}#jhlbphhggl { bottom: 0; position: fixed; margin: 10px; padding: 0; z-index: 10000; border:  none; border-radius: 0px; box-shadow: 7px 7px 10px #666;}.dxprzvib { position: absolute; z-index: 10000; width: 24px; height: 24px; left: -8px; top: -8px;background: transparent url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABgAAAAYCAYAAADgdz34AAABPmlUWHRYTUw6Y29tLmFkb2JlLnhtcAAAAAAAPD94cGFja2V0IGJlZ2luPSLvu78iIGlkPSJXNU0wTXBDZWhpSHpyZVN6TlRjemtjOWQiPz4KPHg6eG1wbWV0YSB4bWxuczp4PSJhZG9iZTpuczptZXRhLyIgeDp4bXB0az0iWE1QIENvcmUgNS41LjAiPgogPHJkZjpSREYgeG1sbnM6cmRmPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjIj4KICA8cmRmOkRlc2NyaXB0aW9uIHJkZjphYm91dD0iL1VzZXJzL2RqaC9EZXNrdG9wL2Nsb3NlX2J1dHRvbi5wbmciLz4KIDwvcmRmOlJERj4KPC94OnhtcG1ldGE+Cjw/eHBhY2tldCBlbmQ9InIiPz5XZOQgAAABhmlDQ1BzUkdCIElFQzYxOTY2LTIuMQAAKJF1kbtLA0EQh78kihIVBVMoWKQIFhJFowS18oFEJQSJCr6a5MxDyMXjLiJiaWGbIoUPbAxiY62d+A8IgqBWItiKhYKNyDmbCBExs+zut7+dmd2dBedCRtOtml7QszkzGhrzLiwueeuecNNOC8O4Y5pljM7MhKlqH3c41HzbrXJV9/vXGlYTlgaOeuERzTBzwpPCkc2coXhP2KOlY6vCZ8J+Uy4o/KD0eJlfFKdK7FQ5PeZcdFzYI+xN/eL4L9bSpi48IOzTMxvaz33USxoT2flZpUvvwCJKiDG8TDHBOEH6GJIxSDcBemRFlfhAKT7CusRqMhpsYbJGijQ5/KJuSPaEzEnRE9Iy4iGm/uBvba1kf6B8QuM01D7b9nsX1B3C165tfx7b9lcRXFKXq0Ilfr0Ag6+i5yua7wiad+D8sqLFT+AiD22PRsyMlSSXdGcyCW+n0LQIrTfgXi7X7Wef4j3MbUP4GvYPoFP8m1e+ATk1Z1GpTBZNAAAACXBIWXMAAAsTAAALEwEAmpwYAAAESElEQVRIibWUS2hUZxTHf9+9kztkRvMwTuI0VzEaYwLGGiW+6EIEN2IFDUrRqCsXFumiG13E4qLQXVooQqGo4KuIK8FoUURRhKSo9RmwqEWiGSaTySSTx53XvaeLLzOTh2npwgMH7vc4//85/3O+C5/Y1FwHd0AFwAZWW9BkwQrl91cpwE2nBzPwVw56PXi2Efr/F0EPlANtpbC3or6+JdDaGiqpq1NmIADZLO7gIJnXr72J58+jif7+PzPwm8DVDTD6nwSPocyDzqrq6j01R47M92/bpoySEhCBTAYcB8bHYWQENx4n++CB9N+8OTI0Pn7eg+82wPCcBI+hTBnGxepVq3Ys6ujAWLkSUintjjOnu2/eELlxQwZisXOuyNfrYSKPaU6VRUHnoqamr8LHj2PU1cHoKCSTRf/YenQUA5gXDCovFvt8PJUKHIb7v0IWwJdvKNBWVV6+Z9HRo6hwGOJxnaHrah8aKmadSkEwqKVKJMBxMAyD8JIlKpVMHk54Xg9wGcAACIBdCntrdu2ab9i2BovHdYZr18KaNZDNwuCg3rcs2LlTezKp9+JxTKWwy8uDPjjYDaFCBcDqilCoxd/aqgqZui5s3QrLl+sGv38Pt26B3w+HDkFzs450HDh5EoaHwXGwXFctgOYorAN+9wFY0BSorw8Z+YbG49DXB7W10NAApglbtmiSmhpoaQGldFWOA9Eo9PeD52EAQQgb0DiVYEVJZaVibEyXfOcOvHwJ9+7p9b59UFICBw6AYWjPZODSJThxQhNPMQt8FiwtSKRMs8oErf3du/DihZbo3Ts4dkxH7d8PvklFczkN3tEBHz7MfEoYGriySKCUnognT+Dp0+m3TRMWLtSSzLRcbvZe0aTw9Rx+GWloEJk3T0S3VHtjo0hXl4jrioiIZLMiuZz+zmRETp8Wqa2dHgMyAvIUfiwQPIJvY5blilLFi3V1IteuFcHTaZGzZ0XOndNEecILF0RCoWkEMUg/hG8KErnQO5HJRF0IF552PA6RSHFa8prnH157uz4bGNCDAGBZeDU1jPX1RQR6CxV0w2dPoMsBb1q5fr/ImTMily+L2HZx37ZFLl4U6ewUMQwRpUQqKkR27JCJhgbvIVzrnmxywf6A9jeQyM3QU0pLRcLhWTpLdbVOwDBEli0TaW+XbFubvDKM4R7YPXWi8i2/OgTnIyDeVGbH0VLNtIEBTbV5M+zeTS4QIHL/vpf0vFNZ6MpfmzZ7PVDhg5+q4WAYlDkbdjJKgW3D9u3Q1ETu7VsiV654sWj0Z9fzOtbD2EcJJqUKKPi+Ag7bELRAGUppUNOEsjL929i0Cc/nI339uvTdvp1Muu4pgR+mgn+UIE8CfOmDgwugOWhZYauy0mfYNixejJdOk4lEMmOvXkXijvPChTNZ6PoC0rOKnUsFgG4IKVhnQKMflpp6MiQHiSz8LdDrwqONkPg3nE9q/wAuWlgfqWyn3QAAAABJRU5ErkJggg==)top left no-repeat;}</style><script language=\"JavaScript\" type=\"text/javascript\">  var c = 0;  function waitForPageRender() {    c = c + 1;    var t=setTimeout(\"checkVisible()\",500);  }  function checkVisible() {    var floating_div;    var max_width;    var max_height;    var body_tag = document.getElementsByTagName(\"body\")[0];    if(document.all) {      floating_div = document.all.jhlbphhggl;    } else if(document.getElementById) {      floating_div = document.getElementById('jhlbphhggl');    }    if (body_tag && (body_tag.offsetWidth >= 300 && body_tag.offsetHeight > 480)) {      floating_div.style.display = 'inherit';      if (body_tag.offsetWidth >= 728) {        max_width = \"728\";        max_height = \"90\";      } else {        max_width = \"320\";        max_height = \"100\";      }      floating_div.innerHTML = floating_div.innerHTML.replace(/MAXWIDTH/g, max_width.toString()).replace(/MAXHEIGHT/g, max_height.toString());      floating_div.style.height = max_height.concat(\"px\");    }    if (floating_div.style.display == 'none' && c < 20) {      waitForPageRender();    }  }</script> \n<meta type=\"copyright\" content=\"Copyright (C) 1996-2017 The Squid Software Foundation and contributors\">\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type=\"text/css\"><!-- \n /*\n * Copyright (C) 1996-2017 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0;\n\tbackground: #efefef;\n\tfont-size: 12px;\n\tcolor: #1e1e1e;\n}\n\n/* Page displayed title area */\n#titles {\n\tmargin-left: 15px;\n\tpadding: 10px;\n\tpadding-left: 100px;\n\tbackground: url('/squid-internal-static/icons/SN.png') no-repeat left;\n}\n\n/* initial title */\n#titles h1 {\n\tcolor: #000000;\n}\n#titles h2 {\n\tcolor: #000000;\n}\n\n/* special event: FTP success page titles */\n#titles ftpsuccess {\n\tbackground-color:#00ff00;\n\twidth:100%;\n}\n\n/* Page displayed body content area */\n#content {\n\tpadding: 10px;\n\tbackground: #ffffff;\n}\n\n/* General text */\np {\n}\n\n/* error brief description */\n#error p {\n}\n\n/* some data which may have caused the problem */\n#data {\n}\n\n/* the error message received from the system or other software */\n#sysmsg {\n}\n\npre {\n}\n\n/* special event: FTP / Gopher directory listing */\n#dirmsg {\n    font-family: courier, monospace;\n    color: black;\n    font-size: 10pt;\n}\n#dirlisting {\n    margin-left: 2%;\n    margin-right: 2%;\n}\n#dirlisting tr.entry td.icon,td.filename,td.size,td.date {\n    border-bottom: groove;\n}\n#dirlisting td.size {\n    width: 50px;\n    text-align: right;\n    padding-right: 5px;\n}\n\n/* horizontal lines */\nhr {\n\tmargin: 0;\n}\n\n/* page displayed footer area */\n#footer {\n\tfont-size: 9px;\n\tpadding-left: 10px;\n}\n\n\nbody\n:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }\n:lang(he) { direction: rtl; }\n --></style>\n</head><body id=ERR_INVALID_REQ>\n<div id=\"titles\">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id=\"content\">\n<p><b>Invalid Request</b> error was encountered while trying to process the request:</p>\n\n<blockquote id=\"data\">\n<pre>GET /api/admin/apitokens HTTP/1.1\nAccept-Encoding: identity\r\nContent-Length: 4\r\nUser-Agent: ansible-httpget\r\nConnection: close\r\nContent-Type: application/json\r\nAuthorization: ** NOT DISPLAYED **\r\nX-Forwarded-For: 172.18.23.56\r\nHost: hygieia.hygieia.apps.d2.casl.rht-labs.com\r\n</pre>\n</blockquote>\n\n<p>Some possible problems are:</p>\n<ul>\n<li id=\"missing-method\"><p>Missing or unknown request method.</p></li>\n<li id=\"missing-url\"><p>Missing URL.</p></li>\n<li id=\"missing-protocol\"><p>Missing HTTP Identifier (HTTP/1.0).</p></li>\n<li><p>Request is too large.</p></li>\n<li><p>Content-Length missing for POST or PUT requests.</p></li>\n<li><p>Illegal character in hostname; underscores are not allowed.</p></li>\n<li><p>HTTP/1.1 <q>Expect:</q> feature is being asked from an HTTP/1.0 software.</p></li>\n</ul>\n\n<p>Your cache administrator is <a href=\"mailto:webmaster?subject=CacheErrorInfo%20-%20ERR_INVALID_REQ&amp;body=CacheHost%3A%20atlwifi3.atlanta-airport.com%0D%0AErrPage%3A%20ERR_INVALID_REQ%0D%0AErr%3A%20%5Bnone%5D%0D%0ATimeStamp%3A%20Thu,%2024%20Jan%202019%2000%3A00%3A02%20GMT%0D%0A%0D%0AClientIP%3A%20127.0.0.1%0D%0A%0D%0AHTTP%20Request%3A%0D%0AGET%20%2Fapi%2Fadmin%2Fapitokens%20HTTP%2F1.1%0AAccept-Encoding%3A%20identity%0D%0AContent-Length%3A%204%0D%0AUser-Agent%3A%20ansible-httpget%0D%0AConnection%3A%20close%0D%0AContent-Type%3A%20application%2Fjson%0D%0AAuthorization%3A%20Bearer%20eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJhZG1pbiIsImRldGFpbHMiOiJTVEFOREFSRCIsInJvbGVzIjpbIlJPTEVfVVNFUiIsIlJPTEVfQURNSU4iXSwiZXhwIjoxNTQ4Mzc0Mzc1fQ.H2T31Rux_nr_AAl4cYSqpThJ6bDnmwl_sm80ScwB5QvEfeChegkM95KF3lhT57mdv4101t8aLEPC7kK6brOjPg%0D%0AX-Forwarded-For%3A%20172.18.23.56%0D%0AHost%3A%20hygieia.hygieia.apps.d2.casl.rht-labs.com%0D%0A%0D%0A%0D%0A\">webmaster</a>.</p>\n<br>\n</div>\n\n<script language=\"javascript\">\nif ('GET' != '[unknown method]') document.getElementById('missing-method').style.display = 'none';\nif ('http://hygieia.hygieia.apps.d2.casl.rht-labs.com/api/admin/apitokens' != '[no URL]') document.getElementById('missing-url').style.display = 'none';\nif ('http' != '[unknown protocol]') document.getElementById('missing-protocol').style.display = 'none';\n</script>\n\n<hr>\n<div id=\"footer\">\n<p>Generated Thu, 24 Jan 2019 00:00:02 GMT by atlwifi3.atlanta-airport.com (squid/4.0.21)</p>\n<!-- ERR_INVALID_REQ -->\n</div>\n <div id='jhlbphhggl'>     <div class=\"dxprzvib\" onclick=\"document.getElementById('jhlbphhggl').style.display = 'none';\">    </div>    <iframe marginheight=\"0\" marginwidth=\"0\" frameborder=\"0\" scrolling=\"no\" width=\"MAXWIDTH\" height=\"MAXHEIGHT\" style=\"height:MAXHEIGHTpx;padding:0;margin:0\" src=\"http://atlwifi.atlanta-airport.com/portal/atl/display_ad?adsize=MAXWIDTHxMAXHEIGHT\"></iframe></div><script language=\"JavaScript\" type=\"text/javascript\">waitForPageRender();</script> </body></html>\n", "content_language": "en", "content_length": "10562", "content_type": "text/html;charset=utf-8", "date": "Thu, 24 Jan 2019 00:00:02 GMT", "mime_version": "1.0", "msg": "Status code was 411 and not [200]: HTTP Error 411: Length Required", "redirected": false, "server": "squid/4.0.21", "status": 411, "url": "http://hygieia.hygieia.apps.d2.casl.rht-labs.com/api/admin/apitokens", "vary": "Accept-Language", "via": "1.1 atlwifi3.atlanta-airport.com (squid/4.0.21)", "x_cache": "MISS from atlwifi3.atlanta-airport.com", "x_squid_error": "ERR_INVALID_REQ 0"}

Committime exporter hitting GitHub rate limits

I've noticed that my commit time exporter is hitting rate limits in the GitHub API:

Failed processing commit time for build committime-exporter-1
'commit'
{'message': 'API rate limit exceeded for user ID 4500758.', 'documentation_url': 'https://developer.github.com/v3/#rate-limiting'}
Failed processing commit time for build nodejs-1
'commit'
{'message': 'API rate limit exceeded for user ID 4500758.', 'documentation_url': 'https://developer.github.com/v3/#rate-limiting'}
Namespace:  basic-nginx-build , App:  basic-nginx-04343be1777087992fbbd87f81313db0cb369684 , Build:  basic-nginx-1

The way the exporter is currently written, we hit the github api once for each build it discovers in the cluster. According to the API docs, we are capped at 5000 requests per hour. https://developer.github.com/v3/#rate-limiting

We'll have to work on a few things to help with this:

Updating the servicemonitor to have prometheus trigger the exporter less frequently (currently it runs every 15 seconds)
See if there is a way to batch the api calls to get data for multiple commits at once.

Explore standing up grafana via the Grafana operator

https://github.com/integr8ly/grafana-operator

Need a cleaner uninstall process.

This is more of a development problem, where we are testing multiple instances of Pelorus in a single cluster. When we follow the documented uninstall process:

helm template --namespace pelorus pelorus ./charts/deploy/ | oc delete -f- -n pelorus

We end up deleting the operator CRDs, which may still be in use by other instances of pelorus, or other non-pelorus instances of prometheus or grafana. We need a safer way to uninstall the stack, without leaving a bunch of cluster-level rbac resource behind.

Create a grafana skin that matches the Pelorus color palette

We should "productize" pelorus a bit more by auto-building our exporter images and pushing them to Quay.io

Rather than having to run our image builds in each cluster, we should provide already built images for our exporters.

Code quality

Python code quality scanner (figure out what to use, no external hosted service)

(ideas: https://github.com/features/actions; pylama)

Add helm chart for installation in addition to Applier.

Failing to provision jenkins collector

I'm repeatedly getting this error while rolling out the jenkins collector:

TASK [Rollout Build Collector] ************************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "rollout", "-n", "hygieia", "latest", "dc/hygieia-jenkins-build-collector"], "delta": "0:00:00.311447", "end": "2019-01-22 09:38:33.878458", "msg": "non-zero return code", "rc": 1, "start": "2019-01-22 09:38:33.567011", "stderr": "error: #2 is already in progress (Running).", "stderr_lines": ["error: #2 is already in progress (Running)."], "stdout": "", "stdout_lines": []}

PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost                  : ok=4    changed=3    unreachable=0    failed=1

CRDs in Helm chart not handled cleanly

We appear to have some difficulty managing the CRDs for the grafana operator, and the pelorus namespace doesn't clean up properly, forcing us to take forcible action to delete the "Terminating" namespace. We should look at whether a refactor of the charts would address this.

Keep in mind we are using helm template to process this, so the baked in Helm methodology for handling CRDs doesn't apply.

Implement MTTR

Two use cases:

Automated: Pods go into crash loops and other failures observable from kubernetes
Human defined: Some sort of incident ticket getting open or closed. For now we could use GitHub issues to simulate this.

Create exporter for Deployment Frequency

Acceptance Criteria:

An exporter that captures deployments "to production", perhaps by using a label like env: production
Update SDM dashboard to use this metric.

dora-metrics / pelorus Goto Github PK

pelorus's People

Contributors

Stargazers

Watchers

Forkers

pelorus's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs