dora-metrics / pelorus Goto Github PK
View Code? Open in Web Editor NEWAutomate the measurement of organizational behavior
Home Page: https://pelorus.readthedocs.io/
License: Apache License 2.0
Automate the measurement of organizational behavior
Home Page: https://pelorus.readthedocs.io/
License: Apache License 2.0
When executing the Helm commands to create an exporter, you must be in the pelorus
namespace or the objects will be created in the current namespace. This occurs even when the --namespace
command is passed as in helm template charts/exporter/ -f exporters/failure/values.yaml --namespace pelorus | oc apply -f-
Currently the lead time exporter requires an administrator to pass a comma-separated list of git repos representing application source code via environment variable to the lead time exporter. This doesn't scale well, as we would have to make an administrative change for each application that comes on board.
What we should do instead is grab the git repository information from the BuildConfigs
we find in the cluster.
Acceptance Criteria:
env: production
I'm repeatedly getting this error while rolling out the jenkins collector:
TASK [Rollout Build Collector] ************************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "rollout", "-n", "hygieia", "latest", "dc/hygieia-jenkins-build-collector"], "delta": "0:00:00.311447", "end": "2019-01-22 09:38:33.878458", "msg": "non-zero return code", "rc": 1, "start": "2019-01-22 09:38:33.567011", "stderr": "error: #2 is already in progress (Running).", "stderr_lines": ["error: #2 is already in progress (Running)."], "stdout": "", "stdout_lines": []}
PLAY RECAP ********************************************************************************************************************************************************************************************************
localhost : ok=4 changed=3 unreachable=0 failed=1
Attempts to login to prometheus and grafana fail with Error Page 500 Internal Error
Logs indicate that
oauth fails with "certificate signed by unknown authority"
When running the script with "--set", you'll get an invalid option. The getopts in the script needs to be updated
After data has been pushed to long term storage, it isn't visible in the Grafana dashboard.
Steps to reproduce
Missing the following dependencies:
dnf install -y libselinux-python
Current Documentation has three places where we point to the multi-cluster configuration, but non of them uses real config examples for such scenario:
This issue is to add such documentation, so it's easier for the user to configure Pelorus instance across multi-clusters.
Rather than having to run our image builds in each cluster, we should provide already built images for our exporters.
Readme with a walkthrough
Sample app & automation
We appear to have some difficulty managing the CRDs for the grafana operator, and the pelorus namespace doesn't clean up properly, forcing us to take forcible action to delete the "Terminating" namespace. We should look at whether a refactor of the charts would address this.
Keep in mind we are using helm template
to process this, so the baked in Helm methodology for handling CRDs doesn't apply.
I wanted to try changing APP_LABEL to one of the kubernetes standard label keys, like app.kubernetes.io/name
, however this crashes the commit and deploy time exporters, which use jsonpath
expressions to get the values from those labels.
Currently all of our documentation deals with deploying and configuring pelorus infrastructure. We need to write docs that deal with the usage of Pelorus once it is up and running. Some possible topics include:
By default prometheus only stores 2 weeks worth of data. In order for MDT tooling to be valuable, we need to store at least 6 months to a year's worth of History. We need to do some research into a long term data store for the stack.
When I worked at Circonus I designed executive dashboards which showed year over year, quarter over quarter, etc., graphs to show devops teams 'how are we doing today' kinds of answers. It would be (IMHO) really interesting to show that same thing for each of our 4 key metrics (ALT, Deploy, MRT, CFR).
If interested I'd be happy to contribute, or give more specifics.
I have a build that hosted on an internal GitLab server. When the committime-exporter
hits this build, it throws an error, and the pod crashes.
Only GitHub repos are currently supported. Skipping build slack-bot-4
Failed processing commit time for build slack-bot-4
'commit'
{'message': 'Not Found', 'documentation_url': 'https://developer.github.com/v3/repos/commits/#get-a-single-commit'}
Traceback (most recent call last):
File "committime/app.py", line 191, in <module>
REGISTRY.register(CommitCollector(username, token, namespaces, apps))
File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/registry.py", line 24, in register
names = self._get_names(collector)
File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/registry.py", line 64, in _get_names
for metric in desc_func():
File "committime/app.py", line 31, in collect
ld_metrics = generate_ld_metrics_list(self._namespaces)
File "committime/app.py", line 178, in generate_ld_metrics_list
metric.getCommitTime()
File "committime/app.py", line 66, in getCommitTime
self.commit_timestamp = loader.convert_date_time_to_timestamp(self.commit_time)
File "/opt/app-root/src/committime/lib_pelorus/loader.py", line 18, in convert_date_time_to_timestamp
timestamp = datetime.strptime(date_time, '%Y-%m-%dT%H:%M:%SZ')
TypeError: strptime() argument 1 must be str, not None
This is more of a development problem, where we are testing multiple instances of Pelorus in a single cluster. When we follow the documented uninstall process:
helm template --namespace pelorus pelorus ./charts/deploy/ | oc delete -f- -n pelorus
We end up deleting the operator CRDs, which may still be in use by other instances of pelorus, or other non-pelorus instances of prometheus or grafana. We need a safer way to uninstall the stack, without leaving a bunch of cluster-level rbac resource behind.
We would like the resulting dashboard (in read-only mode) to be viewable without having to log in.
We are using the following query to calculate Deployment Frequency:
sum(delta(openshift_apps_deploymentconfigs_complete_rollouts_total{phase="available"}[$interval]))
The core metric openshift_apps_deploymentconfigs_complete_rollouts_total{phase="available"}
appears to work just fine, but the delta/sum
functions seem to break when we pull in the interval.
TASK [Get Tokens] *************************************************************************************************************************************************************************************************
FAILED - RETRYING: Get Tokens (5 retries left).
FAILED - RETRYING: Get Tokens (4 retries left).
FAILED - RETRYING: Get Tokens (3 retries left).
FAILED - RETRYING: Get Tokens (2 retries left).
FAILED - RETRYING: Get Tokens (1 retries left).
fatal: [localhost]: FAILED! => {"attempts": 5, "changed": false, "connection": "close", "content": "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">\n<html><head> <style type=\"text/css\">#jhlbphhggl { position: fixed; margin: 20px; z-index: 8888; border: 2px outset #999; border-radius: 7px; box-shadow: 7px 7px 10px #888888; background-color: rgba(80%, 80%, 80%, 0.9); padding-left: 8px; display: none; bottom:0;left:0;}.dxprzvib { position: absolute; z-index: 10000; width: 12px; height: 12px; left: 1px; top: 1px; background:transparent url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAwAAAAMCAYAAABWdVznAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAdhJREFUeNpUUs9rE0EU%2FmZ2Z1VCSV2JCYKWxoKtxUMvLQiipf6o1EPx4jV%2FgL16EZQWvHjuHyCCFKH0LkgVPYhSQRpEsYQaaWnSrOmu%2BbE%2FZmfHmSWBdOANb773vTfv8T2CgbO88owU3mzazP03qd88O%2FS9Nj%2FXfPrksexzSN9Znb5uD%2FPg0ehYoZSbGEuxxo8KfldqL44oe%2F5w62NTY7RX2WKH9Y1x4pemrk2guHgXxfv3MHVjEpeMoMSazkvFYZpr6DaGXq%2BXLrbcBzkaw4h9GIGDpPoTna1tJH9qEC3%2FXKuy61xdWvqmf2BWGC6ckQImNWCGAUjjIDUW%2BmAwkIMAi6IFxTVNdZ1gkDajRIUI6EEbgiYqRED2uwpTFVVMc5Rr6QQEyoSUEHGMiAskjTCdTgYxYh5DJAl809Q06ITQy%2Bc%2FuN32%2BOncMJLzeQgrnQ9JGAF7h2h2OFq2%2FUVBkU7g8s6ttfKas3ihOHr25PQVGJlTqrxE3OnC%2FVzG9t%2BwLudvv1Lc2Hj%2FbhOzs3OeXyhU9%2Ffql8VRYEuagedx%2FCpX8Mlp77gzM8s0k%2FmqBBRkQGXdR5FXqzcTz0uVo9nsDhsZeavcXUXmx5Tur0ZvLqsHRbqNwdX4L8AAS0HI54qEg3wAAAAASUVORK5CYII%3D) top left no-repeat;}#jhlbphhggl { bottom: 0; position: fixed; margin: 10px; padding: 0; z-index: 10000; border: none; border-radius: 0px; box-shadow: 7px 7px 10px #666;}.dxprzvib { position: absolute; z-index: 10000; width: 24px; height: 24px; left: -8px; top: -8px;background: transparent url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABgAAAAYCAYAAADgdz34AAABPmlUWHRYTUw6Y29tLmFkb2JlLnhtcAAAAAAAPD94cGFja2V0IGJlZ2luPSLvu78iIGlkPSJXNU0wTXBDZWhpSHpyZVN6TlRjemtjOWQiPz4KPHg6eG1wbWV0YSB4bWxuczp4PSJhZG9iZTpuczptZXRhLyIgeDp4bXB0az0iWE1QIENvcmUgNS41LjAiPgogPHJkZjpSREYgeG1sbnM6cmRmPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjIj4KICA8cmRmOkRlc2NyaXB0aW9uIHJkZjphYm91dD0iL1VzZXJzL2RqaC9EZXNrdG9wL2Nsb3NlX2J1dHRvbi5wbmciLz4KIDwvcmRmOlJERj4KPC94OnhtcG1ldGE+Cjw/eHBhY2tldCBlbmQ9InIiPz5XZOQgAAABhmlDQ1BzUkdCIElFQzYxOTY2LTIuMQAAKJF1kbtLA0EQh78kihIVBVMoWKQIFhJFowS18oFEJQSJCr6a5MxDyMXjLiJiaWGbIoUPbAxiY62d+A8IgqBWItiKhYKNyDmbCBExs+zut7+dmd2dBedCRtOtml7QszkzGhrzLiwueeuecNNOC8O4Y5pljM7MhKlqH3c41HzbrXJV9/vXGlYTlgaOeuERzTBzwpPCkc2coXhP2KOlY6vCZ8J+Uy4o/KD0eJlfFKdK7FQ5PeZcdFzYI+xN/eL4L9bSpi48IOzTMxvaz33USxoT2flZpUvvwCJKiDG8TDHBOEH6GJIxSDcBemRFlfhAKT7CusRqMhpsYbJGijQ5/KJuSPaEzEnRE9Iy4iGm/uBvba1kf6B8QuM01D7b9nsX1B3C165tfx7b9lcRXFKXq0Ilfr0Ag6+i5yua7wiad+D8sqLFT+AiD22PRsyMlSSXdGcyCW+n0LQIrTfgXi7X7Wef4j3MbUP4GvYPoFP8m1e+ATk1Z1GpTBZNAAAACXBIWXMAAAsTAAALEwEAmpwYAAAESElEQVRIibWUS2hUZxTHf9+9kztkRvMwTuI0VzEaYwLGGiW+6EIEN2IFDUrRqCsXFumiG13E4qLQXVooQqGo4KuIK8FoUURRhKSo9RmwqEWiGSaTySSTx53XvaeLLzOTh2npwgMH7vc4//85/3O+C5/Y1FwHd0AFwAZWW9BkwQrl91cpwE2nBzPwVw56PXi2Efr/F0EPlANtpbC3or6+JdDaGiqpq1NmIADZLO7gIJnXr72J58+jif7+PzPwm8DVDTD6nwSPocyDzqrq6j01R47M92/bpoySEhCBTAYcB8bHYWQENx4n++CB9N+8OTI0Pn7eg+82wPCcBI+hTBnGxepVq3Ys6ujAWLkSUintjjOnu2/eELlxQwZisXOuyNfrYSKPaU6VRUHnoqamr8LHj2PU1cHoKCSTRf/YenQUA5gXDCovFvt8PJUKHIb7v0IWwJdvKNBWVV6+Z9HRo6hwGOJxnaHrah8aKmadSkEwqKVKJMBxMAyD8JIlKpVMHk54Xg9wGcAACIBdCntrdu2ab9i2BovHdYZr18KaNZDNwuCg3rcs2LlTezKp9+JxTKWwy8uDPjjYDaFCBcDqilCoxd/aqgqZui5s3QrLl+sGv38Pt26B3w+HDkFzs450HDh5EoaHwXGwXFctgOYorAN+9wFY0BSorw8Z+YbG49DXB7W10NAApglbtmiSmhpoaQGldFWOA9Eo9PeD52EAQQgb0DiVYEVJZaVibEyXfOcOvHwJ9+7p9b59UFICBw6AYWjPZODSJThxQhNPMQt8FiwtSKRMs8oErf3du/DihZbo3Ts4dkxH7d8PvklFczkN3tEBHz7MfEoYGriySKCUnognT+Dp0+m3TRMWLtSSzLRcbvZe0aTw9Rx+GWloEJk3T0S3VHtjo0hXl4jrioiIZLMiuZz+zmRETp8Wqa2dHgMyAvIUfiwQPIJvY5blilLFi3V1IteuFcHTaZGzZ0XOndNEecILF0RCoWkEMUg/hG8KErnQO5HJRF0IF552PA6RSHFa8prnH157uz4bGNCDAGBZeDU1jPX1RQR6CxV0w2dPoMsBb1q5fr/ImTMily+L2HZx37ZFLl4U6ewUMQwRpUQqKkR27JCJhgbvIVzrnmxywf6A9jeQyM3QU0pLRcLhWTpLdbVOwDBEli0TaW+XbFubvDKM4R7YPXWi8i2/OgTnIyDeVGbH0VLNtIEBTbV5M+zeTS4QIHL/vpf0vFNZ6MpfmzZ7PVDhg5+q4WAYlDkbdjJKgW3D9u3Q1ETu7VsiV654sWj0Z9fzOtbD2EcJJqUKKPi+Ag7bELRAGUppUNOEsjL929i0Cc/nI339uvTdvp1Muu4pgR+mgn+UIE8CfOmDgwugOWhZYauy0mfYNixejJdOk4lEMmOvXkXijvPChTNZ6PoC0rOKnUsFgG4IKVhnQKMflpp6MiQHiSz8LdDrwqONkPg3nE9q/wAuWlgfqWyn3QAAAABJRU5ErkJggg==)top left no-repeat;}</style><script language=\"JavaScript\" type=\"text/javascript\"> var c = 0; function waitForPageRender() { c = c + 1; var t=setTimeout(\"checkVisible()\",500); } function checkVisible() { var floating_div; var max_width; var max_height; var body_tag = document.getElementsByTagName(\"body\")[0]; if(document.all) { floating_div = document.all.jhlbphhggl; } else if(document.getElementById) { floating_div = document.getElementById('jhlbphhggl'); } if (body_tag && (body_tag.offsetWidth >= 300 && body_tag.offsetHeight > 480)) { floating_div.style.display = 'inherit'; if (body_tag.offsetWidth >= 728) { max_width = \"728\"; max_height = \"90\"; } else { max_width = \"320\"; max_height = \"100\"; } floating_div.innerHTML = floating_div.innerHTML.replace(/MAXWIDTH/g, max_width.toString()).replace(/MAXHEIGHT/g, max_height.toString()); floating_div.style.height = max_height.concat(\"px\"); } if (floating_div.style.display == 'none' && c < 20) { waitForPageRender(); } }</script> \n<meta type=\"copyright\" content=\"Copyright (C) 1996-2017 The Squid Software Foundation and contributors\">\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type=\"text/css\"><!-- \n /*\n * Copyright (C) 1996-2017 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0;\n\tbackground: #efefef;\n\tfont-size: 12px;\n\tcolor: #1e1e1e;\n}\n\n/* Page displayed title area */\n#titles {\n\tmargin-left: 15px;\n\tpadding: 10px;\n\tpadding-left: 100px;\n\tbackground: url('/squid-internal-static/icons/SN.png') no-repeat left;\n}\n\n/* initial title */\n#titles h1 {\n\tcolor: #000000;\n}\n#titles h2 {\n\tcolor: #000000;\n}\n\n/* special event: FTP success page titles */\n#titles ftpsuccess {\n\tbackground-color:#00ff00;\n\twidth:100%;\n}\n\n/* Page displayed body content area */\n#content {\n\tpadding: 10px;\n\tbackground: #ffffff;\n}\n\n/* General text */\np {\n}\n\n/* error brief description */\n#error p {\n}\n\n/* some data which may have caused the problem */\n#data {\n}\n\n/* the error message received from the system or other software */\n#sysmsg {\n}\n\npre {\n}\n\n/* special event: FTP / Gopher directory listing */\n#dirmsg {\n font-family: courier, monospace;\n color: black;\n font-size: 10pt;\n}\n#dirlisting {\n margin-left: 2%;\n margin-right: 2%;\n}\n#dirlisting tr.entry td.icon,td.filename,td.size,td.date {\n border-bottom: groove;\n}\n#dirlisting td.size {\n width: 50px;\n text-align: right;\n padding-right: 5px;\n}\n\n/* horizontal lines */\nhr {\n\tmargin: 0;\n}\n\n/* page displayed footer area */\n#footer {\n\tfont-size: 9px;\n\tpadding-left: 10px;\n}\n\n\nbody\n:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }\n:lang(he) { direction: rtl; }\n --></style>\n</head><body id=ERR_INVALID_REQ>\n<div id=\"titles\">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id=\"content\">\n<p><b>Invalid Request</b> error was encountered while trying to process the request:</p>\n\n<blockquote id=\"data\">\n<pre>GET /api/admin/apitokens HTTP/1.1\nAccept-Encoding: identity\r\nContent-Length: 4\r\nUser-Agent: ansible-httpget\r\nConnection: close\r\nContent-Type: application/json\r\nAuthorization: ** NOT DISPLAYED **\r\nX-Forwarded-For: 172.18.23.56\r\nHost: hygieia.hygieia.apps.d2.casl.rht-labs.com\r\n</pre>\n</blockquote>\n\n<p>Some possible problems are:</p>\n<ul>\n<li id=\"missing-method\"><p>Missing or unknown request method.</p></li>\n<li id=\"missing-url\"><p>Missing URL.</p></li>\n<li id=\"missing-protocol\"><p>Missing HTTP Identifier (HTTP/1.0).</p></li>\n<li><p>Request is too large.</p></li>\n<li><p>Content-Length missing for POST or PUT requests.</p></li>\n<li><p>Illegal character in hostname; underscores are not allowed.</p></li>\n<li><p>HTTP/1.1 <q>Expect:</q> feature is being asked from an HTTP/1.0 software.</p></li>\n</ul>\n\n<p>Your cache administrator is <a href=\"mailto:webmaster?subject=CacheErrorInfo%20-%20ERR_INVALID_REQ&body=CacheHost%3A%20atlwifi3.atlanta-airport.com%0D%0AErrPage%3A%20ERR_INVALID_REQ%0D%0AErr%3A%20%5Bnone%5D%0D%0ATimeStamp%3A%20Thu,%2024%20Jan%202019%2000%3A00%3A02%20GMT%0D%0A%0D%0AClientIP%3A%20127.0.0.1%0D%0A%0D%0AHTTP%20Request%3A%0D%0AGET%20%2Fapi%2Fadmin%2Fapitokens%20HTTP%2F1.1%0AAccept-Encoding%3A%20identity%0D%0AContent-Length%3A%204%0D%0AUser-Agent%3A%20ansible-httpget%0D%0AConnection%3A%20close%0D%0AContent-Type%3A%20application%2Fjson%0D%0AAuthorization%3A%20Bearer%20eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJhZG1pbiIsImRldGFpbHMiOiJTVEFOREFSRCIsInJvbGVzIjpbIlJPTEVfVVNFUiIsIlJPTEVfQURNSU4iXSwiZXhwIjoxNTQ4Mzc0Mzc1fQ.H2T31Rux_nr_AAl4cYSqpThJ6bDnmwl_sm80ScwB5QvEfeChegkM95KF3lhT57mdv4101t8aLEPC7kK6brOjPg%0D%0AX-Forwarded-For%3A%20172.18.23.56%0D%0AHost%3A%20hygieia.hygieia.apps.d2.casl.rht-labs.com%0D%0A%0D%0A%0D%0A\">webmaster</a>.</p>\n<br>\n</div>\n\n<script language=\"javascript\">\nif ('GET' != '[unknown method]') document.getElementById('missing-method').style.display = 'none';\nif ('http://hygieia.hygieia.apps.d2.casl.rht-labs.com/api/admin/apitokens' != '[no URL]') document.getElementById('missing-url').style.display = 'none';\nif ('http' != '[unknown protocol]') document.getElementById('missing-protocol').style.display = 'none';\n</script>\n\n<hr>\n<div id=\"footer\">\n<p>Generated Thu, 24 Jan 2019 00:00:02 GMT by atlwifi3.atlanta-airport.com (squid/4.0.21)</p>\n<!-- ERR_INVALID_REQ -->\n</div>\n <div id='jhlbphhggl'> <div class=\"dxprzvib\" onclick=\"document.getElementById('jhlbphhggl').style.display = 'none';\"> </div> <iframe marginheight=\"0\" marginwidth=\"0\" frameborder=\"0\" scrolling=\"no\" width=\"MAXWIDTH\" height=\"MAXHEIGHT\" style=\"height:MAXHEIGHTpx;padding:0;margin:0\" src=\"http://atlwifi.atlanta-airport.com/portal/atl/display_ad?adsize=MAXWIDTHxMAXHEIGHT\"></iframe></div><script language=\"JavaScript\" type=\"text/javascript\">waitForPageRender();</script> </body></html>\n", "content_language": "en", "content_length": "10562", "content_type": "text/html;charset=utf-8", "date": "Thu, 24 Jan 2019 00:00:02 GMT", "mime_version": "1.0", "msg": "Status code was 411 and not [200]: HTTP Error 411: Length Required", "redirected": false, "server": "squid/4.0.21", "status": 411, "url": "http://hygieia.hygieia.apps.d2.casl.rht-labs.com/api/admin/apitokens", "vary": "Accept-Language", "via": "1.1 atlwifi3.atlanta-airport.com (squid/4.0.21)", "x_cache": "MISS from atlwifi3.atlanta-airport.com", "x_squid_error": "ERR_INVALID_REQ 0"}
The installation for Pelorus is really just a helm chart, except that we rely on a couple of values in secrets in the monitoring stack in order to wire up pelorus with the openshift monitoring stack.
Since we've done this, we have added a lookup
function to helm, which could be used to fetch these values as part of processing the chart. https://helm.sh/docs/chart_template_guide/functions_and_pipelines/#using-the-lookup-function
We should switch over to that so that we can get rid of the install script altogether.
Two use cases:
The pelorus project doesn't currently support private github instances.
I've noticed that my commit time exporter is hitting rate limits in the GitHub API:
Failed processing commit time for build committime-exporter-1
'commit'
{'message': 'API rate limit exceeded for user ID 4500758.', 'documentation_url': 'https://developer.github.com/v3/#rate-limiting'}
Failed processing commit time for build nodejs-1
'commit'
{'message': 'API rate limit exceeded for user ID 4500758.', 'documentation_url': 'https://developer.github.com/v3/#rate-limiting'}
Namespace: basic-nginx-build , App: basic-nginx-04343be1777087992fbbd87f81313db0cb369684 , Build: basic-nginx-1
The way the exporter is currently written, we hit the github api once for each build it discovers in the cluster. According to the API docs, we are capped at 5000 requests per hour. https://developer.github.com/v3/#rate-limiting
We'll have to work on a few things to help with this:
I think this is limited to the clusterrolebindings, but I'm noticing that every time I re-run the ./runhelm.sh
script to a new namespace, the clusterrolebindings are getting overwritten for my serviceaccount, which breaks pelorus in all other namespaces. We should reconfigure the helm chart to handle this better.
Because we are using the GitHub API to grab the commit timestamps, we can currently only support source code on GitHub. Right now the collector code skips repos that don't have github.com
in the URL, but I would like to find a more generic way to handle grabbing the commit timestamp.
This is more difficult than it sounds, as the only generic way seems to be to clone each repo. I'd like to find some less expensive operation than a full repo clone.
Refactoring to a generic method would also remove a manual step in the install process, where the user has to go generate an API token, which is likely to have scalability problems anyway.
When we provision our stack, it creates issues in the OpenShift monitoring stack. Alertmanager and prometheus pods go into crash loop.
Running in a disconnected install, changed all image refrances to pull form local quay.
look like everting deploys corerctly but have this error for
prometheus-prometheus-pelorus-0
prometheus-prometheus-pelorus-1
Error: failed to start container "prometheus-config-reloader": Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "exec: "/bin/prometheus-config-reloader": stat /bin/prometheus-config-reloader: no such file or directory"
quaylocal.local/coreos/prometheus-config-reloader:v0.33.0" already present on machine
3.11.153 cluster
The exporter code is becoming difficult to debug when it doesn't work. It would be useful to have a unit testing framework in place that encourages us to write more testable code.
We recently installed Pelorus into a cluster being used on an Open Innovation Labs residency in EMEA. The customer is using Bitbucket as their source code repository and Pelorus is currently unable to collect data from Bitbucket.
Raising this issue to track future development and integration to Bitbucket.
Python code quality scanner (figure out what to use, no external hosted service)
(ideas: https://github.com/features/actions; pylama)
I have the committime exporter deployed in a cluster, returning the following error in the logs:
INFO:root:Namespace: kenwilli-basic-spring-boot-build, App: kenwilli-basic-spring-boot-ea67853d8f9b4f43400500f807d72cdfb5b0936d, Build: kenwilli-basic-spring-boot-3
Traceback (most recent call last):
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/wsgiref/handlers.py", line 137, in run
self.result = application(self.environ, self.start_response)
File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/exposition.py", line 52, in prometheus_app
status, header, output = _bake_output(registry, accept_header, params)
File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/exposition.py", line 40, in _bake_output
output = encoder(registry)
File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/openmetrics/exposition.py", line 56, in generate_latest
floatToGoString(s.value),
File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/utils.py", line 9, in floatToGoString
d = float(d)
TypeError: ("float() argument must be a string or a number, not 'NoneType'", Metric(github_commit_timestamp, Commit timestamp, gauge, , [Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:b662b493c37a2f0810e6c96268a46bda0364bb5e93f0d7c672ea8fd20966da2e'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:5818ec7641d40144628c2537914d2874167c6edf64cfefa8bda89f8b525a36b6'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:2e7efc7c011eef8795f2d73fe37b0ce71198aeb62eca97de49dd2379d8855885'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:2f934e61518d81b9decd423ec8ea88f06e379281df860d120f564721328b7ae3'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'basic-nginx-build', 'app': 'basic-nginx-d5959a2450fa575f95e27a77951ee296319ecfae', 'image_sha': 'sha256:fbcb7bd0acf00c0f012b09e7fbab09c445e3ff71ca96e5bf2625970751029777'}, value=1591804346.0, timestamp=None, exemplar=None), Sample(name='github_commit_timestamp', labels={'namespace': 'kenwilli-basic-spring-boot-build', 'app': 'kenwilli-basic-spring-boot-ea67853d8f9b4f43400500f807d72cdfb5b0936d', 'image_sha': 'sha256:46332b417660900361f7185830a4eb6d5ddc7e3002944ba26ed260e83f415197'}, value=None, timestamp=None, exemplar=None)]))
Because it this one error, the exporter won't return ANY data. If you hit the endpoint for the exporter, it simply returns:
$ curl http://committime-exporter-pelorus-etsauer.apps.cluster-eric.blue.osp.opentlc.com/
A server error occurred. Please contact the administrator.
We need to do the following:
NoneType
error (it shouldn't be)We had a build in the cluster that was in a stuck/pending state because of an InvalidOutputReference
, and caused the committime exporter to crash with:
AttributeError: 'NoneType' object has no attribute 'git'
We need to ensure the exporter can handle finding builds that are in an unexpected state and move on with collection.
Currently, the Deploytime exporter looks for deployments with a label of application
. To match the Kubernetes standard, the exporter should look for the label of app
.
List of things to change/update:
app
labelsIt appears the long term gitops/infra as code solution is going to center around Helm as the templating framework of choice, and ArgoCD as the orchestration engine. We should look at what it would take to convert our applier inventory over to helm/argo.
Acceptance criteria:
Spun up pelorus on a new OpenShift 4.3 cluster.
$ oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0-201905191700+7bd2e5b-dirty", GitCommit:"7bd2e5b", GitTreeState:"dirty", BuildDate:"2019-05-19T23:52:43Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+520769a", GitCommit:"520769a", GitTreeState:"clean", BuildDate:"2019-10-11T01:55:01Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
$ ./runhelm.sh
.. All creates succeed
All pods come up healthy:
$ oc get pods -n pelorus
NAME READY STATUS RESTARTS AGE
grafana-deployment-6dd5455957-4jzwp 2/2 Running 0 5h36m
grafana-operator-9778b7f46-sj7qq 1/1 Running 0 4h47m
prometheus-operator-pelorus-669cfd4649-4brhc 1/1 Running 0 5h37m
prometheus-prometheus-pelorus-0 4/4 Running 1 5h37m
prometheus-prometheus-pelorus-1 4/4 Running 1 5h37m
I can verify that the scrape configs for openshift prometheus get added to the config file.
$ oc get secrets -n pelorus prometheus-prometheus-pelorus -o jsonpath='{.data.prometheus\.yaml\.gz}' | base64 -d | zcat
global:
evaluation_interval: 30s
scrape_interval: 30s
external_labels:
prometheus: pelorus/prometheus-pelorus
prometheus_replica: $(POD_NAME)
rule_files:
- /etc/prometheus/rules/prometheus-prometheus-pelorus-rulefiles-0/*.yaml
scrape_configs:
- job_name: federated-prometheus-local
scrape_interval: 15s
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job="openshift-state-metrics"}'
scheme: https
basic_auth:
username: internal
password: o2wTgU6miU160slPv/dZ8pqarxxpUIKg3JZCGYpBTXrJoyJ1S2fizavMfnKimUvTPw+ebWp8k6x7aRn7NAq6y+kGNKyF62F1EjBvWY3RsMVRY0Ykt63559M0aDDSfhETVRorHRsbYXgGOLUklpqUfJGoaBs9jTRgll+utzYNufUUq2YWxxklZnhsEVV6Mn2pCH56pbHEWtOw5vylL9BpRv5+uzoBTlDxrBPplZbFyDDl0cFRsR7bOovLH9z73UNb4YRR8BXAd3/N7adgzsqgJuDl7tP69IMDiDPT5xOTPWaAMNtMoBDmDx7DIjKmj9g79SY0WqGa1Ar7/6yqQfra
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- prometheus-k8s.openshift-monitoring.svc.cluster.local:9091
labels:
federated_job: federated-prometheus-local
alerting:
alert_relabel_configs:
- action: labeldrop
regex: prometheus_replica
alertmanagers: []
However... there is no scrape data in the pelorus prometheus.
Running pelorus without long term storage breaks Grafana. This is because when long term storage isn't run, the service https://thanos-pelorus.pelorus.svc:9092
isn't created. However, the Grafana dashboard still points to that service.
The setup command ansible-playbook -i galaxy/openshift-toolkit/custom-dashboards/.applier galaxy/openshift-applier/playbooks/openshift-cluster-seed.yml -e include_tags=infrastructure
fails with the following error:
TASK [/tmp/ansible.O_0Fyg/openshift-toolkit/custom-dashboards/mdt-secret-discovery : Fetch grafana_config secret] ******************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "This module requires the OpenShift Python client. Try `pip install openshift`"}```
Fixed by installing python2-openshift package and rerunning.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.