openshift / openshift-tools Goto Github PK

A public repository of scripts used by OpenShift Operations for various purposes

License: Apache License 2.0

Python 88.78% Shell 2.65% PHP 0.13% CSS 0.11% JavaScript 0.04% HTML 0.30% Go 0.57% Makefile 0.01% Dockerfile 0.05% Jinja 7.36%

openshift-tools's Introduction

openshift-tools repository

Description
Builds
Directory Structure

Description

This repository contains scripts used by OpenShift Operations for various purposes, many of which do not pertain directly to management of OpenShift itself.

Code here may be under heavy development, and none of these scripts are officially supported by Red Hat. Use at your own risk.

The repository layout is designed for use with Tito

See our Build instructions for more information.

Builds

RPMs from this repo can be found on our Copr Project.

Directory Structure

openshift-tools
├── docker                Docker image definitions used by OpenShift Operations
├── docs                  Documentation specific to this repository
├── jenkins               Build bot checks for our PRs
├── openshift             Contains OpenShift v3 specific files (templates, pod defs, etc)
├── openshift_tools       Python openshift_tools module
├── pmdas                 Custom PCP PMDAs
├── README.adoc           This file
├── rel-eng               Files used by tito (rpm build tool)
├── scripts               Scripts that usually use the openshift_tools module
├── support-tools         Support related tooling (will be moved into the above structure in the future)
└── web                   Python Web based tooling (e.g. the Zagg REST API)

openshift-tools's People

Contributors

Stargazers

Watchers

Forkers

a13m jgkennedy mwoodson kwoodson mmahut stenwt robotmaxtron blueshells rharrison10 joelddiaz grdryn codificat twiest brennv yepengxj zhiwliu joelsmith ivanhorvath rhdedgar jeremyeder pk-karthik mdshuai wshearn mffiedler yaacov ayyappansubbiah rogeriocastro12 maxamillion t0ffel digideskio drewandersonnz alongoldboim boratonaj ronsengupta nanyte25 zgalor tiwillia blrm javad18 wushiqinlou hardbone12 azmobi2 dgoodwin themurph loetterle jupierce jmprusi portante carol007robot dwoods vpiduri dtschan pombredanne mwringe abdelhegazi danieloh30 mjudeikis rongfengliang rubenvp8510 eparis yocum137 luciddreamz janstey jessesarn sportsbitenews smarterclayton sujaybankar aweiteka ramonguiu jfchevrette wgordon17 platzerg simonpasquier dcbw nhosoi derekwaynecarr xiaoruiguo alexxnica kryndex orpheus11 xgoss spencerx aditya-konarde nmars sxudop pyates86 steventobin baileyvw carlyburton rafael-azevedo hunterfu openshift-training jewzaam binnyrs slamp sferich888 ke7ngo zhiwuliu-aa yithian oddtazz

openshift-tools's Issues

Issue Deploying Local Development Monitoring

While following the How to do local development of openshift-tools monitoring components page, I ran into an issue.

OS: Fedora 25
OC Version: 1.3.3 (initally was running 1.3.4 but ran into an issue)

Ansible and docker are installed and working according to the guide. After cloning the openshift-tools repo, I ran the ./openshift-tools/docker/build-local-setup-centos7.sh script successfully. My next step was to run the ./openshift-tools/docker/local_development/start-local-dev-env.sh script which fails with the following error:

TASK [os_zabbix : Template Create Template] ************************************
fatal: [localhost]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ g_template_app_zabbix_server }}: {u'zitems': [{u'zabbix_type': u'internal', u'description': u'A simple count of the number of partition creates output by the housekeeper script.', u'applications': [u'Zabbix server'], u'value_type': u'int', u'key': u'housekeeper_creates', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'A simple count of the number of partition drops output by the housekeeper script.', u'applications': [u'Zabbix server'], u'value_type': u'int', u'key': u'housekeeper_drops', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'A simple count of the number of errors output by the housekeeper script.', u'applications': [u'Zabbix server'], u'value_type': u'int', u'key': u'housekeeper_errors', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'A simple count of the total number of lines output by the housekeeper script.', u'applications': [u'Zabbix server'], u'value_type': u'int', u'key': u'housekeeper_total', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,alerter,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,configuration syncer,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,db watchdog,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,discoverer,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,escalator,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,history syncer,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,housekeeper,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,http poller,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,icmp pinger,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,ipmi poller,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,java poller,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,node watcher,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,poller,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,proxy poller,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,self-monitoring,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,snmp trapper,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,timer,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,trapper,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[process,unreachable poller,avg,busy]', u'units': u'%'}, {u'zabbix_type': u'internal', u'description': u'', u'interval': 600, u'applications': [u'Zabbix server'], u'value_type': u'int', u'key': u'zabbix[queue,10m]', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'', u'interval': 600, u'applications': [u'Zabbix server'], u'value_type': u'int', u'key': u'zabbix[queue]', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[rcache,buffer,pfree]', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[wcache,history,pfree]', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[wcache,text,pfree]', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[wcache,trend,pfree]', u'units': u''}, {u'zabbix_type': u'internal', u'description': u'', u'applications': [u'Zabbix server'], u'value_type': u'float', u'key': u'zabbix[wcache,values]', u'delta': 1, u'units': u''}, {u'zabbix_type': u'trapper', u'applications': [u'Zabbix server'], u'value_type': u'int', u'description': u'How many versions behind is the current running version of Zabbix', u'key': u'zabbix.software.version.disparity'}], u'ztriggers': [{u'priority': u'high', u'url': u'https://github.com/openshift/ops-sop/blob/master/Services/Zabbix/Custom_Checks.asciidoc', u'expression': u'{Template App Zabbix Server:web.test.fail[check_v2_zabbix_web].last(#1)}<>0 and {Template App Zabbix Server:web.test.fail[check_v2_zabbix_web].last(#2)}<>0', u'description': u'Zabbix v2 Web UI is not responding', u'name': u'Zabbix v2 Web UI is not responding'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_DB_Housekeeping.asciidoc', u'expression': u'{Template App Zabbix Server:housekeeper_errors.last(0)}+{Template App Zabbix Server:housekeeper_creates.last(0)}+{Template App Zabbix Server:housekeeper_drops.last(0)}<>{Template App Zabbix Server:housekeeper_total.last(0)}', u'description': u"There has been unexpected output while running the housekeeping script on the Zabbix. There are only three kinds of lines we expect to see in the output, and we've gotten something enw.\r\n\r\nCheck the script's output in /var/lib/zabbix/state for more details.", u'name': u'Unexpected output in Zabbix DB Housekeeping'}, {u'priority': u'high', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:housekeeper_errors.last(0)}>0', u'description': u"An error has occurred during running the housekeeping script on the Zabbix. Check the script's output in /var/lib/zabbix/state for more details.", u'name': u'Errors during Zabbix DB Housekeeping'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,alerter,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix alerter processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,configuration syncer,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix configuration syncer processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,db watchdog,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix db watchdog processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,discoverer,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix discoverer processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,escalator,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix escalator processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,history syncer,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix history syncer processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,housekeeper,avg,busy].min(1800)}>75', u'description': u'', u'name': u'Zabbix housekeeper processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,http poller,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix http poller processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,icmp pinger,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix icmp pinger processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,ipmi poller,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix ipmi poller processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,java poller,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix java poller processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,node watcher,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix node watcher processes more than 75% busy'}, {u'priority': u'high', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,poller,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix poller processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,proxy poller,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix proxy poller processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,self-monitoring,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix self-monitoring processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,snmp trapper,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix snmp trapper processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,timer,avg,busy].min(600)}>75', u'description': u'Timer processes usually are busy because they have to process time based trigger functions', u'name': u'Zabbix timer processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,trapper,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix trapper processes more than 75% busy'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/Zabbix_state_check.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[process,unreachable poller,avg,busy].min(600)}>75', u'description': u'', u'name': u'Zabbix unreachable poller processes more than 75% busy'}, {u'priority': u'high', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/data_lost_overview_plugin.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[queue,10m].min(600)}>1000', u'description': u'This alert generally indicates a performance problem or a problem with the zabbix-server or proxy.\r\n\r\nThe first place to check for issues is Administration > Queue. Be sure to check the general view and the per-proxy view.', u'name': u'More than 1000 items having missing data for more than 10 minutes'}, {u'priority': u'info', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/check_cache.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[rcache,buffer,pfree].min(600)}<5', u'description': u'Consider increasing CacheSize in the zabbix_server.conf configuration file', u'name': u'Less than 5% free in the configuration cache'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/check_cache.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[wcache,history,pfree].min(600)}<25', u'description': u'', u'name': u'Less than 25% free in the history cache'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/check_cache.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[wcache,text,pfree].min(600)}<25', u'description': u'', u'name': u'Less than 25% free in the text history cache'}, {u'priority': u'avg', u'url': u'https://github.com/openshift/ops-sop/blob/master/Alerts/check_cache.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix[wcache,trend,pfree].min(600)}<25', u'description': u'', u'name': u'Less than 25% free in the trends cache'}, {u'priority': u'info', u'url': u'https://github.com/openshift/ops-sop/blob/master/V3/Alerts/check_zabbix_version.asciidoc', u'expression': u'{Template App Zabbix Server:zabbix.software.version.disparity.last(0)}>0', u'name': u'New version of Zabbix is available', u'description': u'There is a new version of Zabbix available. This Zabbix server should be updated.'}], u'name': u'Template App Zabbix Server', u'zhttptests': [{u'application': u'Zabbix server', u'interval': 60, u'steps': [{u'url': u'{{ g_v2_monitor_url }}', u'status_codes': u'200', u'required': u'Zabbix', u'name': u'Check v2 Zabbix web', u'no': 1}], u'name': u'check_v2_zabbix_web'}]}: 'g_v2_monitor_url' is undefined\n\nThe error appears to have been in '/home/countpickering/openshift-tools/ansible/roles/lib_zabbix/tasks/create_template.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Template Create Template\n ^ here\n"}
to retry, use: --limit @/home/countpickering/openshift-tools/ansible/playbooks/adhoc/zabbix_setup/oo-config-zaio.retry

PLAY RECAP *********************************************************************
localhost : ok=80 changed=48 unreachable=0 failed=1

At this point, some of the environment is up and working. The OpenShift Web Console is working along with the deployments and pods: mysql, oso-cent7-zabbix-web, oso-cent7-zabbix-server. The oso-cent7-zagg-web depoyment says "No deployments". Also I noticed my hosts file has set the oso-cent7-zabbix-web record, but not the oso-cent7-zagg-web record.

The error message above reads as if the mentioned create_template.yml has a syntax issue, but I was unable to visibly find it. Any help on this issue is appreciated.

simplezabbix: cannot import name VariableManager

Traceback (most recent call last):
  File "/usr/bin/ops-zagg-metric-processor", line 36, in <module>
    from openshift_tools.ansible.simplezabbix import SimpleZabbix
  File "/usr/lib/python2.7/site-packages/openshift_tools/ansible/simplezabbix.py", line 27, in <module>
    from ansible.vars import VariableManager
ImportError: cannot import name VariableManager

Testing only in Localhost can create problem

Greetings,
I am an engineer who is interested in infrastructure as code testing. Currently, I am looking for testing anti-patterns in iac test scripts. I noticed Local-only Testing occurring in test instances. This can give happy test results in the local environment but in a real production system test can fail due to the difference in the environment. My recommendation is to test IaC code in an isolated non-local environment which needs to be similar to the production environment.

So I have the following queries:

Do you agree that this is an IaC testing anti-pattern?
Do you want to fix this?

Any feedback is appreciated.

Source Files:
https://github.com/openshift/openshift-tools/blob/prod/ansible/roles/lib_aws_service_limit/tests/test.yml

Do we have any automation for backup and restore of OCP ?

Can anyone tell me if there is any automation build for backup and restore of OCP?

Hawkular Metrics: Provide debugging information when a check fails.

Currently if we have a check that is failing, we don't have any information about what is going on with the server during this time. Having access to logs or other debugging information would be incredible useful.

Things which are essential to attach:

the logs for Hawkular Metrics, Cassandra and Heapster
the output of 'oc get all -o yaml -n openshift-infra'
the output of 'oc describe all -o yaml -n openshift-infra'

cron-send-os-master-metrics.py fails when nodes are not labeled

When a node has no "type" label, the node checks will print: Problem Openshift API checks: 'type'

This is masking the exception:

Traceback (most recent call last):
  File "/usr/bin/cron-send-os-master-metrics", line 422, in <module>
    OMCZ.run()
  File "/usr/bin/cron-send-os-master-metrics", line 98, in run
    self.nodes_not_schedulable()
  File "/usr/bin/cron-send-os-master-metrics", line 353, in nodes_not_schedulable
    if n['metadata']['labels']['type'] == 'master':
KeyError: 'type'

The code should not assume labels exist, and should probably raise some other error when they are missing.

Hawkular Metrics: Provide Context as to why the check is failing

Currently there is a check which queries Hawkular Metrics to make sure that we are collecting metrics from each node and it returns with a yes or no answer.

Why the check fails is not included as report. So when the check fails we don't know if its affecting just a single node, multiple nodes, or the whole cluster. We also don't know if the test failed because the request to Hawkular Metrics failed, or because it doesn't have any metrics stored for the time period requested or if there was some other error.

The check needs to provide some information about why it failed so that we have something to look into.

Minimal information that we need to know:

which node(s) is having the problem
why did that node fail the check (eg it is empty when we are expecting some value to be returned, the request to Hawkular Metrics failed with some error response, etc)

Deploying on origin/centos

I've been looking into testing this for deploying some of the components on an origin cluster.
I have created and imported the generated secrets but the first build will fail at Save scriptrunner's ssh key

https://github.com/openshift/openshift-tools/blob/prod/docker/oso-zabbix-server/centos7/root/config.yml#L78
Replacing this seems to resolve it but perhaps I'm missing some config steps?

-        {% set rsa_key = "" %}
-        {%- set key_name = "g_user_" + g_zbx_actions_config.zbx_scriptrunner_user %}
-        {%- set rsa_key = hostvars["localhost"][key_name]["priv_key"] %}
-        {{- rsa_key -}}
+        {{- g_user_scriptrunner_priv_key -}}

Node service restart caused by Sync Pod startup

What happened:
In our environment, the master node failed due to insufficient system resources, and most of the sync pods unexpectedly restarted during this period, which caused the node service on the computing node to restart, which in turn caused the business Pod on the node to have a MatchNodeSelector phenomenon (as kubernetes/kubernetes#52902 as described).

What you expected to happen:
We hope that sync pod will not cause the atomic-node service on the compute node to restart once when openshift-node/configmap does not change.

How to reproduce it (as minimally and precisely as possible):
Simply delete the sync pod and wait for the controller to regenerate.

Anything else we need to know?:

Environment:

ocp version: v3.11.43
OS (e.g: cat /etc/os-release): RHEL 7.8
Kernel (e.g. uname -a): 3.10.0-1127.13.1.el7.x86_64

/kind bug

Hawkular Metrics: Check similar requests as Console

We should be performing similar checks as to what the console is doing. As this how most of our users are interacting with metrics in OpenShift, this should be the main check that we have.

Performing these checks should give us the best overall idea to how well metrics is performing.

As part of this we need to:

access metrics over the route and not via some other internal service name or ip address directly.
use token based authentication. This is what users use to access metrics.
perform queries over the last 1 hour, 4 hours, day, 3 days and week
perform queries for individual pods as well as deployments (eg replication controllers)

What we need to gather from these checks:

were any errors encountered
how long it took to perform each check (this would be one of the main metrics we would want to be gathering here)
we would also want to be collecting and monitoring the number of metrics returned for each request. This can get a bit tricky. To do this properly we need to know:
- what the 'metrics_resolution' value is in the Heapster pod (and if this has changed). This will determine how many metrics we should expected to be collected within a certain interval
- how long a pod has been running. This gets a lot more tricky with deployments as they consist of multiple pods
- if the metrics stack has been up and running the whole time or if it was down for a certain period of time. If the metric pods were down, they would not have been able to collect metrics during that time.

Determining the exact number of metrics a particular pod or deployment should return in all time periods may not be something we can accurately account for. A good analogue for this may be to limit this to the last 5 or 10 minutes. We can more easily determine if the metrics system and a pod has been up for the last 5 minutes than we can account for the whole history of a week (especially if there were intermittent metric failures during that time)

Hawkular Metrics: automate update process

Currently if Hawkular Metrics is reinstalled in the system, then the metrics checks will fail until someone goes back into the system and updates the credentials being used). We should automate this process so that manual intervention is never required in these situations.

This could be as simple as automatically fetching the metrics from a secret or just using a service account which has the proper permissions.

master-restart return code is not explicit

Currently, the line exec timeout 60 docker wait $child_container makes the master-restart output the return code of the container.

Wouldn't something like this be more explicit ?

timeout 60 docker wait $child_container
echo "Container $child_container returned with exit code $?"

I would do it myself but I am not sure where the script is, as it is in 2 locations:t

Building monitoring containers on RHEL7

Hello

In #2887 a change was made to roll back to older RHEL 7.3, but I was still having issues building the containers.

https://github.com/slashterix/openshift-tools/commit/a0e65d6578a511eaa0f9489a95b4368bc5127754 shows all the changes I needed to make to get all the containers to build.

I'm running OCP 3.6 Enterprise on premises (on VMware and physical hosts) and am trying to give this Zabbix monitoring a try since there isn't much built in for monitoring the platform.

I'd appreciate any feedback as I'm new here before I open a PR. Have I made any drastic mistakes or done things that don't work with the flow for others?

Local development of openshift-tools monitoring

I'm attempting to build a local development of the openshift-tools monitoring following the instructions here: https://github.com/openshift/openshift-tools/blob/stg/docs/local_development_monitoring.adoc

The script I am having an issue with is "start-local-dev-env.sh" the output of that script is below"

$ ./start-local-dev-env.sh
Opening ports on default zone for DNS and container traffic (non-permanent)
success
success
success
success
-- Checking OpenShift client ... OK
-- Checking Docker client ... OK
-- Checking Docker version ... OK
-- Checking for existing OpenShift container ... OK
-- Checking for registry.access.redhat.com/openshift3/ose:v3.4.1.12 image ...
Pulling image registry.access.redhat.com/openshift3/ose:v3.4.1.12
Pulled 1/4 layers, 25% complete
Pulled 2/4 layers, 50% complete
Pulled 3/4 layers, 75% complete
Pulled 4/4 layers, 100% complete
Extracting
Image pull complete
-- Checking Docker daemon configuration ... OK
-- Checking for available ports ...
WARNING: Port 443 is already in use and may cause routing issues for applications.
-- Checking type of volume mount ...
Using nsenter mounter for OpenShift volumes
-- Creating host directories ... OK
-- Finding server IP ...
Using 172.19.72.172 as the server IP
-- Starting OpenShift container ...
Creating initial OpenShift configuration
Starting OpenShift using container 'origin'
Waiting for API server to start listening
OpenShift server started
-- Adding default OAuthClient redirect URIs ... OK
-- Installing registry ... OK
-- Installing router ... OK
-- Importing image streams ... OK
-- Importing templates ... OK
-- Login to server ... OK
-- Creating initial project "myproject" ... OK
-- Removing temporary directory ... OK
-- Server Information ...
OpenShift server started.
The server is accessible via web console at:
https://172.19.72.172:8443

You are logged in as:
User: developer
Password: developer

To login as administrator:
oc login -u system:admin

Login successful.

You have one project on this server: "myproject"

Using project "myproject".
Now using project "monitoring" on server "https://localhost:8443".

You can add applications to this project with the 'new-app' command. For example, try:
oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
to build a new example application in Ruby.
secret/monitoring-secrets
template "ops-cent7-zabbix-monitoring" created
deploymentconfig "mysql" created
deploymentconfig "oso-cent7-zabbix-server" created
deploymentconfig "oso-cent7-zabbix-web" created
deploymentconfig "oso-cent7-zagg-web" created
route "zabbix-web-ssl-route" created
route "zagg-web-ssl-route" created
service "mysql" created
service "oso-cent7-zabbix-server" created
service "oso-cent7-zabbix-web" created
service "oso-cent7-zagg-web" created
Deploying mysql pod
Flag --latest has been deprecated, use 'oc rollout latest' instead
error: cannot trigger a deployment for "mysql" because it contains unresolved images - try 'oc rollout latest dc/mysql'

Any ideas?

instance of suspicious comments

Greetings,

I am a security researcher, who is looking for security smells in Ansible scripts.
I found instances where certain keywords such as TODO, HACK, FIXME, bug repository IDs, in comments within Chef scripts.
According to the Common Weakness Enumeration organization this is a security weakness
(CWE-546: Suspicious Comment https://cwe.mitre.org/data/definitions/546.html).

I am trying to find out if you agree with the findings. I think it is possible to have a nuanced perspective. Any feedback is appreciated.

Any feedback is appreciated.

source: https://github.com/openshift/openshift-tools/blob/prod/openshift/installer/vendored/openshift-ansible-3.5.127/playbooks/common/openshift-cluster/upgrades/v3_5/validator.yml

CI fails when pull request contains double-quotes

When a pull request contains double-quotes in the title or in the description, the CI fails to run. No update form the CI is seen at all on the pull request, but a job is created on the jenkins server.

This appears to be due to the way the openshift-jenkins-plugin escapes characters in the environment variables passed into a job. The double quotes end up being escaped twice, creating unparse-able json:

..., "title": "This is a test PR with \\"double quotes\\"", ...

When it should end up like this:

..., "title": "This is a test PR with \"double quotes\"", ...

This issue can be worked around temporarily by removing the double-quotes from the PR title or description.

Make some of your awesome modules usable by others?

So I've been using Zabbix for a while and making zabbix modules to manage the resources has been on my todo list for a long time. Yall did it. Which rocks. So I'm playing with bringing it out into a separate repo, and would totally love to see it recommended for upstream ansible.

I'm sure there are probably a few others that would be good to do this with.

Thoughts?

Proposal: excluded hosts inventory file

Problem

We need a way to filter which hosts or clusters ansible runs against. We need to be able to easily update these exceptions, track and control changes, and display excluded hosts in TBD tools.

Proposal

Create a flexible data structure that expresses excluded hosts. Extend ansible/inventory/multi_inventory.py script to apply excluded host filter.

Justification

Using the kubernetes API as a model the flat list of items of different 'kinds' allows us to arbitrarily add excluded hosts rules. The downside of this approach is the difficulty of figuring out what the exception schedule is for a particular account. Another downside is the difficulty in understanding what a complex set of maintenance windows might impact a rollout.

Questions:

How would we differentiate between ansible runs to identify maintenance vs config loop vs upgrade?
How would we express daily windows that extend into another day? Multiple windows should work.

Details

Data format example `excluded_hosts.yaml`

---
- kind: no_maintenance
  description: Daily NASA blackout maintenance window
  match:
    type: oo_clusterid
    items:
      - companyb
      - acme
  windows:
    - range:
        - "10:00Z"
        - "23:59Z"
- kind: no_maintenance
  description: Multi-day EMEA blackout maintenance window
  match:
    type: oo_clusterid
    items:
      - tech1
  windows:
    - range:
        - "2017-12-018T01:00Z"
        - "2018-01-03T22:00Z"
- kind: no_configloop
  description: Quarantine broken hosts
  match:
    type: oo_name
    items:
      - opstest-ip-172-89-51-231.ec2.internal
- kind: no_upgrade
  description: Upgrade blacklist
  match:
    type: oo_clusterid
    items:
      - fusion8
      - brand1
      - cm-ops
      - dev-preview
      - devops
      - openshift-stage
      - openshiftops

Implementation

Configuration

New item in multi_inventory.yaml:

excluded_hosts_location: /path/to/excluded_hosts.yaml

Output all hosts inventory (current implementation)

ansible/inventory/multi_inventory.py

Output inventory with excluded hosts removed using current time

ansible/inventory/multi_inventory.py --exclude-hosts

Output inventory without excluded hosts, override time

ansible/inventory/multi_inventory.py --exclude-hosts --time '2017-10-31T12:00:00Z'

Output just excluded hosts to answer question "Which hosts would be excluded now from inventory?"

ansible/inventory/multi_inventory.py --excluded-hosts-only

How to get there

Extend ansible/inventory/multi_inventory.py --exclude-hosts=False (default) and associated tests
Develop excluded hosts validation test so we can safely change excluded_hosts.yaml
Consider dynamic inventory logging so excluded_hosts can be monitored and audited