GithubHelp home page GithubHelp logo

ibm / ansible-openshift-provisioning Goto Github PK

View Code? Open in Web Editor NEW
18.0 18.0 39.0 7.58 MB

Automate the deployment of Red Hat OpenShift Container Platform on IBM zSystems (s390x). Automated User-Provisoned Infrastructure (UPI) setup using Kernel-based Virtual Machine (KVM).

Home Page: https://ibm.github.io/Ansible-OpenShift-Provisioning/

License: MIT License

Jinja 15.73% YAML 81.58% Python 2.69%
ansible automation ibm ibmz infrastructure kvm linux linuxone openshift redhat redhat-enterprise-linux

ansible-openshift-provisioning's People

Contributors

amadeuspodvratnik avatar ftmiranda avatar imgbotapp avatar isumitsolanki avatar jacobemery avatar k-shiva-sai avatar mohammedzee1000 avatar pswilso2017 avatar routerhan avatar silliman avatar smolin-de avatar veera-damisetti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ansible-openshift-provisioning's Issues

Documentation doesn't fill whole webpage.

The documentation web pages don't fill up the entire screen.

It would be amazing if someone could solve this issue!

This problem makes reading things like the tables in Step 2: Set Variables (group_vars) more difficult as it runs off the page.

The docs were created using mkdocs. Here's how to install it.

In this repository, here's the configuration file for mkdocs and here's the folder for all the content on the docs site.

The docs use GitHub pages to deploy. After installing mkdocs, use the following command to create a local deployment for testing purposes use:

mkdocs serve

and then when it's ready use this command to deploy to GitHub:

mkdocs gh-deploy

Although, I believe you won't be able to deploy GitHub pages directly, so just open a Pull Request and I will test it and deploy for you.

Playbook 5 says dig validation of DNS fails, but dig lookup commands are succesful

Ansible Output:

TASK [dns : Template out bastion's resolv.conf file, replacing initial resolv.conf] ***************************************************************************************
skipping: [bastion.ocp1.ibm.com]

TASK [dns : Restart network to update changes made to /etc/resolv.conf] ***************************************************************************************************
skipping: [bastion.ocp1.ibm.com]

TASK [check_dns : Check internal cluster DNS resolution for the bastion] **************************************************************************************************
changed: [bastion.ocp1.ibm.com]

TASK [check_dns : Check internal cluster DNS resolution for external API and apps services] *******************************************************************************
failed: [bastion.ocp1.ibm.com] (item=api.ocp1.ibm.com) => {"ansible_loop_var": "item", "changed": true, "cmd": "dig +short api.ocp1.ibm.com | tail -n1", "delta": "0:00:00.007357", "end": "2023-06-03 20:20:19.514264", "failed_when_result": true, "item": "api.ocp1.ibm.com", "msg": "", "rc": 0, "start": "2023-06-03 20:20:19.506907", "stderr": "", "stderr_lines": [], "stdout": "9.76.61.80", "stdout_lines": ["9.76.61.80"]}
failed: [bastion.ocp1.ibm.com] (item=apps.ocp1.ibm.com) => {"ansible_loop_var": "item", "changed": true, "cmd": "dig +short apps.ocp1.ibm.com | tail -n1", "delta": "0:00:00.006574", "end": "2023-06-03 20:20:19.777103", "failed_when_result": true, "item": "apps.ocp1.ibm.com", "msg": "", "rc": 0, "start": "2023-06-03 20:20:19.770529", "stderr": "", "stderr_lines": [], "stdout": "9.76.61.80", "stdout_lines": ["9.76.61.80"]}
failed: [bastion.ocp1.ibm.com] (item=test.apps.ocp1.ibm.com) => {"ansible_loop_var": "item", "changed": true, "cmd": "dig +short test.apps.ocp1.ibm.com | tail -n1", "delta": "0:00:00.007048", "end": "2023-06-03 20:20:20.039085", "failed_when_result": true, "item": "test.apps.ocp1.ibm.com", "msg": "", "rc": 0, "start": "2023-06-03 20:20:20.032037", "stderr": "", "stderr_lines": [], "stdout": "9.76.61.80", "stdout_lines": ["9.76.61.80"]}

PLAY RECAP ****************************************************************************************************************************************************************
127.0.0.1 : ok=8 changed=4 unreachable=0 failed=0 skipped=18 rescued=0 ignored=0
bastion.ocp1.ibm.com : ok=20 changed=15 unreachable=0 failed=1 skipped=16 rescued=0 ignored=0

Actual dig command from bastion

[root@bastion ~]# dig +short api.ocp1.ibm.com
9.76.61.80
[root@bastion ~]# dig +short apps.ocp1.ibm.com
9.76.61.80
[root@bastion ~]# dig +short test.apps.ocp1.ibm.com
9.76.61.80

DNS api/app entries created without base domain. Bastion dns record missing

I am having several issues in playbook 5:

You can see the failure below where named does not start
It failed to start because it didn't build the DNS record for the bastion
The "api" and "app" DNS records are missing the base domain name portion!

Is there some way I can change my yaml file so these problems don't occur ????

ansible-playbook playbooks/5_setup_bastion.yaml

TASK [dns : Add infrastructure nodes to DNS reverse lookup file on bastion] ***************************************************************************************
changed: [bastion.ocp1.ibm.com] => (item=0)
changed: [bastion.ocp1.ibm.com] => (item=1)
changed: [bastion.ocp1.ibm.com] => (item=2)

TASK [dns : Restart named to update changes made to DNS] **********************************************************************************************************
fatal: [bastion.ocp1.ibm.com]: FAILED! => {"changed": false, "msg": "Unable to restart service named: Job for named.service failed because the control process exited with error code.\nSee "systemctl status named.service" and "journalctl -xe" for details.\n"}

PLAY RECAP ********************************************************************************************************************************************************
127.0.0.1 : ok=8 changed=4 unreachable=0 failed=0 skipped=18 rescued=0 ignored=0
bastion.ocp1.ibm.com : ok=31 changed=25 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0

[admin1@controller Ansible-OpenShift-Provisioning]$

May 31 16:26:11 bastion bash[23363]: zone 0.in-addr.arpa/IN: loaded serial 0
May 31 16:26:11 bastion bash[23363]: zone ibm.com/IN: NS 'bastion.ocp1.ibm.com'
has no address records (A or AAAA)
May 31 16:26:11 bastion bash[23363]: zone ibm.com/IN: not loaded due to errors.
May 31 16:26:11 bastion bash[23363]: _default/ibm.com/IN: bad zone

[root@bastion named]# cat ocp1.db
$TTL 86400
@ IN SOA bastion.ocp1.ibm.com. admin.ocp1.ibm.com.(
2020021821 ;Serial
3600 ;Refresh
1800 ;Retry
604800 ;Expire
86400 ;Minimum TTL
)

;Name Server / Bastion Information
@ IN NS bastion.ocp1.ibm.com.

;IP Address for Name Server
bastion IN A 9.76.61.82

;entry for bootstrap host.
bootstrap.ocp1.ibm.com. IN A 9.76.61.84

;entries for the control nodes
cp3.ocp1.ibm.com. IN A 9.76.61.87
cp2.ocp1.ibm.com. IN A 9.76.61.86
cp1.ocp1.ibm.com. IN A 9.76.61.85

;entries for the compute nodes
aw3.ocp1.ibm.com. IN A 9.76.61.93
aw2.ocp1.ibm.com. IN A 9.76.61.92
aw1.ocp1.ibm.com. IN A 9.76.61.91

;The api identifies the IP of your load balancer.
api.ocp1 IN CNAME bastion.ibm.com.
api-int.ocp1 IN CNAME bastion.ibm.com.

;The wildcard also identifies the load balancer.
apps.ocp1 IN CNAME bastion.ibm.com.
*.apps.ocp1 IN CNAME bastion.ibm.com.

;EOF
iw1.ocp1.ibm.com. IN A 9.76.61.88
iw2.ocp1.ibm.com. IN A 9.76.61.89
iw3.ocp1.ibm.com. IN A 9.76.61.90
[root@bastion named]# cat ocp1.rev
$TTL 86400
@ IN SOA bastion.ocp1.ibm.com. admin.ocp1.ibm.com (
2020011800 ;Serial
3600 ;Refresh
1800 ;Retry
604800 ;Expire
86400 ;Minimum TTL
)
;Name Server Information
@ IN NS bastion.ocp1.ibm.com.
bastion IN A 9.76.61.82

;Reverse lookup for Name Server
82 IN PTR bastion.ocp1.ibm.com.

;PTR Record IP address to Hostname
90 IN PTR iw3.ocp1.ibm.com.
89 IN PTR iw2.ocp1.ibm.com.
88 IN PTR iw1.ocp1.ibm.com.
93 IN PTR aw3.ocp1.ibm.com.
92 IN PTR aw2.ocp1.ibm.com.
91 IN PTR aw1.ocp1.ibm.com.
87 IN PTR cp3.ocp1.ibm.com.
86 IN PTR cp2.ocp1.ibm.com.
85 IN PTR cp1.ocp1.ibm.com.
84 IN PTR bootstrap.ocp1.ibm.com.
82 IN PTR api-int.ocp1.ibm.com.
82 IN PTR api.ocp1.ibm.com.
[root@bastion named]#

[admin1@controller Ansible-OpenShift-Provisioning]$ cat inventories/default/group_vars/all.yaml

Section 1 - Ansible Controller

env:
controller:
sudo_pass: its0

Section 2 - LPAR(s)

z:
high_availability: False
ip_forward: True
lpar1:
create: False
hostname: rdbkkvm4
ip: 9.76.61.184
user: lnxadmin
pass: lnx4rdbk
lpar2:
create: False

hostname:

ip:

user:

pass:

lpar3:
  create: False

hostname:

ip:

user:

pass:

Section 3 - File Server

file_server:
ip: 9.76.61.95
user: admin1
pass: its0
protocol: http
iso_mount_dir: /home/admin1/RHEL/8.7
cfgs_dir: ocp-config

Section 4 - Red Hat

redhat:
username: xxxxxxxxxx
password: xxxxxxxx
# Make sure to enclose pull_secret in 'single quotes'
pull_secret: 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

Section 5 - Bastion

bastion:
create: True
vm_name: bastion
resources:
disk_size: 30
ram: 8192
swap: 4096
vcpu: 4
networking:
ip: 9.76.61.82
hostname: bastion
base_domain: ocp1.ibm.com
subnetmask: 255.255.255.0
gateway: 9.76.61.1
nameserver1: 9.0.0.2

nameserver2:

  forwarder: 9.0.0.2
  interface: enc1
access:
  user: admin1
  pass: its0
  root_pass: its0
options:
  dns: True
  loadbalancer:
    on_bastion: True
    public_ip: 9.76.61.80
    private_ip: 9.76.71.80

Section 6 - Cluster Networking

cluster:
networking:
metadata_name: ocp1
base_domain: ibm.com
subnetmask: 255.255.255.0
gateway: 9.76.61.1
nameserver1: 9.76.61.94

nameserver2:

  forwarder: 9.0.0.2

Section 7 - Bootstrap Node

nodes:
  bootstrap:
    disk_size: 120
    ram: 16384
    vcpu: 4
    vm_name: bootstrap
    ip: 9.76.61.84
    hostname: bootstrap

Section 8 - Control Nodes

  control:
    disk_size: 120
    ram: 16384
    vcpu: 4
    vm_name:
      - cp1
      - cp2
      - cp3
    ip:
      - 9.76.61.85
      - 9.76.61.86
      - 9.76.61.87
    hostname:
      - cp1
      - cp2
      - cp3

Section 9 - Compute Nodes

  compute:
    disk_size: 120
    ram: 16384
    vcpu: 4
    vm_name:
      - aw1
      - aw2
      - aw3
    ip:
      - 9.76.61.91
      - 9.76.61.92
      - 9.76.61.93
    hostname:
      - aw1
      - aw2
      - aw3

Section 10 - Infra Nodes

  infra:
    disk_size: 120
    ram: 16384
    vcpu: 4
    vm_name:
      - iw1
      - iw2
      - iw3
    ip:
      - 9.76.61.88
      - 9.76.61.89
      - 9.76.61.90
    hostname:
      - iw1
      - iw2
      - iw3

#######################################################################################

All variables below this point do not need to be changed for a default installation

#######################################################################################

Section 11 - (Optional) Packages

pkgs:
galaxy: [ ibm.ibm_zhmc, community.general, community.crypto, ansible.posix, community.libvirt ]
controller: [ openssh, expect ]
kvm: [ libguestfs, libvirt-client, libvirt-daemon-config-network, libvirt-daemon-kvm, cockpit-machines, virt-top, qemu-kvm, python3-lxml, cockpit, lvm2 ]
bastion: [ haproxy, httpd, bind, bind-utils, expect, firewalld, mod_ssl, python3-policycoreutils, rsync ]
hypershift: [ make, jq, git, virt-install ]

Section 12 - OpenShift Settings

openshift:
version: 4.12.0
install_config:
api_version: v1
compute:
architecture: s390x
hyperthreading: Enabled
control:
architecture: s390x
hyperthreading: Enabled
cluster_network:
cidr: 10.128.0.0/14
host_prefix: 23
type: OVNKubernetes
service_network: 172.30.0.0/16
fips: 'false'

Section 13 - (Optional) Proxy

proxy:

http:

https:

no:

Section 14 - (Optional) Misc

language: en_US.UTF-8
timezone: America/New_York
keyboard: us
root_access: false
ansible_key_name: ansible-ocpz
ocp_ssh_key_comment: OpenShift key
bridge_name: bond4
network_mode:

#jumphost if network mode is NAT
jumphost:
name:
ip:
user:
pass:
path_to_keypair:

Section 15 - RHCOS (CoreOS)

rhcos_download_url with '/' at the end !

rhcos_download_url: "https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.12/4.12.3/"

For rhcos_os_variant use the OS string as defined in 'osinfo-query os -f short-id'

rhcos_os_variant: rhel8.6

RHCOS live image filenames

rhcos_live_kernel: "rhcos-4.12.3-s390x-live-kernel-s390x"
rhcos_live_initrd: "rhcos-4.12.3-s390x-live-initramfs.s390x.img"
rhcos_live_rootfs: "rhcos-4.12.3-s390x-live-rootfs.s390x.img"

Section 16 - Hypershift

hypershift:
kvm_host:
kvm_host_user:
bastion_hypershift:
bastion_hypershift_user:
mgmt_cluster_nameserver:

go_version: "1.19.5" # Change this if you want to install any other version of go
oc_url:

#Hosted Control Plane Parameters

hcp:
clusters_namespace:
hosted_cluster_name:
basedomain:
pull_secret_file: /root/ansible_workdir/auth_file
ocp_release:
machine_cidr:
arch:
# Make sure to enclose pull_secret in 'single quotes'
pull_secret:

AgentServiceConfig Parameters

asc:
url_for_ocp_release_file:
db_volume_size:
fs_volume_size:
ocp_version:
iso_url:
root_fs_url:
mce_namespace: "multicluster-engine" # This is the Recommended Namespace for Multicluster Engine operator

path_to_key_pair: /home/admin1/.ssh/ansible-ocpz.pub

Proxy settings

in the actual version, an error exists:
When the proxy settings are filled in the NAT section is in a wrong INDENT
the easiest solution: place the PROXY stuff at the end of the group vars

Ipv6 ULA address is not assigned for OCP nodes when OCP is installed on KVM

OCP Bring up with IPV6 on KVM for Dual Stack support  

Once OCP is installed , ocp nodes doesn't get IPV6 ULA address. 

NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 33m v1.27.4+2c83a9f master-1 Ready control-plane,master 32m v1.27.4+2c83a9f master-2 Ready control-plane,master 31m v1.27.4+2c83a9f worker-0 Ready worker 17m v1.27.4+2c83a9f worker-1 Ready worker 16m v1.27.4+2c83a9f [root@bastion network-scripts]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-rc.0 True False 5m46s Cluster version is 4.14.0-rc.0 [root@bastion network-scripts]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-rc.0 True False False 6m22s baremetal 4.14.0-rc.0 True False False 28m cloud-controller-manager 4.14.0-rc.0 True False False 33m cloud-credential 4.14.0-rc.0 True False False 32m cluster-autoscaler 4.14.0-rc.0 True False False 28m config-operator 4.14.0-rc.0 True False False 29m console 4.14.0-rc.0 True False False 12m control-plane-machine-set 4.14.0-rc.0 True False False 28m csi-snapshot-controller 4.14.0-rc.0 True False False 29m dns 4.14.0-rc.0 True False False 27m etcd 4.14.0-rc.0 True False False 27m image-registry 4.14.0-rc.0 True False False 17m ingress 4.14.0-rc.0 True False False 15m insights 4.14.0-rc.0 True False False 22m kube-apiserver 4.14.0-rc.0 True False False 25m kube-controller-manager 4.14.0-rc.0 True False False 25m kube-scheduler 4.14.0-rc.0 True False False 25m kube-storage-version-migrator 4.14.0-rc.0 True False False 29m machine-api 4.14.0-rc.0 True False False 28m machine-approver 4.14.0-rc.0 True False False 28m machine-config 4.14.0-rc.0 True False False 28m marketplace 4.14.0-rc.0 True False False 28m monitoring 4.14.0-rc.0 True False False 9m15s network 4.14.0-rc.0 True False False 28m node-tuning 4.14.0-rc.0 True False False 16m openshift-apiserver 4.14.0-rc.0 True False False 22m openshift-controller-manager 4.14.0-rc.0 True False False 22m openshift-samples 4.14.0-rc.0 True False False 21m operator-lifecycle-manager 4.14.0-rc.0 True False False 28m operator-lifecycle-manager-catalog 4.14.0-rc.0 True False False 28m operator-lifecycle-manager-packageserver 4.14.0-rc.0 True False False 22m service-ca 4.14.0-rc.0 True False False 29m storage 4.14.0-rc.0 True False False 29m

Bastion got ipv6 ula address but ocp master and worker nodes has got private ip4 address only. ipv6 ula address is not assigned.

`bastion
[root@bastion network-scripts]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enc1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:b6:a0:5e brd ff:ff:ff:ff:ff:ff
inet 192.168.122.2/24 brd 192.168.122.255 scope global noprefixroute enc1
valid_lft forever preferred_lft forever
inet6 fd03::22/128 scope global dynamic noprefixroute
valid_lft 3325sec preferred_lft 3325sec
inet6 fe80::5054:ff:feb6:a05e/64 scope link noprefixroute
valid_lft forever preferred_lft forever

Last login: Wed Sep 13 17:28:30 2023 from 192.168.122.2
[core@master-0 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enc1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
link/ether 52:54:00:32:1c:8c brd ff:ff:ff:ff:ff:ff
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 2e:fa:5f:a8:43:1b brd ff:ff:ff:ff:ff:ff
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 52:54:00:32:1c:8c brd ff:ff:ff:ff:ff:ff
inet 192.168.122.4/24 brd 192.168.122.255 scope global noprefixroute br-ex
valid_lft forever preferred_lft forever
inet 169.254.169.2/29 brd 169.254.169.7 scope global br-ex
valid_lft forever preferred_lft forever
6: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000
link/ether 56:7d:56:87:69:14 brd ff:ff:ff:ff:ff:ff
7: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 36:aa:7f:46:f7:eb brd ff:ff:ff:ff:ff:ff
inet 10.130.0.2/23 brd 10.130.1.255 scope global ovn-k8s-mp0
valid_lft forever preferred_lft forever
inet6 fe80::34aa:7fff:fe46:f7eb/64 scope link
valid_lft forever preferred_lft forever

worker-0
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enc1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
link/ether 52:54:00:7c:8d:dd brd ff:ff:ff:ff:ff:ff
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 7e:bb:6f:87:16:be brd ff:ff:ff:ff:ff:ff
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 52:54:00:7c:8d:dd brd ff:ff:ff:ff:ff:ff
inet 192.168.122.7/24 brd 192.168.122.255 scope global noprefixroute br-ex
valid_lft forever preferred_lft forever
inet 169.254.169.2/29 brd 169.254.169.7 scope global br-ex
valid_lft forever preferred_lft forever
6: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000
link/ether 9a:4a:e4:35:1f:48 brd ff:ff:ff:ff:ff:ff
7: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 92:f0:51:7f:9d:5e brd ff:ff:ff:ff:ff:ff
inet 10.131.0.2/23 brd 10.131.1.255 scope global ovn-k8s-mp0
valid_lft forever preferred_lft forever
inet6 fe80::90f0:51ff:fe7f:9d5e/64 scope link
valid_lft forever preferred_lft forever `

Generated full bastion hostname is incorrect

Actual hostname produced is bastion.ibm.com However the bastion base domain name supplied is ocp1.ibm.com so the full hostname should be bastion.ocp1.ibm.com !

All.yaml section 5

Section 5 - Bastion

bastion:
create: True
vm_name: bastion
resources:
disk_size: 30
ram: 8192
swap: 4096
vcpu: 4
networking:
ip: 9.76.61.82
hostname: bastion
base_domain: ocp1.ibm.com
subnetmask: 255.255.255.0
gateway: 9.76.61.1
nameserver1: 9.76.61.94

nameserver2:

  forwarder: 9.0.0.2
  interface: enc1
access:
  user: admin1
  pass: its0
  root_pass: its0
options:
  dns: False
  loadbalancer:
    on_bastion: True
    public_ip: 9.76.61.80
    private_ip: 9.76.71.80

nameserver 9.0.0.2[root@bastion sysconfig]# hostnamectl
Static hostname: bastion.ibm.com
Icon name: computer-vm
Chassis: vm
Machine ID: bdd68adf28e04524ac75cfe120989c78
Boot ID: fb83c2f13ee848cfbad27a5643fb8de6
Virtualization: kvm
Operating System: Red Hat Enterprise Linux 8.7 (Ootpa)
CPE OS Name: cpe:/o:redhat:enterprise_linux:8::baseos
Kernel: Linux 4.18.0-425.3.1.el8.s390x
Architecture: s390x
[root@bastion sysconfig]#

File server with root userid fails

If a userid of root is used, the path of /home/root is attempted. If the use of root is allowed, it should use /root not /home/root.
If root is not "supported", it should be rejected by the code and it should be documented as never to use, instead of not recommended.

TASK [Set libvirt management to libvirt group instead of root.] ****************************************************************************************************
changed: [rdbkkvm4]
TASK [Create file for user's custom libvirt configurations.] *******************************************************************************************************
fatal: [rdbkkvm4]: FAILED! => {"changed": false, "msg": "Error, could not touch target: [Errno 2] No such file or directory: b'/home/root/.config/libvirt/libvirt.conf'", "path": "/home/root/.config/libvirt/libvirt.conf"}
PLAY RECAP *********************************************************************************************************************************************************
127.0.0.1 : ok=9 changed=4 unreachable=0 failed=0 skipped=31 rescued=0 ignored=0
rdbkkvm4 : ok=12 changed=7 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0

Support for RHEL 9

The ansible code needs to support RHEL 9 for the KVM hosts AND guests.

Bastion kickstart file is installing 'Python 3.6.8' which has reached end of life already

The Bastion kickstart file is installing python3 packages with 'python version 3.6.8', which is not supported anymore.
We need to install python 3.9 instead. This package is included in RHEL 8.7 and RHEL 8.8.

https://www.python.org/downloads/release/python-368/

Python 3.6.8
Release Date: Dec. 24, 2018
Note: The release you are looking at is Python 3.6.8, the final bugfix release for the legacy 3.6 series 
which has now reached end-of-life and is no longer supported.

ssh_copy_id does not detect 'Permission denied' errors

The 'ssh_copy_id' task did not copy the ssh key, but returned "ok".
See the lines below:

TASK [ssh_copy_id : Print results of copying ssh id to remote host] ************
ok: [127.0.0.1] => {
    "ssh_copy": {
        "changed": true,
        "cmd": [
            "expect",
            "/home/jenkins/workspace/OCP-BOE/BOE-Installs/dev/ocp-multiarch-install-with-aop-ocp3_dhcp-cluster/aop/roles/ssh_copy_id/files/ssh-copy-id-expect-pass.exp"
        ],
        "delta": "0:00:21.077410",
        "end": "2023-10-12 15:02:23.552777",
        "failed": false,
        "msg": "",
        "rc": 0,
        "start": "2023-10-12 15:02:02.475367",
        "stderr": "",
        "stderr_lines": [],
        "stdout": "spawn ssh-copy-id -f -o StrictHostKeyChecking=no -i /home/jenkins/.ssh/ansible-ocpz.pub ****@172.23.232.220\r\n/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: \"/home/jenkins/.ssh/ansible-ocpz.pub\"\r\nWarning: Permanently added '172.23.232.220' (ECDSA) to the list of known hosts.\r\r\n\r****@172.23.232.220's password: \r\nPermission denied, please try again.\r\r\n\r****@172.23.232.220's password: ",
        "stdout_lines": [
            "spawn ssh-copy-id -f -o StrictHostKeyChecking=no -i /home/jenkins/.ssh/ansible-ocpz.pub ****@172.23.232.220",
            "/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: \"/home/jenkins/.ssh/ansible-ocpz.pub\"",
            "Warning: Permanently added '172.23.232.220' (ECDSA) to the list of known hosts.",
            "",
            "",
            "****@172.23.232.220's password: ",
            "Permission denied, please try again.",
            "",
            "",
            "****@172.23.232.220's password: "
        ]
    }
}

Can not install all released OCP versions

We have problem with the OpenShift version definition in the Ansible-Openshift-provisioning tool.
The variable “env.openshift.version” is also used to specify the RHCOS version.
But, the RHCOS image versions are different.
You can specify OCP version 4.12.6, but there is no RHCOS 4.12.6. The latest RHCOS version is 4.12.3.
So, you can not install all released OCP versions !

FTP server documentation needs update

Now that the KVM host serves everything the bastion needs to boot via HTTP, the FTP server and its variables need to be updated to reflect that.

An FTP server is still required when the user is booting the KVM host(s) for the first time, but after that it is not required.

I haven't figured out yet how to differentiate / describe these two functions in all.yaml and the documentation.

create control nodes

in HA environment why create the control nodes one-by-one and not in a loop parallel, like it is implemented in the create compute nodes role

all.yaml bastion.options.dns: false results in bastion resolv.conf of forwarder NOT nameserver1

Expected Results
With bastion.options.dns: false and bastion.networking.nameserver1 9.76.61.94
bastion resolv.conf should contain 9.76.61.94

Actual results:

[root@bastion network-scripts]# cat ifcfg-enc1
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=none
IPADDR=9.76.61.82
PREFIX=24
GATEWAY=9.76.61.1
DNS1=9.76.61.94
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=no
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=eui64
NAME=enc1
UUID=448a3561-0ca3-45e9-b0b3-545695619c1f
DEVICE=enc1
ONBOOT=yes
[root@bastion network-scripts]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enc1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:45:d4:96 brd ff:ff:ff:ff:ff:ff
inet 9.76.61.82/24 brd 9.76.61.255 scope global noprefixroute enc1
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe45:d496/64 scope link
valid_lft forever preferred_lft forever
[root@bastion network-scripts]#
[root@bastion network-scripts]#
[root@bastion network-scripts]# cat /etc/resolv.conf
search ocp1.ibm.com
nameserver 9.0.0.2[root@bastion network-scripts]#

all.yaml bastion section

Section 5 - Bastion

bastion:
create: True
vm_name: bastion
resources:
disk_size: 30
ram: 8192
swap: 4096
vcpu: 4
networking:
ip: 9.76.61.82
hostname: bastion
base_domain: ocp1.ibm.com
subnetmask: 255.255.255.0
gateway: 9.76.61.1
nameserver1: 9.76.61.94

nameserver2:

  forwarder: 9.0.0.2
  interface: enc1
access:
  user: admin1
  pass: its0
  root_pass: its0
options:
  dns: False
  loadbalancer:
    on_bastion: True
    public_ip: 9.76.61.80
    private_ip: 9.76.71.80

Playbook 3 fails in loop control - Syntax problem

[admin1@controller Ansible-OpenShift-Provisioning]$ ansible-playbook playbooks/3_setup_kvm_host.yaml
ERROR! 'extended_allitems' is not a valid attribute for a LoopControl
The error appears to be in '/home/admin1/ocpauto/Ansible-OpenShift-Provisioning/roles/configure_storage/tasks/main.yaml': line 67, column 5, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
loop_control:
extended: yes
^ here

I commented out extended_allitems, and the playbook no longer had the syntax error.

NAT vs IPForward documentation clarity

The current documentation is not clear about NAT vs IPForward. Specifically the IPForward term could be the kernel sysctl for ipfowarding vs the KVM network (as in virsh net-define) where a macvtap interface is defined with "forward" xml tag. Perhaps the documentation should state NAT vs MacVTAP?

Further explanation / clarity would be helpful.

Task 'Check internal cluster DNS resolution' fails, when external DNS is used

When I use an external DNS server, by setting "env.bastion.options.dns: false" then the following tasks from playbooks/5_setup_bastion.yaml will fail:

TASK [check_dns : Check internal cluster DNS resolution for the bastion] *******************************************************************************
fatal: [bastion]: FAILED! => {"changed": true, "cmd": "dig +short bastion.lnxero1.boe | tail -n1", "delta": "0:00:00.010796", "end": "2023-07-29 07:12:14.980317", "failed_when_result": true, "msg": "", "rc": 0, "start": "2023-07-29 07:12:14.969521", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
TASK [check_dns : Check internal cluster DNS resolution for bootstrap] *********************************************************************************
fatal: [bastion]: FAILED! => {"changed": true, "cmd": "dig +short bootstrap.ocp3.lnxero1.boe | tail -n1", "delta": "0:00:00.011422", "end": "2023-07-29 07:12:23.012899", "failed_when_result": true, "msg": "", "rc": 0, "start": "2023-07-29 07:12:23.001477", "stderr": "", "stderr_lines": [], "stdout": "172.23.232.224", "stdout_lines": ["172.23.232.224"]}

qcow2 images of the zkvm virtual machines is created in /var/lib/libvirt-*

During my experiments, I realized that the qcow2 images for the virtual machines are always put into a directory with the name /var/lib/libvirt-xxx where xxx is a string.

This is problematic in that we are assuming that the user has his storage configured with enough storage available to the /var/lib directory (either through mount on /, /var or /var/lib depending on the user's configuration).

If this is not already the case, the user will need to expand the storage available to /var/lib which may be a mild irritation or a problematic one depending on the users storage configuration.

If the user has not used a storage configuration that is easily expandable, then they will have to risk data on /var/lib to give this additional storage.

It might make more sense to parameterize the location where these images will be stored.
Maybe we can even create a subdir with the xxx expansion and store the qcow2 images there or just use the provided location directory or figure out some other way to parameterize this

Proxy for install packages

If you run the “install_packages” role behind a proxy, it looks like the ansible do NOT pick up the env variable set-up proxy
therefore I suggest placing proxy info to the install_packages rule (should be managed by a variable I think).
I tried, it, and with that mod, it works.

The same is valid for the "registration" role as well

NAT for prod environment

Dear developers,

can you pls extend the NAT possibility toward the HA (=3 KVMs) setup as well?
Together with this pls consider, what happens, if the HOP is from KVM to Bastion, what happens when the Bastion is down?

Probably in a HA setup a 2nd bastion(haproxy etc) is required ...

create bastion

the create bastion ROLE use a kickstart file where SOME additional packages are installed after the "minimal" environment
In all.yaml variable file includes packages that are required for the bastion.
Please install the packages from ONE place only (better is to maintain a list in all.yaml)

playbooks/3_setup_kvm_host.yaml failes, when 'ansible_user: root' is used

playbooks/3_setup_kvm_host.yaml failes, when 'ansible_user: root' is used.

TASK [Set libvirt management to libvirt group instead of root.] *****************************************************************************
ok: [a3elp60]

TASK [Create file for user's custom libvirt configurations.] ********************************************************************************
fatal: [a3elp60]: FAILED! => {"changed": false, "msg": "Error, could not touch target: [Errno 2] No such file or directory: b'/home/root/.config/libvirt/libvirt.conf'", "path": "/home/root/.config/libvirt/libvirt.conf"}

ssh_copy_id task not working for hashed hostnamed

If ssh uses the HashKnownHosts yes directive then the entries in the ~/.ssh/known_hosts file do not contain the hostname or IP in clear text.
As a consequence the step

https://github.com/IBM/Ansible-OpenShift-Provisioning/blob/main/roles/ssh_copy_id/tasks/main.yaml#L7

- name: Delete SSH key from known hosts if it already exists for idempotency
  tags: ssh_copy_id, ssh
  lineinfile:
    path: "~/.ssh/known_hosts"
    search_string: "{{ ssh_target[0] }}"
    state: absent

A solution that works with any config is to run instead:

ssh-keygen -f "~/.ssh/known_hosts" -R "{{ ssh_target[0] }}"

The alternative is to explicitly make sure that hostnames are not hashed, e.g. via the ssh config:

Host *
    HashKnownHosts no

Looks like the default is yes at least for ubuntu 22.04 (in /etc/ssh/ssh_config)

HTTP File server needed URL and filesystem path not documented / Not configurable.

It appears the path of /home/<>/ is needed/assumed At some sites using path my not be desired. Also it not clearly documented that if setting up http on the file server, that it needs to use a specific path.

  1. Please document the required / default path
  2. Recommend making the http server URI and the source file system path for it, to be configurable via a variable.

qcow2 files for OCP node storage leads to poor CPU performance

We've observed very high iowait time being consumed by OCP cluster nodes when they are performing high amount of disk IO.

Please update the automation to give the option to use raw block (SCSI LUN or DASD) devices and pass these through to the OCP guest nodes.

sshuttle included by default, not in RHEL

The ansible code attempts to install sshuttle by default. This is NOT on the RHEL distribution. It is also in the all yaml file in a section that is labeled that you should not need to change the items.

  1. The provided ansible code should only include packages on the RHEL Linux distribution by default
  2. Documentation should describe the specific scenario when sshuttle is needed and how / where to add sshuttle when it is needed.

The OCP verification playbook waits for ever trying to update /etc/hosts file on controller

Ansible Controller type: Mac M1

Ansible version:

ansible --version
ansible [core 2.15.3]
  config file = /Users/mohammedzeeshanahmed/personal_bench/ibm/ansible-ocp-provisioner/IBM-Ansible-OpenShift-Provisioning/ansible.cfg
  configured module search path = ['/Users/mohammedzeeshanahmed/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ansible
  ansible collection location = /Users/mohammedzeeshanahmed/.ansible/collections:/usr/share/ansible/collections
  executable location = /Library/Frameworks/Python.framework/Versions/3.9/bin/ansible
  python version = 3.9.13 (v3.9.13:6de2ca5339, May 17 2022, 11:37:23) [Clang 13.0.0 (clang-1300.0.29.30)] (/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9)
  jinja version = 3.1.2
  libyaml = True

Error:

❯ ansible-playbook playbooks/7_ocp_verification.yaml
[WARNING]: Found both group and host with same name: bastion

PLAY [7 OCP verification] *********************************************************************************************************************************************************************

TASK [Gathering Facts] ************************************************************************************************************************************************************************
ok: [bastion]

TASK [approve_certs : Cancel async 'approve_certs_task', if exists] ***************************************************************************************************************************
skipping: [bastion]

TASK [approve_certs : Approve all pending CSRs in the next 30 min (async task)] ***************************************************************************************************************
changed: [bastion]

TASK [check_nodes : Get and print nodes status] ***********************************************************************************************************************************************
included: /Users/mohammedzeeshanahmed/personal_bench/ibm/ansible-ocp-provisioner/IBM-Ansible-OpenShift-Provisioning/roles/common/tasks/print_ocp_node_status.yaml for bastion

TASK [check_nodes : Get OCP nodes status] *****************************************************************************************************************************************************
ok: [bastion]

TASK [check_nodes : Print OCP nodes status] ***************************************************************************************************************************************************
ok: [bastion] => {
    "oc_get_nodes.stdout_lines": [
        "NAME                     STATUS                     ROLES                  AGE     VERSION           KERNEL-VERSION                INTERNAL-IP    ",
        "ocpz-master-1            Ready                      control-plane,master   27m     v1.26.3+b404935   5.14.0-284.13.1.el9_2.s390x   192.168.122.10 ",
        "ocpz-master-2            Ready                      control-plane,master   26m     v1.26.3+b404935   5.14.0-284.13.1.el9_2.s390x   192.168.122.11 ",
        "ocpz-master-3            Ready                      control-plane,master   25m     v1.26.3+b404935   5.14.0-284.13.1.el9_2.s390x   192.168.122.12 "
    ]
}

TASK [check_nodes : Make sure control and compute nodes are 'Ready' before continuing (retry every 20s)] **************************************************************************************
changed: [bastion] => (item=ocpz-master-1)
changed: [bastion] => (item=ocpz-master-2)
changed: [bastion] => (item=ocpz-master-3)
FAILED - RETRYING: [bastion]: Make sure control and compute nodes are 'Ready' before continuing (retry every 20s) (90 retries left).
FAILED - RETRYING: [bastion]: Make sure control and compute nodes are 'Ready' before continuing (retry every 20s) (89 retries left).
FAILED - RETRYING: [bastion]: Make sure control and compute nodes are 'Ready' before continuing (retry every 20s) (88 retries left).
changed: [bastion] => (item=ocpz-compute-1)
changed: [bastion] => (item=ocpz-compute-2)

TASK [approve_certs : Cancel async 'approve_certs_task', if exists] ***************************************************************************************************************************
ok: [bastion]

TASK [approve_certs : Approve all pending CSRs in the next 30 min (async task)] ***************************************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Wait for cluster operators] ********************************************************************************************************************************
included: /Users/mohammedzeeshanahmed/personal_bench/ibm/ansible-ocp-provisioner/IBM-Ansible-OpenShift-Provisioning/roles/wait_for_cluster_operators/tasks/check_co.yaml for bastion => (item=First)
included: /Users/mohammedzeeshanahmed/personal_bench/ibm/ansible-ocp-provisioner/IBM-Ansible-OpenShift-Provisioning/roles/wait_for_cluster_operators/tasks/check_co.yaml for bastion => (item=Second)
included: /Users/mohammedzeeshanahmed/personal_bench/ibm/ansible-ocp-provisioner/IBM-Ansible-OpenShift-Provisioning/roles/wait_for_cluster_operators/tasks/check_co.yaml for bastion => (item=Third)
included: /Users/mohammedzeeshanahmed/personal_bench/ibm/ansible-ocp-provisioner/IBM-Ansible-OpenShift-Provisioning/roles/wait_for_cluster_operators/tasks/check_co.yaml for bastion => (item=Fourth)
included: /Users/mohammedzeeshanahmed/personal_bench/ibm/ansible-ocp-provisioner/IBM-Ansible-OpenShift-Provisioning/roles/wait_for_cluster_operators/tasks/check_co.yaml for bastion => (item=Fifth and last)

TASK [wait_for_cluster_operators : First round of checking cluster operators] *****************************************************************************************************************
changed: [bastion]

TASK [wait_for_cluster_operators : Print cluster operators which are only in 'PROGRESSING' state] *********************************************************************************************
ok: [bastion] => {
    "oc_get_co.stdout_lines": [
        "authentication                             4.13.1    False       True          True       24m     OAuthServerDeploymentAvailable: no oauth-openshift.openshift-authentication pods available on any node....",
        "console                                    4.13.1    False       True          False      9m53s   DeploymentAvailable: 0 replicas available for console deployment...",
        "dns                                        4.13.1    True        False         True       23m     DNS default is degraded",
        "etcd                                       4.13.1    True        True          False      14m     NodeInstallerProgressing: 1 nodes are at revision 7; 2 nodes are at revision 8",
        "ingress                                              False       True          True       23m     The \"default\" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)",
        "monitoring                                           False       True          True       12m     reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas",
        "network                                    4.13.1    True        True          False      24m     DaemonSet \"/openshift-multus/network-metrics-daemon\" is not available (awaiting 1 nodes)..."
    ]
}

TASK [wait_for_cluster_operators : First round of waiting for cluster operators. Trying 10 times before printing status again] ****************************************************************
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (10 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (9 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (8 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (7 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (6 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (5 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (4 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (3 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (2 retries left).
FAILED - RETRYING: [bastion]: First round of waiting for cluster operators. Trying 10 times before printing status again (1 retries left).
fatal: [bastion]: FAILED! => {"attempts": 10, "changed": true, "cmd": "set -o pipefail\n# Check for 'PROGRESSING' state\noc get co 2> /dev/null | awk '{print $4}'\n", "delta": "0:00:00.124020", "end": "2023-09-11 01:26:10.825538", "msg": "", "rc": 0, "start": "2023-09-11 01:26:10.701518", "stderr": "", "stderr_lines": [], "stdout": "PROGRESSING\nTrue\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nTrue\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nTrue\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse", "stdout_lines": ["PROGRESSING", "True", "False", "False", "False", "False", "False", "False", "False", "False", "False", "False", "False", "False", "False", "True", "False", "False", "False", "False", "False", "False", "False", "True", "False", "False", "False", "False", "False", "False", "False", "False", "False", "False"]}
...ignoring

TASK [wait_for_cluster_operators : Update local variable, if required] ************************************************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Second round of checking cluster operators] ****************************************************************************************************************
changed: [bastion]

TASK [wait_for_cluster_operators : Print cluster operators which are only in 'PROGRESSING' state] *********************************************************************************************
ok: [bastion] => {
    "oc_get_co.stdout_lines": [
        "authentication                             4.13.1    False       True          False      30m     WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://192.168.122.12:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance)",
        "kube-apiserver                             4.13.1    True        True          False      19m     NodeInstallerProgressing: 2 nodes are at revision 5; 1 nodes are at revision 6",
        "monitoring                                           False       True          True       40s     reconciling Telemeter client cluster monitoring view ClusterRoleBinding failed: creating ClusterRoleBinding object failed: clusterroles.rbac.authorization.k8s.io \"cluster-monitoring-view\" not found, reconciling Prometheus Alertmanager RoleBinding failed: creating RoleBinding object failed: roles.rbac.authorization.k8s.io \"monitoring-alertmanager-edit\" not found, prometheuses.monitoring.coreos.com \"k8s\" not found"
    ]
}

TASK [wait_for_cluster_operators : Second round of waiting for cluster operators. Trying 10 times before printing status again] ***************************************************************
FAILED - RETRYING: [bastion]: Second round of waiting for cluster operators. Trying 10 times before printing status again (10 retries left).
FAILED - RETRYING: [bastion]: Second round of waiting for cluster operators. Trying 10 times before printing status again (9 retries left).
FAILED - RETRYING: [bastion]: Second round of waiting for cluster operators. Trying 10 times before printing status again (8 retries left).
FAILED - RETRYING: [bastion]: Second round of waiting for cluster operators. Trying 10 times before printing status again (7 retries left).
FAILED - RETRYING: [bastion]: Second round of waiting for cluster operators. Trying 10 times before printing status again (6 retries left).
changed: [bastion]

TASK [wait_for_cluster_operators : Update local variable, if required] ************************************************************************************************************************
ok: [bastion]

TASK [wait_for_cluster_operators : Third round of checking cluster operators] *****************************************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Print cluster operators which are only in 'PROGRESSING' state] *********************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Third round of waiting for cluster operators. Trying 10 times before printing status again] ****************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Update local variable, if required] ************************************************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Fourth round of checking cluster operators] ****************************************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Print cluster operators which are only in 'PROGRESSING' state] *********************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Fourth round of waiting for cluster operators. Trying 10 times before printing status again] ***************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Update local variable, if required] ************************************************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Fifth and last round of checking cluster operators] ********************************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Print cluster operators which are only in 'PROGRESSING' state] *********************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Fifth and last round of waiting for cluster operators. Trying 10 times before printing status again] *******************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Update local variable, if required] ************************************************************************************************************************
skipping: [bastion]

TASK [wait_for_cluster_operators : Get final cluster operators] *******************************************************************************************************************************
ok: [bastion]

TASK [wait_for_cluster_operators : Print final cluster operators] *****************************************************************************************************************************
ok: [bastion] => {
    "oc_get_co.stdout_lines": [
        "NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE",
        "authentication                             4.13.1    True        False         False      32s     ",
        "baremetal                                  4.13.1    True        False         False      32m     ",
        "cloud-controller-manager                   4.13.1    True        False         False      37m     ",
        "cloud-credential                           4.13.1    True        False         False      38m     ",
        "cluster-autoscaler                         4.13.1    True        False         False      32m     ",
        "config-operator                            4.13.1    True        False         False      33m     ",
        "console                                    4.13.1    True        False         False      5m57s   ",
        "control-plane-machine-set                  4.13.1    True        False         False      32m     ",
        "csi-snapshot-controller                    4.13.1    True        False         False      33m     ",
        "dns                                        4.13.1    True        False         False      32m     ",
        "etcd                                       4.13.1    True        False         False      23m     ",
        "image-registry                             4.13.1    True        False         False      18m     ",
        "ingress                                    4.13.1    True        False         False      8m50s   ",
        "insights                                   4.13.1    True        False         False      26m     ",
        "kube-apiserver                             4.13.1    True        False         False      22m     ",
        "kube-controller-manager                    4.13.1    True        False         False      22m     ",
        "kube-scheduler                             4.13.1    True        False         False      23m     ",
        "kube-storage-version-migrator              4.13.1    True        False         False      33m     ",
        "machine-api                                4.13.1    True        False         False      32m     ",
        "machine-approver                           4.13.1    True        False         False      32m     ",
        "machine-config                             4.13.1    True        False         False      31m     ",
        "marketplace                                4.13.1    True        False         False      32m     ",
        "monitoring                                 4.13.1    True        False         False      2m33s   ",
        "network                                    4.13.1    True        False         False      33m     ",
        "node-tuning                                4.13.1    True        False         False      9m19s   ",
        "openshift-apiserver                        4.13.1    True        False         False      19m     ",
        "openshift-controller-manager               4.13.1    True        False         False      22m     ",
        "openshift-samples                          4.13.1    True        False         False      18m     ",
        "operator-lifecycle-manager                 4.13.1    True        False         False      33m     ",
        "operator-lifecycle-manager-catalog         4.13.1    True        False         False      33m     ",
        "operator-lifecycle-manager-packageserver   4.13.1    True        False         False      20m     ",
        "service-ca                                 4.13.1    True        False         False      33m     ",
        "storage                                    4.13.1    True        False         False      33m     "
    ]
}

TASK [wait_for_install_complete : Almost there! Add host info to /etc/hosts so you can login to the cluster via web browser. Ansible Controller sudo password required] ***********************










^C [ERROR]: User interrupted execution

What is happening?
The final stage of the playbook has a role called wait_for_install_complete, has a task that tries to patch the /etc/hosts file with host information to access the cluster.

- name: Almost there! Add host info to /etc/hosts so you can login to the cluster via web browser. Ansible Controller sudo password required
tags: wait_for_install_complete
become: true
blockinfile:
create: true
backup: true
marker: "# {mark} ANSIBLE MANAGED BLOCK FOR OCP CLUSTER: {{ env.cluster.networking.metadata_name }}"
path: /etc/hosts
block: |
{{ env.bastion.networking.ip }} oauth-openshift.apps.{{ env.cluster.networking.metadata_name }}.{{ env.cluster.networking.base_domain }}
{{ env.bastion.networking.ip }} console-openshift-console.apps.{{ env.cluster.networking.metadata_name }}.{{ env.cluster.networking.base_domain }}
{{ env.bastion.networking.ip }} api.{{ env.cluster.networking.metadata_name }}.{{ env.cluster.networking.base_domain }}
delegate_to: 127.0.0.1

For reasons unknown as of yet, the playbook pauses here and does not proceed further from this task and needs to be interrupted by me manually.

This could cause issues if it happens in automation such as a Jenkins job or a Tekton pipleine.

We should add a timeout for this op while ignoring any errors and even print this information and ask the user to validate this and add it to his hosts file manually if it's not there.

KVM(Jumphost) host IP is unreachable if it is behind a Proxyjump

When installing OCP cluster on a KVM host only reacheable via a proxyjump the installation fails on playbook5
with error:

PLAY [5 setup bastion - configure bastion node with essential services] **************************************************************

TASK [Gathering Facts] ***************************************************************************************************************
fatal: [bastion3]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Connection timed out during banner exchange\r\nConnection to UNKNOWN port 65535 timed out", "unreachable": true}

PLAY RECAP ***************************************************************************************************************************
127.0.0.1                  : ok=16   changed=9    unreachable=0    failed=0    skipped=10   rescued=0    ignored=0   
bastion3                   : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   
jumphost                   : ok=13   changed=5    unreachable=0    failed=0    skipped=2    rescued=1    ignored=0  

As a work around, I had to edit the Ansible-OpenShift-Provisioning/roles/ssh_add_config/tasks/main.yaml file in order for the script to insert my proxyjump entry to into the .ssh/config used during the ansible process.

- name: Create ssh config file (or add to an exsting file) to if network mode is NAT
  [...]
    block: |
      Host {{ env.jumphost.name }} 
        HostName {{ env.jumphost.ip }}
        User {{ env.jumphost.user }}
        IdentityFile {{ path_to_key_pair.split('.')[:-1] | join('.') }}
        ProxyJump gateway   <--------------------
      Host {{ env.bastion.networking.ip }}
        [...]

Shouldn't a proxyjump be accounted for somehow?

Environment:
Scripts were run directly on my mac

Get OCP nodes status sometimes fails during installation, because of an internal timeout

Get OCP nodes status sometimes fails during installation, because of this internal timeout:

Error from server (Timeout): the server was unable to return a response in the time allotted, 
but may still be processing the request (get nodes)

The 'get nodes status' function should be improved. Right now the playbook stop on such error.
E.g. A retry would be a better solution to handle such situation.

Here the reported error:

...
TASK [approve_certs : Cancel async 'approve_certs_task', if exists] ************
skipping: [bastion]

TASK [approve_certs : Approve all pending CSRs in the next 30 min (async task)] ***
changed: [bastion]

TASK [check_nodes : Get and print nodes status] ********************************
included: /home/jenkins/workspace/OCP-BOE/BOE-Installs/dev/ocp-multiarch-install-with-aop-ocp3_dhcp-cluster/aop/roles/common/tasks/print_ocp_node_status.yaml for bastion

TASK [check_nodes : Get OCP nodes status] **************************************
fatal: [bastion]: FAILED! => {"changed": false, "cmd": "set -o pipefail\noc get nodes -o wide | awk -F '  +' '{ printf \"%-24s %-26s %-22s %-7s %-17s %-29s %-15s\\n\", $1, $2, $3, $4, $5, $9, $6 }'\n", "delta": "0:01:00.104890", "end": "2023-08-01 08:58:51.394272", "msg": "non-zero return code", "rc": 1, "start": "2023-08-01 08:57:51.289382", "stderr": "Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)", "stderr_lines": ["Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)"], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
127.0.0.1                  : ok=33   changed=10   unreachable=0    failed=0    skipped=32   rescued=0    ignored=0   
a3elp37                    : ok=7    changed=1    unreachable=0    failed=0    skipped=20   rescued=0    ignored=0   
bastion                    : ok=73   changed=54   unreachable=0    failed=1    skipped=41   rescued=0    ignored=3   
xkvmocp05                  : ok=25   changed=12   unreachable=0    failed=0    skipped=24   rescued=0    ignored=0   

haproxy config file does not contain day-2 compute nodes

Additional compute nodes which are added with the the "playbooks/create_compute_node.yaml" with a day-2 install,
are not added to the "haproxy.cfg" file.

cat  /etc/haproxy/haproxy.cfg
...
listen ingress-router-443
  bind *:443
  mode tcp
  balance source
  #443 section
  server worker-2 worker-2.ocp1.test.multiarch:443 check inter 1s
  server worker-1 worker-1.ocp1.test.multiarch:443 check inter 1s
listen ingress-router-80
  bind *:80
  mode tcp
  balance source
  #80 section
  server worker-1 worker-1.ocp1.test.multiarch:80 check inter 1s
  server worker-2 worker-2.ocp1.test.multiarch:80 check inter 1s
  

Create optional playbook to setup file server

It would be nice for users, and relatively simple, to create a playbook that sets up the file server in a way that is compatible with these playbooks.

If the file server is going to be used to boot KVM hosts, it must have FTP, otherwise only HTTP is required to be installed and configured, unless otherwise specified by the env.file_server.protocol variable.

BUG in the get_ocp role - the install-config

in the get cop role when creating the install config yaml there is a PROBLEM:

TASK [get_ocp : Use template file to create install-config and backup.] *******************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'dict object' has no attribute 'proxy'. 'dict object' has no attribute 'proxy'
failed: [zpocpbastions] (item=/root/ocpinst/install-config.yaml) => {"ansible_loop_var": "item", "changed": false, "item": "/root/ocpinst/install-config.yaml", "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'proxy'. 'dict object' has no attribute 'proxy'"}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'dict object' has no attribute 'proxy'. 'dict object' has no attribute 'proxy'
failed: [zpocpbastions] (item=/root/ocpinst/install-config-backup.yaml) => {"ansible_loop_var": "item", "changed": false, "item": "/root/ocpinst/install-config-backup.yaml", "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'proxy'. 'dict object' has no attribute 'proxy'"}

Old bootstrap files are not always deleted, disk will be filled over time

Old bootstrap files are not deleted. The disk will be filled over time and this will result in errors like:

  • no space left on device or
  • The requested volume capacity will exceed the available pool space when the volume is fully allocated

Here an example of old bootstrap file images:

 ocp3-bootstrap-1.qcow2        /var/lib/libvirt/images/ocp3-bootstrap-1.qcow2        file   100.00 GiB   7.16 GiB
 ocp3-bootstrap-13.qcow2       /var/lib/libvirt/images/ocp3-bootstrap-13.qcow2       file   100.00 GiB   7.19 GiB
 ocp3-bootstrap-14.qcow2       /var/lib/libvirt/images/ocp3-bootstrap-14.qcow2       file   100.00 GiB   7.40 GiB
 ocp3-bootstrap-2.qcow2        /var/lib/libvirt/images/ocp3-bootstrap-2.qcow2        file   100.00 GiB   7.64 GiB
 ocp3-bootstrap-3.qcow2        /var/lib/libvirt/images/ocp3-bootstrap-3.qcow2        file   100.00 GiB   7.20 GiB
 ocp3-bootstrap-4.qcow2        /var/lib/libvirt/images/ocp3-bootstrap-4.qcow2        file   100.00 GiB   7.50 GiB
 ocp3-bootstrap-5.qcow2        /var/lib/libvirt/images/ocp3-bootstrap-5.qcow2        file   100.00 GiB   7.61 GiB
 ocp3-bootstrap-6.qcow2        /var/lib/libvirt/images/ocp3-bootstrap-6.qcow2        file   100.00 GiB   7.68 GiB

OPEN VPN

Please implement a variable to decide whether OpenVPN setup is needed or not.

And use the OpenVPN deployment dependent of that variable (default should be FALSE i think)

Required Community collections should be installed automatically by playbook0

Example:
ansible-galaxy collection install community.general community.crypto ansible.posix community.libvirt

Playbook 0 is about setting up the Ansible controller. To have fewer user errors, if these collections were installed by playbook0 it may result in less errors and more automation.

If they are not included, it is important to document that the required community collections must be installed by the same userid as the one running the subsequent playboooks!

Playbook 0_setup.yaml is failing on Ubuntu image

I got the following error on a Ubuntu:20.04 image:

ansible-playbook playbooks/0_setup.yaml

PLAY [localhost] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [127.0.0.1]

TASK [set_inventory : Find inventory directory from ansible.cfg] ***************
changed: [127.0.0.1]

TASK [set_inventory : Find absolute path to project.] **************************
fatal: [127.0.0.1]: FAILED! => {"changed": true, "cmd": "set -o pipefail\nansible_config="/home/jenkins/workspace/OCP-BOE/BOE-Installs/dev/ocp-multiarch-install-with-aop/aop/ansible.cfg"\necho "${ansible_config%/*}/"\n", "delta": "0:00:00.013986", "end": "2023-05-22 10:54:03.776751", "msg": "non-zero return code", "rc": 2, "start": "2023-05-22 10:54:03.762765", "stderr": "/bin/sh: 1: set: Illegal option -o pipefail", "stderr_lines": ["/bin/sh: 1: set: Illegal option -o pipefail"], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
127.0.0.1 : ok=2 changed=1 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0

env.bridge_name - Documentation / naming unclear.

env.bridge_name seems to be documented as the name of a bridge the ansible will create.

  1. This is not a bridge that gets created, it is a KVM network definition (as in virsh net-define). Clarlity on this point in the documentation as well as the variable naming would be helpful.
  2. The function appears to be dual purpose. a) Which network interface to connect to b) what to call the KVM network definition. This is not discussed in the documentation, and you may want to have separate variable for each item.
  3. Documentation should recommend using something like a bond interface and not and encX interface

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.