ibm-mas / ansible-devops Goto Github PK
View Code? Open in Web Editor NEWAnsible collection supporting devops for IBM Maximo Application Suite
Home Page: https://ibm-mas.github.io/ansible-devops/
License: Eclipse Public License 2.0
Ansible collection supporting devops for IBM Maximo Application Suite
Home Page: https://ibm-mas.github.io/ansible-devops/
License: Eclipse Public License 2.0
It's not possible to install the Application Suite using the automation playbooks. A Truststore Certificate error, related to the certificateIssuer, happens during the install.
1 ./ibm/mas_devops/playbooks/ocp/provision-quickburn.yml
2 ./ibm/mas_devops/playbooks/ocp/configure-ocp.yml
3 ./ibm/mas_devops/playbooks/mas/install-suite.yml
Application Suite installation fails. Going on OpenShift console it's possible to see a "CreateContainerConfigError" on coreidp-login pod.
TASK [truststore : Wait for public certificate to be ready] ********************************
[0;31mfatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to gather information about Certificate(s) even after waiting for 301 seconds"}[0m
It was observed that, this error is happening because the certificateIssuer that remained as "None" in the Suite CR, as shown below:
A workaround for this is to remove this section from the CR, then deployment finishes successfully.
This is in respect to the self-signed CA and server certificate for use with the Mongo CE operator.
The problem really seems to be with the Common Name (CN) being the same/identical. This causes a problem with Pymongo and hostname validation. The CN for the CA should probably just be something like Mongo CA
.
A separate conf file should be created for use when creating the CA. One that contains the different CN and the extensions used when creating it.
Sometimes Even after a cluster is deprovisioned, the loop here is not exiting. Will add an additional conditional check to try to get the loop to exit faster.
"rc": 0,
"retries": 31,
"start": "2021-09-09 20:10:38.036981",
"stderr": "",
"stderr_lines": [],
"stdout": "0\nRetrieving cluster fvtrelease...",
"stdout_lines": [
"0",
"Retrieving cluster fvtrelease..."
]
See TODOs in lite-core-roks.yml
... please deliver fixes to this branch: https://github.com/ibm-mas/ansible-devops/tree/bugfixes2411
# 4. Install BAS
# -----------------------------------------------------------------------------
- name: Install BAS
import_playbook: bas/install-bas.yml
vars:
# BAS Configuration
bas_namespace: "{{ lookup('env', 'BAS_NAMESPACE') | default('ibm-bas', true) }}"
bas_persistent_storage: "{{ lookup('env', 'BAS_PERSISTENT_STORAGE') }}"
bas_meta_storage_class: "{{ lookup('env', 'BAS_META_STORAGE') }}"
bas_username: "{{ lookup('env', 'BAS_USERNAME') | default('basuser', true) }}"
# TODO: Providing a default password of "password" is unacceptable, this needs to be randomly generated if not provided (and provide details in the documentation about how to look up the generated password)
# When fixing this, ensure it is fixed in any other playbooks with the same problem too
bas_password: "{{ lookup('env', 'BAS_PASSWORD') | default('password', true) }}"
# TODO: if this is related to BAS, the env vars should be prefixed BAS_ ... otherwise, why is the grafana config required under the BAS section here/why is the grafana username set to "basuser" as default?
# When fixing this, ensure it is fixed in any other playbooks with the same problem too
grafana_username: "{{ lookup('env', 'GRAFANA_USERNAME') | default('basuser', true) }}"
# TODO: Providing a default password of "password" is unacceptable, this needs to be randomly generated if not provided (and provide details in the documentation about how to look up the generated password)
# When fixing this, ensure it is fixed in any other playbooks with the same problem too
grafana_password: "{{ lookup('env', 'GRAFANA_PASSWORD') | default('password', true) }}"
# TODO: These all need to be made required env vars, these are not useable defaults
# When fixing this, ensure it is fixed in any other playbooks with the same problem too
contact:
email: "{{ lookup('env', 'BAS_CONTACT_MAIL') | default('[email protected]', true) }}"
firstName: "{{ lookup('env', 'BAS_CONTACT_FIRSTNAME') | default('John', true) }}"
lastName: "{{ lookup('env', 'BAS_CONTACT_LASTNAME') | default('Barnes', true) }}"
# MAS Configuration
mas_instance_id: "{{ lookup('env', 'MAS_INSTANCE_ID') }}"
mas_config_dir: "{{ lookup('env', 'MAS_CONFIG_DIR') }}"
Predict and H&P Utilities customers must access Cloud Pak for Data dashbord to look at their Analytics projects and design notebooks, but cp4d url is not secure. This is important either for MAS MS or MAS Demo environments and whatever one used by customers.
By the end of this process cp4d would be accessible also by using a suite.maximo.com domain
To simplify usage, we should auto-download the oc
and ibmcloud
tools (in any roles that require them) to simplity the user experience. The download/extract commands for these can be found in the docker image, however we must make sure that we handle different platforms gracefully .. either limit the auto-download support to specific platforms (and document as such) or make it work universally ... I'm mostly thinking about the usual "Mac problem" here :)
Alternatively can we actually remove the use of ibmcloud
and oc
entirely and just call their APIs/use their ansible modules, that would simplify things even more.
The collection should provide support to install MAS and any application using Manual
upgrade strategy. So anyone that uses this collection to setup their environment won't get any upgrade without their consent. I would suggest make it the default behavior and only install with Automatic approval in case its specified by the executor.
When working on this issue its important to ensure that all the operators in the ibm-common-services
is installed properly, if deployed my MAS csv, it will inherit the Manual
upgrade strategy from MAS and in this case we shuld be also handling the approval for each of operator there, otherwise OLDM and Licensing operators wont get installed and MAS installation will fail.
See https://ibm-mas.github.io/ansible-devops/roles/suite_dns/
We need to support a seperate cis_apikey property, because the API key provided will be stored in a secret in the cluster used by the webhook to create challenge request files in your DNS. We should support the ability to set the API key used here seperate from the main IBMCloud API key used elsewhere so that it can be restricted to only the permissions required by CIS.
Update (@alequint )
suite_dns
role, today, uses IBM Cloud API Key as we can see here, which is the same API Key used for different purposes (the same one set in the correspondent playbook that uses the role previously referred, here)suite_dns
role uses the ibmcloud_apikey, it should use another property, which will contain a more restricted api key.cis_apikey
In IBMCloud ROKS we have seen delays of over an hour before the Red Hat Operator catalog is ready to use. This will cause attempts to install anything from that CatalogSource to fail as the timeouts built into those roles are designed to catch problems with an install, rather than a half-provisioned cluster that is not properly ready to use.
The role will poll for 1 1/2 hours waiting for the redhat-operators
CatalogSource to be ready. If it is not ready after that time the role will fail the run.
We should be able to execute must-gather in a cluster by running a simple role: ibm.mas_devops.must_gather
It's really simple to run must-gather for MAS:
oc adm must-gather --image=quay.io/aiasupport/must-gather -- gather -cgl --noclusterinfo --mas-instance-id {{mas_instance_id}}
After several attempts, I have noticed SafetyWorkspace is not able to complete successfully.
However, IoT is up and running and no pods indicating error.
Then I tried the following workaround:
context cleanup 7bb3854f-fedb-4cf0-a653-c79014aa595b did not finished
[2021-11-25T18:42:28.071Z] [[32mINFO[39m] [TenantController.provisionTenant] [6356f81e-95e5-4639-921a-c5792bdb521a] [New] Provision request
[2021-11-25T18:42:28.072Z] [[32mINFO[39m] [TenantController.provisionTenant] [6356f81e-95e5-4639-921a-c5792bdb521a] [New] {
"name": "masdev",
"guid": "masdev",
"plan": "None",
"planGuid": "None",
"type": "internal",
"tier": "None",
"tenantId": "masdev",
"tenantName": "masdev",
"bluemixSpaceId": "None",
"bluemixOrgId": "None",
"contactName": "None",
"contactEmail": "None"
}
[2021-11-25T18:42:28.074Z] [[32mINFO[39m] [-1] [6356f81e-95e5-4639-921a-c5792bdb521a] [InternalTenantService.provisionTenant] Provisioning tenant masdev
[2021-11-25T18:42:28.102Z] [[32mINFO[39m] [-1] [6356f81e-95e5-4639-921a-c5792bdb521a] [TenantService.BaseService.update] Updating a tenant 1
[2021-11-25T18:42:28.103Z] [[32mINFO[39m] [-1] [6356f81e-95e5-4639-921a-c5792bdb521a] [TenantService.BaseService.get] Getting a tenant 1
[2021-11-25T18:42:28.217Z] [[32mINFO[39m] [-1] [6356f81e-95e5-4639-921a-c5792bdb521a] [InternalTenantService.provisionTenant] First part of tenant (1) provisioning is successful.
[2021-11-25T18:42:28.219Z] [[32mINFO[39m] [OpsNotifier.notify] [6356f81e-95e5-4639-921a-c5792bdb521a] [1] Notifying ops
[2021-11-25T18:42:28.239Z] [[32mINFO[39m] [1] [6356f81e-95e5-4639-921a-c5792bdb521a] [InternalTenantService._provisionTenantAsync] Starting second part of tenant provisioning
[2021-11-25T18:42:28.240Z] [[32mINFO[39m] [1] [6356f81e-95e5-4639-921a-c5792bdb521a] [InternalTenantService._createDefaultDocs] Creating default config, jobs, shields
[2021-11-25T18:42:28.240Z] [[32mINFO[39m] [1] [6356f81e-95e5-4639-921a-c5792bdb521a] [InternalTenantService._createDefaultDocs] Creating default roles
[2021-11-25T18:42:36.700Z] [[32mINFO[39m] [TenantController.getProvisioningStatus] [e45d1198-0af6-4049-b15e-040dacc71bc0] [1] Get status of tenant provisioning
[2021-11-25T18:42:36.727Z] [[32mINFO[39m] [1] [e45d1198-0af6-4049-b15e-040dacc71bc0] [InternalTenantService.getProvisioningStatus] Getting status of tenant provisioning
[2021-11-25T18:42:36.729Z] [[32mINFO[39m] [1] [e45d1198-0af6-4049-b15e-040dacc71bc0] [TenantService.BaseService.get] Getting a tenant 1
[2021-11-25T18:42:58.260Z] [[31mERROR[39m] [-1] [6356f81e-95e5-4639-921a-c5792bdb521a] [InternalTenantService._provisionTenantAsync] Second part of tenant (1) provisioning failed. DriverException: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
at DB2ExceptionConverter.convertException (/home/node/app/node_modules/@mikro-orm/core/platforms/ExceptionConverter.js:8:16)
at DB2ExceptionConverter.convertException (/home/node/app/node_modules/@iot4i/core/src/data/db2-driver/DB2ExceptionConverter.ts:36:18)
at DB2Driver.convertException (/home/node/app/node_modules/@mikro-orm/core/drivers/DatabaseDriver.js:194:54)
at /home/node/app/node_modules/@mikro-orm/core/drivers/DatabaseDriver.js:198:24
at runNextTicks (internal/process/task_queues.js:58:5)
at listOnTimeout (internal/timers.js:523:9)
at processTimers (internal/timers.js:497:7)
at DB2Driver.find (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlDriver.js:46:24)
at DB2Driver.findOne (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlDriver.js:60:21)
at SqlEntityManager.findOne (/home/node/app/node_modules/@mikro-orm/core/EntityManager.js:227:22)
at async Promise.all (index 1)
at InternalTenantService._createDefaultDocs (/home/node/app/node_modules/@iot4i/core/src/services/InternalTenantService.ts:347:5)
at InternalTenantService._provisionTenantAsync (/home/node/app/node_modules/@iot4i/core/src/services/InternalTenantService.ts:188:7)
previous KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
at DB2Client.acquireConnection (/home/node/app/node_modules/knex/lib/client.js:348:26)
at runNextTicks (internal/process/task_queues.js:58:5)
at listOnTimeout (internal/timers.js:523:9)
at processTimers (internal/timers.js:497:7)
at DB2Connection.executeQuery (/home/node/app/node_modules/@mikro-orm/core/connections/Connection.js:56:25)
at DB2Connection.execute (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlConnection.js:105:21)
at QueryBuilder.execute (/home/node/app/node_modules/@mikro-orm/knex/query/QueryBuilder.js:300:21)
at DB2Driver.find (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlDriver.js:46:24)
at DB2Driver.findOne (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlDriver.js:60:21)
at SqlEntityManager.findOne (/home/node/app/node_modules/@mikro-orm/core/EntityManager.js:227:22)
at async Promise.all (index 1)
at InternalTenantService._createDefaultDocs (/home/node/app/node_modules/@iot4i/core/src/services/InternalTenantService.ts:347:5)
at InternalTenantService._provisionTenantAsync (/home/node/app/node_modules/@iot4i/core/src/services/InternalTenantService.ts:188:7)
[2021-11-25T18:42:58.261Z] [[32mINFO[39m] [OpsNotifier.notify] [6356f81e-95e5-4639-921a-c5792bdb521a] [1] Notifying ops
[2021-11-25T18:42:58.262Z] [[32mINFO[39m] [-1] [6356f81e-95e5-4639-921a-c5792bdb521a] [TenantService.BaseService.update] Updating a tenant 1
[2021-11-25T18:42:58.262Z] [[32mINFO[39m] [-1] [6356f81e-95e5-4639-921a-c5792bdb521a] [TenantService.BaseService.get] Getting a tenant 1
[2021-11-25T18:42:58.264Z] [[33mWARN[39m] [xxxxxxxxx--No-Tenant-Id-xxxxxxxx] [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx] [uncaughtException] Possibly Unhandled rejection: DriverException: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
at DB2ExceptionConverter.convertException (/home/node/app/node_modules/@mikro-orm/core/platforms/ExceptionConverter.js:8:16)
at DB2ExceptionConverter.convertException (/home/node/app/node_modules/@iot4i/core/src/data/db2-driver/DB2ExceptionConverter.ts:36:18)
at DB2Driver.convertException (/home/node/app/node_modules/@mikro-orm/core/drivers/DatabaseDriver.js:194:54)
at /home/node/app/node_modules/@mikro-orm/core/drivers/DatabaseDriver.js:198:24
at runNextTicks (internal/process/task_queues.js:58:5)
at listOnTimeout (internal/timers.js:523:9)
at processTimers (internal/timers.js:497:7)
at DB2Driver.find (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlDriver.js:46:24)
at DB2Driver.findOne (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlDriver.js:60:21)
at SqlEntityManager.findOne (/home/node/app/node_modules/@mikro-orm/core/EntityManager.js:227:22)
at async Promise.all (index 1)
at InternalTenantService._createDefaultDocs (/home/node/app/node_modules/@iot4i/core/src/services/InternalTenantService.ts:347:5)
at InternalTenantService._provisionTenantAsync (/home/node/app/node_modules/@iot4i/core/src/services/InternalTenantService.ts:188:7)
previous KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
at DB2Client.acquireConnection (/home/node/app/node_modules/knex/lib/client.js:348:26)
at runNextTicks (internal/process/task_queues.js:58:5)
at listOnTimeout (internal/timers.js:523:9)
at processTimers (internal/timers.js:497:7)
at DB2Connection.executeQuery (/home/node/app/node_modules/@mikro-orm/core/connections/Connection.js:56:25)
at DB2Connection.execute (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlConnection.js:105:21)
at QueryBuilder.execute (/home/node/app/node_modules/@mikro-orm/knex/query/QueryBuilder.js:300:21)
at DB2Driver.find (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlDriver.js:46:24)
at DB2Driver.findOne (/home/node/app/node_modules/@mikro-orm/knex/AbstractSqlDriver.js:60:21)
at SqlEntityManager.findOne (/home/node/app/node_modules/@mikro-orm/core/EntityManager.js:227:22)
at async Promise.all (index 1)
at InternalTenantService._createDefaultDocs (/home/node/app/node_modules/@iot4i/core/src/services/InternalTenantService.ts:347:5)
at InternalTenantService._provisionTenantAsync (/home/node/app/node_modules/@iot4i/core/src/services/InternalTenantService.ts:188:7)
[2021-11-25T18:42:58.283Z] [[32mINFO[39m] [1] [6356f81e-95e5-4639-921a-c5792bdb521a] [InternalTenantService._createDefaultDocs] Role administrator exists, skipping ...
Currently the ocp_login
role requires the user to be with IBM Cloud Kubernetes Service Administrator platform role in the target IBM Cloud account, to run the following command to be specific:
ibmcloud oc cluster config -c {{ cluster_name }} --admin
This may cause failures in case the owner (the user/service ID) of IBM Cloud API Key used doesn't have the required permissions for kubernetes service in that IBM Cloud account although it might still has the cluster_admin role inside that OpenShift cluster (we are in that situation for the clusters in P2PaaS account).
More discussions on Slack: https://ibm-watson-iot.slack.com/archives/C0195MVCEUD/p1640135699356200
Documentation improvement: https://ibm-mas.github.io/ansible-devops/playbooks/dependencies/#install-mongodb-ce
TASK [mongodb : mongodb/community : Fail if mongodb_storage_class is not provided] ******************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "mongodb_storage_class property is required"}
We can use this as a starting point, but it may be quite out of date:
https://github.com/ibm-watson-iot/iot-docs/tree/master/monitoring#3-grafana-setup
This should be added to the ocp_setup_mas_deps
role
https://github.com/ibm-mas/ansible-devops/runs/4418885903?check_suite_focus=true
We should be able to track back to the previous release, maybe due to difference in the way Actions checks out the repository, possibly in a way where we don't get access to tags like we did in Travis?
npm install of git-latest-semver-tag starting
added 75 packages, and audited 76 packages in 4s
3 packages are looking for funding
run `npm fund` for details
3 high severity vulnerabilities
To address all issues, run:
npm audit fix
Run `npm audit` for details.
- npm install complete
LAST TAG =
Creating /home/runner/work/ansible-devops/ansible-devops/.changelog
RELEASE LEVEL = initial
Semantic versioning system initialized: 1.0.0
initial release of 1.0.0
Setting /home/runner/work/ansible-devops/ansible-devops/.version to 1.0.0-pre.master
Setting /home/runner/work/ansible-devops/ansible-devops/.previous_version to 1.0.0
The following setup tasks in the area of performance tuning and monitoring will be useful to have in ansible. The MAS performance team has bash scripts, but would be nice to port this functionality to the ibm-mas/ansible devops playbook.
In roles/suite_verify/tasks/main.yml
:
# 5. Print MAS login information
# -----------------------------------------------------------------------------
# TODO: The admin url should be looked up from the route resource, this should
# not rely on the same logic to construct the default URL we used when we installed
# MAS.
- name: "Lookup cluster subdomain"
community.kubernetes.k8s_info:
api_version: config.openshift.io/v1
kind: Ingress
name: cluster
register: _cluster_subdomain
- name: "Configure domain"
set_fact:
mas_domain: "{{custom_domain | default(mas_instance_id ~ '.' ~ _cluster_subdomain.resources[0].spec.domain )}}"
- debug:
msg:
- "Maximo Application is Ready, use the superuser credentials to authenticate"
- "Superuser Credentials:"
- "Username: .... {{ superuser_credentials.resources[0].data.username | b64decode }}"
- "Password: .... {{ superuser_credentials.resources[0].data.password | b64decode }}"
- "Admin Url: ... https://admin.{{mas_domain}}"
After discussions with the team, it's been advixed that we use the standalone db2u operator instead of CloudPak for Data to meet our Db2 dependency.
The task here is to implement alternative support for the standalone operator in a new role (db2u
) as an alternative to the existing cp4d_db2wh
role.
In the future we may eliminate the cp4d_db2wh
role, but for now we just want to introduce this support as an alternative. Deploying Db2 without CP4D will reduce install time and footprint.
The work has already been started in the db2u branch, however after my accident no-one picked the work up so it's remained in limbo for a number of weeks now.
The role needs to:
With the clean up of default handling into the roles, there is no longer a need to maintain seperate provision-quickburn / provision-roks & deprovision-quickburn / deprovision-roks playbooks.
These should be consolidated into single provision / provision playbooks, with appropriate updates to the mast playbooks that call these, documentation, and associated tekton pipelines.
Migrate code from GitHub Enterprise, prepare code for public release under EPL license & prep for publication to Ansible Galaxy.
Looking forward to build artifacts that can be reused and facilitate user interaction and pipeline automations with ibm-mas.devops
. I'm creating this issue to track the work been done around Openshift Pipeline Operator and this collection.
The goal is to create a 1:1 map between the playbooks available in this collection and Tekton tasks and than pipelines.
PoC will include a set of tasks to deploy Manage against a running MAS. The pipeline will be responsible for deploy and configure db2wh
and deploy and activate Manage
.
Also as part of this work a new public image should be made available. So we can leverage the collection into containers using Tekton tasks.
A plus for this work would be make our tasks available in Tekton hub.https://hub.tekton.dev/
We've ported parts of the internal build system to this repository, however judging from the results of this build it's not working 100% yet: https://app.travis-ci.com/github/ibm-mas/ansible-devops/builds/236812829
Is there a way to reduce the footprint of Kafka by consolidating two AMQ operators into one? BAS and IoT require Kafkfa. The AMQ and BAS both deploy AMQ operators separately into multiple projects, which seems redundant.
There is no check for whether the status object exists ... so there is a timing window on a new deploy where the deploy will blow up, as in the example below:
TASK [ibm.mas_devops.ocp_verify : Check if Red Hat Catalog is ready] ********************************************************************************************************************************************
Wednesday 24 November 2021 11:35:49 +0000 (0:00:02.602) 0:36:24.845 ****
fatal: [localhost]: FAILED! => {"msg": "The conditional check 'redhat_catalog_info.resources[0].status.connectionState.lastObservedState == \"READY\"' failed. The error was: error while evaluating conditional (redhat_catalog_info.resources[0].status.connectionState.lastObservedState == \"READY\"): 'dict object' has no attribute 'status'"}
NO MORE HOSTS LEFT **********************************************************************************************************************************************************************************************
PLAY RECAP ******************************************************************************************************************************************************************************************************
localhost : ok=19 changed=9 unreachable=0 failed=1 skipped=9 rescued=0 ignored=0
We've performed significant refactoring of the roles and playbooks to clean up various problems introduced in v4.4 and v4.5, before we formally release v5.0 we need to test the tekton pipelines, it's likely numerous updates are required.
Other roles, have a conditional check for MAS_CONFIG_DIR, they only save the file when it's set, for example in db2 (same logic can be see in kafka and mongo dep roles):
# 3. Generate a JdbcCfg for MAS configuration
# -----------------------------------------------------------------------------
- include_tasks: tasks/suite_jdbccfg.yml
when:
- mas_instance_id is defined
- mas_instance_id != ""
- mas_config_dir is defined
- mas_config_dir != ""
However in BAS, we don't seem to have any check for whether config dir has been set, resulting in potentially writing to a dangerous and/or undesirable directory if mas_config_dir == "" (which it will if the env var was not set).
- name: Set facts for BASCfg
set_fact:
bas_segment_key: "{{_bas_segmentKey_result.resources[0].data.segmentkey | b64decode}}"
bas_api_key: "{{_bas_apiKey_result.resources[0].data.apikey | b64decode }}"
bas_endpoint_url: "https://{{_bas_endpoint.resources[0].spec.host}}"
bas_tls_crt: "{{_bas_certificates_result.resources[0].data['tls.crt'] | b64decode | regex_findall('(-----BEGIN .+?-----(?s).+?-----END .+?-----)', multiline=True, ignorecase=True) }}"
# 2. Write out the config to the local filesystem
# -----------------------------------------------------------------------------
- name: Copy BASCfg to filesytem
ansible.builtin.template:
src: bascfg.yml.j2
dest: "{{ mas_config_dir }}/bas-{{ bas_namespace }}.yml"
We have the bas role almost complete, but it's not fully integrated into the sample playbooks and there were a few remaining issues to be ironed out before it's considered ready to use
please deliver fixes to this branch: https://github.com/ibm-mas/ansible-devops/tree/bugfixes2411
https://ibm-mas.github.io/ansible-devops/playbooks/lite-core-roks/
This page needs various updates to reflect the addition of automated BAS deployment to the playbook.
Address this after the issues raised in #65 have been addressed, as the fixes there will impact the documentation (we only want to fix this once)
The same issues are present on these pages as well:
To complete support for Predict we actually need to complete support for Manage components.
Today in the code we have this:
---
# Default application spec for Manage
mas_app_ws_spec:
bindings:
jdbc: "{{ mas_appws_jdbc_binding | default( 'system' , true) }}"
components: "{{ mas_appws_components | default({'base': {'version': 'latest'}}, true) }}"
settings:
db:
dbSchema: "{{ db2_schema }}"
maxinst:
demodata: "{{ manage_demo_data | bool }}"
db2Vargraphic: true
indexSpace: MAXINDEX
tableSpace: MAXDATA
bypassUpgradeVersionCheck: false
However, because our playbook strategy is to drive configuration via environment variables, there is no way to leverage this capability today as it requires settint vars specifically in the playbook; which we don't want to do becuase we want the config to be passed in via env vars.
To solve this, let's support something like this:
export MAS_APPWS_COMPONENTS=base=latest,health=latest
ansible-playbook playbooks/mas/configure-app.yml
Inside the suite_app_configure
role we should have a new plugin that can parse name:value pairs in this env var into the object that we need to pass to the workspace cr spec. This would be used in the defaults/main.yml file to set the default value for mas_appws_components
, with a default if the env var is not set being omit
(ie don't define the variable at all)
Then the pipeline itself will need updated to have a new param:
- name: mas_appws_components_manage
type: string
description: Manage components to configure in the workspace
default: "base=latest,health=latest"
and also ensure this param is passed to the manage configure workspace step in the pipeline:
# 16.2 Configure Manage workspace
- name: cfg-manage
params:
- name: junit_suite_name
value: app-cfg-manage
- name: mas_app_id
value: manage
- name: mas_workspace_id
value: "$(params.mas_workspace_id)"
- name: mas_appws_components
value: "$(params.mas_appws_components_manage)"
taskRef:
name: mas-devops-configure-app
kind: ClusterTask
Hopefully that gives a good idea of what we need here.
None of these applications can be successfully installed or installed + configured using the Ansible collection.
In my testing in fvt-dev, only the original application support is functional (IoT & Manage)
Please deliver fixes to this branch: https://github.com/ibm-mas/ansible-devops/tree/bugfixes2411
1st run:
TASK [ibm.mas_devops.bas_install : Wait for FullDeployment to be ready (60s delay)] ********************************************************************************************************************************************************
Wednesday 24 November 2021 15:52:40 +0000 (0:00:02.010) 0:03:43.775 ****
FAILED - RETRYING: Wait for FullDeployment to be ready (60s delay) (40 retries left).
FAILED - RETRYING: Wait for FullDeployment to be ready (60s delay) (39 retries left).
... <snip> ...
FAILED - RETRYING: Wait for FullDeployment to be ready (60s delay) (3 retries left).
FAILED - RETRYING: Wait for FullDeployment to be ready (60s delay) (2 retries left).
FAILED - RETRYING: Wait for FullDeployment to be ready (60s delay) (1 retries left).
fatal: [localhost]: FAILED! => {"api_found": true, "attempts": 40, "changed": false, "resources": [{"apiVersion": "bas.ibm.com/v1", "kind": "FullDeployment", "metadata": {"creationTimestamp": "2021-11-24T15:52:39Z", "generation": 1, "managedFields": [{"apiVersion": "bas.ibm.com/v1", "fieldsType": "FieldsV1", "fieldsV1": {"f:spec": {".": {}, "f:airgapped": {".": {}, "f:backup_deletion_frequency": {}, "f:backup_retention_period": {}, "f:enabled": {}}, "f:allowed_domains": {}, "f:db_archive": {".": {}, "f:frequency": {}, "f:persistent_storage": {".": {}, "f:storage_class": {}, "f:storage_size": {}}, "f:retention_age": {}}, "f:env_type": {}, "f:event_scheduler_frequency": {}, "f:ibmproxyurl": {}, "f:image_pull_secret": {}, "f:kafka": {".": {}, "f:storage_class": {}, "f:storage_size": {}, "f:zookeeper_storage_class": {}, "f:zookeeper_storage_size": {}}, "f:postgres": {".": {}, "f:storage_class": {}, "f:storage_size": {}}, "f:prometheus_metrics": {}, "f:prometheus_scheduler_frequency": {}}, "f:status": {"f:phase": {}}}, "manager": "OpenAPI-Generator", "operation": "Update", "time": "2021-11-24T15:53:49Z"}, {"apiVersion": "bas.ibm.com/v1", "fieldsType": "FieldsV1", "fieldsV1": {"f:status": {".": {}, "f:conditions": {}}}, "manager": "ansible-operator", "operation": "Update", "time": "2021-11-24T16:01:43Z"}], "name": "fulldeployment", "namespace": "ibm-bas", "resourceVersion": "133659", "selfLink": "/apis/bas.ibm.com/v1/namespaces/ibm-bas/fulldeployments/fulldeployment", "uid": "299ab75b-61f5-4751-97d7-b50840e5c1a7"}, "spec": {"airgapped": {"backup_deletion_frequency": "@daily", "backup_retention_period": 7, "enabled": "false"}, "allowed_domains": "*", "db_archive": {"frequency": "@monthly", "persistent_storage": {"storage_class": "", "storage_size": "10G"}, "retention_age": 6}, "env_type": "lite", "event_scheduler_frequency": "*/10 * * * *", "ibmproxyurl": "https://iaps.ibm.com", "image_pull_secret": "bas-images-pull-secret", "kafka": {"storage_class": "", "storage_size": "5G", "zookeeper_storage_class": "", "zookeeper_storage_size": "5G"}, "postgres": {"storage_class": "", "storage_size": "10G"}, "prometheus_metrics": [], "prometheus_scheduler_frequency": "@daily"}, "status": {"conditions": [{"ansibleResult": {"changed": 12, "completion": "2021-11-24T16:01:43.152198", "failures": 1, "ok": 17, "skipped": 3}, "lastTransitionTime": "2021-11-24T16:01:43Z", "message": "Failed to find exact match for kafka.strimzi.io/v1beta2.Kafka by [kind, name, singularName, shortNames]", "reason": "Failed", "status": "False", "type": "Failure"}, {"lastTransitionTime": "2021-11-24T16:01:43Z", "message": "Running reconciliation", "reason": "Running", "status": "True", "type": "Running"}], "phase": "Installing"}}]}
NO MORE HOSTS LEFT *************************************************************************************************************************************************************************************************************************
PLAY RECAP *********************************************************************************************************************************************************************************************************************************
localhost : ok=98 changed=22 unreachable=0 failed=1 skipped=32 rescued=0 ignored=0
Wednesday 24 November 2021 16:33:58 +0000 (0:41:17.379) 0:45:01.155 ****
===============================================================================
2nd run the next day, different failure at the same place:
FAILED - RETRYING: Wait for FullDeployment to be ready (60s delay) (1 retries left).
fatal: [localhost]: FAILED! => {"api_found": true, "attempts": 40, "changed": false, "resources": [{"apiVersion": "bas.ibm.com/v1", "kind": "FullDeployment", "metadata": {"creationTimestamp": "2021-11-24T15:52:39Z", "generation": 1, "managedFields": [{"apiVersion": "bas.ibm.com/v1", "fieldsType": "FieldsV1", "fieldsV1": {"f:spec": {".": {}, "f:airgapped": {".": {}, "f:backup_deletion_frequency": {}, "f:backup_retention_period": {}, "f:enabled": {}}, "f:allowed_domains": {}, "f:db_archive": {".": {}, "f:frequency": {}, "f:persistent_storage": {".": {}, "f:storage_class": {}, "f:storage_size": {}}, "f:retention_age": {}}, "f:env_type": {}, "f:event_scheduler_frequency": {}, "f:ibmproxyurl": {}, "f:image_pull_secret": {}, "f:kafka": {".": {}, "f:storage_class": {}, "f:storage_size": {}, "f:zookeeper_storage_class": {}, "f:zookeeper_storage_size": {}}, "f:postgres": {".": {}, "f:storage_class": {}, "f:storage_size": {}}, "f:prometheus_metrics": {}, "f:prometheus_scheduler_frequency": {}}, "f:status": {"f:phase": {}}}, "manager": "OpenAPI-Generator", "operation": "Update", "time": "2021-11-24T15:53:49Z"}, {"apiVersion": "bas.ibm.com/v1", "fieldsType": "FieldsV1", "fieldsV1": {"f:status": {".": {}, "f:conditions": {}}}, "manager": "ansible-operator", "operation": "Update", "time": "2021-11-25T12:03:28Z"}], "name": "fulldeployment", "namespace": "ibm-bas", "resourceVersion": "846474", "selfLink": "/apis/bas.ibm.com/v1/namespaces/ibm-bas/fulldeployments/fulldeployment", "uid": "299ab75b-61f5-4751-97d7-b50840e5c1a7"}, "spec": {"airgapped": {"backup_deletion_frequency": "@daily", "backup_retention_period": 7, "enabled": "false"}, "allowed_domains": "*", "db_archive": {"frequency": "@monthly", "persistent_storage": {"storage_class": "", "storage_size": "10G"}, "retention_age": 6}, "env_type": "lite", "event_scheduler_frequency": "*/10 * * * *", "ibmproxyurl": "https://iaps.ibm.com", "image_pull_secret": "bas-images-pull-secret", "kafka": {"storage_class": "", "storage_size": "5G", "zookeeper_storage_class": "", "zookeeper_storage_size": "5G"}, "postgres": {"storage_class": "", "storage_size": "10G"}, "prometheus_metrics": [], "prometheus_scheduler_frequency": "@daily"}, "status": {"conditions": [{"ansibleResult": {"changed": 1, "completion": "2021-11-25T12:03:28.405458", "failures": 1, "ok": 18, "skipped": 4}, "lastTransitionTime": "2021-11-25T12:03:28Z", "message": "unknown playbook failure", "reason": "Failed", "status": "False", "type": "Failure"}, {"lastTransitionTime": "2021-11-25T12:03:28Z", "message": "Running reconciliation", "reason": "Running", "status": "True", "type": "Running"}], "phase": "Installing"}}]}
NO MORE HOSTS LEFT *************************************************************************************************************************************************************************************************************************
PLAY RECAP *********************************************************************************************************************************************************************************************************************************
localhost : ok=98 changed=13 unreachable=0 failed=1 skipped=32 rescued=0 ignored=0
ibm-bas namespace is left in this state:
/mnt/c/Users/DaveParker$ oc get pods -n ibm-bas
NAME READY STATUS RESTARTS AGE
amq-streams-cluster-operator-v1.7.3-6dc4b4d74d-fn62k 1/1 Running 0 20h
backrest-backup-instrumentationdb-r9c8c 0/1 Completed 0 15h
behavior-analytics-services-operator-d5f6f5899-s8qvd 2/2 Running 0 20h
createcluster-2zg6f 0/1 Completed 0 20h
dashboard-deployment-66c874dbbb-wksj4 2/2 Running 0 20h
instrumentationdb-7b69964d8c-9cjbb 1/1 Running 0 20h
instrumentationdb-backrest-shared-repo-75566dfd69-wz44f 1/1 Running 0 20h
instrumentationdb-stanza-create-crv5s 0/1 Completed 0 20h
kafka-zookeeper-0 0/1 Pending 0 20h
pgo-deploy-ln6lx 0/1 Completed 0 20h
postgres-operator-6c585b8c78-mzz8v 4/4 Running 0 20h
Doc on how to install AppConnect for MAS 8.6: https://www.ibm.com/docs/en/mas86/8.6.0?topic=ons-app-connect
AppConnect is required to deploy H&P Utilities in MAS 8.6, so its provisioning must be automated as part of this collection.
Steps informed by the team:
appconn
spec.useCommonServices
must set to false
For mas 8.6, we use appconnect operator version 1.5.2, dashboard version 12.0.1.0-r3 and the license for above (II) is AppConnectEnterpriseProduction L-KSBM-C37J2R
See: https://app.travis-ci.com/github/ibm-mas/ansible-devops/builds/236820483
$ $HOME/build.common/bin/gitrelease.sh
/home/travis/.travis/functions: line 109: /home/travis/build.common/bin/gitrelease.sh: No such file or directory
Looks like we forgot to port this part of the build system over to public repo (and it's still referencing the path used by build.common when calling it).
The quality of the role READMEs - which are pulled into the documentation (https://ibm-mas.github.io/ansible-devops/) is pretty low, a thorough review of each is required to ensure the documentation is accurate and complete.
BAS install fails due to ROKS provisioning errors, we don't know why sometimes ROKS messes up this aspect of the cluster config, but it causes problems for one of our dependencies; BAS.
We can put something in ocp verify that checks for one of the two secrets that we are expecting on ROKS ... if things are going to fail because sometimes this secret is missing we should make them fail as early as possible seeing as we don't know how to resolve the issue.
Reference: https://github.com/ibm-mas/ansible-devops/blob/master/ibm/mas_devops/roles/bas_install/tasks/bascfg.yml#L63-L103 ... We need to have similar logic in ocp verify that checks that either of these secrets exists. If it doesn't, then fail the verify because the cluster is not ready for MAS.
Need a new role to support enhancing roks clusters to configure:
There may be other places these are logged too, this is just one example. Passwords should not be logged, documentation should direct the user to where they can obtain the passwords (e.g. what secret they are saved to in k8s, or if they are written to a file on disk). The logs should be suitable for a user to copy/paste into a GitHub issue without risking exposing sensitive information.
TASK [ibm.mas_devops.bas_install : Generate bas_password if none has been provided] ***********************************************************************************************************************
Monday 29 November 2021 15:40:52 +0000 (0:00:00.055) 0:58:13.934 *******
ok: [localhost] => {"ansible_facts": {"bas_password": "ovjRTloehosQYLL"}, "changed": false}
TASK [ibm.mas_devops.bas_install : Generate bas_grafana_password if none has been provided] ***************************************************************************************************************
Monday 29 November 2021 15:40:52 +0000 (0:00:00.068) 0:58:14.002 *******
ok: [localhost] => {"ansible_facts": {"bas_grafana_password": "YnsEjgMfqVMWhhn"}, "changed": false}
... or sort out billing in Travis, either works.
FWIW .. this is the starter I wrote a while back / completely untested but might be helpful as a starting point:
name: Migrated Travis CI Build
# Do a SIMPLE migration of the travis build, worry about taking advantage of other features of Actions after we get it basically working first
# See if we can get it working on one branch first
on:
push:
branches: [ actionstest ]
jobs:
build:
runs-on: ubuntu-latest
# Most scipts will probably fail because $GITHUB_WORKSPACE / $TRAVIS_BUILD_DIR envs are different, etc
steps:
- uses: actions/checkout@v2
- name: Make scripts executable
run: chmod u+x $TRAVIS_BUILD_DIR/build/bin/*.sh
- name: Run initbuild
run: $GITHUB_WORKSPACE/build/bin/initbuild.sh
- name: Install Python requirements to build the pages documentation website
run: python -m pip install -q ansible==2.10.3 mkdocs yamllint
- name: Validate that the ansible collection lints successfully
run: yamllint -c yamllint.yaml ibm/mas_devops
- name: Validate that the mkdocs site builds successfully
run: mkdocs build --verbose --clean --strict
- name: Build the Ansible collection
run: $GITHUB_WORKSPACE/build/bin/build-collection.sh
- name: Install the Ansible collection in the container image
run: cp $GITHUB_WORKSPACE/ibm/mas_devops/ibm-mas_devops-$(cat $GITHUB_WORKSPACE/.version).tar.gz $GITHUB_WORKSPACE/image/ansible-devops/ibm-mas_devops.tar.gz
- name: Build the docker image
run: $GITHUB_WORKSPACE/build/bin/docker-build.sh -n ibmmas -i ansible-devops
- name: Push the docker image
run: $GITHUB_WORKSPACE/build/bin/docker-push.sh
- name: Build the tekton clustertasks
run: $TRAVIS_BUILD_DIR/pipelines/bin/build-pipelines.sh
# TODOs
# Publish a GitHub release with the ansible collection tgz as an assset
# - $TRAVIS_BUILD_DIR/build/bin/git-release.sh $TRAVIS_BUILD_DIR/ibm/mas_devops/ibm-mas_devops-$(cat $TRAVIS_BUILD_DIR/.version).tar.gz $TRAVIS_BUILD_DIR/pipelines/ibm-mas_devops-clustertasks-$(cat $TRAVIS_BUILD_DIR/.version).tar.gz
#deploy:
# provider: pages
# skip_cleanup: true
# github_url: github.com
# github_token: $GITHUB_TOKEN
# verbose: true
# local_dir: site
# target_branch: gh-pages
# on:
# branch: master
bash pipelines/bin/install-pipelines.sh
/home/david/ibm-mas/ansible-devops/pipelines/bin
subscription.operators.coreos.com/openshift-pipelines-operator created
Wait for Pipeline operator to be ready
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "tasks.tekton.dev" not found
See TODO in the script:
# TODO: do while STATE != ready
# otherwise the CRD lookup will fail, as the timeout only helps AFTER the CRD initially exists
# STATE=$(oc get subscription openshift-pipelines-operator -n openshift-operators -o=jsonpath="{.status.state}")
New plan for GitHub actions:
Store the API key to publish to Galaxy in GitHub secure settings
A new workflow that runs on tags only, which will
We need to run ansible lint to know whether the collection will actually be accepted into Galaxy, there are things that it will catch which a normal yaml lint will not (e.g. missing metadata required by Galaxy)
For now, we will keep the job of creating tags as a manual process, but once all the kinks are ironed out of the system, we should be able to make a commit to master auto-create a tag, which will trigger the release. For now, anyone with the power to create a tag will be able to manually promote from master to release by creating a tag.
The below was written before I migrated to Github Actions:
When master branch builds we should auto-publish to Ansible Galaxy, at present it's left to me to manually release the collection seperate from the build. This should be easy to accomplish, but hold fire until the teething problems with the ported mini-build system are all resolved
We need to find an alternative way to publish GH releases, as Travis-CI doesn't support SSH keys in the same way that our Travis Enterprise instance does, until we can automate GH release, we can't automate Ansible release.
This is why the build currently reports error when trying to create the release:
Publishing new release 4.2.0 to GitHub
remote: Support for password authentication was removed on August 13, 2021. Please use a personal access token instead.
remote: Please see https://github.blog/2020-12-15-token-authentication-requirements-for-git-operations/ for more information.
fatal: Authentication failed for 'https://github.com/ibm-mas/ansible-devops.git/'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.