This repository contains the program to analyze logs from Linux Pacemaker High Availability cluster for SAP solutions running in Google Cloud Platform.
It can parse pacemaker logs, system logs, hb_report from SUSE and sosreport from Redhat.
It generates an output file with critical Pacemaker events such as fencing, resource actions and errors such as failed resource operations.
This is not an officially supported Google product
The program requires Python 3.6+ to run.
Show help:
./logpaser -h
Option '-p' to parse up to two pacemaker logs
./logparser -p node1_pacemaker.log node2_pacemaker.log
Option '-s' to parse up to two system logs.
./logparser -s node1_system_log node2_system_log
Option '-hb' to parse one hb_report file from SLES.
./logparser -hb hb_report.tar.bz2
Option '-sos' to parse up to two sosreport from RHEL.
./logparser -sos sosreport1.tar.xz sosreport2.tar.xz
NOTE: At least one of the above four options is requried and they can be combined.
Optional options:
Option '-o' to specify output file name. By default, the output is written to logparser.out
Option '-b' to specify analysis begin time in format YYYY-MM-DD or YYYY-MM-DD-HH:MM
Option '-e' to specify analysis end time in format YYYY-MM-DD or YYYY-MM-DD-HH:MM
Examples:
-o output.txt
-o output.txt -b 2021-02-22
-o output.txt -b 2021-02-22 -e 2021-02-25
-o output.txt -e 2021-02-25-00:00
Below logs are filterted, captured and ordered in chronological order by the log parser
- Fencing action, reason, and result
Examples:
2021-03-26 03:10:38 node1 pengine: notice: LogNodeActions: * Fence (reboot) node2 'peer is no longer part of the cluster'
2021-03-26 03:10:57 node1 stonith-ng: notice: remote_op_done: Operation 'reboot' targeting node1 on node2 for [email protected]: OK
2020-12-21 12:15:18 node2 stonith-ng:notice: remote_op_done: Operation reboot of node1 by node2 for [email protected]: Timer expired
- Pacemaker actions to move/start/stop/recover/promote/demote cluster resources
Examples:
2021-03-26 03:10:38 node1 pengine: notice: LogAction: * Move rsc_vip_int-primary ( node2 -> node1 )
2021-03-26 03:10:38 node1 pengine: notice: LogAction: * Move rsc_ilb_hltchk ( node2 -> node1 )
2021-03-26 03:10:38 node1 pengine: notice: LogAction: * Stop rsc_SAPHanaTopology_SID_HDB00:1 ( node2 ) due to node availability
- Failed resource operations
Examples:
2021-03-26 04:50:48 node1 crmd: info: process_lrm_event: Result of monitor operation for rsc_SAPHana_SID_HDB00 on node1: 7 (not running) | call=232 key=rsc_SAPHana_SID_HDB00_monitor_61000 confirmed=false cib-update=345
2020-07-23 13:11:44 node2 crmd: info: process_lrm_event: Result of monitor operation for rsc_vip_gcp_ers on node2: 7 (not running)
- Corosync communication error and failure, also membership changes
Examples:
2021-11-25 03:19:32 node1 corosync: [TOTEM ] A processor failed, forming new configuration.
2021-11-25 03:19:33 node1 corosync: [TOTEM ] Failed to receive the leave message. failed: 2
2021-11-25 03:19:33 node2 corosync[2445]: message repeated 214 times: [ [TOTEM ] Retransmit List: 31609]
2021-11-25 03:19:33 node1 corosync: [TOTEM ] A new membership (10.0.0.10:2668) was formed. Members left: 2
2021-11-25 03:20:14 node1 corosync: [TOTEM ] A new membership (10.0.0.10:2672) was formed. Members joined: 2
- Cluster/Node/Resource maintenance/standby/manage mode change
Examples:
(cib_perform_op) info: + /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-maintenance-mode']: @value=true
(cib_perform_op) info: + /cib/configuration/nodes/node[@id='2']/instance_attributes[@id='nodes-2']/nvpair[@id='nodes-2-standby']: @value=on
- Resource agent, fence agent warnings and errors
Examples:
2021-03-16 14:12:31 node1 SAPHana(rsc_SAPHana_SID_HDB01): ERROR: ACT: HANA SYNC STATUS IS NOT 'SOK' SO THIS HANA SITE COULD NOT BE PROMOTED
2021-01-11 02:21:03 node1 stonith-ng[7812]: notice: Operation 'monitor' [86041] for device 'STONITH-node1' returned: -62 (Timer expired)
2021-01-15 07:15:05 node1 gcp:stonith: ERROR - gcloud command not found at /usr/bin/gcloud
2021-02-08 17:05:30 node1 SAPInstance(rsc_sap_SID_ASCS10): ERROR: SAP instance service msg_server is not running with status GRAY !
2021-03-16 13:26:10 node2 SAPHana(rsc_SAPHana_SID_HDB01): INFO: ACT site=node2, setting SFAIL for secondary (5) - srRc=10 lss=4
- High CPU load or Pacemaker critical logs
Examples:
2021-02-11 10:49:43 node1 crmd: notice: High CPU load detected: 205.089996
2021-02-11 10:49:43 node2 crmd: crit: tengine_stonith_notify: We were allegedly just fenced by node1 for node1!
- Reach migration threshold and force resource off
Examples:
check_migration_threshold: Forcing rsc_name away from node1 after 1000000 failures (max=5000)
- Location constraint added due to manual resource movement
Examples:
2021-02-11 10:49:43 node2 cib: info: cib_perform_op: ++ /cib/configuration/constraints: <rsc_location id="cli-ban-grp_sap_cs_sid-on-node1" rsc="grp_sap_cs_sid" role="Started" node="node1" score="-INFINITY"/>
2021-02-11 11:26:29 node2 stonith-ng: info: update_cib_stonith_devices_v2: Updating device list from the cib: delete rsc_location[@id='cli-prefer-grp_sap_cs_sid']