percona-lab / pacemaker-replication-agents Goto Github PK

Repository of the Percona Pacemaker resource agents

License: ISC License

Shell 100.00%

pacemaker-replication-agents's Introduction

Percona resource agents

This repository stores Pacemaker resource agents used in solutions developed by Percona.

For bug fixes, support and development, please contact Percona Sales team at https://www.percona.com/about-percona/contact.

Mailing lists:

PRM:

For announce and general questions regarding the use and configuration of PRM:

https://groups.google.com/d/forum/prm-discuss

For discussion concerning the development of the mysql resource agent:

https://groups.google.com/d/forum/prm-devel

pacemaker-replication-agents's People

Contributors

Stargazers

Watchers

pacemaker-replication-agents's Issues

rename mysql_prm56 to mysql_prm_gtid

rename mysql_prm56 to mysql_prm_gtid
update docs
& other references

RA is not working anymore on RHEL7 / CentOS 7 / Ubuntu 14.04

Hi there,

I tried the resource agent with a PCS cluster with up to date OS (RHEL/CentOS 7.1 and Ubuntu 14.04.3) and up to date Percona Server (5.6.26-74.0-log and 5.6.27-75.0-log) and it is not working.
Creating a master/slave cluster the RA cannot initiate the M/S.

Brlow my my.cnf and the pacemaker config

node 1: SRVOWNSQL01
node 2: SRVOWNSQL02
primitive p_vip IPaddr2
params ip=172.28.107.200 cidr_netmask=24 nic=bond1
op start interval=0s timeout=20s
op stop interval=0s timeout=20s
op monitor interval=20s
meta target-role=Started
p_mysql ocf:percona:mysql
params config="/etc/my.cnf" pid="/var/lib/mysql/mysqld.pid" socket="/var/run/mysqld/mysqld.sock" replication_user="repl_user"
replication_passwd="_" max_slave_lag="60" evict_outdated_slaves="false" binary="/sbin/mysqld"
test_user="test_user" test_passwd="_******"
op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1"
op monitor interval="2s" role="Slave" OCF_CHECK_LEVEL="1"
op start interval="0" timeout="60s"
op stop interval="0" timeout="60s"
ms ms_MySQL p_mysql
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" globally-unique="false" target-role="Master" is-managed="true"
colocation writer_on_master inf: p_vip ms_MySQL:Master
order ms_MySQL_promote_before_vip inf: ms_MySQL:promote p_vip:start
property cib-bootstrap-options:
stonith-enabled=false
no-quorum-policy=ignore
have-watchdog=false
dc-version=1.1.13-a14efad
cluster-infrastructure=corosync
cluster-name=DBCluster
last-lrm-refresh=1446108717
rsc_defaults rsc_defaults-options:
migration-threshold=1

[mysql]

CLIENT

port = 3306
socket = /var/lib/mysql/mysql.sock

[mysqld]

GENERAL

user = mysql
default-storage-engine = InnoDB
socket = /var/lib/mysql/mysql.sock
pid-file = /var/lib/mysql/mysql.pid
bind-address=0.0.0.0

relay-log=mysql-relay-bin

log-slave-updates

replicate-ignore-db=mysql

MyISAM

key-buffer-size = 32M
myisam-recover = FORCE,BACKUP

SAFETY

max-allowed-packet = 16M
max-connect-errors = 1000000

skip-name-resolve

sql-mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY
sysdate-is-now = 1
innodb = FORCE
innodb-strict-mode = 1

DATA STORAGE

datadir = /var/lib/mysql/

BINARY LOGGING

log-bin = /var/lib/mysql/mysql-bin
expire-logs-days = 14
sync-binlog = 1
binlog_format = ROW
log-slave-updates

CACHES AND LIMITS

tmp-table-size = 32M
max-heap-table-size = 32M
query-cache-type = 0
query-cache-size = 0
max-connections = 100000
thread-cache-size = 500
open-files-limit = 65535
table-definition-cache = 4096
table-open-cache = 10000

INNODB

innodb-flush-method = O_DIRECT
innodb-log-files-in-group = 2
innodb-log-file-size = 512M
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table = 1
innodb-buffer-pool-size = 32G

LOGGING

log-error = /var/lib/mysql/mysql-error.log
log-queries-not-using-indexes = 1
slow-query-log = 1
slow-query-log-file = /var/lib/mysql/mysql-slow.log

REPLICATION

server_id = 1
slave_transaction_retries=4294967295

[client-server]

include all files from the config directory

!includedir /etc/my.cnf.d

mysql_prm56: auto_position=1 never 'forced'

It seems that if auto_position argument of the slave replication is once 0, it will never become 1, even though the code has some logic to enable it.

You will keep on getting these errors:

Aug 30 21:43:22 training-vm24 crm_ticket[25136]: notice: crm_log_args: Invoked: /usr/sbin/crm_ticket --info
Aug 30 21:43:22 training-vm24 mysql_prm56(p_mysql_sessions)[25108]: WARNING: Auto position is not enabled, we force it

GEO-DR: implement revoking of ticket when cluster fails

Is there some way to ensure that when the mysql resource for example (no master possible) fails on the datacenter which has the ticket, that the ticket gets revoked?

multiple invocations of mysql_start_low allowed, demote action can start mysql

Found a issue with a basic timeline like this:

master mysql server is '03'. mysqld on 03 is inadvertently killed (outside of pacemaker) and 01 is promoted to master. This is the timeline:

18:29:51 — mysqld is killed on 03
18:29:52 — demote on 03, vip down on 03
18:29:59 — MySQL not running: removing old PID file
mysql starting via demote action (from demote action -> mysql_status) — innodb recovery (first instance)
18:30:19 — demote timed out on 03 after 20000ms (apparently default demote action timeout)
18:30:20 — action: mysql start on 03 (second instance)
18:31:20 — start timeout after 60000ms on 03, promote on 01
18:31:21 — vip up on 01

18:32:13 — mysql explicitly aborted on 03 (second instance)
18:37:46 — mysql innodb recovery finished and up on 03 (first start)

during this I see lots of these in the mysql error log on 03:

InnoDB: Unable to lock ./ibdata1, error: 11
InnoDB: Check that you do not already have another mysqld process
InnoDB: using the same InnoDB data or log files.

So, there are a few basic issues:

mysql_start_low should not be able to start two instances of mysql (pid file check or whatever)
I'm not sure the demote action should be able to start mysql or not. Maybe.
Failover in this case (with respective timeouts) was 90 seconds, but I need to get it to less than 60. In this case mysql startup is pretty slow (and crash recovery like 8 minutes), so I want to circumvent the attempt to restart the master and just failover -- Is there a way for PRM to skip this?

substitute "hostname -n" with "crm_node -n"

Sometimes the corosync node name is different form the hostname.
For mysql_prm there is a specific attriute. Not so for myql_monitor.
In my case using crm_mon -n instean of hostname -n solves the problem and permits to set the attributs with the command crm_attribute -N

Demotion fails if a long-running query holds locks

We've found that demotion fails when there's a long-running query holding locks. It looks like "set_read_only on" blocks forever, which means the server won't be successfully demoted until the query ends-- even if it's just an innocuous SELECT. In our particular case we have some reporting queries that can run 20 minutes or so, turning the failover into a mess.

What we've done locally is change the demote so that we do a kill before the set_read_only, as well as after. We figured everything is going to get blown away anyway so there shouldn't be a functional difference.

mysql_prm56 not up to date

By looking at the diffs, there are several changes that happened on the mysql_prm but no on the mysql_prm56

mysqlbinlog is slow for agent purposes

For example:

[root@mysql tmp]# time mysqlbinlog -vvv --base64-output=decode-rows mysqld-relay-bin.001185 | grep 'Xid =' -A2 | grep -v '\-\-' | tail -n 186 | egrep 'Xid|\# at [0-9]{2,10}' | tac | grep -A2 Xid |grep '# at ' | tail -n 1 | rev | cut -d' ' -f1 | rev
422883242

real	0m19.517s
user	0m23.235s
sys	0m5.289s

This can cause monitor call to timeout - we need something like MHA binlog parser but the agent is written in Shell, so we might need another solution - perhaps one in python.

merge Jervin's non-root ssh user from mysql_prm56 to mysql_prm

merge Jervin's non-root ssh user from mysql_prm56 to mysql_prm, pull request 32

You cannot just choose any resource name

glb_master_exists=`echo "$resources" | grep -A2 $glb_master_resource | egrep -c 'Master[^\/]'`

The code uses this a lot... glb_master_resource... grep..

so if you have p_mysql and p_mysql_2, then well... problems again...

Documentation for mysql_prm_gtid

Implement some kind of release/stable branches

Currently I am fetching the 'master' branch as version to be used.

However, some of the commit messages show me that the 'master' is not necessarily always a stable version.

It would be great to have some kind of versioning and/or assurance that certain branches will always be stable.

PRM-operational-guide reader failure count

At the bottom of the document, you state setting the readable gt 0

I believe this also requires a change of -inf to inf, as you just swapped the rule from a true on neg (readable lt 1, true=failed), to a positive (readable gt 0, true=ok)

Or just using the line from the current setup instructions is fine.
location loc-no-reader-vip-1 reader_vip_1
rule $id="rule-no-reader-vip-1" -inf: readable lt 1

booth ticket names should not contain the same data

from debug:

++ /usr/sbin/crm_ticket --info
++ grep ticketSessions
++ awk '{ print $2 }'
++ grep -c granted

if you have ticketSessions and TicketSessions2, then some will fail....

Pacemaker agent: mysql_monitor doesn't check if mysql running properly

Hi,

we use PXC with Pacemaker and mysql_monitor agent.
If mysqld was killed (for example with "-9" or OOM) then the PID Check is not absolutely correct.
There is a if (line 467) inside the if (line 459). But only an "else" for the outer "if".

I am not very familiar with git functionality. So here is the diff:

This helps:
--- /usr/lib/ocf/resource.d/percona/mysql_monitor.old 2014-05-30 16:29:20.941745469 +0000
+++ /usr/lib/ocf/resource.d/percona/mysql_monitor 2014-05-30 16:31:45.555613831 +0000
@@ -537,6 +537,10 @@
;;

         esac

+ else

          ocf_log $1 "MySQL is not running"

```
          set_reader_attr 0
```
```
          set_writer_attr 0
   fi
```
else
ocf_log $1 "MySQL is not running"

Error: Unable to parse xml for: mysql

Using the latest mysql_prm (Version 20140901101510), Issue #8 remains unresolved.
I'm on Centos 6.5 (and RHEL 6.5):

Latest version of mysql_prm produces this error.

$ pcs resource describe ocf:percona:mysql
Resource options for: ocf:percona:mysql
Error: Unable to parse xml for: mysql

Although commit 741e860 claims to solve the problem, it does not. The only change in that commit is to the version number.

I have made the following patch locally to fix the problem:

--- mysql.orig  2014-10-29 00:22:19.954100868 -0700
+++ mysql   2014-10-29 00:14:08.970863412 -0700
@@ -140,7 +140,7 @@
 : ${OCF_RESKEY_geo_remote_IP}=""
 : ${OCF_RESKEY_booth_master_ticket}=${OCF_RESKEY_booth_master_ticket_default}
 : ${OCF_RESKEY_post_promote_script}=""
-: ${OCF_RESKEY_prm_binlog_parser_path}="`which prm_binlog_parser` 2> /dev/null"
+: ${OCF_RESKEY_prm_binlog_parser_path}="`which prm_binlog_parser 2> /dev/null`"

 : ${OCF_RESKEY_async_stop=${OCF_RESKEY_async_stop_default}}
 : ${OCF_RESKEY_try_restart_crashed_master=${OCF_RESKEY_try_restart_crashed_master_default}}

Race Condition on Monitor (meta_role=[Slave|Master]

On a specific situation where a DC goes down, and a manual ticket grant was executed on the secondary DC before a slave check from that DC completes. The new master's readable attribute will remain 0 (albeit, unintentionally reset to 0). Here is an example from a trace log where set_reader_attr was called to set 1 then 0:

[root@node1 ~]# cat log|egrep "^(notify|monitor|OCF_RESKEY_CRM_meta_role|\+ set_reader_attr|Mon Nov|\+ ocf_log debug 'MySQL monitor succeeded|OCF_RESKEY_CRM_meta_notify_operation|OCF_RESKEY_CRM_meta_notify_key_type|OCF_RESKEY_CRM_meta_notify_promote_uname)"
[...]
Mon Nov 19 21:50:58 EST 2014
monitor
OCF_RESKEY_CRM_meta_role=Slave
Mon Nov 19 21:51:00 EST 2014
notify
OCF_RESKEY_CRM_meta_notify_key_type=pre
OCF_RESKEY_CRM_meta_notify_operation=promote
OCF_RESKEY_CRM_meta_notify_promote_uname=node1
Mon Nov 19 21:51:00 EST 2014
notify
OCF_RESKEY_CRM_meta_notify_key_type=pre
OCF_RESKEY_CRM_meta_notify_operation=promote
OCF_RESKEY_CRM_meta_notify_promote_uname=node1
Mon Nov 19 21:51:01 EST 2014
OCF_RESKEY_CRM_meta_notify_promote_uname=node1
+ set_reader_attr 1
Mon Nov 19 21:51:01 EST 2014
notify
OCF_RESKEY_CRM_meta_notify_key_type=post
OCF_RESKEY_CRM_meta_notify_operation=promote
OCF_RESKEY_CRM_meta_notify_promote_uname=node1
+ set_reader_attr 0
Mon Nov 19 21:51:01 EST 2014
monitor
OCF_RESKEY_CRM_meta_role=Master
+ ocf_log debug 'MySQL monitor succeeded (master)'
+ ocf_log debug 'MySQL monitor succeeded (master)'
Mon Nov 19 21:51:06 EST 2014
monitor
OCF_RESKEY_CRM_meta_role=Master
+ ocf_log debug 'MySQL monitor succeeded (master)'

mysql_prm56 name is confusing

mysql_prm56 provides GTID functionality, but it's not necessary to use GTIDs when using 56.

Consider renaming it to mysql_prmGTID or merge them together.

False Check for Slave IO State

The agent has this code, however the if condition is bad. When Slave_IO_Running: Yes, the slave would be wrongly marked as broken.

    887       # Is the slave_io thread running?
    888       if [ "$slave_io" != 'No' ]; then
    889          ocf_log info "Slave IO thread not running, master likely dead or stopped"
    890          break;
    891       fi

The comparison should be changed:

    888       if [ "$slave_io" != 'Yes' ]; then

Unable to add mysql instance in corosync pacemaker

I am unable to add myinstance1 to corosync pacemaker cluster. any suggestions

Migration summary:

Node server1:
Node server2:
p_mysql_myinstance:1: migration-threshold=1000000 fail-count=1000000
Node qoruma01:

Failed actions:
p_mysql_myinstance:1_start_0 (node=server2, call=740, rc=5, status=complete): not installed

crm_attribute Error Cleanup

When the agent checks for any attribute and there was none set, it returns an error and is quite confusing from the logfile/syslog when enabled. In many cases, we can make this better by using the --default option of crm_attribute and changing the following checks for results.

++ /usr/sbin/crm_attribute -N maindb03.boardreader.com -q -l reboot --name p_mysql_master_crashed --query
Error performing operation: No such device or address

For example, when mysql_monitor is invoked with meta role == Master it triggers this code:

            # Is this following a recent master crash?
            master_crashed_ts=`$CRM_ATTR_MASTER_CRASHED_TS --query`

            if [ ! -z $master_crashed_ts ]; then
               if [ `date +%s` -gt "$((${master_crashed_ts}+3600))" ]; then
                  #Let's cleanup the cib
                  $CRM_ATTR_MASTER_CRASHED_TS -D
                  $CRM_ATTR_LAST_TRX -D
               fi
            fi

We can change it to:

            # Is this following a recent master crash?
            master_crashed_ts=`$CRM_ATTR_MASTER_CRASHED_TS --query --default=0`

            if [ "$master_crashed_ts" -gt "0" ]; then
               if [ `date +%s` -gt "$((${master_crashed_ts}+3600))" ]; then
                  #Let's cleanup the cib
                  $CRM_ATTR_MASTER_CRASHED_TS -D
                  $CRM_ATTR_LAST_TRX -D
               fi
            fi

geo-DR: remove need to have ssh

Why do we need ssh from every machine to every machine?

For geo-DR, this is used to fetch who the master is and it's binary log information.

Can't we just have a small daemon that runs in the cluster and serves those requests?
Or are there other ways through booth itself?

setup this Percona XtraDB Cluster

I am unable to find the answer on google with clarity so I am using this to clear my doubts I aplogise in advance if its not the right place to post it . does this pacemaker work with Percona XtraDB Cluster solution as I see its one master-to- slave where as Percona XtraDB Cluster is all master and all slave . Would like to know more about this

Monitor doesn't really work

After issuing command
kill pidof mysqld; nc -l -p 3306 > /dev/null &
crm_mon shows me that mysql is still running. Also on a slave:
I killed mysql then pacemaker tried to start, but not successful. After I started mysql slave by hand, but it still show's me that it is not running

REPL_INFO position need to change without cluster/pcs restart

Is there any way I can update REPL_INFO position without restarting PCS ?

Sometime, Temporary fluctuation at Slave side, reset the SLAVE node to very old position and it takes almost week to complete that lag.

Agent cannot resume a slave from previous position

When a slave lost connection to the master for some reason i.e. network issue and Pacemaker takes it down, when it comes back up, it will not resume from the replication coordinates before the shutdown, instead it will pickup REPL_INFO from the cluster configuration which is going back.

Invalid Monitor Result After pre-promote on A Slave

On a geo-DR setup, if I kill the ticket daemon (kill -9 boothd) on the active site, Pacemaker would trigger a pre-promote and unset the master on the slaves just to keep everything sane. However, the following monitor would fail because check slave would complain now that SHOW SLAVE STATUS is empty from this part on the check_slave function:

    772       # An empty status could happen when a master is demote in a
    773       # geo DR setup, let's check
    774       if [ $MYSQL_LAST_ERR -eq 0 -a $glb_master_side -eq 1 ]; then
    775          # This is not the master side, let's try to setup the slave
    776          # No need to unset the master since slave status is empty
    777          set_reader_attr 0
    778          set_master
    779          return $OCF_SUCCESS
    780       fi
    781
    782       ocf_log err "check_slave invoked on an instance that is not a replication slave."
    783       exit $OCF_ERR_GENERIC

Because the slave replication status was reset, there was no error and the site was in fact the master side, the set_master part is never activated. I believe the conditional should be:

    774       if [ $MYSQL_LAST_ERR -eq 0 -a $glb_master_exists -eq 1 ]; then

Error in Percona replication manager (PRM) setup guide

In the manual you add colocation with resource "writer_vip" before create it. This causes the error.

[root@centos-web02]# pcs constraint colocation add master ms_MySQL with writer_vip
Error: Resource 'writer_vip' does not exist

Characters left over after parsing '-INF': '-INF'

Hello,

I'm using pacemaker 1.1.5 with ocf:percona:mysql_prm_gtid and I'm getting the following error in the logs:

pengine: error: crm_int_helper: Characters left over after parsing '-inf': '-inf'

Which I fixed by changing set_master_score -INF to set_master_score -infinity in mysql_prm_gtid.

RPM/DEB to install

It would be great to have deb/rpm's available to install prm.

MySQL Nodes for Different MS Sets

On a cluster with multiple ms sets like below, we need to be able to run the same RA on the nodes without the primitives competing for them. For example, I don't want node1 to run p_mysql_historical, nor I do not want p_mysql_historical to monitor node1 if its resource is running there because p_mysql_sessions would be there.

 Master/Slave Set: ms_MySQL_sessions [p_mysql_sessions]
     Masters: [ node1 ]
     Slaves: [ node2 ]
 Master/Slave Set: ms_MySQL_historical [p_mysql_historical]
     Slaves: [ node3 ]
 sessions_r_vip    (ocf::heartbeat:IPaddr2):    Started node1
 sessions_w_vip    (ocf::heartbeat:IPaddr2):    Started node1
 Resource Group: g-booth
     booth-ip    (ocf::heartbeat:IPaddr2):    Started node1
     booth    (ocf::pacemaker:booth-site):    Started node1

I could add a rule like this, but still probes would cause the resources to stop and start in a loop.

location location-p_mysql_sessions p_mysql_sessions \
        rule -inf: #uname eq node3

Failed Detection of Ticket Leader when there are Multiple Tickets

If I have multiple tickets on a single GEO setup, detection of is_master_side from the agent always returns true if one of the ticket is granted on the site:

[root@node ~]# booth client list
ticket: ticketSessions, leader: 192.168.56.96, expires: 2014-09-05 01:38:24
ticket: ticketHistorical, leader: 192.168.56.97, expires: 2014-09-05 01:38:39

Here's an excerpt from the trace file, the problem is that newlines from the output is being removed:

+ is_master_side
+ local ticket crmTicketRet
+ '[' 13 -gt 0 ']'
++ /usr/sbin/crm_ticket --info
+ crmTicketRet='ticketHistorical        granted           last-granted='\''Fri Sep  5 00:26:37 2014'\''
ticketSessions  revoked           last-granted='\''Fri Sep  5 00:17:16 2014'\'''
+ '[' 0 -ne 0 ']'
++ grep -c granted
++ awk '{ print $2 }'
++ grep ticketSessions
++ echo ticketHistorical granted 'last-granted='\''Fri' Sep 5 00:26:37 '2014'\''' ticketSessions revoked 'last-granted='\''Fri' Sep 5 00:17:16 '2014'\'''
+ ticket=1
+ '[' 1 -eq 1 ']'
+ return 0
+ glb_master_side=0
+ '[' 13 -gt 0 -a 0 -ne 0 ']'

Docs - need updating for "pcs" compatability (Redhat 6.6+)

Firstly, the docs are excellent! Very well structured and written, but I have been trying to follow them on a Redhat 6.6 system and have run into a few problems with the updated sections relating to the "pcs" commands.

Firstly the command "pcs constraint colocation add writer_vip p_mysql role=Master" does not work and needs to be translated to "pcs constraint colocation add writer_vip ms_MySQL role=Master" just to get it to run, but that does not work as expected and produces an error saying the format is invalid.

It would be good to have the documentation updated to remove all references to "crm" and replace with the "pcs" equivalents, and validate that all works as expected.

Thanks in advance,
Dave

mysql_stop does not really kills mysql in debian

HI,
Found a case when mysql_stop does not behaves as it should.

The problem is in this part of code:
pid=cat ${OCF_RESKEY_pid}.starting 2> /dev/null
/bin/kill $pid > /dev/null
rc=$?
if [ $rc != 0 ]; then
ocf_log err "MySQL couldn't be stopped"
return $OCF_ERR_GENERIC
fi

Debug:

'[' '!' -f /var/run/mysqld/mysqld.pid.starting ']'
++ cat /var/run/mysqld/mysqld.pid.starting
pid=32752
/bin/kill 32752
rc=0
'[' 0 '!=' 0 ']'
'[' 0 -eq 1 ']'
shutdown_timeout=15
'[' -n 900000 ']'
shutdown_timeout=895
count=0
'[' 0 -lt 895 ']'
kill -s 0 32752
rc=0
'[' 0 -ne 0 ']'
++ expr 0 + 1
count=1
sleep 1
ocf_log debug 'MySQL still hasn'''t stopped yet. Waiting...'
'[' 2 -lt 2 ']'

It does kill /usr/bin/mysqld_safe process but its child process /usr/sbin/mysqld stays alive till timeout is over and mysql gets killed with -KILL. That makes the process of failover nonworkable.

OS - Debian 7
Mysql - Percona 5.6.22

geo-dr: before-acquire-handler

from: https://github.com/ClusterLabs/booth

Before ticket renewal, the leader runs an external program if
such program is set in 'before-acquire-handler'. The external
program should ensure that the cluster managed service which is
protected by this ticket can run at this site. If that program
fails, the leader relinquishes the ticket. It announces its
intention to step down by broadcasting an unsolicited VOTE_FOR
with an empty vote. On receiving such RPC other servers start new
elections to elect a new leader.

Do we have some checks to make sure no ticket will end up with a cluster that cannot handle the resources?
For example, when you have a secondary datacenter with a slave node. When being granted a ticket, but replication is broken, you will end up with a broken cluster. The secondary dc will have the ticket, but cannot start the resources. It's better to not grant the ticket at all.

mysql_monitor doesn't detect when mysqld was killed abruptely

In a properly configured percona-cluster+pacemaker with mysql_monitor agent, if you run killall -9 mysqld the pid file is left which confuses mysql_monitor, because it tries run [ "u$pid" != "u" -a -d /proc/$pid ] and the exit code is never catched to set the read/write property to 0

Here is a debug log of this situation:

Thu Apr  2 21:27:52 UTC 2015
monitor
OCF_RA_VERSION_MAJOR=1
OCF_RA_VERSION_MINOR=0
OCF_RESKEY_CRM_meta_OCF_CHECK_LEVEL=1
OCF_RESKEY_CRM_meta_clone=0
OCF_RESKEY_CRM_meta_clone_max=3
OCF_RESKEY_CRM_meta_clone_node_max=1
OCF_RESKEY_CRM_meta_globally_unique=false
OCF_RESKEY_CRM_meta_interval=1000
OCF_RESKEY_CRM_meta_name=monitor
OCF_RESKEY_CRM_meta_notify=false
OCF_RESKEY_CRM_meta_timeout=30000
OCF_RESKEY_OCF_CHECK_LEVEL=1
OCF_RESKEY_cluster_type=pxc
OCF_RESKEY_crm_feature_set=3.0.7
OCF_RESKEY_max_slave_lag=5
OCF_RESKEY_password=mia123
OCF_RESKEY_pid=/var/run/mysqld/mysqld.pid
OCF_RESKEY_socket=/var/run/mysqld/mysqld.sock
OCF_RESKEY_user=sstuser
OCF_RESOURCE_INSTANCE=res_mysql_monitor
OCF_RESOURCE_PROVIDER=percona
OCF_RESOURCE_TYPE=mysql_monitor
OCF_ROOT=/usr/lib/ocf
+ case $__OCF_ACTION in
+ mysql_monitor
+ '[' -e /var/run/mysqld/mysqld.pid ']'
++ cat /var/run/mysqld/mysqld.pid
+ pid=730
+ '[' -d /proc -a -d /proc/1 ']'
+ '[' u730 '!=' u -a -d /proc/730 ']'
+ '[' 1 -eq 0 ']'
+ mysql_monitor_monitor
+ '[' -f /var/run/resource-agents/mysql-monitor-res_mysql_monitor.state ']'
+ return 0
+ rc=0
+ ocf_log debug 'res_mysql_monitor monitor : 0'
+ '[' 2 -lt 2 ']'
+ __OCF_PRIO=debug
+ shift
+ __OCF_MSG='res_mysql_monitor monitor : 0'
+ case "${__OCF_PRIO}" in
+ __OCF_PRIO=DEBUG
+ '[' DEBUG = DEBUG ']'
+ ha_debug 'DEBUG: res_mysql_monitor monitor : 0'
+ '[' x0 = x0 ']'
+ return 0
+ exit 0

Error: Unable to parse xml for: mysql

Latest version of mysql_prm produces this error.

$ pcs resource describe ocf:percona:mysql
Resource options for: ocf:percona:mysql
Error: Unable to parse xml for: mysql

This bug was introduced in commit 13d5326. Parent commit works fine.

btw I'm on centos 6.5

mysql 5.7 support

mysql 5.7 deprecates information_schema in favor of performance_schema for global_variables. This affects:

mysql_run -Q -sw -O $MYSQL $MYSQL_OPTIONS_REPL -BN -e "select replace(var\ iable_value,'\n','') as variable_value from information_schema.global_variables\ where variable_name in ('gtid_executed')" > $slave_status_file

And

mysql_run -Q -sw -O $MYSQL $MYSQL_OPTIONS_REPL -BN -e "select variable_name,\ replace(variable_value,'\n','') as variable_value from information_schema.globa\ l_variables where variable_name in ('server_uuid','gtid_executed')" > $master_s\ tatus_file

information_schema.global_variables is deprecated in favor of performance_schema.global_variables. See http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_show_compatibility_56.

This is reflected in the logs by the following entry:

mysql_prm_gtid(p_mysql)[10769]: 2016/09/02_13:24:44 ERROR: ERROR 3167 (HY000) at line 1: The 'INFORMATION_SCHEMA.GLOBAL_VARIABLES' feature is disabled; see the documentation for 'show_compatibility_56'

Additionally, the grep on this warning doesn't work anymore:

echo "$error" | egrep -v '^Warning: Using a password on the command line'\

See how mysql client 5.7 prints the warning:

mysql: [Warning] Using a password on the command line interface can be insecure.

Which makes the grep ineffective, which means a lot of spam in the logs.

Besides these two minor things, we haven't found any other problems using 5.7.

Last but not least: Thanks for the great work!!

REPL_INFO should not be used if slave was previously replicating properly

REPL_INFO can be stale - for example, if slave drops from the network for some reason, the agent would reset replication to use the values from REPL_INFO which would be outdated. At worse this can break replication, lucky if the old binlogs does not exists anymore.

Slaves Terminating Replication

When promoting a new master using a location rule described on the link below, slaves would lose connection from the master.

https://github.com/percona/percona-pacemaker-agents/blob/master/doc/PRM-operational-guide.rst#switching-roles

Another instance was when following the procedure here http://dotmanila.com/blog/2014/06/percona-replication-manager-renaming-cluster-hostnames/ after step 3.

140613 5:51:38 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)

mariadb 10.x support

Is it possible to modify the agent to support mariadb 10.x ?