<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Original comment by Yoshinor...@gmail.com on 16 Sep 2

masterha_manager doesnt start if one of slaves is dead and "ignore_fail=1" about mysql-master-ha HOT 7 CLOSED

luxinle commented on August 28, 2024

masterha_manager doesnt start if one of slaves is dead and "ignore_fail=1"

from mysql-master-ha.

Comments (7)

GoogleCodeExporter commented on August 28, 2024

This is an expected behavior. ignore_fail parameter works on master failover 
(via masterha_manager, or when running masterha_master_switch 
--master_state=dead manually), but does not work when *starting* 
masterha_manager. In below scenarios, ignore_fail should work.
- Start masterha_manager when all servers (including master) are alive. After 
MHA enters steady-state (pinging master), kill both ignore_fail marked slave 
and master.
- Run masterha_master_switch --master_state=dead when master and ignore_fail 
marked slave are down.

Original comment by [email protected] on 20 Nov 2012 at 8:25

from mysql-master-ha.

GoogleCodeExporter commented on August 28, 2024

well, i try to explain question i keep in my mind.
suppose there is little cluster with 5 machines.
s1-s2-s3-s4-s5
One of these is a master(r/w) and others are slaves(read-only)

s1
 +--s2
 +--s3
 +--s4
 +--s5

s1 - master

I am thinking about scripting and automation of starting masterha_manager. I 
think to use pacemaker. I can tell pacemaker to start masterha_manager on 
machine that is not master. For example on s2 machine. if s1 will dead, 
masterha_manager does failover.
in case s2 will dead, pacemaker see it and start masterha_manager on other 
machine - e.g. s3 and continue monitoring master and can do failover when 
necessary. Also all machines in the cluster have identically mha.conf file.
But in current behavior of masterha_manager it's not possible to work in that 
scenario.
Because, when s2 is dead and pacemaker will try to start mha on s3, but mha 
couldnt start by reason 'one of slaves that exists in conf isn't alive'.
For me it's not matter that s2 is dead - there are else 3 full working slaves.
I would wanted to start mha monitoring and failover processing anyway in that 
case. 
s2 can be repaired and added to cluster without any difficulties little later.

else one case: when mha did failover death of s1 and move master to s2,  in 
current realization after failover mha will terminate and exit to console. So i 
should start mha again on other machine e.g. s3. but it couldnt because s1 is 
dead (all machines in mha.conf file must be live)

I think it'd be good to have variable such 'ignore_fail_onstart' for handling 
cases of death one of slaves or machines. Or may be variable like 
'count_alive_slaves' that tell mha how much slaves must be alive for next 
working.

ps:also i understand that i could change mha.conf file after death of any 
machines, but it does working of cluster more complex. In that way i should do 
monitoring of all machines and to correct conf always when someone died or alive

Original comment by [email protected] on 21 Nov 2012 at 10:56

from mysql-master-ha.

GoogleCodeExporter commented on August 28, 2024

I can add either command line argument or conf file parameter to skip checking 
failed servers on masterha_manager start easily. I think adding a command line 
argument (i.e. --ignore-fail-on-start) makes more sense.

You can try below patch if you want. It will skip checking ignore_fail marked 
instances on startup.

--- lib/MHA/MasterMonitor.pm.old      2012-11-14 18:28:20.000000000 -0800
+++ lib/MHA/MasterMonitor.pm     2012-11-21 18:56:32.000000000 -0800
@@ -359,7 +359,7 @@ sub wait_until_master_is_unreachable() {
         sprintf( "Identified master is %s.", $current_master->get_hostinfo() )
       );
     }
-    $_server_manager->validate_num_alive_servers( $current_master, 0 );
+    $_server_manager->validate_num_alive_servers( $current_master, 1 );
     if ( check_master_ssh_env($current_master) ) {
       if ( check_master_binlog($current_master) ) {
         $log->error("Master configuration failed.");

Original comment by [email protected] on 22 Nov 2012 at 3:01

Changed state: Accepted

from mysql-master-ha.

GoogleCodeExporter commented on August 28, 2024

thanks for patch, it works.

i have a new problem :)

i test next case:
there is three machines with next roles
172.16.50.14 (master)
 +--172.16.50.11 (slave)
 +--172.16.50.13 (slave)

in this test, before start masterha_manager, i shutdown mysql on 13
then start mha_manager
it starts good, it sees that 13 is dead.
it write:
###############
###############
Tue Nov 27 17:35:28 2012 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:35:28 2012 - [info] Reading application default configurations 
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:35:28 2012 - [info] Reading server configurations from 
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:35:28 2012 - [info] MHA::MasterMonitor version 0.54.
Tue Nov 27 17:35:28 2012 - [info] Dead Servers:
Tue Nov 27 17:35:28 2012 - [info]   172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:35:28 2012 - [info] Alive Servers:
Tue Nov 27 17:35:28 2012 - [info]   172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:35:28 2012 - [info]   172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info] Alive Slaves:
Tue Nov 27 17:35:28 2012 - [info]   172.16.50.11(172.16.50.11:3306)  
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:35:28 2012 - [info]     Replicating from 
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info]     Primary candidate for the new Master 
(candidate_master is set)
Tue Nov 27 17:35:28 2012 - [info] Current Alive Master: 
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info] Checking slave configurations..
Tue Nov 27 17:35:28 2012 - [info]  read_only=1 is not set on slave 
172.16.50.11(172.16.50.11:3306).
Tue Nov 27 17:35:28 2012 - [warning]  relay_log_purge=0 is not set on slave 
172.16.50.11(172.16.50.11:3306).
Tue Nov 27 17:35:28 2012 - [info] Checking replication filtering settings..
Tue Nov 27 17:35:28 2012 - [info]  binlog_do_db= testdb, binlog_ignore_db=
Tue Nov 27 17:35:28 2012 - [info]  Replication filtering check ok.
Tue Nov 27 17:35:28 2012 - [info] Starting SSH connection tests..
Tue Nov 27 17:35:29 2012 - [info] All SSH connection tests passed successfully.
Tue Nov 27 17:35:29 2012 - [info] Checking MHA Node version..
Tue Nov 27 17:35:30 2012 - [info]  Version check ok.
Tue Nov 27 17:35:30 2012 - [info] Checking SSH publickey authentication 
settings on the current master..
Tue Nov 27 17:35:30 2012 - [info] HealthCheck: SSH to 172.16.50.14 is reachable.
Tue Nov 27 17:35:30 2012 - [info] Master MHA Node version is 0.54.
Tue Nov 27 17:35:30 2012 - [info] Checking recovery script configurations on 
the current master..
Tue Nov 27 17:35:30 2012 - [info]   Executing command: save_binary_logs 
--command=test --start_pos=4 --binlog_dir=/home/mysqldata/ 
--output_file=/home/mha_manager_data/app1/save_binary_logs_test 
--manager_version=0.54 --start_file=mysql-bin.000013
Tue Nov 27 17:35:30 2012 - [info]   Connecting to 
[email protected](172.16.50.14)..
  Creating /home/mha_manager_data/app1 if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /home/mysqldata/, up to mysql-bin.000013
Tue Nov 27 17:35:30 2012 - [info] Master setting check done.
Tue Nov 27 17:35:30 2012 - [info] Checking SSH publickey authentication and 
checking recovery script configurations on all alive slave servers..
Tue Nov 27 17:35:30 2012 - [info]   Executing command : apply_diff_relay_logs 
--command=test --slave_user='mha' --slave_host=172.16.50.11 
--slave_ip=172.16.50.11 --slave_port=3306 --workdir=/home/mha_manager_data/app1 
--target_version=5.5.28-MariaDB-log --manager_version=0.54 
--relay_log_info=/home/mysqldata/relay-log.info  --relay_dir=/home/mysqldata/  
--slave_pass=xxx
Tue Nov 27 17:35:30 2012 - [info]   Connecting to 
[email protected](172.16.50.11:22)..
  Checking slave recovery environment settings..
    Opening /home/mysqldata/relay-log.info ... ok.
    Relay log found at /home/mysqldata, up to mysql-relay-bin.000005
    Temporary relay log file is /home/mysqldata/mysql-relay-bin.000005
    Testing mysql connection and privileges.. done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Tue Nov 27 17:35:31 2012 - [info] Slaves settings check done.
Tue Nov 27 17:35:31 2012 - [info]
172.16.50.14 (current master)
 +--172.16.50.11

Tue Nov 27 17:35:31 2012 - [warning] master_ip_failover_script is not defined.
Tue Nov 27 17:35:31 2012 - [warning] shutdown_script is not defined.
Tue Nov 27 17:35:31 2012 - [info] Set master ping interval 3 seconds.
Tue Nov 27 17:35:31 2012 - [warning] secondary_check_script is not defined. It 
is highly recommended setting it to check master reachability from two or more 
routes.
Tue Nov 27 17:35:31 2012 - [info] Starting ping health check on 
172.16.50.14(172.16.50.14:3306)..
Tue Nov 27 17:35:31 2012 - [info] Ping(SELECT) succeeded, waiting until MySQL 
doesn't respond..
########################
########################

while it is monitoring master, i start mysql on 13 and shutdown on 11.
Then i shutdown master and look what will happen

i expect mha_manager will start failover: it will looking again which slaves 
are working and then will choose the best master candidate of them.
but it doesnt
it wants to use only 11 candidate, and doesnt think about 13

#########
#########
Tue Nov 27 17:36:07 2012 - [warning] Got error on MySQL select ping: 2006 
(MySQL server has gone away)
Tue Nov 27 17:36:07 2012 - [info] Executing SSH check script: save_binary_logs 
--command=test --start_pos=4 --binlog_dir=/home/mysqldata/ 
--output_file=/home/mha_manager_data/app1/save_binary_logs_test 
--manager_version=0.54 --binlog_prefix=mysql-bin
  Creating /home/mha_manager_data/app1 if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /home/mysqldata/, up to mysql-bin.000013
Tue Nov 27 17:36:07 2012 - [info] HealthCheck: SSH to 172.16.50.14 is reachable.
Tue Nov 27 17:36:10 2012 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:10 2012 - [warning] Connection failed 1 time(s)..
Tue Nov 27 17:36:13 2012 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:13 2012 - [warning] Connection failed 2 time(s)..
Tue Nov 27 17:36:16 2012 - [warning] Got error on MySQL connect: 2003 (Can't 
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:16 2012 - [warning] Connection failed 3 time(s)..
Tue Nov 27 17:36:16 2012 - [warning] Master is not reachable from health 
checker!
Tue Nov 27 17:36:16 2012 - [warning] Master 172.16.50.14(172.16.50.14:3306) is 
not reachable!
Tue Nov 27 17:36:16 2012 - [warning] SSH is reachable.
Tue Nov 27 17:36:16 2012 - [info] Connecting to a master server failed. Reading 
configuration file /etc/masterha_default.cnf and /etc/mha_manager/app1.cnf 
again, and trying to connect to all servers to check server status..
Tue Nov 27 17:36:16 2012 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:36:16 2012 - [info] Reading application default configurations 
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Reading server configurations from 
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Dead Servers:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Servers:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Slaves:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.13(172.16.50.13:3306)  
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:36:16 2012 - [info]     Replicating from 
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info]     Primary candidate for the new Master 
(candidate_master is set)
Tue Nov 27 17:36:16 2012 - [info] Checking slave configurations..
Tue Nov 27 17:36:16 2012 - [info]  read_only=1 is not set on slave 
172.16.50.13(172.16.50.13:3306).
Tue Nov 27 17:36:16 2012 - [warning]  relay_log_purge=0 is not set on slave 
172.16.50.13(172.16.50.13:3306).
Tue Nov 27 17:36:16 2012 - [info] Checking replication filtering settings..
Tue Nov 27 17:36:16 2012 - [info]  Replication filtering check ok.
Tue Nov 27 17:36:16 2012 - [info] Master is down!
Tue Nov 27 17:36:16 2012 - [info] Terminating monitoring script.
Tue Nov 27 17:36:16 2012 - [info] Got exit code 20 (Master dead).
Tue Nov 27 17:36:16 2012 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:36:16 2012 - [info] Reading application default configurations 
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Reading server configurations from 
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] MHA::MasterFailover version 0.54.
Tue Nov 27 17:36:16 2012 - [info] Starting master failover.
Tue Nov 27 17:36:16 2012 - [info]
Tue Nov 27 17:36:16 2012 - [info] * Phase 1: Configuration Check Phase..
Tue Nov 27 17:36:16 2012 - [info]
Tue Nov 27 17:36:16 2012 - [info] Dead Servers:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info] Checking master reachability via mysql(double 
check)..
Tue Nov 27 17:36:16 2012 - [info]  ok.
Tue Nov 27 17:36:16 2012 - [info] Alive Servers:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Slaves:
Tue Nov 27 17:36:16 2012 - [info]   172.16.50.13(172.16.50.13:3306)  
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:36:16 2012 - [info]     Replicating from 
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info]     Primary candidate for the new Master 
(candidate_master is set)
Tue Nov 27 17:36:16 2012 - [error][/usr/share/perl/5.14/MHA/ServerManager.pm, 
ln443]  Server 172.16.50.11(172.16.50.11:3306) is dead, but must be alive! 
Check server settings.
Tue Nov 27 17:36:16 2012 - [error][/usr/share/perl/5.14/MHA/ManagerUtil.pm, 
ln178] Got ERROR:  at /usr/share/perl/5.14/MHA/MasterFailover.pm line 258
#########
#########
and also it does not do failover in that case.


Questions:
1. as i understand i should restart mha_manager everytime when any of slaves 
will star/stop, dont i? because only on restart mha builds list of candidates...
2. if i do as in p.1, it becames very hard to do that in pacemaker (i'm not ace 
in pacemaker yet)
3. maybe could you change  it's behavior  and build list of candidates (listed 
in conf file) when failover time happens? it would be more logical on my mind, 
because slaves can do stop/start many times in their life, and what state they 
will be in any point of time - nobody knows:) even they are.

Original comment by [email protected] on 27 Nov 2012 at 2:11

from mysql-master-ha.

GoogleCodeExporter commented on August 28, 2024

Does slave 172.16.50.11 set ignore_fail=1? Otherwise MHA does not start 
failover.
You don't need to restart MHA on just slave start/stop. You need to restart 
when you add or remove slaves.

Original comment by [email protected] on 28 Nov 2012 at 7:42

from mysql-master-ha.

GoogleCodeExporter commented on August 28, 2024

you're right, i forgot set ignore_fail=1 on 172.16.50.11
Now i checked it, and tested again. mha works good and make 13 as master

Original comment by [email protected] on 28 Nov 2012 at 9:15

from mysql-master-ha.

GoogleCodeExporter commented on August 28, 2024

Original comment by [email protected] on 16 Sep 2013 at 6:41

Changed state: Done

from mysql-master-ha.

masterha_manager doesnt start if one of slaves is dead and "ignore_fail=1" about mysql-master-ha HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs