GithubHelp home page GithubHelp logo

signal18 / replication-manager Goto Github PK

View Code? Open in Web Editor NEW
623.0 48.0 164.0 74.5 MB

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server

Home Page: https://signal18.io/products/srm

License: GNU General Public License v3.0

Go 73.03% Shell 1.88% Python 8.70% HTML 8.70% Makefile 0.14% Dockerfile 0.04% JavaScript 3.00% CSS 4.53%
failover slave leader-election mariadb mysql replication gtid haproxy proxysql monitoring

replication-manager's Introduction

replication-manager Build Status

replication-manager

replication-manager is an high availability solution to manage MariaDB 10.x and MySQL & Percona Server 5.7 GTID replication topologies.

The main features are:

  • Replication monitoring
  • Topology detection
  • Slave to master promotion (switchover)
  • Master election on failure detection (failover)
  • Replication best practice enforcement
  • Target to up to zero loss in most failure scenarios
  • Multiple cluster management
  • Proxy integration (ProxySQL, MaxScale, HAProxy, Spider)

License

replication-manager is released under the GPLv3 license. (complete licence text)

It includes third-party libraries released under their own licences. Please refer to the vendor directory for more information.

It also includes derivative work from the go-carbon library by Roman Lomonosov, released under the MIT licence and found under the graphite directory. The original library can be found here: https://github.com/lomik/go-carbon

Copyright and Support

Replication Manager for MySQL and MariaDB is developed and supported by SIGNAL18 CLOUD SAS.

replication-manager's People

Contributors

ahfa92 avatar caffeinated92 avatar dependabot[bot] avatar emmaloubersac avatar fmbiete avatar guilhem avatar marielgn avatar markus456 avatar ncstate-mafields avatar skord avatar sschwarz avatar svaroqui avatar tanji avatar tcaxias avatar terwey avatar testwill avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

replication-manager's Issues

Cannot build Dockerfile on 0.7 branch because of go shitty package manager

$ docker-compose up -d                                                                                                                                                                           
Building agent3
Step 1 : FROM golang:1.6-alpine
 ---> 6deacc16609c
Step 2 : RUN apk add --update git && rm -rf /var/cache/apk/*
 ---> Using cache
 ---> c0b0d13aed26
Step 3 : RUN mkdir -p /go/src/replication-manager
 ---> Using cache
 ---> b31b34b27264
Step 4 : WORKDIR /go/src/replication-manager
 ---> Using cache
 ---> 7c6e0df18245
Step 5 : COPY . /go/src/replication-manager
 ---> e9b10693323a
Removing intermediate container db9cbcebd446
Step 6 : RUN go get .
 ---> Running in 6f70e32db8ec
package github.com/mariadb-corporation/replication-manager/gtid: cannot find package "github.com/mariadb-corporation/replication-manager/gtid" in any of:
    /usr/local/go/src/github.com/mariadb-corporation/replication-manager/gtid (from $GOROOT)
    /go/src/github.com/mariadb-corporation/replication-manager/gtid (from $GOPATH)
ERROR: Service 'agent3' failed to build: The command '/bin/sh -c go get .' returned a non-zero code: 1

--daemon should be a command

I like the idea of the subcommands in mrm. Perhaps the --daemon switch in the monitor subcommand should be a separate command?

[0.6.2]Failover and switchover are failed : invalid memory address or nil pointer dereference

Hi ,
The failover and switchover functions are failed .
The failover error message as below :
2016/03/17 12:21:57 INFO : Server mdb1:9801 is dead.
2016/03/17 12:21:57 INFO : Starting master switch
2016/03/17 12:21:57 INFO : Electing a new master
2016/03/17 12:21:57 INFO : Slave mdb2:9801 [0] has been elected as a new master
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x40cdc8]

goroutine 1 [running]:
panic(0x702de0, 0xc82000e0f0)
/usr/lib/go/src/runtime/panic.go:464 +0x3e6
main.(*ServerMonitor).delete(0xc820070640, 0x8f41e0)
/home/tanj/Dev/go/src/github.com/mariadb-corporation/replication-manager/monitor.go:259 +0x208
main.masterFailover(0x7fff32300701)
/home/tanj/Dev/go/src/github.com/mariadb-corporation/replication-manager/failover.go:44 +0x9d1
main.main()
/home/tanj/Dev/go/src/github.com/mariadb-corporation/replication-manager/repmgr.go:286 +0x218d

The switchover error message as below :
2016-03-17 11:54:32 Monitor started in switchover mode
2016-03-17 11:54:52 INFO : Slave mdb2:9801 [0] has been elected as a new master
2016-03-17 11:54:52 DEBUG: Election rig: mdb2:9801 elected as preferred master
2016-03-17 11:54:52 DEBUG: Checking eligibility of slave server mdb2:9801
2016-03-17 11:54:52 DEBUG: Processing 2 candidates
2016-03-17 11:54:52 INFO : Electing a new master
2016-03-17 11:54:52 INFO : Checking long running updates on master
2016-03-17 11:54:52 INFO : Flushing tables on mdb1:9801 (master)
2016-03-17 11:54:52 INFO : Starting master switch
2016-03-17 11:54:32 Monitor started in switchover modepanic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x40cdc8]

                                                               goroutine 1 [running]:
                                                                                     panic(0x702de0, 0xc82000e100)
                    /usr/lib/go/src/runtime/panic.go:464 +0x3e6
                                                               main.(*ServerMonitor).delete(0xc820072640, 0x8f41e0)
                    /home/tanj/Dev/go/src/github.com/mariadb-corporation/replication-manager/monitor.go:259 +0x208
             main.masterFailover(0xc8202b7e00)
                                                    /home/tanj/Dev/go/src/github.com/mariadb-corporation/replication-manager/failover.go:44 +0x9d1
                                             main.main()
                                                            /home/tanj/Dev/go/src/github.com/mariadb-corporation/replication-manager/repmgr.go:313 +0x2a5c

runtime error: makeslice: len out of range

When following the instructions in the contrib/cluster-maxscale/README.md file I get this error after all the other services have been started.

panic: runtime error: makeslice: len out of range
replication-manager_1  | 
replication-manager_1  | goroutine 1 [running]:
replication-manager_1  | panic(0x972cc0, 0xc820278d20)
replication-manager_1  |    /usr/local/go/src/runtime/panic.go:481 +0x3e6
replication-manager_1  | main.glob.func10(0xcf9460, 0xc820084c80, 0x0, 0x5)
replication-manager_1  |    /go/src/replication-manager/repmgr.go:220 +0x35e
replication-manager_1  | replication-manager/vendor/github.com/spf13/cobra.(*Command).execute(0xcf9460, 0xc820084b40, 0x5, 0x5, 0x0, 0x0)
replication-manager_1  |    /go/src/replication-manager/vendor/github.com/spf13/cobra/command.go:565 +0x85a
replication-manager_1  | replication-manager/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xcf8860, 0xcf9460, 0x0, 0x0)
replication-manager_1  |    /go/src/replication-manager/vendor/github.com/spf13/cobra/command.go:651 +0x55c
replication-manager_1  | replication-manager/vendor/github.com/spf13/cobra.(*Command).Execute(0xcf8860, 0x0, 0x0)
replication-manager_1  |    /go/src/replication-manager/vendor/github.com/spf13/cobra/command.go:610 +0x2d
replication-manager_1  | main.main()
replication-manager_1  |    /go/src/replication-manager/main.go:30 +0x27
$ docker-compose ps                                                                                                                                                                                         Name                               Command               State             Ports          
---------------------------------------------------------------------------------------------------------
clustermaxscale_agent1_1                replication-manager agent  ...   Up                               
clustermaxscale_agent2_1                replication-manager agent  ...   Up                               
clustermaxscale_agent3_1                replication-manager agent  ...   Up                               
clustermaxscale_mariadb1_1              docker-entrypoint.sh --log ...   Up       3306/tcp                
clustermaxscale_mariadb2_1              docker-entrypoint.sh --log ...   Up       3306/tcp                
clustermaxscale_mariadb3_1              docker-entrypoint.sh --log ...   Up       3306/tcp                
clustermaxscale_maxscale_1              /usr/bin/maxscale --nodaem ...   Up       0.0.0.0:32772->3306/tcp 
clustermaxscale_replication-manager_1   replication-manager monito ...   Exit 2     

Passing --daemon=true in the command serves as workarround.

Cannot connect to node error: Pleasse add hostname

I was unable to connect to db4 (mysql schema is not replicated and I forgot to add credentials there). It would save me a fair bit of time if the error said the issue was at db4.

mariadb@maxscale01:~$ ./go/bin/replication-manager --hosts=db1,db2,db3,db4 --user=replication_manager:..... --rpluser=slave_user --interactive --switchover=keep
2016/06/10 13:54:51 ERROR: Database access denied: Error 1045: Access denied for user 'replication_manager'@'172.27.4.103' (using password: YES)

FR: Toggle On-Call/On-Leave via console monitor

The Web-based Monitor can toggle --interactive, but doesn't seem like the console-based monitor can do it (not listed in the README). It would be great to have the same ability in both places. This helps when running the console monitor under screen.

Thanks!

replication-manager 0.6 question?

replication-manager --version
MariaDB Replication Manager version 0.6.3

master: 172.28.14.118 mariadb 10.0.23
slave: 172.28.14.114 mariadb 10.0.23

Both master and slave service normal, read and write not problem too, But execute replication-manager at below:
replication-manager --user=root:123456 --rpluser=root:123456 --hosts=172.28.14.118:3306 --failover=force --socket=/home/mtest1/data/mysql.sock
2016/03/18 11:21:06 ERROR: Could not autodetect a failed master!

replication-manager --user=root:123456 --rpluser=root:123456 --hosts=172.28.14.118:3306,172.28.14.114:3306 --failover=force --socket=/home/mtest1/data/mysql.sock
2016/03/18 11:21:28 *ERROR: Multi-master topologies are not yet supported.
*

Is our's configure or grammar have problem ?

Misleading Error Message: Error getting privileges for user root on server 127.0.0.1:3307: sql: no rows in result set.

When running the monitor:

replication-manager monitor --hosts=127.0.0.1:3307,127.0.0.1:3308,127.0.0.1:3309 --user=root --rpluser=r --verbose

I got this error message:

ERROR:ERR00015 Error getting privileges for user root on server 127.0.0.1:3307: sql: no rows in result set.

Turning on the general query log, I found out what queries this was after:

Execute SELECT Select_priv, Process_priv, Super_priv, Repl_slave_priv, Repl_client_priv, Reload_priv FROM mysql.user WHERE user = 'root' AND host = '127.0.0.1'
Execute SELECT Select_priv, Process_priv, Super_priv, Repl_slave_priv, Repl_client_priv, Reload_priv FROM mysql.user WHERE user = 'r' AND host = '127.0.0.1'
Execute SELECT Select_priv, Process_priv, Super_priv, Repl_slave_priv, Repl_client_priv, Reload_priv FROM mysql.user WHERE user = 'r' AND host = '%'

I had the first two users, but I did not have the r@% user.

This was confusing because it wasn't the root user as the error message stated.

It would be great if the error message could show what query it ran, to make debugging these a bit easier.

Thanks!

FR: Dashboard: Notification when not connected

Feature request. The dashboard needs to have some sort of warning when it's not connected to the monitor anymore. Otherwise, if I'm only looking at the dashboard, I wouldn't know if the monitor daemon has stopped and not providing up to date information.

Mrm breaks if no transactions have been executed after a failover

If MRM does a failover but no transactions have been executed afterwards the next failover does not work.

Also it will not work if only transactions from old school slave has been executed in between the 2 failovers (see MDEV-10279).

As this would not really happen in the wild I suggest adding a work-around: Please execute a bogus transaction after the failover has succeeded.

Add basic build instructions to README.md (or elsewhere)

From my recent experience one has to do at least the following to build replication manager (I've used Fedora 23):

If go is missing:

sudo yum install golang-bin
export GOPATH=

Then:

git clone https://github.com/mariadb-corporation/replication-manager.git
cd replication-manager
go get github.com/go-sql-driver/mysql
go get github.com/nsf/termbox-go
go get github.com/tanji/mariadb-tools/dbhelper
go get github.com/ogier/pflag
go build
./replication-manager --version

I think it makes sense to document this explicitly

ERROR: No slaves were detected

Sorry to bother you again, but I still have problem with my configuration. Structure looks like following:

3 servers: 1 maxscale (10.0.0.1), 1 mariadb master(10.0.0.2), 1 mariadb slave(10.0.0.3).

Simple test to exchange master-slave fails with following output:

replication-manager -user user:user -rpluser replica:repl1ca -hosts 10.0.0.2:3306,10.0.0.3:3306 -failover=force -verbose

2015/11/30 10:59:49 DEBUG: Creating new server: 10.0.0.2:3306
2015/11/30 10:59:49 DEBUG: Checking if server 10.0.0.2:3306 is slave
2015/11/30 10:59:49 DEBUG: Server 10.0.0.2:3306 is not a slave. Setting aside
2015/11/30 10:59:49 DEBUG: Creating new server: 10.0.0.3:3306
2015/11/30 10:59:49 DEBUG: Checking if server 10.0.0.3:3306 is slave
2015/11/30 10:59:49 DEBUG: Server 10.0.0.3:3306 is not a slave. Setting aside
2015/11/30 10:59:49 ERROR: No slaves were detected.

Was looking for the error massage in sources and found following condition:

if servers[k].UsingGtid != ""

But on slave Using_Gtid value is definitely not empty:

Using_Gtid: Current_Pos
Gtid_IO_Pos: 0-1-12542

So, I am just curious what to check next.

auto failover?

if only one master and one slave , when master down , it can't failover ?

one master and two slave failover question ?

Hi,
we use one master and two slave in replication-manager 0.6.4.
master: 172.28.14.114
slave: 172.28.14.118, 172.28.14.117

when we stop mysql service at 172.28.14.114, Slave 172.28.14.117 elected as a new master,
172.28.14.118 as slave of 172.28.14.117 and replication normal.
but ,when we start mysql service at 172.28.14.114, io thread show error:
Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-114-1775428, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'

MariaDB Replication Monitor and Health Checker version 0.6.3 | Mode: Auto Failover

Master Host   Port                              Current GTID      Binlog Position  Strict Mode

172.28.14.117 3306 0-117-1790726 0-117-1790726 OFF

 Slave Host   Port  Binlog   Using GTID         Current GTID           Slave GTID   Replication Health  Delay  RO

172.28.14.118 3306 ON Slave_Pos 0-117-1790694 0-117-1790694 Behind master 1 ON
172.28.14.114 3306 ON Current_Pos 0-114-1775428 0-118-1648605 NOT OK, IO Stopped 0 ON

Ctrl-Q to quit, Ctrl-S to switchover

2016-06-24 22:49:07 INFO : Master switch on 172.28.14.117 complete
2016-06-24 22:49:04 INFO : Change master on slave 172.28.14.118
2016-06-24 22:49:04 INFO : Switching other slaves to the new master
2016-06-24 22:49:04 INFO : Resetting slave on new master and set read/write mode on
2016-06-24 22:49:03 INFO : Stopping slave thread on new master
2016-06-24 22:49:03 INFO : Switching master
2016-06-24 22:49:03 INFO : Slave 172.28.14.117 [0] has been elected as a new master
2016-06-24 22:49:03 INFO : Electing a new master
2016-06-24 22:49:03 INFO : Starting master switch
2016-06-24 22:49:03 Declaring master as failed
2016-06-24 22:49:03 Master Failure detected! Retry 5/5
2016-06-24 22:49:02 Master Failure detected! Retry 4/5
2016-06-24 22:49:01 Master Failure detected! Retry 3/5
2016-06-24 22:49:00 Master Failure detected! Retry 2/5
2016-06-24 22:48:59 Master Failure detected! Retry 1/5

Read .my.cnf file

I have my credentials in ~/.my.cnf. Please implement them to be used.

Error with Monitor: Preferred master is not included in the hosts option

I don't quite understand this error message:
Preferred master is not included in the hosts option

wfong@willcn:$ replication-manager monitor --hosts=127.0.0.1:3307,127.0.0.1:3308,127.0.0.1:3309 --user=root --rpluser=r --http-server --http-root=go/src/github.com/mariadb-corporation/replication-manager/dashboard --autorejoin --prefmaster=127.0.0.1:3308,127.0.0.1:3309
2016/07/12 00:09:40 ERROR: Preferred master is not included in the hosts option
wfong@willcn:
$ replication-manager monitor --hosts=127.0.0.1:3307 --user=root --rpluser=r --http-server --http-root=go/src/github.com/mariadb-corporation/replication-manager/dashboard --autorejoin --prefmaster=127.0.0.1:3308,127.0.0.1:3309
2016/07/12 00:11:35 ERROR: Preferred master is not included in the hosts option
wfong@willcn:~$

Not sure how to specify the hosts in my replication environment, and specify which hosts that I would like MRM to use as a master.

replication-manager doesn't work with Amazon RDS, maxscale

I tried to use replication-manager on my two Amazon RDS host but it errored out with Super privilege.
Amazon RDS doesn't give SUPER privilege to RDS users.

/usr/local/bin/replication-manager --user=XXX:XXX123 --rpluser=XXXX:XXXX --hosts=TEST1.rds.amazonaws.com,TEST2.rds.amazonaws.com --failover=monitor
2016/06/13 10:50:11 ERROR: User must have SUPER privilege

Do we have any ways to proceed with replication manager without super privileges. I have given my user all privileges but not super.

Panic: runtime error: index out of range

Just got the latest binaries and tried to setup failover with simple master-slave replication. Somehow like this:

mariadb-repmgr -user user:user -rpluser repl:repl -hosts 10.0.0.2:3306,10.0.0.3:3306 -failover=force

Master went down and I got:

panic: runtime error: index out of range

goroutine 1 [running]:
main.main()
/home/tanj/Dev/go/src/github.com/tanji/mariadb-tools/mariadb-repmgr/repmgr.go:187 +0x2c5a

goroutine 5 [syscall]:
os/signal.loop()
/usr/lib/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
/usr/lib/go/src/os/signal/signal_unix.go:27 +0x35

goroutine 6 [chan receive]:
database/sql.(*DB).connectionOpener(0xc208046140)
/usr/lib/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
/usr/lib/go/src/database/sql/sql.go:452 +0x31c

goroutine 7 [runnable]:
database/sql.(*DB).connectionOpener(0xc2080461e0)
/usr/lib/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
/usr/lib/go/src/database/sql/sql.go:452 +0x31c

Any clue how to solve?

Binary releases?

Hello, just wondering if there are plans to put a binary release here?

Please implement slave catchup timeout instead of gtidcheck

Currently the documentation (in --help) of the --gtidcheck feature is unclear. It would be more useful if there was a timeout instead of a switch for this.

The timeout should determine how long mrm waits until it considers the slave to be too far out of sync to do the switchover. I think a default of 30 seconds is good in most cases.

Add sysbench control in web interface

Add parameters
sysbench_binary=
sysbench_host=
sysbench_port=
sysbench_threads=

The web interface would enable to run sysbench and report a graph from power of 2 up 2 sysbench_threads

The docker integration of mrm-maxscale-keepalived-sysbench-xtrabackup would make it a nice service

Add binlog control

purge_master_binlog=1

When enable this option auto purge master log until the most late slave

purge_relay_log=100000

When enable this option stop io_threads when events in relay log > 10000 on the slaves to have disk space limit usage when sql_thread catchup io_threads are restated

SLA Uptime updates too frequently

On my completely idle system, the SLA Uptime updates too frequently, which makes the value bounce around from 99.94 to 99.99999. This is a bit distracting, and also less reliable.

I'm not sure what the purpose of this information is, so I'm not sure what to recommend as a fix. But it should at least take a longer sample.

Thanks,
-will

Topology is broken...

wfong@willcn:~$ replication-manager topology --hosts=127.0.0.1:3307,127.0.0.1:3308,127.0.0.1:3309 --user=root --rpluser=r
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x28 pc=0x62f33d]

goroutine 1 [running]:
panic(0x992da0, 0xc82000e140)
/usr/lib/go-1.6/src/runtime/panic.go:481 +0x3e6
github.com/mariadb-corporation/replication-manager/state.(_StateMachine).SetMasterUpAndSync(0x0, 0x7ffccd840000)
/home/wfong/go/src/github.com/mariadb-corporation/replication-manager/state/state.go:102 +0xbd
main.topologyDiscover(0x0, 0x0)
/home/wfong/go/src/github.com/mariadb-corporation/replication-manager/topology.go:247 +0x2458
main.glob.func12(0xd3aa80, 0xc820129d70, 0x0, 0x3)
/home/wfong/go/src/github.com/mariadb-corporation/replication-manager/topology.go:268 +0x36
github.com/mariadb-corporation/replication-manager/vendor/github.com/spf13/cobra.(_Command).execute(0xd3aa80, 0xc820129c50, 0x3, 0x3, 0x0, 0x0)
/home/wfong/go/src/github.com/mariadb-corporation/replication-manager/vendor/github.com/spf13/cobra/command.go:575 +0x896
github.com/mariadb-corporation/replication-manager/vendor/github.com/spf13/cobra.(_Command).ExecuteC(0xd39ba0, 0xd3aa80, 0x0, 0x0)
/home/wfong/go/src/github.com/mariadb-corporation/replication-manager/vendor/github.com/spf13/cobra/command.go:661 +0x55c
github.com/mariadb-corporation/replication-manager/vendor/github.com/spf13/cobra.(_Command).Execute(0xd39ba0, 0x0, 0x0)
/home/wfong/go/src/github.com/mariadb-corporation/replication-manager/vendor/github.com/spf13/cobra/command.go:620 +0x2d
main.main()
/home/wfong/go/src/github.com/mariadb-corporation/replication-manager/main.go:40 +0x27
wfong@willcn:~$

No need to change gtid_slave_pos on slaves, breaks multi-domain intermediate master setup

Hi,

Currently mrm changes the gtid_slave_pos on the slaves. I don't see why that is needed in normal operation.

I am testing MRM in a slightly special setup, -> is a replication stream domain 1 and => is replication stream domain 2:
Server A -> B, C, D, Server C => D
The SET GLOBAL gtid_slave_pos fails on server C due to domain 2 not present, on server D it fails because the secondary replication stream is down.

We came to 2 caveats:
If you failover a master and have to let it rejoin the cluster. However there is an easy fix for that situation: If there is not gtid_slave_pos present on a master that has to become slave, we have to do CHANGE MASTER TO MASTER_USE_GTID=current_pos;

Another problem arises when the slave receives writes on the same GTID domain as the master. This is an incorrect setup and if we have strict_mode enabled writing on the slave will break replication anyways.

Perhaps I missed anything? Very interested in your response!

Cheers,
Michaël

switchover fails with "Error splitting GTID: 1-4134630-7,2-4134631-1"

Hello @tanji

Got the next problem. :-) Version 0.61 now. But the problem seems to exist in 0.52 and 0.61 too.

I am trying to switchover with:

/usr/local/bin/replication-manager --user=replmgr:pass --rpluser=repl:pass --hosts=10.4.129.11,10.4.129.12 --switchover=keep --logfile=/var/log/mrm.log

What i get is:

   Master Host   Port                              Current GTID      Binlog Position  Strict Mode                                                                    
    10.4.129.11   3306                   2-4134631-1,1-4134630-7          1-4134630-7          OFF                                                                    

     Slave Host   Port  Binlog   Using GTID         Current GTID           Slave GTID   Replication Health  Delay  RO                                                 
    10.4.129.12   3306      ON    Slave_Pos 1-4134630-7,2-4134631-1 1-4134630-7,2-4134631-1           Running OK      0  ON                                           


 Ctrl-Q to quit, Ctrl-S to switchover                                                                                                                                 


 2016-03-15 11:01:43 DEBUG: Checking eligibility of slave server 10.4.129.12                                                                                          
 2016-03-15 11:01:43 DEBUG: Processing 1 candidates                                                                                                                   
 2016-03-15 11:01:43 INFO : Electing a new master                                                                                                                     
 2016-03-15 11:01:43 INFO : Checking long running updates on master                                                                                                   
 2016-03-15 11:01:43 INFO : Flushing tables on 10.4.129.11 (master)                                                                                                   
 2016-03-15 11:01:43 INFO : Starting master switch                                                                                                                    
 2016-03-15 11:01:36 Monitor started in switchover mode2016/03/15 11:01:43 Error splitting GTID: 1-4134630-7,2-4134631-1   

Any idea what can be the cause of this?

Switchover/Failover should print out a status of all listed hosts

A Switchover/Failover event should show somewhere (maybe with just --verbose) all the candidate hosts listed, and perhaps a little something on why each one wasn't chosen. Anything would be helpful like: cannot connect, connection refused, replication in bad state, access denied, etc

I have no idea why :3307 was not listed/considered/etc.

wfong@willcn:$ replication-manager switchover --hosts=127.0.0.1:3307,127.0.0.1:3308,127.0.0.1:3309 --prefmaster=127.0.0.1:3307 --user=root --rpluser=r --verbose
2016/07/11 15:55:01 INFO : Starting master switch
2016/07/11 15:55:01 INFO : Flushing tables on 127.0.0.1:3308 (master)
2016/07/11 15:55:01 INFO : Checking long running updates on master
2016/07/11 15:55:01 INFO : Electing a new master
2016/07/11 15:55:01 DEBUG: Processing 1 candidates
2016/07/11 15:55:01 DEBUG: Checking eligibility of slave server 127.0.0.1:3309 [0]
2016/07/11 15:55:01 DEBUG: Got sequence(s) [73] for server [0]
2016/07/11 15:55:01 INFO : Slave 127.0.0.1:3309 [0] has been elected as a new master
2016/07/11 15:55:01 INFO : Terminating all threads on 127.0.0.1:3308
2016/07/11 15:55:01 INFO : Rejecting updates on 127.0.0.1:3308 (old master)
2016/07/11 15:55:01 INFO : Switching master
2016/07/11 15:55:01 INFO : Waiting for candidate Master to synchronize
2016/07/11 15:55:01 DEBUG: Syncing on master GTID Binlog Pos [0-2-73]
2016/07/11 15:55:01 DEBUG: Server:127.0.0.1:3308 Current GTID:0-2-73 Slave GTID:0-3-72 Binlog Pos:0-2-73
2016/07/11 15:55:01 DEBUG: MASTER_POS_WAIT executed.
2016/07/11 15:55:01 DEBUG: Server:127.0.0.1:3309 Current GTID:0-2-73 Slave GTID:0-2-73 Binlog Pos:0-3-72
2016/07/11 15:55:01 INFO : Stopping slave thread on new master
2016/07/11 15:55:01 INFO : Resetting slave on new master and set read/write mode on
2016/07/11 15:55:01 INFO : Switching old master as a slave
2016/07/11 15:55:01 INFO : Switching other slaves to the new master
2016/07/11 15:55:01 INFO : Master switch on 127.0.0.1:3309 complete
wfong@willcn:
$ mysqladmin -h127.0.0.1 -P3307 ping
mysqld is alive
wfong@willcn:~$

Switchover and Failover help messages are the same

The differences between "switchover" and "failover" should be defined in the help messages more clearly. Currently, they are the same:

`wfong@willcn:~$ diff -u <(replication-manager switchover --help 2>&1) <(replication-manager failover --help 2>&1)
--- /dev/fd/63 2016-07-11 15:17:54.952259345 +0800
+++ /dev/fd/62 2016-07-11 15:17:54.952259345 +0800
@@ -1,7 +1,7 @@
Trigger failover on a dead master by promoting a slave.

Usage:

  • replication-manager switchover [flags]
  • replication-manager failover [flags]

Flags:
--connect-timeout int Database connection timeout in seconds (default 5)
wfong@willcn:~$
`

Replication manager does not support named connections

I cannot use this tool because of SHOW SLAVE STATUS returns no information on my slave as the replication is configured using a named connection and instead it should do SHOW SLAVE 'foobar' STATUS.

Instead of suggesting to add a parameter to set the connection name I would suggest to add an argument to allow setting arbitrary variables onto the session like I can do with pt-table-checksum, then I could do --set-vars default_master_connection=foobar and the tool would be able to work with my setup.

Before you tell me than multisource replication is not supported with this tool I would like to clarify my setup:

I have 5 mariadb servers, B replicates from A, D replicates from C, E replicates from A and C, I would never include D in a switchover event, it is a backup server we use only for taking snapshots and data indexing, a switchover event would be always between A and B or between C and D, I had to rename the replication in thoose servers to match the names configured on D, otherwise I could not use pt-table-checksum to monitor the data on my servers.

version 0.60 ERROR: Could not autodetect a master!

Hello.

I have a problem with the brandnew 0.60 version. Here is what i try:

/usr/local/bin/replication-manager --version --hosts=10.4.129.11,10.4.129.12 --user=replmgr:pass --rpluser=repl:pass --failover=monitor

That gives:

MariaDB Replication Manager version 0.6.0-9450672
2016/03/11 21:02:21 ERROR: Could not autodetect a master!

With 0.52 this works just fine. Maxscale looks good too.

maxscale root # maxadmin -pmariadb "show servers"
Server 0x7f8be0cae3a0 (server1)
        Server:                              10.4.129.11
        Status:                              Master, Running
        Protocol:                    MySQLBackend
        Port:                                3306
        Server Version:                 10.0.23-MariaDB-log
        Node Id:                     4134630
        Master Id:                   -1
        Slave Ids:                   4134631
        Repl Depth:                  0
        Number of connections:               28
        Current no. of conns:                0
        Current no. of operations:   0
Server 0x7f8be0cae2c0 (server2)
        Server:                              10.4.129.12
        Status:                              Slave, Running
        Protocol:                    MySQLBackend
        Port:                                3306
        Server Version:                 10.0.23-MariaDB-log
        Node Id:                     4134631
        Master Id:                   4134630
        Slave Ids:                   
        Repl Depth:                  1
        Number of connections:               28
        Current no. of conns:                0
        Current no. of operations:   0

Any hints?

thanks and cheers
t.

Both failover and switchover are failed

Hi ,

A few days ago ,I test the MRM-0.6.3 with failover and switchover .
When I want to do failover via Maxscale ,the mrm show message "could not autodetect a failed master" , I'm sure the master is down and the slaves are live .

When I want to do switchover ,the mrm show message "no suitable candidates found" ,I have use --prefmaster option to choice the new master .

I want to reproduce the two issues ,but I can't reproduce they again .

Could you tell me what's problem could cause the two issue ?
Thank you.

--ignore-servers is not honored after failover took place once

Please be aware I am running a small hack, https://github.com/michaeldg/replication-manager/blob/master/misc.go:49, getSeqFromGtid always returns 1. This is to work around issue #30.

The first failover I executed worked perfectly, the comamnd I used was:
./go/bin/replication-manager --hosts=a-oltp01,a-oltp02,b-oltp01,b-oltp02 --user=replication_manager:root --rpluser=slave_user:repl --interactive --switchover=keep --readonly=false --ignore-servers=b-oltp01,b-oltp02 --gtidcheck=0

The first failover was (as expected) to a-oltp02. The second failover was, incorrectly, to b-oltp01.

--ignore-servers is checked after doing sanity checks

Although issue #31 is resolved, I do think there is a minor improvement to be made. The output of the failover command shows errors for servers that are ignored for master election. This alerted me a bit, it is because these sanity checks are executed before the ignored servers are ignored. (monitor.go:261)

I think it makes sense to first skip if server is ignored and then continue with the sanity checks.

Error when splitting GTID when failing over

Master Host   Port                              Current GTID      Binlog Position  Strict Mode                                                                                                                                            
 a-db2   3306                           0-203-4,1-11-13      0-203-4,1-11-13          OFF                                                                                                                                            

 Slave Host   Port  Binlog   Using GTID         Current GTID           Slave GTID   Replication Health  Delay  RO                                                                                                                         
 a-db1   3306      ON    Slave_Pos      0-203-4,1-11-13      0-203-4,1-11-13           Running OK      0  ON                                                                                                                         
  b-db1   3306      ON    Slave_Pos 0-203-4,1-11-13,2-21-18      0-203-4,1-11-13           Running OK      0 OFF                                                                                                                      
  b-db2   3306      ON    Slave_Pos 0-203-4,1-11-13,2-21-18 0-203-4,1-11-13,2-21-18           Running OK      0 OFF                                                                                                                   
                                                                                                                                                                    Ctrl-Q to quit, Ctrl-S to switchover                                                                                                                                                                                                         
                                                                                                                                                                    2016-06-10 16:48:27 INFO : Electing a new master                                                                                                                                                                                             

2016-06-10 16:48:27 INFO : Checking long running updates on master
2016-06-10 16:48:27 INFO : Flushing tables on a-db2 (master)
2016-06-10 16:48:27 INFO : Starting master switch
2016-06-10 16:48:25 Monitor started in switchover mode
2016/06/10 16:48:27 Error splitting GTID: 0-203-4,1-11-13
mariadb@maxscale01:$ ^C
mariadb@maxscale01:
$ ^C

The switchover is not work .

Hi ,
I am testing the MRM 0.52 version.
When I want to do switchover in my lab ,I got an error mesage : "ERROR: Could not autodetect a master!"
My mrm command is "/mariadb/software/mrm/mariadb-repmgr -hosts=mdb1:3306,mdb2:3306,mdb3:3306 -user=admin:admin -rpluser=rep:rep -prefmaster=mdb2:3306 -switchover=keep"

Does my mrm command is wrong ?
If I want to switchover ,how should I use the mrm options ?

Keep state of the last failover in non-interactive mode

If MRM is used in interactive mode there is a limit on the number of failovers that it does and this number is obviously kept somewhere inside the application as a state. However when using MRM with MaxScale it is used as a single failover and exits after performing the task and thus it doesn't keep state. This could lead to flapping issues.

If you would be able to keep the state of the previous failover(s) and add a timestamp of the last failover there you could add this protection also with the single execution.

The failover.sh script can not work with MRM 0.52 version

Hi ,
I am testing the MRM 0.52 version.
I using the failover.sh script to failover (https://mariadb.com/blog/mariadb-automatic-failover-maxscale-and-mariadb-replication-manager) , and I also change the failover option = 'force'.
When the master down ,I got a message : "ERROR: None of the switchover or failover modes are set."
Then I output the cmd command ,the command can't get the $nodelist variable .

How should I fix my issue?
Thank you.

Could not autodetect a failed master

Hi ,
When I use command as below ,I got an error message : "Could not autodetect a failed master!"
./mariadb-repmgr -user dba:dba -rpluser rep:rep -hosts db1:3307,db2:3307,db3:3307 -prefmaster=db2:3307 -failover=force
or
./mariadb-repmgr -user dba:dba -rpluser rep:rep -hosts db1:3307,db2:3307,db3:3307 -prefmaster=db2:3307 -switchover=keep

The first command I can executing on 0.4.1 version but the error will occur on 0.5.1 version.

Add a more check for thread_handling=pool-of-threads

add an option

reject-write=max_connection | read_only default lower connection to 1

if max_connection is set then
thread_handling=pool-of-threads to pass failover check
extra-port= is check to be the same as mrm connection port

Maxscale task
https://jira.mariadb.org/browse/MXS-778

in read_only mode if MXS-778 is implemented
Would need master freeze in monitor.go to be waiting for at least 2 time the default monitoring time of maxscale before entering the kill long running transaction loop

Go error during failover: slice bounds out of range

I'm testing MRM at the moment and I'm getting the following error when failing over manually:

/usr/local/bin/replication-manager` --user=root:pass --rpluser=repl:replpass --hosts=10.10.18.11:3306,10.10.18.12:3306,10.10.18.13:3306 --failover=force --interactive=false --logfile=/var/log/failover.log
2016/03/16 11:02:39 INFO : Server 10.10.18.11:3306 is dead.
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
main.(*TermLog).Add(0x877670, 0xc20802c300, 0x33)
/home/vagrant/replication-manager/termlog.go:19 +0x37b
main.logprint(0xc208075900, 0x1, 0x1)
/home/vagrant/replication-manager/display.go:99 +0x502
main.masterFailover(0x7fff109ac801)
/home/vagrant/replication-manager/failover.go:12 +0x10f
main.main()
/home/vagrant/replication-manager/repmgr.go:285 +0x210c

goroutine 5 [syscall]:
os/signal.loop()
/usr/lib/golang/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
/usr/lib/golang/src/os/signal/signal_unix.go:27 +0x35

goroutine 6 [chan receive]:
database/sql.(*DB).connectionOpener(0xc208034140)
/usr/lib/golang/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
/usr/lib/golang/src/database/sql/sql.go:452 +0x31c

goroutine 8 [runnable]:
database/sql.(*DB).connectionOpener(0xc2080341e0)
/usr/lib/golang/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
/usr/lib/golang/src/database/sql/sql.go:452 +0x31c

goroutine 9 [runnable]:
database/sql.(*DB).connectionOpener(0xc208034000)
/usr/lib/golang/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
/usr/lib/golang/src/database/sql/sql.go:452 +0x31c

I tried to resolve it but Go isn't one of my used languages. I saw you didn't touch the termlog.go file in the past 5 months, but did change the behavior of logging in display.go. It seems to be a problem with somewhere exceeding the length of the TermLog.

I also had a different issue with the MaxScale/MRM failover: MaxScale only seems to send the list of nodes that are up in the $NODESLIST variable. This used to be the complete list of nodes and I don't know if this behavior changed in 1.3.0?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.