I have configured a scheme with one master, one standby and one witness. repmgrd i

But I have the same config on all servers (except for values of <code class="notransla

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Automatic failover doesn't work about repmgr HOT 19 CLOSED

enterprisedb commented on September 16, 2024

Automatic failover doesn't work

from repmgr.

Comments (19)

ckruse commented on September 16, 2024

What version of repmgr are you using? RC1? What exactly happens after the last log message you posted, did repmgrd exit or is this message still repeating? What is your configuration?

The background to my questions is that there are several tune-able options which affect the time repmgrd does the autofailover. With default settings autofailover will happen after round about 30 minutes. So based on your log it is only about six minutes ago that the PostgreSQL server has been killed. It may be likely that the autofailover simply didn't happen yet.

The configuration options to watch at are those:

retry_promote_interval_secs=300
master_response_timeout=60

# How many time we try to reconnect to master before starting failover procedure
reconnect_attempts=6
reconnect_interval=10

from repmgr.

lexqt commented on September 16, 2024

What version of repmgr are you using

Latest (from git, master branch, commit @c3b58658ad856392edb4036adc3d023fbcaa52b0)

What exactly happens after the last log message you posted, did repmgrd exit or is this message still repeating?

Exit. Then upstart tries to respawn it but it dies again.

What is your configuration?

Master configuration (standby is the same but with other node, node_name and conninfo).

# Cluster
#################

cluster=cluster0

# Node
#################

node=1
node_name=srv4

# Connection
#################

conninfo='host=srv4 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_interval=10 keepalives_count=6'
rsync_options='--archive --checksum --compress --progress --rsh="ssh -o \"StrictHostKeyChecking no\""'
ssh_options='-o "StrictHostKeyChecking no"'

# How many seconds we wait for master response before declaring master failure
master_response_timeout=60

# How many time we try to reconnect to master before starting failover procedure
reconnect_attempts=6
reconnect_interval=10

# Failover
#################

failover=automatic
monitor_interval_secs=2
priority=0
promote_command='/var/lib/postgresql/promote_postgres_master.sh'
follow_command='repmgr -f /var/lib/postgresql/repmgr/repmgr.conf -w 640 standby follow'

# only for manual failover
retry_promote_interval_secs=300

# Logging
#################

loglevel=NOTICE
logfacility=STDERR
logfile=/var/log/postgresql/repmgrd.log

# PostgreSQL
#################

pg_bindir=/usr/lib/postgresql/9.3/bin
pg_ctl_options='-l /var/log/postgresql/postgresql-9.3-main.log -o "--config-file=/etc/postgresql/9.3/main/postgresql.conf"'

from repmgr.

Jaime2ndQuadrant commented on September 16, 2024

The problem you are experimenting seems to be because now automatic failover is not the default action. Please check repmgr.conf in the standby to check the value of the failover setting.

from repmgr.

lexqt commented on September 16, 2024

But I have the same config on all servers (except for values of node, node_name and conninfo). And failover is automatic (as shown above).

from repmgr.

Jaime2ndQuadrant commented on September 16, 2024

Did you check the repmgr.conf? My comment is based the log you showed us:
"""
[2014-02-17 15:06:34] [INFO] repmgrd Connecting to database 'host=srv14 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_i
nterval=10 keepalives_count=6'
[2014-02-17 15:06:34] [INFO] repmgrd Connected to database, checking its state
[2014-02-17 15:06:34] [INFO] repmgrd Connecting to primary for cluster 'cluster0'
[2014-02-17 15:06:34] [INFO] finding node list for cluster 'cluster0'
[2014-02-17 15:06:34] [INFO] checking role of cluster node 'host=srv11 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_in
terval=10 keepalives_count=6'
"""
This messages comes from getMasterConnection() and the only way this is being called after a failure is detected is in repmgrd.c:560, which only happens if (local_options.failover == MANUAL_FAILOVER).

btw, this are the logs in the standby right?

from repmgr.

lexqt commented on September 16, 2024

Did you check the repmgr.conf?

Yes.

btw, this are the logs in the standby right?

Yes.

repmgrd.c:560, which only happens if (local_options.failover == MANUAL_FAILOVER).

But before repmgrd reaches line 560, this (line 557 log_err(_("We couldn't reconnect to master. Now checking if another node has been promoted.\n"));) must be executed too. However there is no such error message in logs.

from repmgr.

lexqt commented on September 16, 2024

I've added log_debug at the beginning of some functions and set loglevel to DEBUG.

More verbose log:

[2014-02-17 21:54:26] [DEBUG] StandbyMonitor
[2014-02-17 21:54:26] [DEBUG] CheckConnection
[2014-02-17 21:54:26] [DEBUG] is_pgup
[2014-02-17 21:54:26] [DEBUG] CancelQuery
[2014-02-17 21:54:26] [DEBUG] wait_connection_availability
[2014-02-17 21:54:26] [DEBUG] wait_connection_availability
[2014-02-17 21:54:26] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [DEBUG] CheckConnection
[2014-02-17 21:54:27] [DEBUG] is_pgup
[2014-02-17 21:54:27] [DEBUG] CancelQuery
[2014-02-17 21:54:27] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [DEBUG] CancelQuery
[2014-02-17 21:54:27] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [WARNING] Can't stop current query: PQcancel() -- connect() failed: Connection refused

[2014-02-17 21:54:29] [DEBUG] StandbyMonitor
[2014-02-17 21:54:29] [DEBUG] CheckConnection
[2014-02-17 21:54:29] [DEBUG] is_pgup
[2014-02-17 21:54:29] [DEBUG] CancelQuery
[2014-02-17 21:54:29] [DEBUG] wait_connection_availability
[2014-02-17 21:54:29] [WARNING] Can't stop current query: PQcancel() -- connect() failed: Connection refused

[2014-02-17 21:54:29] [INFO] repmgrd Connecting to database 'host=srv14 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_interval=10 keepalives_count=6'
[2014-02-17 21:54:29] [DEBUG] establishDBConnection
[2014-02-17 21:54:29] [INFO] repmgrd Connected to database, checking its state
[2014-02-17 21:54:29] [INFO] repmgrd Connecting to primary for cluster 'cluster0'
[2014-02-17 21:54:29] [DEBUG] getMasterConnection
[2014-02-17 21:54:29] [INFO] finding node list for cluster 'cluster0'
[2014-02-17 21:54:29] [INFO] checking role of cluster node 'host=srv11 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_interval=10 keepalives_count=6'
[2014-02-17 21:54:29] [DEBUG] establishDBConnection
[2014-02-17 21:54:29] [ERROR] Connection to database failed: could not connect to server: Connection refused
        Is the server running on host "srv11" (10.129.235.24) and accepting
        TCP/IP connections on port 5432?

from repmgr.

ckruse commented on September 16, 2024

This really is interesting since I can't reproduce that…

from repmgr.

ckruse commented on September 16, 2024

Can you start repmgrd with -p repmgrd.pid and tell us if the repmgr PID still exists after it exits? And, if yes, can you send me a coredump file to [email protected]?

from repmgr.

lexqt commented on September 16, 2024

postgres@srv14:/tmp/test1$ gdb --args /usr/bin/repmgrd -f /var/lib/postgresql/repmgr/repmgr.conf --monitoring-history --verbose -p repmgrd.pid
...
(gdb) run
Starting program: /usr/bin/repmgrd -f /var/lib/postgresql/repmgr/repmgr.conf --monitoring-history --verbose -p repmgrd.pid
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bba288 in ?? () from /usr/lib/libpq.so.5
(gdb) generate-core-file

Check your mail for coredump file.

from repmgr.

ckruse commented on September 16, 2024

Hm, can you recompile with CFLAGS="-O0 -g3 -fno-omit-frame-pointer"? This core dump is not really helpful:

(gdb) bt
#0  0x00007ffff7bba288 in ?? ()
#1  0x00007ffff7ba6180 in ?? ()
#2  0x00007ffff7865dc0 in ?? ()
#3  0x0000000000000000 in ?? ()

from repmgr.

ckruse commented on September 16, 2024

Ok, after a talk with Andres I realized that the issue with the core dump may be a lack of symbols on my system. Can you do the backtrace locally and post it here? Just type bt in gdb.

from repmgr.

ckruse commented on September 16, 2024

I also pushed two new commits, one fixing a bug (shouldn't be yours) and another one with two more log messages to be able to distinguish problems more easily

from repmgr.

lexqt commented on September 16, 2024

I recompiled (without two latest commits) with CFLAGS="-O0 -g3 -fno-omit-frame-pointer" and... no crash:

[2014-02-18 17:09:42] [WARNING] Can't stop current query: PQcancel() -- connect() failed: Connection refused

[2014-02-18 17:09:42] [WARNING] repmgrd: Connection to master has been lost, trying to recover... 60 seconds before failover decision

Then I recompiled latest version as usual without extra CFLAGS. Crash:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bba288 in ?? () from /usr/lib/libpq.so.5
(gdb) bt
#0  0x00007ffff7bba288 in ?? () from /usr/lib/libpq.so.5
#1  0x00007ffff7bba59d in PQreset () from /usr/lib/libpq.so.5
#2  0x0000555555558296 in is_pgup (conn=0x5555557842d0, timeout=60) at dbutils.c:165
#3  0x0000555555559768 in CheckConnection (conn=0x5555557842d0, type=0x55555555f2c7 "master") at repmgrd.c:1080
#4  0x000055555555ba3a in StandbyMonitor () at repmgrd.c:546
#5  0x0000555555556dce in main (argc=<optimized out>, argv=<optimized out>) at repmgrd.c:410

from repmgr.

ckruse commented on September 16, 2024

That's interesting. Can you install debug symbols for libpq and recompile with CFLAGS="-O2 -g3 -fno-omit-frame-pointer" and retry?

from repmgr.

lexqt commented on September 16, 2024

Result:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bba288 in connectDBStart (conn=0x62f2d0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/interfaces/libpq/fe-connect.c:1362
1362    /tmp/buildd/postgresql-9.3-9.3.2/build/../src/interfaces/libpq/fe-connect.c: No such file or directory.
(gdb) bt
#0  0x00007ffff7bba288 in connectDBStart (conn=0x62f2d0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/interfaces/libpq/fe-connect.c:1362
#1  0x00007ffff7bba59d in PQreset (conn=0x62f2d0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/interfaces/libpq/fe-connect.c:2983
#2  0x0000000000403abf in is_pgup (conn=0x62f2d0, timeout=60) at dbutils.c:165
#3  0x0000000000404f36 in CheckConnection (conn=0x62f2d0, type=0x40a697 "master") at repmgrd.c:1080
#4  0x0000000000406f64 in StandbyMonitor () at repmgrd.c:546
#5  0x00000000004026ad in main (argc=<optimized out>, argv=<optimized out>) at repmgrd.c:410

from repmgr.

ckruse commented on September 16, 2024

Thank you very much for your help. I think I found the bug (commit f080792). Can please test with current HEAD?

from repmgr.

lexqt commented on September 16, 2024

Yes. Now it's ok. Thanks!

from repmgr.

ckruse commented on September 16, 2024

Great! Thanks, once again, for your help!

from repmgr.

Automatic failover doesn't work about repmgr HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs