Comments (19)
What version of repmgr are you using? RC1? What exactly happens after the last log message you posted, did repmgrd exit or is this message still repeating? What is your configuration?
The background to my questions is that there are several tune-able options which affect the time repmgrd does the autofailover. With default settings autofailover will happen after round about 30 minutes. So based on your log it is only about six minutes ago that the PostgreSQL server has been killed. It may be likely that the autofailover simply didn't happen yet.
The configuration options to watch at are those:
retry_promote_interval_secs=300
master_response_timeout=60
# How many time we try to reconnect to master before starting failover procedure
reconnect_attempts=6
reconnect_interval=10
from repmgr.
What version of repmgr are you using
Latest (from git, master branch, commit @c3b58658ad856392edb4036adc3d023fbcaa52b0)
What exactly happens after the last log message you posted, did repmgrd exit or is this message still repeating?
Exit. Then upstart tries to respawn it but it dies again.
What is your configuration?
Master configuration (standby is the same but with other node
, node_name
and conninfo
).
# Cluster
#################
cluster=cluster0
# Node
#################
node=1
node_name=srv4
# Connection
#################
conninfo='host=srv4 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_interval=10 keepalives_count=6'
rsync_options='--archive --checksum --compress --progress --rsh="ssh -o \"StrictHostKeyChecking no\""'
ssh_options='-o "StrictHostKeyChecking no"'
# How many seconds we wait for master response before declaring master failure
master_response_timeout=60
# How many time we try to reconnect to master before starting failover procedure
reconnect_attempts=6
reconnect_interval=10
# Failover
#################
failover=automatic
monitor_interval_secs=2
priority=0
promote_command='/var/lib/postgresql/promote_postgres_master.sh'
follow_command='repmgr -f /var/lib/postgresql/repmgr/repmgr.conf -w 640 standby follow'
# only for manual failover
retry_promote_interval_secs=300
# Logging
#################
loglevel=NOTICE
logfacility=STDERR
logfile=/var/log/postgresql/repmgrd.log
# PostgreSQL
#################
pg_bindir=/usr/lib/postgresql/9.3/bin
pg_ctl_options='-l /var/log/postgresql/postgresql-9.3-main.log -o "--config-file=/etc/postgresql/9.3/main/postgresql.conf"'
from repmgr.
The problem you are experimenting seems to be because now automatic failover is not the default action. Please check repmgr.conf in the standby to check the value of the failover setting.
from repmgr.
But I have the same config on all servers (except for values of node
, node_name
and conninfo
). And failover
is automatic
(as shown above).
from repmgr.
Did you check the repmgr.conf? My comment is based the log you showed us:
"""
[2014-02-17 15:06:34] [INFO] repmgrd Connecting to database 'host=srv14 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_i
nterval=10 keepalives_count=6'
[2014-02-17 15:06:34] [INFO] repmgrd Connected to database, checking its state
[2014-02-17 15:06:34] [INFO] repmgrd Connecting to primary for cluster 'cluster0'
[2014-02-17 15:06:34] [INFO] finding node list for cluster 'cluster0'
[2014-02-17 15:06:34] [INFO] checking role of cluster node 'host=srv11 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_in
terval=10 keepalives_count=6'
"""
This messages comes from getMasterConnection() and the only way this is being called after a failure is detected is in repmgrd.c:560, which only happens if (local_options.failover == MANUAL_FAILOVER).
btw, this are the logs in the standby right?
from repmgr.
Did you check the repmgr.conf?
Yes.
btw, this are the logs in the standby right?
Yes.
repmgrd.c:560, which only happens if (local_options.failover == MANUAL_FAILOVER).
But before repmgrd reaches line 560, this (line 557 log_err(_("We couldn't reconnect to master. Now checking if another node has been promoted.\n"));
) must be executed too. However there is no such error message in logs.
from repmgr.
I've added log_debug at the beginning of some functions and set loglevel to DEBUG.
More verbose log:
[2014-02-17 21:54:26] [DEBUG] StandbyMonitor
[2014-02-17 21:54:26] [DEBUG] CheckConnection
[2014-02-17 21:54:26] [DEBUG] is_pgup
[2014-02-17 21:54:26] [DEBUG] CancelQuery
[2014-02-17 21:54:26] [DEBUG] wait_connection_availability
[2014-02-17 21:54:26] [DEBUG] wait_connection_availability
[2014-02-17 21:54:26] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [DEBUG] CheckConnection
[2014-02-17 21:54:27] [DEBUG] is_pgup
[2014-02-17 21:54:27] [DEBUG] CancelQuery
[2014-02-17 21:54:27] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [DEBUG] CancelQuery
[2014-02-17 21:54:27] [DEBUG] wait_connection_availability
[2014-02-17 21:54:27] [WARNING] Can't stop current query: PQcancel() -- connect() failed: Connection refused
[2014-02-17 21:54:29] [DEBUG] StandbyMonitor
[2014-02-17 21:54:29] [DEBUG] CheckConnection
[2014-02-17 21:54:29] [DEBUG] is_pgup
[2014-02-17 21:54:29] [DEBUG] CancelQuery
[2014-02-17 21:54:29] [DEBUG] wait_connection_availability
[2014-02-17 21:54:29] [WARNING] Can't stop current query: PQcancel() -- connect() failed: Connection refused
[2014-02-17 21:54:29] [INFO] repmgrd Connecting to database 'host=srv14 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_interval=10 keepalives_count=6'
[2014-02-17 21:54:29] [DEBUG] establishDBConnection
[2014-02-17 21:54:29] [INFO] repmgrd Connected to database, checking its state
[2014-02-17 21:54:29] [INFO] repmgrd Connecting to primary for cluster 'cluster0'
[2014-02-17 21:54:29] [DEBUG] getMasterConnection
[2014-02-17 21:54:29] [INFO] finding node list for cluster 'cluster0'
[2014-02-17 21:54:29] [INFO] checking role of cluster node 'host=srv11 user=repmgr dbname=repmgr connect_timeout=10 keepalives_idle=10 keepalives_interval=10 keepalives_count=6'
[2014-02-17 21:54:29] [DEBUG] establishDBConnection
[2014-02-17 21:54:29] [ERROR] Connection to database failed: could not connect to server: Connection refused
Is the server running on host "srv11" (10.129.235.24) and accepting
TCP/IP connections on port 5432?
from repmgr.
This really is interesting since I can't reproduce that…
from repmgr.
Can you start repmgrd with -p repmgrd.pid
and tell us if the repmgr PID still exists after it exits? And, if yes, can you send me a coredump file to [email protected]?
from repmgr.
postgres@srv14:/tmp/test1$ gdb --args /usr/bin/repmgrd -f /var/lib/postgresql/repmgr/repmgr.conf --monitoring-history --verbose -p repmgrd.pid
...
(gdb) run
Starting program: /usr/bin/repmgrd -f /var/lib/postgresql/repmgr/repmgr.conf --monitoring-history --verbose -p repmgrd.pid
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bba288 in ?? () from /usr/lib/libpq.so.5
(gdb) generate-core-file
Check your mail for coredump file.
from repmgr.
Hm, can you recompile with CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
? This core dump is not really helpful:
(gdb) bt
#0 0x00007ffff7bba288 in ?? ()
#1 0x00007ffff7ba6180 in ?? ()
#2 0x00007ffff7865dc0 in ?? ()
#3 0x0000000000000000 in ?? ()
from repmgr.
Ok, after a talk with Andres I realized that the issue with the core dump may be a lack of symbols on my system. Can you do the backtrace locally and post it here? Just type bt
in gdb.
from repmgr.
I also pushed two new commits, one fixing a bug (shouldn't be yours) and another one with two more log messages to be able to distinguish problems more easily
from repmgr.
I recompiled (without two latest commits) with CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
and... no crash:
[2014-02-18 17:09:42] [WARNING] Can't stop current query: PQcancel() -- connect() failed: Connection refused
[2014-02-18 17:09:42] [WARNING] repmgrd: Connection to master has been lost, trying to recover... 60 seconds before failover decision
Then I recompiled latest version as usual without extra CFLAGS. Crash:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bba288 in ?? () from /usr/lib/libpq.so.5
(gdb) bt
#0 0x00007ffff7bba288 in ?? () from /usr/lib/libpq.so.5
#1 0x00007ffff7bba59d in PQreset () from /usr/lib/libpq.so.5
#2 0x0000555555558296 in is_pgup (conn=0x5555557842d0, timeout=60) at dbutils.c:165
#3 0x0000555555559768 in CheckConnection (conn=0x5555557842d0, type=0x55555555f2c7 "master") at repmgrd.c:1080
#4 0x000055555555ba3a in StandbyMonitor () at repmgrd.c:546
#5 0x0000555555556dce in main (argc=<optimized out>, argv=<optimized out>) at repmgrd.c:410
from repmgr.
That's interesting. Can you install debug symbols for libpq and recompile with CFLAGS="-O2 -g3 -fno-omit-frame-pointer"
and retry?
from repmgr.
Result:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bba288 in connectDBStart (conn=0x62f2d0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/interfaces/libpq/fe-connect.c:1362
1362 /tmp/buildd/postgresql-9.3-9.3.2/build/../src/interfaces/libpq/fe-connect.c: No such file or directory.
(gdb) bt
#0 0x00007ffff7bba288 in connectDBStart (conn=0x62f2d0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/interfaces/libpq/fe-connect.c:1362
#1 0x00007ffff7bba59d in PQreset (conn=0x62f2d0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/interfaces/libpq/fe-connect.c:2983
#2 0x0000000000403abf in is_pgup (conn=0x62f2d0, timeout=60) at dbutils.c:165
#3 0x0000000000404f36 in CheckConnection (conn=0x62f2d0, type=0x40a697 "master") at repmgrd.c:1080
#4 0x0000000000406f64 in StandbyMonitor () at repmgrd.c:546
#5 0x00000000004026ad in main (argc=<optimized out>, argv=<optimized out>) at repmgrd.c:410
from repmgr.
Thank you very much for your help. I think I found the bug (commit f080792). Can please test with current HEAD?
from repmgr.
Yes. Now it's ok. Thanks!
from repmgr.
Great! Thanks, once again, for your help!
from repmgr.
Related Issues (20)
- repmgr replication with multislaves failover is failing HOT 1
- Repmgr: an older version of the extension is installed but it's not true HOT 1
- **High Priority** repmgr with multi slaves replication solution is not working in k8s environment
- node_rejoin event
- CRITICAL (node "foo" (ID: 2) is not attached to expected upstream node "bar" (ID: 1) repmgr-16
- LOG: could not receive data from client: Connection reset by peer HOT 3
- Slot in the catchup state
- repmgr - failed: fe_sendauth: no password supplied
- Issue encountered while adding script for split-brain prevention HOT 1
- Self-node check method not changed. HOT 1
- node "node_master" (ID: 1) is registered as primary but running as standby HOT 4
- Promote secondary to new master
- master node fails to automatically rejoin the cluster after recovery from failure HOT 1
- repmgrd autofailover not working if PR is down with File system hang HOT 6
- repmgr cluster crosscheck does not work with custom PostgreSQL location
- single HTML page documentation link is broken HOT 1
- repmgr daemon status showing repmgrd as 'not running' HOT 1
- Repmgr cannot perform failover in a cluster that is not in a healthy state.
- PostgreSQL 17 support HOT 1
- How to update conninfo in a running cluster?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from repmgr.