Hello and thank you for your time!
I tried an rolling upgrade from RonDB-21.04.9 to RonDB-22.10.0 but it failed, even though it ought to work and is documented:
https://docs.rondb.com/rondb_upgrade/#upgrading-from-rondb-2104-to-2210
I get an error (last four lines):
2023-07-24 21:36:54 [ndbd] INFO -- DIH reported initial start, now starting the Node Inclusion Protocol
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x5b13ae]
ndbmtd(ndb_print_stacktrace()+0x45) [0x6371e5]
ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x26) [0x6f7b96]
ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x5f5629]
ndbmtd(Qmgr::execCM_REGCONF(Signal*)+0x599) [0x7e6b69]
ndbmtd() [0x517a25]
ndbmtd() [0x5dd293]
ndbmtd(mt_job_thread_main+0x317) [0x5cd947]
ndbmtd() [0x6372cc]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f620c601609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f620ba24293]
2023-07-24 21:36:54 [ndbd] INFO -- incompatible version own=0x160a00 other=0x150409, shutting down
2023-07-24 21:36:54 [ndbd] INFO -- QMGR (Line: 1715) 0x00000006
2023-07-24 21:36:54 [ndbd] INFO -- Error handler shutting down system
2023-07-24 21:36:56 [ndbd] ALERT -- Node 1: Forced node shutdown completed. Occurred during startphase 1. Caused by error 6304: 'Unsupported version(Restart error). Temporary error, restart node'.
(END)
First I tried to upgrade from RonDB-21.04.8 directly to RonDB-22.10.0 and got the same error of "incompatible version/Unsupported version" as above. Then I figured I probably out to upgrade to RonDB-21.04.9 first. This upgrade to 21.04.9 worked but then the upgrade to 22.10.0 fails in the same fashion as stated.
I observed the documentation, that is, I start with a state:
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=1 @94.75.251.146 (RonDB-21.04.9, Nodegroup: 0)
id=2 @94.75.251.146 (RonDB-21.04.9, Nodegroup: 0, *)
[ndb_mgmd(MGM)] 2 node(s)
id=65 @94.75.251.146 (RonDB-21.04.9)
id=66 @94.75.251.146 (RonDB-21.04.9)
[mysqld(API)] 5 node(s)
id=67 @94.75.251.146 (RonDB-21.04.9)
id=68 @94.75.251.146 (RonDB-21.04.9)
id=231 @94.75.251.146 (RonDB-21.04.9)
id=232 (not connected, accepting connect from any host)
id=233 (not connected, accepting connect from any host)
and first upgrade the management nodes and arrive at:
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=1 @94.75.251.146 (RonDB-21.04.9, Nodegroup: 0)
id=2 @94.75.251.146 (RonDB-21.04.9, Nodegroup: 0, *)
[ndb_mgmd(MGM)] 2 node(s)
id=65 @94.75.251.146 (RonDB-22.10.0)
id=66 @94.75.251.146 (RonDB-22.10.0)
[mysqld(API)] 5 node(s)
id=67 (not connected, accepting connect from any host)
id=68 (not connected, accepting connect from any host)
id=231 (not connected, accepting connect from any host)
id=232 (not connected, accepting connect from any host)
then I stop data node 1 and start a new data node of RonDB-22.10.0 with the --initial
flag but that fails as stated above.
I also noticed that when I have the two new management nodes I get a lot of warnings and alerts of missed heartbeats:
2023-07-24 21:42:15 [MgmtSrvr] WARNING -- Node 2: Node 66 missed heartbeat 3
2023-07-24 21:42:15 [MgmtSrvr] WARNING -- Node 1: Node 65 missed heartbeat 3
2023-07-24 21:42:15 [MgmtSrvr] WARNING -- Node 1: Node 66 missed heartbeat 3
2023-07-24 21:42:16 [MgmtSrvr] WARNING -- Node 2: Node 65 missed heartbeat 4
2023-07-24 21:42:16 [MgmtSrvr] ALERT -- Node 2: Node 65 declared dead due to missed heartbeat
2023-07-24 21:42:16 [MgmtSrvr] INFO -- Node 1: Communication to Node 65 closed
2023-07-24 21:42:16 [MgmtSrvr] INFO -- Node 2: Communication to Node 65 closed
2023-07-24 21:42:16 [MgmtSrvr] INFO -- Node 2: Lost arbitrator node 65 - process failure [state=6]
2023-07-24 21:42:16 [MgmtSrvr] INFO -- Node 2: President restarts arbitration thread [state=1]
Hm, I tried to repeat the process writing this issue ticket and after I stopped data node 1, node 2 dies as well:
2023-07-25 11:09:44 [ndbd] INFO -- Node 1 disconnected in state: 0
2023-07-25 11:09:45 [ndbd] INFO -- Node 1 disconnected in state: 0 - Repeated 2 times
2023-07-25 11:09:45 [ndbd] INFO -- findNeighbours from: 6718 old (left: 1 right: 1) new (65535 65535)
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
2023-07-25 11:09:45 [ndbd] ALERT -- Network partitioning - no arbitrator available
2023-07-25 11:09:45 [ndbd] INFO -- President restarts arbitration thread [state=8]
stack_bottom = 0 thread_stack 0x0
ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x5ae73e]
ndbmtd(ndb_print_stacktrace()+0x45) [0x65a835]
ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x26) [0x6eda06]
ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x61bb69]
ndbmtd(Qmgr::startArbitThread(Signal*)+0x2ab) [0x7b2d9b]
ndbmtd(Qmgr::handleArbitCheck(Signal*)+0x46e) [0x7b344e]
ndbmtd() [0x5fedf4]
ndbmtd() [0x6010b8]
ndbmtd(mt_job_thread_main+0x1e6) [0x6028f6]
ndbmtd() [0x65a93c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f69efb23609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f69eef46293]
2023-07-25 11:09:45 [ndbd] INFO -- Arbitrator decided to shutdown this node
2023-07-25 11:09:45 [ndbd] INFO -- QMGR (Line: 8009) 0x00000002
2023-07-25 11:09:45 [ndbd] INFO -- Error handler shutting down system
2023-07-25 11:09:47 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other no
de(s)(Arbitration error). Temporary error, restart node'.
2023-07-25 11:10:53 [ndbd] INFO -- Angel pid: 653061 started child: 653062
I start both data nodes up again with RonDB-21.04.9 and try again:
-- RonDB -- Management Client --
ndb_mgm> show
Connected to Management Server at: nl3:1187
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=1 @94.75.251.146 (RonDB-21.04.9, Nodegroup: 0)
id=2 @94.75.251.146 (RonDB-21.04.9, Nodegroup: 0, *)
[ndb_mgmd(MGM)] 2 node(s)
id=65 @94.75.251.146 (RonDB-22.10.0)
id=66 @94.75.251.146 (RonDB-22.10.0)
[mysqld(API)] 5 node(s)
id=67 (not connected, accepting connect from any host)
id=68 (not connected, accepting connect from any host)
id=231 (not connected, accepting connect from any host)
id=232 (not connected, accepting connect from any host)
id=233 (not connected, accepting connect from any host)
ndb_mgm> 1 stop
This time the data node 2 didn't die:
Connected to Management Server at: nl3:1187
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=1 (not connected, accepting connect from nl3)
id=2 @94.75.251.146 (RonDB-21.04.9, Nodegroup: 0, *)
[ndb_mgmd(MGM)] 2 node(s)
id=65 @94.75.251.146 (RonDB-22.10.0)
id=66 @94.75.251.146 (RonDB-22.10.0)
[mysqld(API)] 5 node(s)
id=67 (not connected, accepting connect from any host)
id=68 (not connected, accepting connect from any host)
id=231 (not connected, accepting connect from any host)
id=232 (not connected, accepting connect from any host)
id=233 (not connected, accepting connect from any host)
ndb_mgm>
and I try to start a new data node 1 with RonDB-22.10.0:
ndbmtd --ndb-connectstring=nl3:1186,nl3:1187 --ndb-nodeid=1 --initial
But same result:
2023-07-25 11:47:42 [ndbd] INFO -- We are running with 16 LDM workers and 4 REDO log parts. This means that we need to use a mutex to access REDO log parts
2023-07-25 11:47:42 [ndbd] INFO -- Watchdog KillSwitch off.
2023-07-25 11:47:42 [ndbd] INFO -- Starting QMGR phase 1
2023-07-25 11:47:42 [ndbd] INFO -- DIH reported initial start, now starting the Node Inclusion Protocol
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x5b13ae]
ndbmtd(ndb_print_stacktrace()+0x45) [0x6371e5]
ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x26) [0x6f7b96]
ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x5f5629]
ndbmtd(Qmgr::execCM_REGCONF(Signal*)+0x599) [0x7e6b69]
ndbmtd() [0x517a25]
ndbmtd() [0x5dd293]
ndbmtd(mt_job_thread_main+0x317) [0x5cd947]
ndbmtd() [0x6372cc]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f15c0910609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f15bfd33293]
2023-07-25 11:47:42 [ndbd] INFO -- incompatible version own=0x160a00 other=0x150409, shutting down
2023-07-25 11:47:42 [ndbd] INFO -- QMGR (Line: 1715) 0x00000006
2023-07-25 11:47:42 [ndbd] INFO -- Error handler shutting down system
2023-07-25 11:47:44 [ndbd] ALERT -- Node 1: Forced node shutdown completed. Occurred during startphase 1. Caused by error 6304: 'Unsupported version(Restart error). Temporary error, restart node'.
Please tell me if I should add more information to the ticket. I'd be glad if you pointed out what I am doing wrong.
Thanks again, Max