Comments (2)
@kbr-scylla this happened with consistent-topology-changes (and rbno).
Can you please look into this?
from scylladb.
Nothing unexpected happened here. Decommission process was interrupted too late, at the end:
INFO 2024-06-16 06:39:01,130 [shard 0:strm] raft_topology - streaming completed
the topology coordinator has reached the point of no return -- after streaming is completed we enter a state from which the only way is forward, i.e. finishing the decommission, which causes the node to be banned.
INFO 2024-06-16 06:39:01,134 [shard 0: gms] raft_topology - updating topology state: decommissioning: streaming completed for node 8d337728-7ec3-4376-8363-df83207c3a9b
ERROR 2024-06-16 06:39:01,302 [shard 0:main] rpc - client 127.0.7.4:7000: client connection dropped: recv: Connection reset by peer
ERROR 2024-06-16 06:39:01,303 [shard 0: gms] rpc - client 127.0.7.4:7000: client connection dropped: recv: Connection reset by peer
ERROR 2024-06-16 06:39:01,303 [shard 0: gms] rpc - client 127.0.7.4:7000: client connection dropped: recv: Connection reset by peer
INFO 2024-06-16 06:39:01,336 [shard 0: gms] raft_topology - entered `write both read new` transition state
INFO 2024-06-16 06:39:01,336 [shard 0: gms] raft_topology - executing global topology command barrier_and_drain, excluded nodes: {}
ERROR 2024-06-16 06:39:01,390 [shard 0: gms] raft_topology - drain rpc failed, proceed to fence old writes: std::runtime_error (raft topology: exec_global_command(barrier_and_drain) failed with seastar::rpc::closed_error (connection is closed))
INFO 2024-06-16 06:39:01,393 [shard 0: gms] raft_topology - updating topology state: advance fence version to 12
INFO 2024-06-16 06:39:01,587 [shard 0: gms] raft_topology - executing global topology command barrier, excluded nodes: {}
ERROR 2024-06-16 06:39:01,646 [shard 0: gms] raft_topology - transition_state::write_both_read_new, global_token_metadata_barrier failed, error std::runtime_error (raft topology: exec_global_command(barrier) failed with seastar::rpc::closed_error (connection is closed))
WARN 2024-06-16 06:39:02,279 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
INFO 2024-06-16 06:39:03,019 [shard 0:main] raft_group_registry - marking Raft server 8d337728-7ec3-4376-8363-df83207c3a9b as dead for raft groups
WARN 2024-06-16 06:39:04,281 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
WARN 2024-06-16 06:39:06,282 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
WARN 2024-06-16 06:39:08,283 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
WARN 2024-06-16 06:39:10,284 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
INFO 2024-06-16 06:39:11,650 [shard 0: gms] raft_topology - updating topology state: decommissioning: read fence completed
INFO 2024-06-16 06:39:11,766 [shard 0: gms] raft_topology - entered `left token ring` transition state
INFO 2024-06-16 06:39:11,766 [shard 0: gms] raft_topology - executing global topology command barrier_and_drain, excluded nodes: {}
ERROR 2024-06-16 06:39:11,804 [shard 0: gms] raft_topology - drain rpc failed, proceed to fence old writes: std::runtime_error (raft topology: exec_global_command(barrier_and_drain) failed with seastar::rpc::closed_error (connection is closed))
INFO 2024-06-16 06:39:11,807 [shard 0: gms] raft_topology - updating topology state: advance fence version to 13
INFO 2024-06-16 06:39:11,919 [shard 0: gms] raft_topology - executing global topology command barrier, excluded nodes: {}
ERROR 2024-06-16 06:39:11,986 [shard 0: gms] raft_topology - transition_state::left_token_ring, raft_topology_cmd::command::barrier failed, error std::runtime_error (raft topology: exec_global_command(barrier) failed with seastar::rpc::closed_error (connection is closed))
WARN 2024-06-16 06:39:12,286 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
WARN 2024-06-16 06:39:14,287 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
WARN 2024-06-16 06:39:16,288 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
WARN 2024-06-16 06:39:18,289 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
WARN 2024-06-16 06:39:20,290 [shard 0:strm] gossip - failure_detector_loop: Send echo to node 127.0.7.4, status = failed: seastar::rpc::closed_error (connection is closed)
INFO 2024-06-16 06:39:21,990 [shard 0: gms] raft_topology - updating topology state: report request completion in left_token_ring state
WARN 2024-06-16 06:39:22,098 [shard 0: gms] raft_topology - failed to tell node 8d337728-7ec3-4376-8363-df83207c3a9b to shut down - it may hang. It's safe to shut it down manually now. (Exception: seastar::rpc::closed_error (connection is closed))
INFO 2024-06-16 06:39:22,103 [shard 0: gms] raft_topology - removing node 8d337728-7ec3-4376-8363-df83207c3a9b from group 0 configuration...
INFO 2024-06-16 06:39:22,112 [shard 0: gms] raft_topology - node 8d337728-7ec3-4376-8363-df83207c3a9b removed from group 0 configuration
INFO 2024-06-16 06:39:22,116 [shard 0: gms] raft_topology - updating topology state: finished decommissioning node 8d337728-7ec3-4376-8363-df83207c3a9b
This is why the node gets stuck here when trying to restart:
INFO 2024-06-16 06:39:39,696 [shard 0:strm] raft_group0 - finish_setup_after_join: becoming a voter in the group 0 configuration...
it can not contact the cluster because it's banned.
The test is timing based and therefore flaky.
We need to make sure that we interrupt the node before decommission finishes. @temichus and @aleksbykov you guys dealt with such problems before and unflaked many such tests, so I'm assigning it to you.
from scylladb.
Related Issues (20)
- Document sort order of different CQL types
- Remove global_req_id field from schema_altering_statement
- Aggregation queries with filtering are not parallelized HOT 1
- docs: Issue on page Workload Prioritization HOT 2
- batchlog: time to replay failed batches can explode as the batchlog table accumulats tombstones HOT 2
- Race between tablet split and repair HOT 3
- [Docs]: Change the location where Scylla Doctor should be downloaded from
- Data Synchronization Error between scylla and another database due to the Out-of-Order CDC Log Data HOT 5
- docs: Issue on page Local Secondary Indexes HOT 2
- docs: Issue on page ScyllaDB Materialized Views
- doc: standarize the nodetool reference pages
- Repair task fail during disruption in test with tablets HOT 3
- No connection in pool (M1 Mac) HOT 1
- types: add byte-comparable representation for all types HOT 1
- Drop table during repair with LCS cause all nodes to aborting with coredumps HOT 2
- Some data structure (topology related?) consumes a lot of memory (grows quickly with cluster size)
- timing issue: rebootstrap of node with same ip and new host id failed because ip could stay in gossip on some node HOT 3
- Several nodes aborting coredump with error ' token_metadata - host_id for endpoint :: is not found, at: ' during replacing terminated node. HOT 2
- docs: Issue on page How to Report a ScyllaDB Problem HOT 1
- Backport automation: replace label when backport was done and close the PR
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scylladb.