TestUpdateClusterLayout::test_simple_kill_node_while_decommissioning driver times out after waiting for 30 seconds when connecting the decomissioned server,about scylladb/scylladb

Comments (26)

tchaikov commented on May 30, 2024 1

thank you for pointing out this. fixed.

from scylladb.

tchaikov commented on May 30, 2024

1706764541750_update_cluster_layout_tests.py TestUpdateClusterLayout test_simple_kill_node_while_decommissioning.zip

from scylladb.

bhalevy commented on May 30, 2024

The issue description above got garbled due to missing `

from scylladb.

tchaikov commented on May 30, 2024

test_simple_decommission_node_while_query_info this time. see https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/6374/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/Tests___dtest___test_simple_decommission_node_while_query_info_2_/

from scylladb.

Annamikhlin commented on May 30, 2024

seen also on https://jenkins.scylladb.com/job/scylla-master/job/next/7193/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/Build___x86___dtest___test_simple_decommission_node_while_query_info_2_/

self = <update_cluster_layout_tests.TestUpdateClusterLayout object at 0x7f02a482d810>
rf = 2

    @pytest.mark.required_features("!consistent-topology-changes")
    @pytest.mark.parametrize("rf", [1, 2])
    def test_simple_decommission_node_while_query_info(self, rf):
        """
        Test decommissioning node streams all data
        1. Create a cluster with a three nodes with rf, insert data
        2. Decommission node, while node is decommissioning query data
        3. Check that the cluster returns all
        """
        cluster = self.cluster
        consistency = {1: ConsistencyLevel.ONE, 2: ConsistencyLevel.TWO}[rf]
    
        # Disable hinted handoff and set batch commit log so this doesn't
        # interfer with the test (this must be after the populate)
        cluster.set_configuration_options(values=self.default_config_options(), batch_commitlog=True)
        cluster.populate(3).start()
        node1, node2, node3 = cluster.nodelist()
    
        session = self.patient_cql_connection(node1)
        create_ks(session, "ks", rf)
        create_cf(session, "cf", read_repair=0.0, columns={"c1": "text", "c2": "text"})
    
        num_keys = 2000
        insert_c1c2(session, keys=range(num_keys), consistency=consistency)
    
        def run(stop_run):
            logger.debug("Background SELECT loop starting")
            query = session.prepare("SELECT * FROM cf")
            query.consistency_level = consistency
            while not stop_run.is_set():
                result = list(session.execute(query))
                assert len(result) == num_keys
                time.sleep(0.01)
            logger.debug("Background SELECT loop done")
    
        executor = ThreadPoolExecutor(max_workers=1)
        stop_run = Event()
        t = executor.submit(run, stop_run)
    
        logger.debug("Decommissioning node2")
        node2.decommission()
    
        stop_run.set()
>       t.result()

update_cluster_layout_tests.py:1196: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib64/python3.11/concurrent/futures/_base.py:456: in result
    return self.__get_result()
/usr/lib64/python3.11/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
/usr/lib64/python3.11/concurrent/futures/thread.py:58: in run
    result = self.fn(*self.args, **self.kwargs)
update_cluster_layout_tests.py:1183: in run
    result = list(session.execute(query))
cassandra/cluster.py:2726: in cassandra.cluster.Session.execute
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cassandra.OperationTimedOut: errors={'127.0.73.1:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.73.1:9042

cassandra/cluster.py:5085: OperationTimedOut

from scylladb.

fruch commented on May 30, 2024

history show this issue with test_simple_kill_node_while_decommissioning, is happening since September:
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/6a77bc60-c5d2-11ee-81c7-3986d18dafd5

test_simple_decommission_node_while_query_info was refactor recently in https://github.com/scylladb/scylla-dtest/pull/3814 by @bhalevy, so it's probably something new as part of that.

from scylladb.

denesb commented on May 30, 2024

Also seen in #16927.

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/6482/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/Tests___dtest___test_simple_decommission_node_while_query_info_1_/

from scylladb.

nyh commented on May 30, 2024

Seen again in CI run: https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/7992/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/FullDtest___full_split001___test_simple_decommission_node_while_query_info_2_/

from scylladb.

bhalevy commented on May 30, 2024

@xemul since you're looking into issues with decommissioning (from the tablets path), can you please look into this issue too?
It's hurting us in CI mostly, but it we need to understand the root cause and see how serious it is.

from scylladb.

fruch commented on May 30, 2024

@bhalevy @xemul

I think it might be related to change in https://github.com/scylladb/scylla-dtest/pull/3814

The specific test was doing only 100 requests in the background thread
You changed to keep on going until told to stop.

It is known that during charging topology we can get client side errors (in c-s we have default retry of 10 times), so I can think we need better retries in this case.

Python driver doesn't supply very good machinery for it, I'm trying to improve it in:
scylladb/python-driver#298

Meanwhile I think we should add retries in the test code itself

from scylladb.

bhalevy commented on May 30, 2024

Why is it ok to see client-side errors during topology changes?
We need to fix that.

Cc @avelanarius

from scylladb.

avikivity commented on May 30, 2024

Raising to P0.

from scylladb.

Annamikhlin commented on May 30, 2024

seen again in:
https://jenkins.scylladb.com/job/scylla-master/job/next/7291/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/Build___x86___dtest___test_simple_decommission_node_while_query_info_2_/

from scylladb.

xemul commented on May 30, 2024

https://github.com/scylladb/scylla-dtest/pull/3985 doesn't help, it still happens rarely

from scylladb.

xemul commented on May 30, 2024

Driver complains

E   cassandra.OperationTimedOut: errors={'127.0.79.1:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.79.1:9042

In node-1 logs I have

20:05:30,608 [shard 0:strm] storage_service - decommission[077e2627-bf0d-41bd-af6c-dee41073f850]: Added node=127.0.79.2 as leaving node, coordinator=127.0.79.2
20:05:43,588 [shard 0:stmt] cql_server - Processing EXECUTE request from 127.0.0.1:52688
20:05:43,612 [shard 0: gms] gossip - Removed endpoint 127.0.79.2
20:06:43,589 [shard 0:stmt] cql_server - Done processing EXECUTE request from 127.0.0.1:52688
20:06:43,589 [shard 0:stmt] cql_server - 127.0.0.1:52688: request resulted in read_timeout_error, stream 92, code 4608, message [Operation timed out for ks.cf - received only 1 responses from 2 CL=TWO.]

from scylladb.

fruch commented on May 30, 2024

Driver complains

E   cassandra.OperationTimedOut: errors={'127.0.79.1:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.79.1:9042

In node-1 logs I have

20:05:30,608 [shard 0:strm] storage_service - decommission[077e2627-bf0d-41bd-af6c-dee41073f850]: Added node=127.0.79.2 as leaving node, coordinator=127.0.79.2
20:05:43,588 [shard 0:stmt] cql_server - Processing EXECUTE request from 127.0.0.1:52688
20:05:43,612 [shard 0: gms] gossip - Removed endpoint 127.0.79.2
20:06:43,589 [shard 0:stmt] cql_server - Done processing EXECUTE request from 127.0.0.1:52688
20:06:43,589 [shard 0:stmt] cql_server - 127.0.0.1:52688: request resulted in read_timeout_error, stream 92, code 4608, message [Operation timed out for ks.cf - received only 1 responses from 2 CL=TWO.]

So what's holding this query for that long ?

from scylladb.

xemul commented on May 30, 2024

So what's holding this query for that long ?

This is what I'm currently investigating

from scylladb.

xemul commented on May 30, 2024

Apparently one of the replicas that coordinator tried to query was the decommissioned one and it didn't respond

from scylladb.

xemul commented on May 30, 2024

node1 logs:

09:55:20,504 [shard 0:stmt] cql_server - Processing EXECUTE 0/386 request from 127.0.0.1:60216
09:55:20,530 [shard 0: gms] seastar - stopped client socket from 127.0.85.1:65066 to 127.0.85.2:7000
09:55:20,530 [shard 0:stmt] seastar - stopped client socket from 127.0.85.1:65530 to 127.0.85.2:7000
09:55:20,530 [shard 0:main] seastar - stopped client socket from 127.0.85.1:52920 to 127.0.85.2:7000
09:55:20,530 [shard 1:stmt] seastar - stopped client socket from 127.0.85.1:49867 to 127.0.85.2:7000
09:55:20,530 [shard 1:stmt] seastar - stopped client socket from 127.0.85.1:53841 to 127.0.85.2:7000
09:55:20,530 [shard 1:strm] seastar - stopped client socket from 127.0.85.1:63599 to 127.0.85.2:7000
09:55:20,530 [shard 0:strm] seastar - stopped client socket from 127.0.85.1:61294 to 127.0.85.2:7000

09:55:30,317 [shard 0: gms] seastar - stopped client socket from 127.0.85.1:59698 to 127.0.85.2:7000
09:55:30,317 [shard 0:main] seastar - stopped client socket from 127.0.85.1:58596 to 127.0.85.2:7000
09:55:30,317 [shard 1:strm] seastar - stopped client socket from 127.0.85.1:61301 to 127.0.85.2:7000
09:55:30,317 [shard 0:strm] seastar - stopped client socket from 127.0.85.1:57678 to 127.0.85.2:7000

09:56:20,505 [shard 0:stmt] cql_server - Done processing EXECUTE 1/389 request from 127.0.0.1:60216
09:56:20,505 [shard 0:stmt] cql_server - 127.0.0.1:60216: request resulted in read_timeout_error, stream 51, code 4608, message [Operation timed out for ks.cf - received only 1 responses from 2 CL=TWO.]

node2 logs:

09:55:08,629 [shard 0:strm] api - decommission
09:55:20,530 [shard 0:strm] seastar - stopped server socket from 127.0.85.1:65066
09:55:20,530 [shard 1:strm] seastar - stopped server socket from 127.0.85.1:63599
09:55:20,530 [shard 1:strm] seastar - stopped server socket from 127.0.85.1:49867
09:55:20,530 [shard 1:strm] seastar - stopped server socket from 127.0.85.1:53841
09:55:20,530 [shard 0:strm] seastar - stopped server socket from 127.0.85.1:52920
09:55:20,530 [shard 0:strm] seastar - stopped server socket from 127.0.85.1:61294
09:55:20,530 [shard 0:strm] seastar - stopped server socket from 127.0.85.1:65530

09:55:30,145 [shard 0:strm] storage_service - Stop transport: starts
09:55:30,146 [shard 0:strm] storage_service - Stop transport: shutdown rpc and cql server done
09:55:30,316 [shard 0:strm] storage_service - Stop transport: stop_gossiping done
09:55:30,317 [shard 0:strm] seastar - stopped server socket from 127.0.85.1:59698
09:55:30,317 [shard 0:strm] seastar - stopped server socket from 127.0.85.1:57678
09:55:30,317 [shard 0:strm] seastar - stopped server socket from 127.0.85.1:58596
09:55:30,317 [shard 1:strm] seastar - stopped server socket from 127.0.85.1:61301
09:55:30,317 [shard 0:strm] storage_service - Stop transport: shutdown messaging_service done
09:55:30,317 [shard 0:strm] storage_service - Stop transport: shutdown stream_manager done
09:55:30,317 [shard 0:strm] storage_service - Stop transport: done

Summary:
09:55:08,629 node2 decommission starts
09:55:20,504 node1 query starts
09:55:20,530 node2 closes some connections from node1
09:55:30,145 node2 starts stopping transport (after unbootstrapping)
09:55:30,317 node2 closes remaining connections from node1
09:56:20,505 node1 times-out processing query

from scylladb.

xemul commented on May 30, 2024

11:22:15,843 [shard 1:stmt] cql_server - Processing EXECUTE 0/390 request from 127.0.0.1:42062
11:22:15,843 [shard 1:stmt] rpc - send READ_DATA to 127.0.67.2:0      // multiple
11:22:15,846 [shard 1:stmt] rpc - send READ_DIGEST to 127.0.67.2:0    // multiple
11:22:15,859 [shard 0: gms] gossip - Node 127.0.67.2 will be removed from gossip at [2024-03-08 11:22:14]: (expire = 1709886134996526720, now = 1709626935859056195, diff = 259199 seconds)
11:22:15,859 [shard 0: gms] storage_service - Removing tokens {...} for 127.0.67.2
11:22:15,863 [shard 1:stmt] rpc - send READ_DATA to 127.0.67.2:0      // multiple
11:22:15,863 [shard 1:stmt] rpc - send READ_DIGEST to 127.0.67.2:0    // multiple
11:22:15,865 [shard 0: gms] gossip - Removed endpoint 127.0.67.2
11:22:15,865 [shard 0: gms] gossip - InetAddress 127.0.67.2 is now DOWN, status = LEFT
11:22:15,865 [shard 0:strm] seastar - stopped client socket from 127.0.67.1:60194 to 127.0.67.2:7000
11:22:15,865 [shard 0:main] seastar - stopped client socket from 127.0.67.1:55812 to 127.0.67.2:7000
11:22:15,865 [shard 0:stmt] seastar - stopped client socket from 127.0.67.1:63380 to 127.0.67.2:7000
11:22:15,865 [shard 0: gms] seastar - stopped client socket from 127.0.67.1:50892 to 127.0.67.2:7000
11:22:15,865 [shard 1:stmt] seastar - stopped client socket from 127.0.67.1:54689 to 127.0.67.2:7000
11:22:15,866 [shard 1:stmt] rpc - message 3 to 127.0.67.2:0 failed with seastar::rpc::closed_error (connection is closed)    // multiple
11:22:15,866 [shard 1:stmt] rpc - message 5 to 127.0.67.2:0 failed with seastar::rpc::closed_error (connection is closed)    // multiple
                            ^^^ this is the last message about RPC failure to node 67.2

11:22:15,866 [shard 1:strm] seastar - stopped client socket from 127.0.67.1:62397 to 127.0.67.2:7000
11:22:15,866 [shard 1:stmt] seastar - stopped client socket from 127.0.67.1:50003 to 127.0.67.2:7000
11:22:16,858 [shard 1:strm] rpc - send GOSSIP_ECHO to 127.0.67.2:0

+10 sec
11:22:25,003 [shard 0:strm] storage_service - decommission[6e00cf02-ef34-4163-8065-23f0ca571e20]: Started to check if nodes={127.0.67.2} have left the cluster, coordinator=127.0.67.2
11:22:25,003 [shard 0:strm] storage_service - decommission[6e00cf02-ef34-4163-8065-23f0ca571e20]: Finished to check if nodes={127.0.67.2} have left the cluster, coordinator=127.0.67.2
11:22:25,003 [shard 0:strm] storage_service - decommission[6e00cf02-ef34-4163-8065-23f0ca571e20]: Marked ops done from coordinator=127.0.67.2
11:22:25,008 [shard 0:strm] rpc - send RAFT_APPEND_ENTRIES to 127.0.67.2:0
11:22:25,644 [shard 0:strm] seastar - stopped server socket from 127.0.67.2:61160
11:22:25,644 [shard 0:strm] seastar - stopped server socket from 127.0.67.2:60168
11:22:25,644 [shard 0:strm] seastar - stopped server socket from 127.0.67.2:50518
11:22:25,644 [shard 0: gms] seastar - stopped client socket from 127.0.67.1:51680 to 127.0.67.2:7000
11:22:25,644 [shard 0:strm] seastar - stopped server socket from 127.0.67.2:51434
11:22:25,644 [shard 0:strm] seastar - stopped server socket from 127.0.67.2:65474
11:22:25,644 [shard 0:main] seastar - stopped client socket from 127.0.67.1:63482 to 127.0.67.2:7000
11:22:25,644 [shard 0:strm] seastar - stopped client socket from 127.0.67.1:59708 to 127.0.67.2:7000
11:22:25,644 [shard 1:strm] seastar - stopped server socket from 127.0.67.2:61331
11:22:25,644 [shard 1:strm] seastar - stopped server socket from 127.0.67.2:64683
11:22:25,644 [shard 1:strm] seastar - stopped client socket from 127.0.67.1:60679 to 127.0.67.2:7000
11:22:25,644 [shard 1:strm] seastar - stopped server socket from 127.0.67.2:61417
11:22:25,644 [shard 1:strm] seastar - stopped server socket from 127.0.67.2:52147

+50 sec
11:23:15,843 [shard 1:stmt] cql_server - Done processing EXECUTE 1/391 request from 127.0.0.1:42062
11:23:15,843 [shard 1:stmt] cql_server - 127.0.0.1:42062: request resulted in read_timeout_error, stream 94, code 4608, message [Operation timed out for ks.cf - received only 1 responses from 2 CL=TWO.]
11:23:15,844 [shard 1:stmt] cql_server - Advertising disconnection of CQL client 127.0.0.1:42062

despite node2 failed all READ_DATA/READ_DIGEST requests from node1, it took node1 1 more minute to time-out its reading

from scylladb.

xemul commented on May 30, 2024

Timeout exception is raised from digest_read_resolver::on_timeout()

from scylladb.

xemul commented on May 30, 2024

Closed connection from decommissioned node don't fail the request instantly because of

    void on_error(gms::inet_address ep, error_kind kind) override {
        if (waiting_for(ep)) {
            _failed++;
        }
        if (kind == error_kind::DISCONNECT && _block_for == _target_count_for_cl) {
            // if the error is because of a connection disconnect and there is no targets to speculate
            // wait for timeout in hope that the client will issue speculative read
            // FIXME: resolver should have access to all replicas and try another one in this case
            return;
        }

_block_for == 2, _target_count_for_cl == 2

from scylladb.

xemul commented on May 30, 2024

The check comes from 7277ee2 with the description of

After ac27d1c if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.

@gleb-cloudius , please shed more light on this -- what kind of "speculative read (retry)" is (was?) client supposed to issue? In this test client just waits until timeout and fails the query (and fails the whole test eventually)

from scylladb.

xemul commented on May 30, 2024

Range read unconditionally creates never_speculating_read_executor passing it 2 replica IPs one of which is 2nd node's. Then gossiper removes 2nd endpoint. Then node2 closes sockets and digest resolver's on_error happens

storage_proxy - creating range read executor for range (-inf, {-9210079157227413570, end}] in table ks.cf with targets {127.0.4.1, 127.0.4.3}
...
storage_proxy - creating range read executor for range ({-9108879658673895196, end},{-9095897187352045003, end}] in table ks.cf with targets {127.0.4.1, 127.0.4.2}
...
rpc - send READ_DIGEST to 127.0.4.3:0
...
rpc - send READ_DIGEST to 127.0.4.2:0
...
gossip - Removed endpoint 127.0.4.2
gossip - InetAddress 127.0.4.2 is now DOWN, status = LEFT
rpc - message READ_DIGEST to 127.0.4.2:0 failed with seastar::rpc::closed_error (connection is closed)
...
storage_proxy - digest read error 1, block_for 2 failed 1 target_count_for_cl 2

from scylladb.

xemul commented on May 30, 2024

OK, driver-side has configurable (off by default) speculative execution. Turning it ON fixes the issue

from scylladb.

xemul commented on May 30, 2024

dtest fix merged

from scylladb.

TestUpdateClusterLayout::test_simple_kill_node_while_decommissioning driver times out after waiting for 30 seconds when connecting the decomissioned server about scylladb HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs