Seen in: <a href="https://jenkins.scylladb.com/job/scylla-master/job/next/7090/testRep

Anything interesting in node 127.0.98.3 logs? <p dir="a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Another failure spotted on 5.4 next <a href="https://github.com/scyl

Another failure spotted on 5.4 next <a href="https://gi

[dtest] wide_rows_test.TestWideRows.test_large_row_in_materialized_view failed about scylladb HOT 15 OPEN

Annamikhlin commented on May 28, 2024

[dtest] wide_rows_test.TestWideRows.test_large_row_in_materialized_view failed

from scylladb.

Comments (15)

mykaul commented on May 28, 2024

Anything interesting in node 127.0.98.3 logs?

from scylladb.

Annamikhlin commented on May 28, 2024

Anything interesting in node 127.0.98.3 logs?

not sure there is something interesting I see
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/7191/artifact/logs-full.release.001/1705484423669_wide_rows_test.py%3A%3ATestWideRows%3A%3Atest_large_row_in_materialized_view%5BSizeTieredCompactionStrategy%5D/node3.log

from scylladb.

fruch commented on May 28, 2024

most time you won't see anything interesting in logs in cases of timeouts

it happens more then a few times in the last year:
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/546d8010-b544-11ee-81c7-3986d18dafd5

I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.

from scylladb.

nyh commented on May 28, 2024

I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.

I'm not sure how the people responsible for MV (me ;-)) need to sign it (what is "it"?) off.

Basically, CREATE MATERIALIZED view is a slow operation. It shouldn't be very slow - e.g., it doesn't wait for the entire view building to happen (this happened asynchronously in the background, later), but it's a schema operation and those are always relatively slow. In my cql-pytest work, I noticed the Python driver's default request_timeout setting of 10 seconds was not always enough on extremely overloaded test machines and debug builds - which are often 100 times (!) slower than normal, so I increased request_timeout to 120. What request_timeout do you use in dtest? A quick glance in the code suggests maybe it's 600 seconds - which is more than enough, but please check to make sure what request_timeout applies to this specific test.

In addition to request_timeout, there is also connect_timeout and control_connection_timeout, which I also increased to 60 seconds. If I understand correctly, dtest_setup.py increases them to "just" 5 and 6 seconds, respectively, maybe that's not enough? But to be honest, I think the specific error you got doesn't indicate that these two parameters are relevant (see #11289 on why those two additional timeout increases were necessary).

If you do have a very long timeout (e.g., 600 seconds) and you're sure about it, you might have discovered a genuine bug in the schema update code which causes a hang. This should be something that the schema-change people (e.g, @kbr-scylla) should be worried about, not the MV people.

from scylladb.

fruch commented on May 28, 2024

I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.

I'm not sure how the people responsible for MV (me ;-)) need to sign it (what is "it"?) off.

Basically, CREATE MATERIALIZED view is a slow operation. It shouldn't be very slow - e.g., it doesn't wait for the entire view building to happen (this happened asynchronously in the background, later), but it's a schema operation and those are always relatively slow. In my cql-pytest work, I noticed the Python driver's default request_timeout setting of 10 seconds was not always enough on extremely overloaded test machines and debug builds - which are often 100 times (!) slower than normal, so I increased request_timeout to 120. What request_timeout do you use in dtest? A quick glance in the code suggests maybe it's 600 seconds - which is more than enough, but please check to make sure what request_timeout applies to this specific test.

In addition to request_timeout, there is also connect_timeout and control_connection_timeout, which I also increased to 60 seconds. If I understand correctly, dtest_setup.py increases them to "just" 5 and 6 seconds, respectively, maybe that's not enough? But to be honest, I think the specific error you got doesn't indicate that these two parameters are relevant (see #11289 on why those two additional timeout increases were necessary).

If you do have a very long timeout (e.g., 600 seconds) and you're sure about it, you might have discovered a genuine bug in the schema update code which causes a hang. This should be something that the schema-change people (e.g, @kbr-scylla) should be worried about, not the MV people.

dtest doesn't change the default_timeout of a session, and that's 10s.
if for specific command we need more than that, we change those one,
and seems like you just explained that creation of MV is same as any other schema changing operation, and might take more than 10s, and only after 10min it should be a problem ?

from scylladb.

nyh commented on May 28, 2024

I thought I saw you set request_timeout=600, but it was apparently in some unused part of dtest, not the default.
10 minutes is ridiculously high (although the same might be said about 10 seconds ;-)). In cql-pytest I set it to 120 and it was more than enough.
Here is my comment about this:

        # The default timeout (in seconds) for execute() commands is 10, which
        # should have been more than enough, but in some extreme cases with a
        # very slow debug build running on a very busy machine and a very slow
        # request (e.g., a DROP KEYSPACE needing to drop multiple tables)
        # 10 seconds may not be enough, so let's increase it. See issue #7838.
        request_timeout=request_timeout)

As this notes, the slow request that caused problems for me was a DROP KEYSPACE that needed to drup multiple tables and views - not a single MV creation which in my experience never exceeded 10 seconds - but who knows, maybe if the test machine is 500 times slower than necessary, instead of the "usual" 100, even a single MV creation is too slow :-(

from scylladb.

mykaul commented on May 28, 2024

@nyh - what should be the next step here?

from scylladb.

kbr-scylla commented on May 28, 2024

Another failure spotted on 5.4 next

1712085589054_wide_rows_test.py TestWideRows test_large_row_in_materialized_view[SizeTieredCompactionStrategy].zip

dtest-gw3.log

from scylladb.

nyh commented on May 28, 2024

Another failure spotted on 5.4 next

1712085589054_wide_rows_test.py TestWideRows test_large_row_in_materialized_view[SizeTieredCompactionStrategy].zip

dtest-gw3.log

Indeed also has

    def test_large_row_in_materialized_view(self):
...
>       session.execute('create materialized view %s as select * from %s '
...

E   cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.27.3:9042
cassandra/cluster.py:5048: OperationTimedOut

Looking at the scylla log in the zip you attached (thanks), node3.log (the IP address reporting) has

INFO  2024-04-02 19:19:15,650 [shard 0:stat] migration_manager - Create new view: org.apache.cassandra.config.CFMetaData@0x60000511d500[cfId=e5336a20-f125-11ee-8c76-1527e943ec90,ksName=wide_row,cfName=user_events_view
...
INFO  2024-04-02 19:19:15,690 [shard 0:stat] schema_tables - Creating wide_row.user_events_view id=e5336a20-f125-11ee-8c76-1527e943ec90 version=033cd87c-ebaf-36dc-8ba9-97ceebb21464
...
INFO  2024-04-02 19:19:15,747 [shard 0:stat] view - Finished building view wide_row.user_events_view

So it seems not only did creating the view not take a long time - in less than 0.1 seconds we not only created the view, we even finished the view building (for which CREATE MATERIALIZED VIEW doesn't even wait). In the log, nothing further happens after 19:19:15, until 19:19:46 (30 seconds later) when Scylla is asked to shut down.

I guess it's possible that we have a bug that a successful CREATE MATERIALIZED VIEW forgets to return a response, but why do we not see it all the time? Why isn't there any error logging or something?
I suspect some sort of Python driver bug, but don't know how to explain it. Does anyone know what "Connection defunct by heartbeat" even means?

from scylladb.

kbr-scylla commented on May 28, 2024

Does anyone know what "Connection defunct by heartbeat" even means?

Good question. @sylwiaszunejko how is 'Connection defunct by heartbeat': 'Client request timeout different from a "normal timeout"? (I assume that this is some kind of special timeout, not every timeout looks like this, but I may be wrong)

from scylladb.

sylwiaszunejko commented on May 28, 2024

Does anyone know what "Connection defunct by heartbeat" even means?

Good question. @sylwiaszunejko how is 'Connection defunct by heartbeat': 'Client request timeout different from a "normal timeout"? (I assume that this is some kind of special timeout, not every timeout looks like this, but I may be wrong)

@kbr-scylla Heartbeat is sending OPTIONS message on idle connections. The timeout on which the heartbeat wait for idle connection responses and an interval on which to heartbeat idle connections can be configured. This helps keep connections open through network devices that expire idle connections and discover bad connections early in low-traffic scenarios.

from scylladb.

kbr-scylla commented on May 28, 2024

@sylwiaszunejko so it means that the connections was marked as "defunct" due to this heartbeat, and this happened before we sent the request? I'm trying to understand the relationship between "defunct by heartbeat" and "Client request timeout".

Did "client request timeout" cause "defunct by heartbeat", or was the connection marked as defunct first, which then caused the request to "timeout"?

What happens when we try to send a request over a connection that is "defunct by heartbeat"? Do we attempt the request nevertheless? Or does the driver immediately assume that the request will fail and returns timeout?

from scylladb.

sylwiaszunejko commented on May 28, 2024

I guess this message come from this place in python-driver : https://github.com/scylladb/python-driver/blob/master/cassandra/cluster.py#L4556. I am not that familiar with this logic, but judging by the comments and the code, I guess the connection was marked as defunct first, which then caused the request to "timeout".

from scylladb.

sylwiaszunejko commented on May 28, 2024

@kbr-scylla FYI, after talking with @avelanarius, I realized that the 'Connection defunct by heartbeat': 'Client request timeout message could be misleading and the timeout is not always due to the heartbeat issue. For example in this issue - scylladb/python-driver#275, the timeout was caused by ThreadPoolExecutor blocking tasks not heartbeat failing.

from scylladb.

kbr-scylla commented on May 28, 2024

Possibly related / the same? #14806

from scylladb.

[dtest] wide_rows_test.TestWideRows.test_large_row_in_materialized_view failed about scylladb HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs