GithubHelp home page GithubHelp logo

Comments (15)

mykaul avatar mykaul commented on May 28, 2024

Anything interesting in node 127.0.98.3 logs?

from scylladb.

Annamikhlin avatar Annamikhlin commented on May 28, 2024

Anything interesting in node 127.0.98.3 logs?

not sure there is something interesting I see
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/7191/artifact/logs-full.release.001/1705484423669_wide_rows_test.py%3A%3ATestWideRows%3A%3Atest_large_row_in_materialized_view%5BSizeTieredCompactionStrategy%5D/node3.log

from scylladb.

fruch avatar fruch commented on May 28, 2024

most time you won't see anything interesting in logs in cases of timeouts

it happens more then a few times in the last year:
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/546d8010-b544-11ee-81c7-3986d18dafd5

I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.

from scylladb.

nyh avatar nyh commented on May 28, 2024

I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.

I'm not sure how the people responsible for MV (me ;-)) need to sign it (what is "it"?) off.

Basically, CREATE MATERIALIZED view is a slow operation. It shouldn't be very slow - e.g., it doesn't wait for the entire view building to happen (this happened asynchronously in the background, later), but it's a schema operation and those are always relatively slow. In my cql-pytest work, I noticed the Python driver's default request_timeout setting of 10 seconds was not always enough on extremely overloaded test machines and debug builds - which are often 100 times (!) slower than normal, so I increased request_timeout to 120. What request_timeout do you use in dtest? A quick glance in the code suggests maybe it's 600 seconds - which is more than enough, but please check to make sure what request_timeout applies to this specific test.

In addition to request_timeout, there is also connect_timeout and control_connection_timeout, which I also increased to 60 seconds. If I understand correctly, dtest_setup.py increases them to "just" 5 and 6 seconds, respectively, maybe that's not enough? But to be honest, I think the specific error you got doesn't indicate that these two parameters are relevant (see #11289 on why those two additional timeout increases were necessary).

If you do have a very long timeout (e.g., 600 seconds) and you're sure about it, you might have discovered a genuine bug in the schema update code which causes a hang. This should be something that the schema-change people (e.g, @kbr-scylla) should be worried about, not the MV people.

from scylladb.

fruch avatar fruch commented on May 28, 2024

I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.

I'm not sure how the people responsible for MV (me ;-)) need to sign it (what is "it"?) off.

Basically, CREATE MATERIALIZED view is a slow operation. It shouldn't be very slow - e.g., it doesn't wait for the entire view building to happen (this happened asynchronously in the background, later), but it's a schema operation and those are always relatively slow. In my cql-pytest work, I noticed the Python driver's default request_timeout setting of 10 seconds was not always enough on extremely overloaded test machines and debug builds - which are often 100 times (!) slower than normal, so I increased request_timeout to 120. What request_timeout do you use in dtest? A quick glance in the code suggests maybe it's 600 seconds - which is more than enough, but please check to make sure what request_timeout applies to this specific test.

In addition to request_timeout, there is also connect_timeout and control_connection_timeout, which I also increased to 60 seconds. If I understand correctly, dtest_setup.py increases them to "just" 5 and 6 seconds, respectively, maybe that's not enough? But to be honest, I think the specific error you got doesn't indicate that these two parameters are relevant (see #11289 on why those two additional timeout increases were necessary).

If you do have a very long timeout (e.g., 600 seconds) and you're sure about it, you might have discovered a genuine bug in the schema update code which causes a hang. This should be something that the schema-change people (e.g, @kbr-scylla) should be worried about, not the MV people.

dtest doesn't change the default_timeout of a session, and that's 10s.
if for specific command we need more than that, we change those one,
and seems like you just explained that creation of MV is same as any other schema changing operation, and might take more than 10s, and only after 10min it should be a problem ?

from scylladb.

nyh avatar nyh commented on May 28, 2024

I thought I saw you set request_timeout=600, but it was apparently in some unused part of dtest, not the default.
10 minutes is ridiculously high (although the same might be said about 10 seconds ;-)). In cql-pytest I set it to 120 and it was more than enough.
Here is my comment about this:

        # The default timeout (in seconds) for execute() commands is 10, which
        # should have been more than enough, but in some extreme cases with a
        # very slow debug build running on a very busy machine and a very slow
        # request (e.g., a DROP KEYSPACE needing to drop multiple tables)
        # 10 seconds may not be enough, so let's increase it. See issue #7838.
        request_timeout=request_timeout)

As this notes, the slow request that caused problems for me was a DROP KEYSPACE that needed to drup multiple tables and views - not a single MV creation which in my experience never exceeded 10 seconds - but who knows, maybe if the test machine is 500 times slower than necessary, instead of the "usual" 100, even a single MV creation is too slow :-(

from scylladb.

mykaul avatar mykaul commented on May 28, 2024

@nyh - what should be the next step here?

from scylladb.

kbr-scylla avatar kbr-scylla commented on May 28, 2024

Another failure spotted on 5.4 next

1712085589054_wide_rows_test.py TestWideRows test_large_row_in_materialized_view[SizeTieredCompactionStrategy].zip

dtest-gw3.log

from scylladb.

nyh avatar nyh commented on May 28, 2024

Another failure spotted on 5.4 next

1712085589054_wide_rows_test.py TestWideRows test_large_row_in_materialized_view[SizeTieredCompactionStrategy].zip

dtest-gw3.log

Indeed also has

    def test_large_row_in_materialized_view(self):
...
>       session.execute('create materialized view %s as select * from %s '
...

E   cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.27.3:9042
cassandra/cluster.py:5048: OperationTimedOut

Looking at the scylla log in the zip you attached (thanks), node3.log (the IP address reporting) has

INFO  2024-04-02 19:19:15,650 [shard 0:stat] migration_manager - Create new view: org.apache.cassandra.config.CFMetaData@0x60000511d500[cfId=e5336a20-f125-11ee-8c76-1527e943ec90,ksName=wide_row,cfName=user_events_view
...
INFO  2024-04-02 19:19:15,690 [shard 0:stat] schema_tables - Creating wide_row.user_events_view id=e5336a20-f125-11ee-8c76-1527e943ec90 version=033cd87c-ebaf-36dc-8ba9-97ceebb21464
...
INFO  2024-04-02 19:19:15,747 [shard 0:stat] view - Finished building view wide_row.user_events_view

So it seems not only did creating the view not take a long time - in less than 0.1 seconds we not only created the view, we even finished the view building (for which CREATE MATERIALIZED VIEW doesn't even wait). In the log, nothing further happens after 19:19:15, until 19:19:46 (30 seconds later) when Scylla is asked to shut down.

I guess it's possible that we have a bug that a successful CREATE MATERIALIZED VIEW forgets to return a response, but why do we not see it all the time? Why isn't there any error logging or something?
I suspect some sort of Python driver bug, but don't know how to explain it. Does anyone know what "Connection defunct by heartbeat" even means?

from scylladb.

kbr-scylla avatar kbr-scylla commented on May 28, 2024

Does anyone know what "Connection defunct by heartbeat" even means?

Good question. @sylwiaszunejko how is 'Connection defunct by heartbeat': 'Client request timeout different from a "normal timeout"? (I assume that this is some kind of special timeout, not every timeout looks like this, but I may be wrong)

from scylladb.

sylwiaszunejko avatar sylwiaszunejko commented on May 28, 2024

Does anyone know what "Connection defunct by heartbeat" even means?

Good question. @sylwiaszunejko how is 'Connection defunct by heartbeat': 'Client request timeout different from a "normal timeout"? (I assume that this is some kind of special timeout, not every timeout looks like this, but I may be wrong)

@kbr-scylla Heartbeat is sending OPTIONS message on idle connections. The timeout on which the heartbeat wait for idle connection responses and an interval on which to heartbeat idle connections can be configured. This helps keep connections open through network devices that expire idle connections and discover bad connections early in low-traffic scenarios.

from scylladb.

kbr-scylla avatar kbr-scylla commented on May 28, 2024

@sylwiaszunejko so it means that the connections was marked as "defunct" due to this heartbeat, and this happened before we sent the request? I'm trying to understand the relationship between "defunct by heartbeat" and "Client request timeout".

Did "client request timeout" cause "defunct by heartbeat", or was the connection marked as defunct first, which then caused the request to "timeout"?

What happens when we try to send a request over a connection that is "defunct by heartbeat"? Do we attempt the request nevertheless? Or does the driver immediately assume that the request will fail and returns timeout?

from scylladb.

sylwiaszunejko avatar sylwiaszunejko commented on May 28, 2024

I guess this message come from this place in python-driver : https://github.com/scylladb/python-driver/blob/master/cassandra/cluster.py#L4556. I am not that familiar with this logic, but judging by the comments and the code, I guess the connection was marked as defunct first, which then caused the request to "timeout".

from scylladb.

sylwiaszunejko avatar sylwiaszunejko commented on May 28, 2024

@kbr-scylla FYI, after talking with @avelanarius, I realized that the 'Connection defunct by heartbeat': 'Client request timeout message could be misleading and the timeout is not always due to the heartbeat issue. For example in this issue - scylladb/python-driver#275, the timeout was caused by ThreadPoolExecutor blocking tasks not heartbeat failing.

from scylladb.

kbr-scylla avatar kbr-scylla commented on May 28, 2024

Possibly related / the same? #14806

from scylladb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.