Comments (15)
Anything interesting in node 127.0.98.3 logs?
from scylladb.
Anything interesting in node 127.0.98.3 logs?
not sure there is something interesting I see
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/7191/artifact/logs-full.release.001/1705484423669_wide_rows_test.py%3A%3ATestWideRows%3A%3Atest_large_row_in_materialized_view%5BSizeTieredCompactionStrategy%5D/node3.log
from scylladb.
most time you won't see anything interesting in logs in cases of timeouts
it happens more then a few times in the last year:
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/546d8010-b544-11ee-81c7-3986d18dafd5
I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.
from scylladb.
I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.
I'm not sure how the people responsible for MV (me ;-)) need to sign it (what is "it"?) off.
Basically, CREATE MATERIALIZED view is a slow operation. It shouldn't be very slow - e.g., it doesn't wait for the entire view building to happen (this happened asynchronously in the background, later), but it's a schema operation and those are always relatively slow. In my cql-pytest work, I noticed the Python driver's default request_timeout
setting of 10 seconds was not always enough on extremely overloaded test machines and debug builds - which are often 100 times (!) slower than normal, so I increased request_timeout
to 120. What request_timeout
do you use in dtest? A quick glance in the code suggests maybe it's 600 seconds - which is more than enough, but please check to make sure what request_timeout applies to this specific test.
In addition to request_timeout
, there is also connect_timeout
and control_connection_timeout
, which I also increased to 60 seconds. If I understand correctly, dtest_setup.py increases them to "just" 5 and 6 seconds, respectively, maybe that's not enough? But to be honest, I think the specific error you got doesn't indicate that these two parameters are relevant (see #11289 on why those two additional timeout increases were necessary).
If you do have a very long timeout (e.g., 600 seconds) and you're sure about it, you might have discovered a genuine bug in the schema update code which causes a hang. This should be something that the schema-change people (e.g, @kbr-scylla) should be worried about, not the MV people.
from scylladb.
I would argue this create mv, should probably have bigger timeout, but after people that are responsible for MVs would "sign" it off.
I'm not sure how the people responsible for MV (me ;-)) need to sign it (what is "it"?) off.
Basically, CREATE MATERIALIZED view is a slow operation. It shouldn't be very slow - e.g., it doesn't wait for the entire view building to happen (this happened asynchronously in the background, later), but it's a schema operation and those are always relatively slow. In my cql-pytest work, I noticed the Python driver's default
request_timeout
setting of 10 seconds was not always enough on extremely overloaded test machines and debug builds - which are often 100 times (!) slower than normal, so I increasedrequest_timeout
to 120. Whatrequest_timeout
do you use in dtest? A quick glance in the code suggests maybe it's 600 seconds - which is more than enough, but please check to make sure what request_timeout applies to this specific test.In addition to
request_timeout
, there is alsoconnect_timeout
andcontrol_connection_timeout
, which I also increased to 60 seconds. If I understand correctly, dtest_setup.py increases them to "just" 5 and 6 seconds, respectively, maybe that's not enough? But to be honest, I think the specific error you got doesn't indicate that these two parameters are relevant (see #11289 on why those two additional timeout increases were necessary).If you do have a very long timeout (e.g., 600 seconds) and you're sure about it, you might have discovered a genuine bug in the schema update code which causes a hang. This should be something that the schema-change people (e.g, @kbr-scylla) should be worried about, not the MV people.
dtest doesn't change the default_timeout
of a session, and that's 10s.
if for specific command we need more than that, we change those one,
and seems like you just explained that creation of MV is same as any other schema changing operation, and might take more than 10s, and only after 10min it should be a problem ?
from scylladb.
I thought I saw you set request_timeout=600, but it was apparently in some unused part of dtest, not the default.
10 minutes is ridiculously high (although the same might be said about 10 seconds ;-)). In cql-pytest I set it to 120 and it was more than enough.
Here is my comment about this:
# The default timeout (in seconds) for execute() commands is 10, which
# should have been more than enough, but in some extreme cases with a
# very slow debug build running on a very busy machine and a very slow
# request (e.g., a DROP KEYSPACE needing to drop multiple tables)
# 10 seconds may not be enough, so let's increase it. See issue #7838.
request_timeout=request_timeout)
As this notes, the slow request that caused problems for me was a DROP KEYSPACE that needed to drup multiple tables and views - not a single MV creation which in my experience never exceeded 10 seconds - but who knows, maybe if the test machine is 500 times slower than necessary, instead of the "usual" 100, even a single MV creation is too slow :-(
from scylladb.
@nyh - what should be the next step here?
from scylladb.
Another failure spotted on 5.4 next
from scylladb.
Another failure spotted on 5.4 next
Indeed also has
def test_large_row_in_materialized_view(self):
...
> session.execute('create materialized view %s as select * from %s '
...
E cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.27.3:9042
cassandra/cluster.py:5048: OperationTimedOut
Looking at the scylla log in the zip you attached (thanks), node3.log (the IP address reporting) has
INFO 2024-04-02 19:19:15,650 [shard 0:stat] migration_manager - Create new view: org.apache.cassandra.config.CFMetaData@0x60000511d500[cfId=e5336a20-f125-11ee-8c76-1527e943ec90,ksName=wide_row,cfName=user_events_view
...
INFO 2024-04-02 19:19:15,690 [shard 0:stat] schema_tables - Creating wide_row.user_events_view id=e5336a20-f125-11ee-8c76-1527e943ec90 version=033cd87c-ebaf-36dc-8ba9-97ceebb21464
...
INFO 2024-04-02 19:19:15,747 [shard 0:stat] view - Finished building view wide_row.user_events_view
So it seems not only did creating the view not take a long time - in less than 0.1 seconds we not only created the view, we even finished the view building (for which CREATE MATERIALIZED VIEW doesn't even wait). In the log, nothing further happens after 19:19:15, until 19:19:46 (30 seconds later) when Scylla is asked to shut down.
I guess it's possible that we have a bug that a successful CREATE MATERIALIZED VIEW forgets to return a response, but why do we not see it all the time? Why isn't there any error logging or something?
I suspect some sort of Python driver bug, but don't know how to explain it. Does anyone know what "Connection defunct by heartbeat" even means?
from scylladb.
Does anyone know what "Connection defunct by heartbeat" even means?
Good question. @sylwiaszunejko how is 'Connection defunct by heartbeat': 'Client request timeout
different from a "normal timeout"? (I assume that this is some kind of special timeout, not every timeout looks like this, but I may be wrong)
from scylladb.
Does anyone know what "Connection defunct by heartbeat" even means?
Good question. @sylwiaszunejko how is
'Connection defunct by heartbeat': 'Client request timeout
different from a "normal timeout"? (I assume that this is some kind of special timeout, not every timeout looks like this, but I may be wrong)
@kbr-scylla Heartbeat is sending OPTIONS message on idle connections. The timeout on which the heartbeat wait for idle connection responses and an interval on which to heartbeat idle connections can be configured. This helps keep connections open through network devices that expire idle connections and discover bad connections early in low-traffic scenarios.
from scylladb.
@sylwiaszunejko so it means that the connections was marked as "defunct" due to this heartbeat, and this happened before we sent the request? I'm trying to understand the relationship between "defunct by heartbeat" and "Client request timeout".
Did "client request timeout" cause "defunct by heartbeat", or was the connection marked as defunct first, which then caused the request to "timeout"?
What happens when we try to send a request over a connection that is "defunct by heartbeat"? Do we attempt the request nevertheless? Or does the driver immediately assume that the request will fail and returns timeout?
from scylladb.
I guess this message come from this place in python-driver : https://github.com/scylladb/python-driver/blob/master/cassandra/cluster.py#L4556. I am not that familiar with this logic, but judging by the comments and the code, I guess the connection was marked as defunct first, which then caused the request to "timeout".
from scylladb.
@kbr-scylla FYI, after talking with @avelanarius, I realized that the 'Connection defunct by heartbeat': 'Client request timeout
message could be misleading and the timeout is not always due to the heartbeat issue. For example in this issue - scylladb/python-driver#275, the timeout was caused by ThreadPoolExecutor blocking tasks not heartbeat failing.
from scylladb.
Possibly related / the same? #14806
from scylladb.
Related Issues (20)
- Tablets should be enabled by default HOT 1
- Tablets warning message is too long and complex
- DESCRIBE KEYSPACE should include Tablets properties HOT 2
- scylla-housekeeping error on Docker, while parsing scylla_release_version 6.0.0-rc0 HOT 1
- test_group0_schema_versioning seems to be flaky
- test: util: start_writes_to_cdc_table: CL=2 makes some writes fail
- scylladb read/write latency is unstable HOT 9
- test.py: test_default_tombstone_gc is broken
- repair_service::~repair_service(): Assertion `_stopped' failed.
- auth: cache improvements (epic)
- node-node-encryption documentation should be extended to include listen ports configuration HOT 2
- Strongly consistent tables based on tablets
- docs: in case of majority loss, restore-from-backup is the only remaining option. HOT 1
- topology_experimental_raft/test_tablets is flaky HOT 1
- Docs: Document hard and soft ScyllaDB limits HOT 1
- `raft::request_aborted` gives no information what exactly was aborted HOT 1
- Seamless transition to inter-node encryption HOT 6
- heap-use-after-free in stream_session vs. cleanup_tablet HOT 3
- the baseline of text in monospace is lower than that of text in regular font HOT 1
- table::calculate_tablet_count() can potentially stall with large tablet count HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scylladb.