Comments (12)
The following disk stall caused the problem
13:42:42 disk_stall.go:128: test status: pausing 18.999452787s before next simulated disk stall on n1
13:43:01 cluster.go:2369: running cmd sudo dmsetup suspend --nofl...
on nodes [:1]; details in run_134301.673602484_n1_sudo-dmsetup-suspend.log
13:43:32 cluster.go:2369: running cmd sudo dmsetup resume data1
on nodes [:1]; details in run_134332.310261019_n1_sudo-dmsetup-resume-.log
13:43:42 disk_stall.go:128: test status: pausing 9m18.999550257s before next simulated disk stall on n1
The pmax of observed WAL fsync latency is not unexpectedly high.
Unlike #122364, this is not encrypted FS.
KV p99 is 10+s and n1 lost its leases:
![Screenshot 2024-05-19 at 5 15 52 PM](https://private-user-images.githubusercontent.com/54990988/331901711-4fa51710-2f75-4c82-abc3-4a535f247309.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyMjMyODEsIm5iZiI6MTcxODIyMjk4MSwicGF0aCI6Ii81NDk5MDk4OC8zMzE5MDE3MTEtNGZhNTE3MTAtMmY3NS00YzgyLWFiYzMtNGE1MzVmMjQ3MzA5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjEyVDIwMDk0MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg2ZTYzZGYwNzQ3ZWY4M2IyOWJmOGFmOGMzNzgwNDY5MzdlMGNiZThhOTZmYTFmNTk1ODk0MGJkMzQ4YTViMGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.FAwakQrmkQUvejLuzMGnwuGN6AzlQi9JFnTX_XpKNqI)
from cockroach.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:
(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z
(cluster.go:2349).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
This test on roachdash | Improve this report!
from cockroach.
The p99 goroutine scheduling latency is low, and the number of runnable goroutines is also low, but we have slot exhaustion. Which must be a side-effect of request processing being stuck. We don't have latency histograms for the storage read path, which would allow us to narrow down.
from cockroach.
We could enhance the test to monitor the used slots on n1 and if it exceeds 500, take a goroutine dump, so we know where those goroutines are stuck.
from cockroach.
Never mind. We have a dump.
from cockroach.
There are > 2500 goroutines stuck waiting on [1]
<-v.loaded
in tableCacheShard
, so they are waiting for the cache value (*Reader
etc.) to be loaded from disk.
There are 3 goroutines that are trying to do the load
goroutine 2557115 [runnable]:
runtime/pprof.Do({0x7d2ee08?, 0xca9b180?}, {{0xc00138ed00?, 0x0?, 0x0?}}, 0xc00a4d82e8)
GOROOT/src/runtime/pprof/runtime.go:52 +0xad
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x603b?}, 0xc00175b958)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:931 +0x43c
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c
Which is this code
https://github.com/cockroachdb/pebble/blob/c077435aef13d640fc599ad01d10455f68510d1a/table_cache.go#L931-L933
But they are in the runtime package here https://github.com/golang/go/blob/master/src/runtime/pprof/runtime.go#L47-L52 which (since this is at the closing brace of the function) looks like defer SetGoroutineLabels(ctx)
, which calls into https://github.com/golang/go/blob/master/src/runtime/proflabel.go#L12-L35. Unclear why this should be stuck here.
[1]
goroutine 2942846 [runnable]:
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x6a03?}, 0xc00175b958)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:862 +0x10f
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c
github.com/cockroachdb/pebble.(*tableCacheShard).newIters(0xc0036446e0, {0x7d30150, 0xc0e3361a70}, 0xc03fc99a00, 0xc0e301b170, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:505 +0x97
github.com/cockroachdb/pebble.(*tableCacheContainer).newIters(0x7f82ead0a064?, {0x7d30150?, 0xc0e3361a70?}, 0x115a6ca?, 0xc00f372b40?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:219 +0x9e
github.com/cockroachdb/pebble.tableNewIters.TODO(0xd?, {0x7d30150?, 0xc0e3361a70?}, 0x125b0e5?, 0xc0e301b2d8?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}})
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:90 +0x6e
github.com/cockroachdb/pebble.(*levelIter).loadFile(0xc0e301b108, 0xc03fc99a00, 0x1)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:647 +0x2d1
github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE(0xc0e301b108, {0xc0a5d23140, 0xd, 0xd}, {0xc0a5d23670, 0xd, 0xd}, 0x0)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:725 +0xb7
github.com/cockroachdb/pebble.(*mergingIter).seekGE(0xc0e301ac00, {0xc0a5d23670?, 0x0?, 0x17d0e7acb5255d37?}, 0x0?, 0x0)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1035 +0xf1
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGEStrict(0xc0e301ac00, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0x1000121ebcf?, 0x7f80187fa960?}, 0x10?)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1113 +0x65
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGE(0xc01993b008?, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0xc008c0e9b8?, 0x4b6789?}, 0xd?)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1104 +0x1d
github.com/cockroachdb/pebble.(*lazyCombinedIter).SeekPrefixGE(0xc0e301aa68, {0xc0a5d23140?, 0x0?, 0xd?}, {0xc0a5d23670, 0xd, 0xd}, 0x10?)
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/range_keys.go:617 +0x57
github.com/cockroachdb/pebble.(*Iterator).SeekPrefixGE(0xc0e301a608, {0xc0a5d23670, 0xd, 0xd})
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/iterator.go:1492 +0x564
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekGE(0xc08ae59858, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}})
github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:373 +0xc5
github.com/cockroachdb/cockroach/pkg/storage.mvccGetMetadata({0x7db7b60, 0xc08ae59858}, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}}, 0xc0e30db440)
github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1656 +0xa5
github.com/cockroachdb/cockroach/pkg/storage.mvccPutInternal({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2277 +0x3b4
github.com/cockroachdb/cockroach/pkg/storage.mvccPutUsingIter({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2032 +0x1f6
github.com/cockroachdb/cockroach/pkg/storage.MVCCPut({_, _}, {_, _}, {_, _, _}, {_, _}, {{0xc0e3428900, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1939 +0x494
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.Put({_, _}, {_, _}, {{0x7dd6d08, 0xc0e33070c0}, {{0x17d0e7afce337fe9, 0x0}, 0x0, {0x17d0e7afce378f6f, ...}, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_put.go:74 +0x391
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand({_, _}, {_, _}, {_, _}, _, _, {{0x17d0e7afce337fe9, 0x0}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:488 +0x3a9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch({_, _}, {_, _}, {_, _}, {_, _}, _, 0xc0e33698c0, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:305 +0xa5d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:736 +0x1d6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:703 +0x359
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluate1PC(_, {_, _}, {_, _}, _, _, _)
from cockroach.
here is the dump
goroutine_dump.2024-05-19T13_43_12.731.double_since_last_dump.000007660.txt.gz
from cockroach.
The test failure in #124399 (comment) does not have a goroutine dump from the stall corresponding to the failure. I have not yet looked at the metrics.
14:24:38 test_runner.go:1098: ##teamcity[testFailed name='disk-stalled/wal-failover/among-stores' details='(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z|
from cockroach.
I am going to remove the release blocker label since this is a rare case where WAL failover is not mitigating SQL-level latency, but narrowly speaking, the observed high latency is on the read path, so WAL failover itself is working. We will continue investigating, but this is not a release blocker since the same would occur on a block cache + page cache miss for a read.
from cockroach.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 9cbd031ecc99039507957a6bbc273a4da6775397:
(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.166684829s at 2024-05-28T14:05:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=true
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
This test on roachdash | Improve this report!
from cockroach.
#124399 (comment) does not have a goroutine dump from the time of the failure.
from cockroach.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 0d7110a85b5bbf0eb68bafd15abca076948b434e:
(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.010474919s at 2024-06-12T14:08:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
Same failure on other branches
- #124977 roachtest: disk-stalled/wal-failover/among-stores failed [C-test-failure O-roachtest O-robot T-storage branch-release-24.1.1-rc]
This test on roachdash | Improve this report!
from cockroach.
Related Issues (20)
- : failed HOT 1
- ccl/sqlproxyccl: TestFailedConnection failed
- ccl/streamingccl/streamingest: TestTenantStatusWithFutureCutoverTime failed HOT 1
- streamingccl: add even more LDR metrics to DB console HOT 2
- util/log: TestHTTPSinkHeadersAndCompression failed
- roachtest: disagg-rebalance/aws/n4cpu4 failed HOT 1
- kv/kvnemesis: TestKVNemesisSingleNode failed HOT 3
- roachtest: backup-restore/mixed-version failed HOT 2
- pkg/sql/logictest/tests/cockroach-go-testserver-23.2/cockroach-go-testserver-23_2_test: TestLogic_mixed_version_can_login failed HOT 1
- ccl/streamingccl/streamingest: TestProtectedTimestampManagement failed HOT 1
- Cannot restore backup unknown type kind COMPOSITE HOT 7
- roachtest: jepsen/monotonic/majority-ring-subcritical-skews failed HOT 1
- roachtest: jepsen/g2/majority-ring-subcritical-skews failed HOT 1
- roachtest: jepsen/g2/strobe-skews failed
- schema: report user friendly status message during a schema change HOT 1
- ccl/changefeedccl: TestChangefeedPanicRecovery failed HOT 1
- ccl/changefeedccl: TestChangefeedHandlesDrainingNodes failed HOT 9
- ccl/changefeedccl: TestChangefeedPropagatesTerminalError failed HOT 2
- changefeedccl: add duplicate ordering test HOT 2
- pkg/sql/logictest/tests/fakedist-vec-off/fakedist-vec-off_test: TestLogic_hash_join failed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cockroach.