GithubHelp home page GithubHelp logo

Comments (12)

sumeerbhola avatar sumeerbhola commented on July 17, 2024

The following disk stall caused the problem
13:42:42 disk_stall.go:128: test status: pausing 18.999452787s before next simulated disk stall on n1
13:43:01 cluster.go:2369: running cmd sudo dmsetup suspend --nofl... on nodes [:1]; details in run_134301.673602484_n1_sudo-dmsetup-suspend.log
13:43:32 cluster.go:2369: running cmd sudo dmsetup resume data1 on nodes [:1]; details in run_134332.310261019_n1_sudo-dmsetup-resume-.log
13:43:42 disk_stall.go:128: test status: pausing 9m18.999550257s before next simulated disk stall on n1

The pmax of observed WAL fsync latency is not unexpectedly high.
Screenshot 2024-05-19 at 5 11 26 PM

Unlike #122364, this is not encrypted FS.
KV p99 is 10+s and n1 lost its leases:
Screenshot 2024-05-19 at 5 14 44 PM

Screenshot 2024-05-19 at 5 15 52 PM

from cockroach.

cockroach-teamcity avatar cockroach-teamcity commented on July 17, 2024

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z
(cluster.go:2349).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

from cockroach.

sumeerbhola avatar sumeerbhola commented on July 17, 2024

The p99 goroutine scheduling latency is low, and the number of runnable goroutines is also low, but we have slot exhaustion. Which must be a side-effect of request processing being stuck. We don't have latency histograms for the storage read path, which would allow us to narrow down.
Screenshot 2024-05-21 at 1 05 16 PM
Screenshot 2024-05-21 at 1 05 53 PM
Screenshot 2024-05-21 at 1 06 03 PM

from cockroach.

sumeerbhola avatar sumeerbhola commented on July 17, 2024

We could enhance the test to monitor the used slots on n1 and if it exceeds 500, take a goroutine dump, so we know where those goroutines are stuck.

from cockroach.

sumeerbhola avatar sumeerbhola commented on July 17, 2024

Never mind. We have a dump.

from cockroach.

sumeerbhola avatar sumeerbhola commented on July 17, 2024

There are > 2500 goroutines stuck waiting on [1]

<-v.loaded

in tableCacheShard, so they are waiting for the cache value (*Reader etc.) to be loaded from disk.
There are 3 goroutines that are trying to do the load

goroutine 2557115 [runnable]:
runtime/pprof.Do({0x7d2ee08?, 0xca9b180?}, {{0xc00138ed00?, 0x0?, 0x0?}}, 0xc00a4d82e8)
	GOROOT/src/runtime/pprof/runtime.go:52 +0xad
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x603b?}, 0xc00175b958)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:931 +0x43c
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c

Which is this code
https://github.com/cockroachdb/pebble/blob/c077435aef13d640fc599ad01d10455f68510d1a/table_cache.go#L931-L933

But they are in the runtime package here https://github.com/golang/go/blob/master/src/runtime/pprof/runtime.go#L47-L52 which (since this is at the closing brace of the function) looks like defer SetGoroutineLabels(ctx), which calls into https://github.com/golang/go/blob/master/src/runtime/proflabel.go#L12-L35. Unclear why this should be stuck here.
[1]

goroutine 2942846 [runnable]:
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x6a03?}, 0xc00175b958)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:862 +0x10f
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c
github.com/cockroachdb/pebble.(*tableCacheShard).newIters(0xc0036446e0, {0x7d30150, 0xc0e3361a70}, 0xc03fc99a00, 0xc0e301b170, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:505 +0x97
github.com/cockroachdb/pebble.(*tableCacheContainer).newIters(0x7f82ead0a064?, {0x7d30150?, 0xc0e3361a70?}, 0x115a6ca?, 0xc00f372b40?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:219 +0x9e
github.com/cockroachdb/pebble.tableNewIters.TODO(0xd?, {0x7d30150?, 0xc0e3361a70?}, 0x125b0e5?, 0xc0e301b2d8?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:90 +0x6e
github.com/cockroachdb/pebble.(*levelIter).loadFile(0xc0e301b108, 0xc03fc99a00, 0x1)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:647 +0x2d1
github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE(0xc0e301b108, {0xc0a5d23140, 0xd, 0xd}, {0xc0a5d23670, 0xd, 0xd}, 0x0)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:725 +0xb7
github.com/cockroachdb/pebble.(*mergingIter).seekGE(0xc0e301ac00, {0xc0a5d23670?, 0x0?, 0x17d0e7acb5255d37?}, 0x0?, 0x0)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1035 +0xf1
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGEStrict(0xc0e301ac00, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0x1000121ebcf?, 0x7f80187fa960?}, 0x10?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1113 +0x65
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGE(0xc01993b008?, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0xc008c0e9b8?, 0x4b6789?}, 0xd?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1104 +0x1d
github.com/cockroachdb/pebble.(*lazyCombinedIter).SeekPrefixGE(0xc0e301aa68, {0xc0a5d23140?, 0x0?, 0xd?}, {0xc0a5d23670, 0xd, 0xd}, 0x10?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/range_keys.go:617 +0x57
github.com/cockroachdb/pebble.(*Iterator).SeekPrefixGE(0xc0e301a608, {0xc0a5d23670, 0xd, 0xd})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/iterator.go:1492 +0x564
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekGE(0xc08ae59858, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}})
	github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:373 +0xc5
github.com/cockroachdb/cockroach/pkg/storage.mvccGetMetadata({0x7db7b60, 0xc08ae59858}, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}}, 0xc0e30db440)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1656 +0xa5
github.com/cockroachdb/cockroach/pkg/storage.mvccPutInternal({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2277 +0x3b4
github.com/cockroachdb/cockroach/pkg/storage.mvccPutUsingIter({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2032 +0x1f6
github.com/cockroachdb/cockroach/pkg/storage.MVCCPut({_, _}, {_, _}, {_, _, _}, {_, _}, {{0xc0e3428900, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1939 +0x494
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.Put({_, _}, {_, _}, {{0x7dd6d08, 0xc0e33070c0}, {{0x17d0e7afce337fe9, 0x0}, 0x0, {0x17d0e7afce378f6f, ...}, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_put.go:74 +0x391
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand({_, _}, {_, _}, {_, _}, _, _, {{0x17d0e7afce337fe9, 0x0}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:488 +0x3a9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch({_, _}, {_, _}, {_, _}, {_, _}, _, 0xc0e33698c0, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:305 +0xa5d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:736 +0x1d6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:703 +0x359
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluate1PC(_, {_, _}, {_, _}, _, _, _)

from cockroach.

sumeerbhola avatar sumeerbhola commented on July 17, 2024

here is the dump
goroutine_dump.2024-05-19T13_43_12.731.double_since_last_dump.000007660.txt.gz

from cockroach.

sumeerbhola avatar sumeerbhola commented on July 17, 2024

The test failure in #124399 (comment) does not have a goroutine dump from the stall corresponding to the failure. I have not yet looked at the metrics.

14:24:38 test_runner.go:1098: ##teamcity[testFailed name='disk-stalled/wal-failover/among-stores' details='(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z|

from cockroach.

sumeerbhola avatar sumeerbhola commented on July 17, 2024

I am going to remove the release blocker label since this is a rare case where WAL failover is not mitigating SQL-level latency, but narrowly speaking, the observed high latency is on the read path, so WAL failover itself is working. We will continue investigating, but this is not a release blocker since the same would occur on a block cache + page cache miss for a read.

from cockroach.

cockroach-teamcity avatar cockroach-teamcity commented on July 17, 2024

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 9cbd031ecc99039507957a6bbc273a4da6775397:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.166684829s at 2024-05-28T14:05:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

from cockroach.

sumeerbhola avatar sumeerbhola commented on July 17, 2024

#124399 (comment) does not have a goroutine dump from the time of the failure.

from cockroach.

cockroach-teamcity avatar cockroach-teamcity commented on July 17, 2024

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 0d7110a85b5bbf0eb68bafd15abca076948b434e:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.010474919s at 2024-06-12T14:08:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #124977 roachtest: disk-stalled/wal-failover/among-stores failed [C-test-failure O-roachtest O-robot T-storage branch-release-24.1.1-rc]

This test on roachdash | Improve this report!

from cockroach.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.