GithubHelp home page GithubHelp logo

Comments (2)

arulajmani avatar arulajmani commented on July 18, 2024

This seems like a re-occurrence of the issue we diagnosed in #126272.

We see from the logs that we try to stall n4:

17:59:46 failover.go:856: failing n4 (pause)
teamcity-15831467-1719410159-51-n7cpu2: sending signal 19
18:00:46 failover.go:861: recovering n4 (pause)
teamcity-15831467-1719410159-51-n7cpu2: sending signal 18
18:00:48 monitor.go:177: Monitor event: n4: cockroach process for system interface died (exit code 7)

But the process is terminated before it can be recovered.

W240626 18:00:47.421735 694336 1@util/log/file.go:274 â‹® [-] 1473  disk slowness detected: unable to sync log files within 10s
W240626 18:00:47.450892 110 1@server/server_http.go:296 ⋮ [T1,Vsystem,n4] 1479  close tcp ‹[::]:26258›: ‹use of closed network connection›
F240626 18:00:47.423273 694337 1@util/log/file.go:265 â‹® [-] 1489  disk stall detected: unable to sync log files within 20s
F240626 18:00:47.423273 694337 1@util/log/file.go:265 â‹® [-] 1489 !
F240626 18:00:47.423273 694337 1@util/log/file.go:265 â‹® [-] 1489 !This node experienced a fatal error (printed above), and as a result the
F240626 18:00:47.423273 694337 1@util/log/file.go:265 â‹® [-] 1489 !process is terminating.

@itsbilal this was fixed in #125703, but we never backported it to any of the release branches. Maybe we should?

This specific failure is on an rc branch, so we can probably close this out. However, I'll assign it to you and keep it open for now so that it gets your eyes for the backport ping.

from cockroach.

itsbilal avatar itsbilal commented on July 18, 2024

I wonder if we can just tweak the failover roachtest to not stall the log directory at all, or to force a much higher value for COCKROACH_LOG_MAX_SYNC_DURATION. I'm wary of backporting a subtle change in disk stall behaviour to stable release branches as it has some potential to backfire in the wild. WAL Failover is a new feature in 24.1 with not that many users (yet), but the ways in which cockroach responds to disk stalls is more set-in-stone and relied on by more customers, and I don't think it's a good idea to throw a subtle change in that space inside a patch release.

@arulajmani what do you think?

from cockroach.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.