GithubHelp home page GithubHelp logo

Comments (4)

bgentry avatar bgentry commented on August 25, 2024 1

I was able to get one failure of this locally with go test -race -run 'Test_Client_InsertTriggersImmediateWork' -count 5000, but not with the new extra logging in place. At least it's not impossible to reproduce locally, just rare.

There are some odd things about the ordering of operations in this test that I'm looking into.

from river.

bgentry avatar bgentry commented on August 25, 2024

I’ve got some more info on this flaky test. I added some more logging in the test callback function, as well as in the waitForClientHealthy routine. I have 3 different flaky test scenarios so far

  1. For one of the failures, the notifier never comes up as healthy, it’s just stuck in initializing. I'm guessing it's an issue with the connect attempt just hanging on occasion, and we don't have any specific timeout on there to prevent it from hanging indefinitely (if no context timeout or pgx config ConnectTimeout is set).

    In my local testing, I believe I am able to fully resolve this issue by setting a tight 1 second context timeout on the notifier's establishConn. However a hardcoded short timeout like that is not viable outside of dev/tests. Maybe we should just set the ConnectTimeout on the pgconn.Config for all our tests? This also feels like it should be documented in a "Postgres best practices" doc page, or included in some kind of automated "is my pool configured sanely for prod workloads" check.

    IMO it also highlights the importance of further developing our client health API. The client monitor is the only way to be able to debug this stuff right now, and it's not exposed outside of the river package.

  2. The 2nd scenario is more like the above one you shared. The notifier has started successfully, and yet the 2nd job does not get worked during the test—there's basically a 5 second gap between the 1st job running and the client getting shut down because the test context has ended. Still trying to understand the cause here.

  3. I ended up with a 3rd failure on my latest run. It's a variation of (2), except that after the main test failed there was an additional panic due to logging after the test ended. Seems we were trying to emit a resignation message on a pool that was already closed:

    panic: Log in goroutine after Test_Client_InsertTriggersImmediateWork has completed: time=2024-02-22T09:02:01.816-06:00 level=ERROR msg="error attempting to resign" elector.err="closed pool"
    

from river.

brandur avatar brandur commented on August 25, 2024

Nice investigating. Regarding (1):

  • As discussed on chat, seems plausible to try decreasing ConnectTimeout for tests.
  • The fact that the test seems to start also seems to suggest a problem somewhere in waitForClientHealthy and/or the client monitoring infrastructure.

Another strategy we can try here if the connect timeout doesn't fix it is to work our way in from the edges and put more test coverage, including stress tests), on the various components like notifier, then producer. There's almost certainly bugs and rough edges in there, and it'll help tease those out and hopefully improve the overall resilience of Client.

from river.

brandur avatar brandur commented on August 25, 2024

Haven't seen this one in days. Definitively fixed by #253.

from river.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.