GithubHelp home page GithubHelp logo

HAProxy 2.9.4 Randomly Crashes about haproxy HOT 6 OPEN

JB0925 avatar JB0925 commented on June 11, 2024
HAProxy 2.9.4 Randomly Crashes

from haproxy.

Comments (6)

wtarreau avatar wtarreau commented on June 11, 2024 1

OK that was it, I could also reproduce it by modifying the code, and I can confirm the fix works. I've merged it now and will backport it. In the mean time, you can work around this by disabling the global pools (it's usually disabled except on some rare distros where it's unknown whether malloc() is fast). For this, just add -dMno-global on the command line and you won't hit that bug anymore.

from haproxy.

Darlelet avatar Darlelet commented on June 11, 2024

Ok so apparently process is crashing due to ha_panic() being called as a result of thread contention.

According to your trace and version, nearly all threads are stuck here (45 out of 64):

#1  0x00000000005d63c1 in pool_flush (pool=0x22b37c0) at src/pool.c:765
        next = <optimized out>
        temp = <optimized out>
        down = <optimized out>
        bucket = <optimized out>
#2  0x00000000004d9d00 in __task_free (t=<optimized out>) at include/haproxy/task.h:628
        __ptr = <optimized out>

Which corresponds to this:

static inline void __task_free(struct task *t)
{
        if (t == th_ctx->current) {
                th_ctx->current = NULL;
                __ha_barrier_store();
        }
        BUG_ON(task_in_wq(t) || task_in_rq(t));

        BUG_ON((ulong)t->caller & 1);
#ifdef DEBUG_TASK
        HA_ATOMIC_STORE(&t->debug.prev_caller, HA_ATOMIC_LOAD(&t->caller));
#endif
        HA_ATOMIC_STORE(&t->caller, (void*)1); // make sure to crash if used after free

        pool_free(pool_head_task, t);
        th_ctx->nb_tasks--;
        if (unlikely(stopping))
                pool_flush(pool_head_task); /* ======= line 628: STUCK HERE */
}

So this would suggest that the contention occurs during soft-stop. Do you know if your haproxy process is restarted either manually of from a script every once in a while? You mentioned that it was crashing roughly once time a day, so maybe this could coincide with a restart performed once per day?

Now to explain why the crash occurs in 2.9.4 and not on the last 2.8 version you tested (which 2.8 exact version was it by the way?), maybe this could have something to do with 72c23bd (most recent spoe code change) or recent changes around pools code to reduce contention with multiple buckets introduced in 2.9.

Perhaps in this case some spoe resources are slowly piling up during process runtime, and when soft-stop occurs, all those resources are scheduled for cleanup and there is too much at once for haproxy to keep up?

from haproxy.

15ljindal avatar 15ljindal commented on June 11, 2024

We were able to somewhat reproduce the crash by restarting haproxy and seeing the crash during shut down.

Not every restart caused the crash though.

  • if after a restart I do another restart within less than 30 minutes, there is no crash. But if I do the restart after a couple of hours, then there is a crash.
  • hosts without the spoa do not seem to crash on restart but the hosts with spoa did crash on restart after running a couple of hours. So it seems the pile up is related to spoa being there.

We plan to run this spoa at much higher requests per second - so this issue seems quite concerning.

Perhaps in this case some spoe resources are slowly piling up during process runtime

I am wondering why we would have such "excessive" pile up that is too much for haproxy to keep up with. And what is it that we are doing wrong. Do you have any suggestions on things we should try to understand the pile up? e.g. any commands we can run before restarting. So that we can do the right fix.

I am trying to compare the "show pools" output of two hosts - one without spoa and one with spoa; I can see that some pools have a big difference. But I am not sure at what point the build up can be termed "excessive".
Screenshot 2024-02-08 at 7 19 58 PM

from haproxy.

wtarreau avatar wtarreau commented on June 11, 2024

There's something really odd here. It's not expected to spend that much time like this on an entry in a pool, particularly in this version where entries are sharded to further reduce contention.

I'm just wondering if we don't have a bug here in pool_flush() in case multiple threads call it at once. Indeed, the
do { ... } while ((ret = xchg(BUSY)) == BUSY) may occasionally replace a NULL with a BUSY if the value changes between the test and the xchg() call. But the following test replaces it with NULL only if ret is not null, so unless I'm missing something, we may occasionally replace a NULL with BUSY since commit 2a4523f ("BUG/MAJOR: pools: fix possible race with free() in the lockless variant") merged in 2.5.

I need to study this more carefully to make sure I'm not missing anything. As Aurélien mentioned, it's possible that a recent change puts a bit more stress on some pool_free() calls and more easily triggered the issue.

from haproxy.

wtarreau avatar wtarreau commented on June 11, 2024

I'll likely merge the attached patch, you may want to try to apply it.
pool.diff.txt

from haproxy.

JB0925 avatar JB0925 commented on June 11, 2024

Hey @wtarreau and @Darlelet , I wanted to confirm that this fix is indeed working, and also thank you for the time and effort you put into it, as well as for the quick response. We really appreciate your efforts.

from haproxy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.