Comments (6)
OK that was it, I could also reproduce it by modifying the code, and I can confirm the fix works. I've merged it now and will backport it. In the mean time, you can work around this by disabling the global pools (it's usually disabled except on some rare distros where it's unknown whether malloc() is fast). For this, just add -dMno-global
on the command line and you won't hit that bug anymore.
from haproxy.
Ok so apparently process is crashing due to ha_panic()
being called as a result of thread contention.
According to your trace and version, nearly all threads are stuck here (45 out of 64):
#1 0x00000000005d63c1 in pool_flush (pool=0x22b37c0) at src/pool.c:765
next = <optimized out>
temp = <optimized out>
down = <optimized out>
bucket = <optimized out>
#2 0x00000000004d9d00 in __task_free (t=<optimized out>) at include/haproxy/task.h:628
__ptr = <optimized out>
Which corresponds to this:
static inline void __task_free(struct task *t)
{
if (t == th_ctx->current) {
th_ctx->current = NULL;
__ha_barrier_store();
}
BUG_ON(task_in_wq(t) || task_in_rq(t));
BUG_ON((ulong)t->caller & 1);
#ifdef DEBUG_TASK
HA_ATOMIC_STORE(&t->debug.prev_caller, HA_ATOMIC_LOAD(&t->caller));
#endif
HA_ATOMIC_STORE(&t->caller, (void*)1); // make sure to crash if used after free
pool_free(pool_head_task, t);
th_ctx->nb_tasks--;
if (unlikely(stopping))
pool_flush(pool_head_task); /* ======= line 628: STUCK HERE */
}
So this would suggest that the contention occurs during soft-stop. Do you know if your haproxy process is restarted either manually of from a script every once in a while? You mentioned that it was crashing roughly once time a day, so maybe this could coincide with a restart performed once per day?
Now to explain why the crash occurs in 2.9.4 and not on the last 2.8 version you tested (which 2.8 exact version was it by the way?), maybe this could have something to do with 72c23bd (most recent spoe code change) or recent changes around pools code to reduce contention with multiple buckets introduced in 2.9.
Perhaps in this case some spoe resources are slowly piling up during process runtime, and when soft-stop occurs, all those resources are scheduled for cleanup and there is too much at once for haproxy to keep up?
from haproxy.
We were able to somewhat reproduce the crash by restarting haproxy and seeing the crash during shut down.
Not every restart caused the crash though.
- if after a restart I do another restart within less than 30 minutes, there is no crash. But if I do the restart after a couple of hours, then there is a crash.
- hosts without the spoa do not seem to crash on restart but the hosts with spoa did crash on restart after running a couple of hours. So it seems the pile up is related to spoa being there.
We plan to run this spoa at much higher requests per second - so this issue seems quite concerning.
Perhaps in this case some spoe resources are slowly piling up during process runtime
I am wondering why we would have such "excessive" pile up that is too much for haproxy to keep up with. And what is it that we are doing wrong. Do you have any suggestions on things we should try to understand the pile up? e.g. any commands we can run before restarting. So that we can do the right fix.
I am trying to compare the "show pools" output of two hosts - one without spoa and one with spoa; I can see that some pools have a big difference. But I am not sure at what point the build up can be termed "excessive".
from haproxy.
There's something really odd here. It's not expected to spend that much time like this on an entry in a pool, particularly in this version where entries are sharded to further reduce contention.
I'm just wondering if we don't have a bug here in pool_flush() in case multiple threads call it at once. Indeed, the
do { ... } while ((ret = xchg(BUSY)) == BUSY)
may occasionally replace a NULL with a BUSY if the value changes between the test and the xchg() call. But the following test replaces it with NULL only if ret is not null, so unless I'm missing something, we may occasionally replace a NULL with BUSY since commit 2a4523f ("BUG/MAJOR: pools: fix possible race with free() in the lockless variant") merged in 2.5.
I need to study this more carefully to make sure I'm not missing anything. As Aurélien mentioned, it's possible that a recent change puts a bit more stress on some pool_free() calls and more easily triggered the issue.
from haproxy.
I'll likely merge the attached patch, you may want to try to apply it.
pool.diff.txt
from haproxy.
Hey @wtarreau and @Darlelet , I wanted to confirm that this fix is indeed working, and also thank you for the time and effort you put into it, as well as for the quick response. We really appreciate your efforts.
from haproxy.
Related Issues (20)
- QUIC/H3 vs H2 performance difference for large payloads HOT 23
- Logging multiple combined FIX messages HOT 8
- High CPU (possibly stick-tables related) on 2.9.6 HOT 33
- Freezing frontend in state LIM after high load test HOT 4
- Unable to set a carriage return through a variable with http-request return HOT 4
- SPOE requests hanging until processing time is met when doing a reload HOT 2
- Allow preserving abstract namespace sockets address length HOT 5
- QUIC Interop "resumption" testcase failure when run with LibreSSL HOT 7
- src/http_ext.c: uninitialized variable suspected by gcc-14 HOT 5
- Attach config elements to a uniquely defined ID
- Preserve stats across reloads HOT 2
- src/sample.c: couple of coverity findings HOT 6
- httpclient adding full URL to the generated request. HOT 2
- haproxy 2.9.5 (solaris) external-check command go in infinite loop HOT 9
- Can not extract value from cookie if it is unescaped json HOT 3
- src/listener.c: null pointer dereference suspected by coverity HOT 2
- Support for explosing new methods to LUA for httpclient. HOT 4
- 2.9.6: A bogus APPCTX [0x56295e8a5a70] is spinning at 9291217 calls per second and refuses to die, aborting now! HOT 21
- 2.9.6 httpchk high CPU usage when check fails HOT 6
- investigating coredump in 20240315-ring-12i branch HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from haproxy.