Detailed Deion of the Problem Hi, when we deployed 3.0-lates

hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

server-state-file causing SEGV in 3.0-c0ee2d7 about haproxy HOT 9 CLOSED

felipewd commented on June 10, 2024

server-state-file causing SEGV in 3.0-c0ee2d7

from haproxy.

Comments (9)

wtarreau commented on June 10, 2024 1

Ah, thanks, so it's related to the duplication of the startup logs. I have no idea why it does something that strange but it will point me to some locations to look at. It's totally possible that it's a variant of the same that you faced earlier.

from haproxy.

wtarreau commented on June 10, 2024 1

Found (and fixed) it now, thanks for your precious help as usual, Felipe! It was an issue of usable vs allocated size, the ring buffer was shrinking by 192 bytes between reexecs of the master, and you seem to have enough warnings to reach the end during the copy ;-) If you pull the latest version, or alternately the patch above and its predecessor, that will ifx it.

from haproxy.

felipewd commented on June 10, 2024

Looking at our server-state file, we see nothing unusual, but: We have backends that are IP address and other that are hosts, and like I said, some are http-checks, some are tcp-checks (unrelated to IP/hosts thing)...in total we have 723 backends declared.

One thing is our hosts-declared backends are using server-templates, such as:

backend backend_80
 tcp-check connect
 balance static-rr
    server-template server 1-4 some_host_.com.br:80  check inter 5s downinter 2s resolvers dnsserver resolve-prefer ipv4

When using TLS backends:

backend backend_443
 tcp-check connect
 balance static-rr
    server-template server 1-4 another_host_.com.br:443  check inter 5s downinter 2s resolvers dnsserver resolve-prefer ipv4 ssl sni req.hdr(host) verify none

And for http-checks

backend b3 
    http-reuse always
    default-server check fall 2 rise 4 inter 500ms fastinter 250ms downinter 5s
    option httpchk 
    http-check send meth GET uri /probe ver HTTP/1.1 hdr Host probe.blabla.com.br
    server s_b3 <IP address>:80

We could post it here, if you think it might help.

from haproxy.

wtarreau commented on June 10, 2024

Thank you Felipe! At least it's unrelated to the stick-tables fix, I'm reassured! The random cores could be caused by a double free or such a thing. We've had recent updates related to server states (not the state file though), and maybe they unveiled a long sleeping bug. If it's 100% reproducible even outside of production with your state file, maybe a test should be to try to build with ASAN by passing ARCH_FLAGS="-ggdb3 -fsanitize=address" and see if it catches something. Usually it's pretty good for memory issues during startup.

from haproxy.

felipewd commented on June 10, 2024

hey @wtarreau there were some errors we could catch:

==2406815==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 11668 byte(s) in 822 object(s) allocated from:
    #0 0x7f8d96e5c817 in strdup (/usr/lib64/libasan.so.6+0x5c817)
    #1 0x8b8054 in srv_state_srv_update src/server_state.c:360
    #2 0x8b8054 in srv_state_px_update src/server_state.c:514

Direct leak of 2520 byte(s) in 360 object(s) allocated from:
    #0 0x7f8d96e5c817 in strdup (/usr/lib64/libasan.so.6+0x5c817)
    #1 0x666afa in _srv_parse_init src/server.c:3315
    #2 0xa6f55f  (/usr/sbin/haproxy+0xa6f55f)

Indirect leak of 80 byte(s) in 1 object(s) allocated from:
    #0 0x7f8d96eb19a7 in calloc (/usr/lib64/libasan.so.6+0xb19a7)
    #1 0x8cb3d8 in map_create_descriptor src/map.c:83
    #2 0x8cb3d8 in sample_load_map src/map.c:109
    #3 0x8cb3d8 in sample_load_map src/map.c:98

Indirect leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f8d96eb19a7 in calloc (/usr/lib64/libasan.so.6+0xb19a7)
    #1 0x7e2b6b in pattern_new_expr src/pattern.c:2168
    #2 0x60c0010d17bf  (<unknown module>)

SUMMARY: AddressSanitizer: 14300 byte(s) leaked in 1184 allocation(s).

from haproxy.

wtarreau commented on June 10, 2024

Thanks. But then I'm confused, if it mentions it detected leaks, that means it could exit cleanly without crashing ?

from haproxy.

felipewd commented on June 10, 2024

Oh, sorry...we do a config-test before actually reloading, and these errors caused $? -ne 0 so it didn't actually reload.

If I removed that and tried to really reload, we got the SEGV:

=================================================================
==2433857==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x631000024740 at pc 0x7f790ee3c8de bp 0x7fff123f7870 sp 0x7fff123f7020
WRITE of size 33495 at 0x631000024740 thread T0
    #0 0x7f790ee3c8dd in __interceptor_memcpy (/usr/lib64/libasan.so.6+0x3c8dd)
    #1 0x990d8a in vp_getblk_ofs include/haproxy/vecpair.h:244
    #2 0x990d8a in vp_peek_ofs include/haproxy/vecpair.h:275
    #3 0x990d8a in ring_dup include/haproxy/ring.h:95
    #4 0x9925c4 in startup_logs_dup src/errors.c:208
    #5 0x4387df in main src/haproxy.c:3727
    #6 0x7f790e6fd03c in __libc_start_main (/lib64/libc.so.6+0x2403c)
    #7 0x43b339 in _start (/usr/sbin/haproxy+0x43b339)

0x631000024740 is located 0 bytes to the right of 65344-byte region [0x631000014800,0x631000024740)
allocated by thread T0 here:
    #0 0x7f790eeb17ef in __interceptor_malloc (/usr/lib64/libasan.so.6+0xb17ef)
    #1 0x9399fa in ring_make_from_area src/ring.c:91
    #2 0x9399fa in ring_new src/ring.c:111

SUMMARY: AddressSanitizer: heap-buffer-overflow (/usr/lib64/libasan.so.6+0x3c8dd) in __interceptor_memcpy
Shadow bytes around the buggy address:
  0x0c627fffc890: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c627fffc8a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c627fffc8b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c627fffc8c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c627fffc8d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c627fffc8e0: 00 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa
  0x0c627fffc8f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c627fffc900: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c627fffc910: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c627fffc920: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c627fffc930: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==2433857==ABORTING

with the core just indicating this:

(gdb) bt f
#0  0x00007ff659de47d6 in ?? ()
No symbol table info available.
#1  0x00000000007d184e in cfg_parse_listen (file=<optimized out>, linenum=<optimized out>, args=0x433469 <SSL_get_finished@plt+9>, kwm=<optimized out>)
    at src/cfgparse-listen.c:2278
        optnum = <optimized out>
        curr_defproxy = <error reading variable curr_defproxy (Cannot access memory at address 0xcb5be0)>
        last_defproxy = <error reading variable last_defproxy (Cannot access memory at address 0xcb5c20)>
        err = <optimized out>
        rc = 0
        err_code = <optimized out>
        cond = 0x0
        errmsg = <optimized out>
        bind_conf = <optimized out>
#2  0x0000000000000000 in ?? ()
No symbol table info available.

from haproxy.

felipewd commented on June 10, 2024

Hi @wtarreau thanks for the quick fix! Thank goodness for those warnings ;-)

I can confirm this fixes this issue.

The link to the server-state-files it turns out is the warning that haproxy emit when loading the file. We get 1 line per backend (over 700 warnings, then), such as:

[WARNING]  (2568562) : config : Server mia-01/mia-01 is DOWN, changed from server-state after a reload. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

So this also explains why this issue vanished when we removed the state file.

Thanks!

from haproxy.

wtarreau commented on June 10, 2024

Yes it totally makes sense indeed! I don't even know if these warnings are useful. Maybe we should turn them to diag instead ? I'm wondering what could be done to avoid them from the user side!

Thanks for confirming the fix BTW! I'm closing so that we try to keep the focus on remaining issues, but feel free to continue here regarding the warnings if needed.

from haproxy.

server-state-file causing SEGV in 3.0-c0ee2d7 about haproxy HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs