GithubHelp home page GithubHelp logo

Comments (10)

wtarreau avatar wtarreau commented on June 10, 2024 1

I had not noticed first that you were using maxsslrate. Pretty interesting. Maybe you're facing a race condition that prevents it from properly recovering when the limit it met. That's something reasonably easy to try to reproduce on our side by setting a lower limit. At least your "show fd" shows the listener is active (thus not disabled) in the poller. Thanks for these, we'll need a bit of time to analyse it deeper now.

from haproxy.

capflam avatar capflam commented on June 10, 2024 1

I checked. At first glance, it seems similar but I doubt it is related. Especially because here the listeners don't seems to be limited when the issue occurred.

from haproxy.

wtarreau avatar wtarreau commented on June 10, 2024

Very strange, I've never heard about any similar report even on the other versions you mention. Was the "show info" above produced when the problem was happening ? Or can't you connect anymore to the stats socket when the problem is happening ? Does it recover only by restarting haproxy ? If you're able to connect to the stats socket, sending a "show stat" and a "show fd" could help.

What I suspect could be related to the size of the backlog: I'm seeing a sessrate of 446 in your "show info" output, which indicates what connection rate is acceptable with SSL negotiation. Let's assume your server can deal with 2k sessions/s including SSL etc. If you receive an attack with more than that, the accept queue will fill up. It will then take 40s to process the last entry in the queue at 2k/s, and by then the client will have aborted but there's no way to know, so it costs a handshake calculation for nothing. In such a case, an approach can be to limit the backlog to a much lower value via the "backlog" keyword on the "bind" line. This way during an attack, you won't be accumulating connections that users gave up, and the recovery can be much faster. Just set that to 2-3x the max rate you can accept so that users don't needlessly wait more than 2-3s before getting an error.

from haproxy.

kubrickfr avatar kubrickfr commented on June 10, 2024

Very strange, I've never heard about any similar report even on the other versions you mention. Was the "show info" above produced when the problem was happening ?

Yes

Or can't you connect anymore to the stats socket when the problem is happening ?

Nope, connecting to the socket is fine

Does it recover only by restarting haproxy ?

That is correct

If you're able to connect to the stats socket, sending a "show stat" and a "show fd" could help.

Will do as soon as I can identify a host has the issue again

What I suspect could be related to the size of the backlog: I'm seeing a sessrate of 446 in your "show info" output, which indicates what connection rate is acceptable with SSL negotiation. Let's assume your server can deal with 2k sessions/s including SSL etc. If you receive an attack with more than that, the accept queue will fill up. It will then take 40s to process the last entry in the queue at 2k/s, and by then the client will have aborted but there's no way to know, so it costs a handshake calculation for nothing.

We set a max ssl rate of 3k sessions. (in fact the limits are dynamic, we set maxsslrate $(( $(nproc) * 1500 )) and maxconn $(( $(nproc) * 40000 ). We've benchmarked it and it works fine for us (<100% CPU/Network/memory).

In such a case, an approach can be to limit the backlog to a much lower value via the "backlog" keyword on the "bind" line. This way during an attack, you won't be accumulating connections that users gave up, and the recovery can be much faster. Just set that to 2-3x the max rate you can accept so that users don't needlessly wait more than 2-3s before getting an error.

Ok I will try tinkering with backlog

from haproxy.

kubrickfr avatar kubrickfr commented on June 10, 2024

Ok, so on another host with 4 cores, and therefore our dynamic maxconn set to 160000, we get this

$ ss -ltnup 'sport = :443'
Netid                            State                             Recv-Q                            Send-Q                                                         Local Address:Port                                                         Peer Address:Port                            Process                            
tcp                              LISTEN                            0                                 90000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  
tcp                              LISTEN                            0                                 90000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  
tcp                              LISTEN                            90001                             90000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  
tcp                              LISTEN                            90001                             90000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  

This time we hit the non dynamic limit of net.core.somaxconn = 90000 (which I realise should really be calculated as well).

$ echo show info | socat UNIX-CONNECT:/var/lib/haproxy/stats,connect-timeout=2 stdio
Name: HAProxy
Version: 2.9.6-9eafce5
Release_date: 2024/02/26
Nbthread: 4
Nbproc: 1
Process_num: 1
Pid: 2003
Uptime: 2d 5h33m21s
Uptime_sec: 192801
Memmax_MB: 0
PoolAlloc_MB: 6
PoolUsed_MB: 6
PoolFailed: 0
Ulimit-n: 2000043
Maxsock: 2000043
Maxconn: 1000000
Hard_maxconn: 1000000
CurrConns: 2208
CumConns: 8298281
CumReq: 4057092995
MaxSslConns: 0
CurrSslConns: 2207
CumSslConns: 8191837
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 3
ConnRateLimit: 0
MaxConnRate: 6002
SessRate: 3
SessRateLimit: 0
MaxSessRate: 6002
SslRate: 3
SslRateLimit: 6000
MaxSslRate: 6001
SslFrontendKeyRate: 4
SslFrontendMaxKeyRate: 7755
SslFrontendSessionReuse_pct: 0
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 0
SslCacheLookups: 40
SslCacheMisses: 40
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
Tasks: 3644
Run_queue: 0
Idle_pct: 71
node: ip-10-2-32-176.ap-southeast-1.compute.internal
Stopping: 0
Jobs: 2221
Unstoppable Jobs: 1
Listeners: 11
ActivePeers: 0
ConnectedPeers: 0
DroppedLogs: 0
BusyPolling: 0
FailedResolutions: 0
TotalBytesOut: 8706499242217
TotalSplicedBytesOut: 0
BytesOutRate: 47007488
DebugCommandsIssued: 0
CumRecvLogs: 0
Build info: 2.9.6-9eafce5
Memmax_bytes: 0
PoolAlloc_bytes: 6503920
PoolUsed_bytes: 6503920
Start_time_sec: 1712051025
Tainted: 0
TotalWarnings: 76
MaxconnReached: 0
BootTime_ms: 156
Niced_tasks: .
$ cat /proc/net/sockstat
sockets: used 183355
TCP: inuse 28965 orphan 0 tw 392 alloc 183185 mem 76212
UDP: inuse 11 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
$ uptime 
 15:17:47 up 2 days,  5:34,  1 user,  load average: 1.40, 1.50, 1.67
 echo show stat | socat UNIX-CONNECT:/var/lib/haproxy/stats,connect-timeout=2 stdio
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,agent_status,agent_code,agent_duration,check_desc,agent_desc,check_rise,check_fall,check_health,agent_rise,agent_fall,agent_health,addr,cookie,mode,algo,conn_rate,conn_rate_max,conn_tot,intercepted,dcon,dses,wrew,connect,reuse,cache_lookups,cache_hits,srv_icur,src_ilim,qtime_max,ctime_max,rtime_max,ttime_max,eint,idle_conn_cur,safe_conn_cur,used_conn_cur,need_conn_est,uweight,agg_server_status,agg_server_check_status,agg_check_status,srid,sess_other,h1sess,h2sess,h3sess,req_other,h1req,h2req,h3req,proto,-,ssl_sess,ssl_reused_sess,ssl_failed_handshake,h2_headers_rcvd,h2_data_rcvd,h2_settings_rcvd,h2_rst_stream_rcvd,h2_goaway_rcvd,h2_detected_conn_protocol_errors,h2_detected_strm_protocol_errors,h2_rst_stream_resp,h2_goaway_resp,h2_open_connections,h2_backend_open_streams,h2_total_connections,h2_backend_total_streams,h1_open_connections,h1_open_streams,h1_total_connections,h1_total_streams,h1_bytes_in,h1_bytes_out,h1_spliced_bytes_in,h1_spliced_bytes_out,
www_ssl,FRONTEND,,,2365,66164,160000,2674547,7364640673982,1069039835706,0,0,1290,,,,,OPEN,,,,,,,,,1,2,0,,,,0,1,0,7755,,,,0,4056045898,0,1658,288043,32,,22701,38582,4056375550,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,1,6001,8191793,0,0,0,0,,,0,0,,,,,,,0,,,,,,,,,,0,2674547,0,0,0,4056375550,0,0,,-,2674555,0,5439614,0,0,0,0,0,0,0,0,0,0,0,0,0,2365,346,2674547,4056374288,8412490505427,1068762896456,0,0,
www,FRONTEND,,,1,5,160000,3466,248130,593027,0,0,37,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,13,,,,0,3234,0,251,3,0,,0,13,3488,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,13,3466,3234,0,0,0,,,0,0,,,,,,,0,,,,,,,,,,0,3466,1,0,0,3488,0,0,,-,0,0,0,0,0,3,0,1,0,0,0,0,0,0,1,0,1,0,3466,3475,296944,596122,0,0,
app,i-0ed07def403d51d04,0,0,1,33537,,5427241,9726848390,1385133631,,0,,33849,271,0,0,UP,1,1,0,5,2,18405,100,,1,4,9,,5427241,,2,89,,9419,L7OK,200,5,0,5393176,0,0,0,0,,,,5393176,15172,25,,,,,0,,,0,0,17,107,,,,Layer7 check passed,,1,3,3,,,,10.2.37.170:8080,,http,,,,,,,,0,60210,5367031,,,1,,0,7226,14244,60170,0,0,1,1,2,1,,,,3,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
app,i-0f3070f2d8b7b1d0d,0,0,173,14884,,298538885,535274006840,77726797767,,0,,45294,43248,0,0,UP,256,1,0,4,2,18399,150,,1,4,10,,298538885,,2,10902,,20046,L7OK,200,0,0,298466900,0,6,0,0,,,,298466906,370418,14205,,,,,0,,,0,0,17,63,,,,Layer7 check passed,,1,3,3,,,,10.2.34.206:8080,,http,,,,,,,,0,839598,297699287,,,398,,0,7280,17522,60525,0,0,398,173,571,256,,,,3,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
app,i-07c3394790ff363ff,0,0,172,14882,,296034911,530763867149,77021383861,,0,,50914,39210,0,0,UP,256,1,0,5,2,18418,100,,1,4,11,,296034911,,2,11475,,19980,L7OK,200,3,0,295959262,0,20,0,0,,,,295959282,372788,12903,,,,,0,,,0,0,18,67,,,,Layer7 check passed,,1,3,3,,,,10.2.35.181:8080,,http,,,,,,,,0,847203,295187708,,,399,,0,7248,18999,60953,0,0,399,172,571,256,,,,3,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
app,i-0541d9d428440e383,0,0,1,225,,3966656,7115916553,1026978140,,0,,0,0,0,0,UP,1,1,0,0,1,18377,73,,1,4,1,,3966656,,2,46,,4191,L7OK,200,1,0,3966647,0,0,0,0,,,,3966647,10625,0,,,,,0,,,0,0,17,86,,,,Layer7 check passed,,1,3,3,,,,10.2.42.23:8080,,http,,,,,,,,0,20316,3946340,,,1,,0,11,1502,60198,0,0,1,1,2,1,,,,4,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
app,BACKEND,0,0,347,38700,32000,4056374503,7364640718370,1069039943633,0,0,,221932,169328,0,0,UP,514,4,0,,5,18418,31,,1,4,0,,4056373077,,1,22702,,38582,,,,0,4056045898,0,607,288046,32,,,,4056334583,823043,59015,0,0,0,0,0,,,0,0,17,59,,,,,,,,,,,,,,http,leastconn,,,,,,,0,3975617,4052397460,0,0,,,0,7280,18999,60953,0,,,,,514,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1146,347,4052868,4056450328,1063974567729,8696536936083,0,0,
stats,FRONTEND,,,0,3,1000000,12851,1696332,928968202,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,0,0,1,,,,0,12851,0,0,0,0,,0,1,12851,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,1,12851,12851,0,0,0,,,0,0,,,,,,,0,,,,,,,,,,0,12851,0,0,0,12851,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12851,12851,2017607,929508480,0,0,
stats,BACKEND,0,0,0,0,100000,0,1696332,928968202,0,0,,0,0,0,0,UP,0,0,0,,0,192772,,,1,5,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,15,,,0,0,1,2,,,,,,,,,,,,,,http,roundrobin,,,,,,,0,0,0,0,0,,,0,0,4,224,0,,,,,0,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

And the result of show fd:
2519.fds.txt

Maybe what these two instances have in common is that they both hit their maxsslrate limit at some point in the past.

from haproxy.

Darlelet avatar Darlelet commented on June 10, 2024

Could it be related to #2476 then?

from haproxy.

capflam avatar capflam commented on June 10, 2024

However, the fix was backported, thus it can be tested.

from haproxy.

kubrickfr avatar kubrickfr commented on June 10, 2024

[EDIT: previous version had typo 2.6.9 -> 2.9.6]
As mentioned in the bug description, the present issue happens with "haproxy-next 2.9.6, with patches up to c788ce33af85a28fa66f591cb65a7ea6c0f92007" which includes BUG/MINOR: listener: Wake proxy's mngmt task up if necessary on session release from #2476

from haproxy.

kubrickfr avatar kubrickfr commented on June 10, 2024

Ha! now that I re-read my message, I realise my there is a typo, it's 2.9.6! not 2.6.9, this applies to my last comment as well, adding an [EDIT] note.

from haproxy.

capflam avatar capflam commented on June 10, 2024

Thanks for the confirmation :) So it is indeed another issue.

from haproxy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.