In what area(s)? /area autoscale What version o

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Unwanted scale-down during sustained load triggers an eventual explosion in replicas about serving HOT 3 OPEN

DavidR91 commented on July 22, 2024

Unwanted scale-down during sustained load triggers an eventual explosion in replicas

from serving.

Comments (3)

skonto commented on July 22, 2024

@DavidR91 hi, I will try to reproduce could you also paste/attach your logs from the autoscaler side with debug enabled?
I am looking for statements like: "Delaying scale to 0, staying at X".

from serving.

DavidR91 commented on July 22, 2024

Attached an autoscaler log in debug. This is less of a less dramatic scale down than mentioned above but I think still a valid repro
export.csv

The log starts at the point just after a load spike switches into a load 'soak' for 30 minutes at ~6k RPS (although the full 30 minutes are not included).

Notable is that ~7:51:40 is the point just after a pod is removed where request durations spike upward as a result (which coincides with a scale from 8 to 7 in the log):

(Charts are in UTC so 8:51 is the relevant time below):

Load test

from serving.

DavidR91 commented on July 22, 2024

So I've been attempting to debug this, and not really finding much up of use.

Here I added my own scraper to get the stat.proto values from individual pods off of :9090 of the queue-proxy, and plot the reported concurrency for each pod etc. over time:

Request volume does go up slightly at 1642, and the request concurrency does too, and the pod count is decreased at this time.

The effect can be manipulated with scale down delay, stable window time etc. but it doesn't completely go away: after an initial panic eventually there will be a scale down eventually, even in the middle of consistent loads, and the time configs only delay it.

So I have a few questions about concurrency, since maybe we're just misusing it?

What does it actually mean? Is it totally unit less or is it analogous to RPS? Our target is always the default 100 no matter the service, is this incorrect/unworkable?
If we switch to RPS but increase the RPS to e.g. 500 (so 70% util. brings it to 350) the pods stick around for the entire run and do not scale down at all - should we be setting concurrency to match these proportions? (500.0?)

from serving.

Recommend Projects

Unwanted scale-down during sustained load triggers an eventual explosion in replicas about serving HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs