atlassian / escalator Goto Github PK

View Code? Open in Web Editor NEW

642.0 39.0 60.0 5.54 MB

Escalator is a batch or job optimized horizontal autoscaler for Kubernetes

License: Apache License 2.0

Go 99.65% Makefile 0.22% Dockerfile 0.13%

kubernetes autoscaling aws golang cluster auto-scaling-group asg scaling autoscaler workloads

escalator's Introduction

Escalator

Escalator is a batch or job optimized horizontal autoscaler for Kubernetes

It is designed for large batch or job based workloads that cannot be force-drained and moved when the cluster needs to scale down - Escalator will ensure pods have been completed on nodes before terminating them. It is also optimised for scaling up the cluster as fast as possible to ensure pods are not left in a pending state.

Key Features

Calculate requests and capacity to determine whether to scale up, down or to stay at the current scale
Waits until non-daemonset pods on nodes have completed before terminating the node
Designed to work on selected auto-scaling groups to allow the default Kubernetes Autoscaler to continue to scale service based workloads
Automatically terminate oldest nodes first
Support for slack space to ensure extra space in the event of a spike of scheduled pods
Does not terminate or factor cordoned nodes into calculations - allows cordoned nodes to persist for debugging
Support for different cloud providers - AWS only at the moment
Scaling and utilisation metrics
Leader election so you can run a HA Deployment inside a cluster.
Basic support for multiple different types of instances in a Node Group.

The need for this autoscaler is derived from our own experiences with very large batch workloads being scheduled and the default autoscaler not scaling up the cluster fast enough. These workloads can't be force-drained by the default autoscaler and must complete before the node can be terminated.

Documentation and Design

See Docs

Requirements

Kubernetes version 1.24+. Escalator has been tested and deployed on 1.24+ and newer. Older versions of Kubernetes may have bugs or issues that will prevent it from functioning properly.
Go version 1.20+
Dependencies and their locked versions can be found in go.mod and go.sum.

Building

# Fetch dependencies and build Escalator
make build

How to run - Quick Start

Locally (out of cluster)

go run cmd/main.go --kubeconfig=~/.kube/config --nodegroups=nodegroups_config.yaml

Deployment (in cluster)

See Deployment for full Deployment documentation.

# Build the docker image
docker build -t atlassian/escalator .

# Create RBAC configuration
kubectl create -f docs/deployment/escalator-rbac.yaml

# Create config map - modify to suit your needs
kubectl create -f docs/deployment/escalator-cm.yaml

# Create deployment
kubectl create -f docs/deployment/escalator-deployment.yaml

Configuring

See Configuration

Testing

make test

Test a specific package

For example, to test the controller package:

go test ./pkg/controller

Contributors

Pull requests, issues and comments welcome. For pull requests:

Add tests for new features and bug fixes
Follow the existing style (we are using goreturns to format and lint escalator)
Separate unrelated changes into multiple pull requests

See the existing issues for things to start contributing.

For bigger changes, make sure you start a discussion first by creating an issue and explaining the intended change.

Atlassian requires contributors to sign a Contributor License Agreement, known as a CLA. This serves as a record stating that the contributor is entitled to contribute the code/documentation/translation to the project and is willing to have it used in distributions and derivative works (or is willing to transfer ownership).

Prior to accepting your contributions we ask that you please follow the appropriate link below to digitally sign the CLA. The Corporate CLA is for those who are contributing as a member of an organization and the individual CLA is for those contributing as an individual.

License

escalator's People

Stargazers

Watchers

escalator's Issues

Update dependencies and pin versions

Update the dep dependencies and pin packages to the following versions:

github.com/sirupsen/logrus: 1.0.5
k8s.io/api: kubernetes-1.10.3
k8s.io/apimachinery: kubernetes-1.10.3
k8s.io/client-go: 7.0.0
github.com/aws/aws-sdk-go: v1.13.57
k8s.io/kubernetes: 1.10.3

Filter out already cordoned nodes in calculations

We need to have a way to filter out/respect cordoned nodes in calculations, but when it is time to scale down, these cordoned nodes need to be prioritised for termination (after the grace periods and the node has been determined as empty).

This will be especially helpful when there certain nodes that are having issues and need to be prioritised for destroy.

We will also need a special taint that we can apply to a node that escalator will completely filter out the node from the calculations, but also to not terminate it. This will be helpful when investigating an issue with a node and you need the node to not be terminated.

Node group yaml config parser not parsing more than 1 node group

Seen when using more than 1 node group in the yaml config. Escalator will then only run on the first node group, ignoring the rest.

Setup AWS config

Create the setup for the AWS library to make api calls using the chosen credentials or role

We want to both run in-cluster and out-cluster

Incorrect scale up calculations

Scale up calculations are calculated as follows:

CPU Utilisation = 80%
Scale up threshold (scale_up_threshhold_percent) = 70
Node worth (100 nodes) = 1.0
Remaining percentage needed to be below scale up threshold: 80 - 70 = 10
Scale up delta: 10 / 1 = 10.0
Amount sent to cloud provider to increase by: ceil(10.0) = 10

The key calculation is the 80 - 70 = 10 line to determine the amount to scale up by. 10% is the amount we have determined to scale by.

Unfortunately this won't scale up the node group by enough, after the 10 new nodes come up, the utilisation would be 72.7272727273%.

This isn't ideal as it would potentially take 2 or more scale up activities to bring the utilisation below the scale up threshold (70%).

The correct way to do the scale up would be to perform a percentage decrease calculation: = (original_value - new_value)/new_value * 100 or (80-70)/70 * 100

Using this, we should actually be increasing the node group by 14.2857142857% or 15 nodes. The utilisation after this scale up would be 69.5652173913%, below the scale up threshold of 70.

Generate unique ID for each scale activity

Generate a unique ID for each scale activity/scan of a node group. Will be useful when trying to debug a particular set of events.

Add Travis CI for automated testing on PR

Implement tests for scale up and scale down functions

test all the functions in scale_down.go and scale_uo.go

Add additional metrics

Add some additional Prometheus metrics, for example:

Whether the scaling lock is present
Time that the scaling lock has been present for
Time taken for new nodes to be created to be ready in Kubernetes
Time after node has been tainted until it was terminated
Current state (scaling up/doing nothing/scaling down), this could also just be the Node Delta value (negative, zero, positive)
ASG metrics (current ASG size, desired ASG size)

Add support for Kubernetes memory requests in millibytes

It was discovered in #115 that Escalator doesn't handle pod's having memory requests in millibytes. This will cause the memory requests for all pods in a node group to be inaccurate whilst there is a pod with memory in the millibytes. This has a flow on effect in that memory request metrics are inaccurate, and Escalator may start scaling down unexpectedly if only memory requests are specified.

Millibytes are supported in Kubernetes - kubernetes/kubernetes#28741 for backwards compatible reasons.

We will need tests around different memory and cpu formats to handle edge cases like this as well.

AWS internal ASG size becoming out of date/stale

We use the following for calculating the new size to tell AWS to set the ASG to:

size := n.Size()
size + delta

For some reason, n.Size() does not return the correct size, lower than it is currently and Escalator doesn't tell the ASG to scale up by enough.

Requested a scale up by 8:

time="2018-03-08T05:35:51Z" level=debug msg="**********[START NODEGROUP shared]**********"
time="2018-03-08T05:35:51Z" level=info msg="pods total: 977" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="nodes remaining total: 106" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="nodes remaining untainted: 106" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="nodes remaining tainted: 0" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="cpu: 72.85534591194968, memory: 76.92229586166764" nodegroup=shared
time="2018-03-08T05:35:51Z" level=debug msg="Unlocking scale lock"
time="2018-03-08T05:35:51Z" level=info msg="lock(false): there are 0 upcoming nodes requested." nodegroup=shared
time="2018-03-08T05:35:51Z" level=debug msg="Delta= 8" nodegroup=shared
time="2018-03-08T05:35:51Z" level=warning msg="There are no tainted nodes to untaint" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="increasing asg by 8" drymode=false nodegroup=shared
time="2018-03-08T05:35:51Z" level=debug msg="Locking scale lock"
time="2018-03-08T05:35:51Z" level=debug msg="DeltaScaled= 8" nodegroup=shared
time="2018-03-08T05:35:51Z" level=debug msg="Scaling took a total of 352.493456ms"

But only told AWS to increase the ASG to 109:

    "requestParameters": {
        "desiredCapacity": 109,
        "autoScalingGroupName": "",
        "honorCooldown": true
    },

Add option to perform a drain before terminating a node

Add a option (in the nodegroup.yaml configuration file) to perform a drain before terminating an instance in a node group.

This option should be per node group, as some node groups may have jobs only vs some with services only.

Instance does not belong in ASG when scaling down

Under specific circumstances, Escalator isn't able to terminate nodes in an ASG. This is due to the DeleteNodes() function in the AWS cloud provider returning the following error:

node ip-10-153-110-221.ec2.internal belongs in a different asg than <asg>

The cloud provider Belongs() function is responsible for determining if a node belongs in the target ASG and in this situation it is returning that the node isn't in the ASG.

The Belongs() function uses the results from an AWS DescribeAutoScalingGroups API call in the current run, so it could potentially be an issue with AWS returning invalid ASG information.

The problem seems to fix itself by restarting the pod that Escalator is running in, which may mean it is a problem with Escalator.

Some possible causes:

~~The Belongs() function does not work properly~~ - highly doubt this is the issue, it's a very simple function
AWS is returning incorrect information on the DescribeAutoScalingGroups API call
- We've seen this occur in the past with Autoscaling API outages with AWS which leads me to believe it is a problem with AWS
- Maybe Escalator is using a problematic Autoscaling API endpoint and by restarting the pod it uses a newer, working one
This comment suggests that it's an issue with memory/how the ASG result is being accessed that is resulting in incorrect values

Remove committed vendor dependencies

Implement basic cluster utilisation logic

Implement the core logic of the cluster-utilisation algorithm

Create a testing harness

Define and create the testing infrastructure that will be used

A node count less than the configured "min_nodes" doesn't trigger a scale up

Having a node count less than the configured "min_nodes" option in the nodegroups yaml won't trigger a scale up to ensure that the node group is at the minimum.

This leaves the node group in a broken state where it needs to scale up but can't because it is below the minimum node count.

See below logs for example.

time="2018-03-02T00:50:03Z" level=info msg="Starting with log level debug"
time="2018-03-02T00:50:03Z" level=info msg="Validating options: [PASS]" nodegroup=shared
time="2018-03-02T00:50:03Z" level=info msg="Registered with drymode false" nodegroup=shared
time="2018-03-02T00:50:03Z" level=info msg="Using in cluster config"
time="2018-03-02T00:50:03Z" level=info msg="Waiting for cache to sync..."
time="2018-03-02T00:50:03Z" level=debug msg="Trying to sync cache: tries = 0, max = 3"
time="2018-03-02T00:50:04Z" level=info msg="Cache took 201.721232ms to sync"
time="2018-03-02T00:50:05Z" level=info msg="aws session created successfully"
time="2018-03-02T00:50:05Z" level=debug msg="**********[AUTOSCALER FIRST LOOP]**********"
time="2018-03-02T00:50:05Z" level=debug msg="**********[START NODEGROUP shared]**********"
time="2018-03-02T00:50:05Z" level=info msg="pods total: 0" nodegroup=shared
time="2018-03-02T00:50:05Z" level=info msg="nodes remaining total: 3" nodegroup=shared
time="2018-03-02T00:50:05Z" level=info msg="nodes remaining untainted: 3" nodegroup=shared
time="2018-03-02T00:50:05Z" level=info msg="nodes remaining tainted: 0" nodegroup=shared
time="2018-03-02T00:50:05Z" level=warning msg="Node count of 3 less than minimum of 10" nodegroup=shared
time="2018-03-02T00:50:05Z" level=debug msg="Scaling took a total of 58.830382ms"

Fix "threshhold" typos

Change all "threshhold" typos to "threshold"

Implement scaling up and scaling down

Implement the logic to scale up asg and scale down nodes

Test Decision logic

Write unit tests that test our decision logic

Implement the taint scale down logic action

Implement an action that will taint nodes unneeded

Utilisation is rounded down when determining scale delta

Utilisation is rounded down when calculating scale delta. For example:

Utilisation: 70.05%
Scale up threshold: 70

The utilisation is then rounded down to 70%, and because 70% is not greater than 70, no scaling is done.

70.05% should really be rounded up to 71% to ensure we scale when we exceed the threshold.

Create documentation

Add some initial documentation for the project. All documentation that does not fit on the homepage README should go into the docs/ folder

Child Issues:

Move metrics inside controller (not global)

At the moment, metrics are global and can be used by just calling metrics.<metric-name>. They should ideally be part of the controller package. This will make it easier in the future when we will need to test the controller package, we can easily mock out the metrics provider.

Document cluster-autoscaler algorithm

Need to fully document how the system algorithm works. Via flowchart and description

Setup Kubernetes client

Setup the Kubernetes client for both in-cluster and out-cluster config

Create build workflow

Create a scalable:

Dockerfile
Makefile
Build scripts (if applicable)

Implement healthcheck endpoint

Implement healthcheck endpoint to respond with the health of Escalator.

Should perform perform the following checks internally:

Whether the cloud provider session can list/describe node groups in the cloud provider
Whether it can list pods/nodes on the apiserver (depends on how long this takes and whether it is taxing)

Filter our static pods when using "default" mode

Filter out static pods when using "default" mode in Escalator. At the moment they inflate the utilisation but aren't running on the nodes that are being scaled up/down.

Static pods - https://kubernetes.io/docs/tasks/administer-cluster/static-pod/

Update documentation

Delete node from Kubernetes before terminating in cloud provider

We have seen instances where terminating nodes directly from the cloud provider triggers log entries of Kubernetes trying to poll terminated nodes. This is because the node hasn't been gracefully removed from Kubernetes.

An enhancement would be to delete the node from Kubernetes before terminating it from the cloud provider.

Scaling up stops working when max_nodes is reached

If the cluster already has the max number of nodes with some tainted nodes, when the scale-up threshold is reached nothing will happen. No node will be untainted and of course, no node will be added to the cloud provider as well.
This bug also causes another problem. If you have tainted nodes and need to scale up, the max number of nodes that will be untainted at a time will be the difference between the max_nodes and the current number of nodes.
E.g: Nodes = 9
Max number = 10
Tainted = 5
Need more 3 nodes? Instead of untainting 3 nodes at once, only 1 node will be untainted each time scaling up is ran.

Another example from a log:
time="2018-06-13T18:23:36Z" level=debug msg="**********[AUTOSCALER MAIN LOOP]**********" time="2018-06-13T18:23:37Z" level=debug msg="**********[START NODEGROUP default]**********" time="2018-06-13T18:23:37Z" level=info msg="pods total: 48" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining total: 18" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="cordoned nodes remaining total: 0" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining untainted: 12" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining tainted: 6" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="cpu: 110.43333333333334, memory: 82.31950436831336" nodegroup=default time="2018-06-13T18:23:37Z" level=debug msg="Unlocking scale lock" time="2018-06-13T18:23:37Z" level=debug msg="Delta: 5" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="increasing nodes exceeds maximum (18). Clamping add amount to (0)" time="2018-06-13T18:23:37Z" level=warning msg="Scale up delta is less than or equal to 0 after clamping: 0" time="2018-06-13T18:23:37Z" level=debug msg="DeltaScaled: 0" nodegroup=default time="2018-06-13T18:23:37Z" level=debug msg="Scaling took a total of 110.042288ms"

Add support for AWS assume role

Publish Docker image

Thanks for making this! I'd love to be able to pull a canonical image, especially tagged with a release number, rather than building the image myself. Is that planned?

Remove cordon/uncordon when terminating instance

Remove the cordon/uncordon process when terminating instances in the node group. This is a carry over from phase 2 of Escalator which only involved tainting/cordoning of nodes and letting the default cluster autoscaler handle the rest.

Scale Up Calculation

Implement the logic to determine how many nodes need to be added/untainted on a scale up event

Support autodiscovery via ASG labels

We currently have to define autoscaling group sizes in two places: Once when creating the actual ASG, and once when configuring escalator. It would be nice if escalator could, given the name of the ASG it should be managing, find the max and min sizes. We've had situations in the past where we update the sizes in one place and not the other, which is easy to do since ASG configuration and Kubernetes deployment configuration usually lives in pretty different places.

The most important use case for us is to have autodiscovery of cluster size, but having autodiscovery of all configuration would be really cool.

Document system architecture

Need to document how the application will be designed. Including modules and structures. Includes UML diagram.

Create the node and pod watcher functionality

Create the functionality to watch pods and nodes from the cache. To be used by main logic

Add core dependencies

Using godep and the vendor/ file, include all core dependencies of the autoscaler

Multiple node groups for the same set of pods

We have the following use-case. We want escalator to manage auto-scaling two node groups where all of our pods could be scheduled in any of these two node groups, i.e. there would not be a 1-1 mapping between pod labels and node groups labels.

The reason why we want to have two node groups, is for example, so that one node group manages spot instances while another handles on-demand instances. We would prefer to scale up spot instances whenever possible, but if they are not available, we would be ok scaling up the on demand instance.

My question is, is this currently possible? If not, what would it take for us to implement this?

Thanks!

escalator_node_group_mem_request metric is frequently 0

I've started to monitor metrics and am seeing that escalator_node_group_mem_request only sometimes has values. This is showing escalator_node_group_mem_request in blue and escalator_node_group_mem_capacity in green:

When I look at the metrics directly, I see:

# HELP escalator_node_group_mem_request milli value of node request mem
# TYPE escalator_node_group_mem_request gauge
escalator_node_group_mem_request{node_group="mygroup"} 0

escalator_node_group_pods is still showing pods (in the hundreds) and I know they have requests, so it looks like somehow the data is getting lost.

Remove unused/deprecated configuration options

Remove unused/deprecated configuration options as can be confusing, options include:

UntaintUpperCapacityThreshholdPercent
UntaintLowerCapacityThreshholdPercent
SoftTaintEffectPercent
DampeningStrength
DaemonSetUsagePercent
MinSlackSpacePercent
ScaleDownMinGracePeriodSeconds

Update metrics to be namespaced under "escalator"

At the moment, metrics are exposed without a namespace, e.g. "node_group_untainted_nodes". When there are a lot of metrics being scraped by prometheus, the Escalator metrics are very generic and can be confusing.

We should be using a common namespace across all of the metrics so that they are easily findable.

The updated metric name would be: "escalator_node_group_untainted_nodes"

Get Cloudprovider working

Pods without node affinity cause Escalator to crash

Caused by pods without a node affinity, not all keys are checked to ensure that the sub field that is being accessed's parents exist.

Update cloud_provider_asg option to not be specific to AWS

Update cloud_provider_asg node group option to not be specific to AWS. As this option could be used in the future for specifying a GCP "Instance Group" (https://cloud.google.com/compute/docs/instance-groups/) or other cloud providers, it should be clear that it is not just for AWS.

Implement draining logic instead of basic pod check before deletion

Mimic the drain logic that cluster autoscaler uses to decide if a node can be deleted. and perform the draining logic on the node

Implement leader election mechanism

Implement Kubernetes leader election mechanism to allow for running multiple Escalator replicas, with only one performing the actual work.

At the moment, if multiple Escalator replicas are deployed, all will perform the scaling and will lead to undesired results.

Different node selection methods for termination

At the moment Escalator supports only one type of mode for the selection of which nodes to terminate - oldest first. This mode just prioritises the oldest nodes in the Kubernetes API by the creation timestamp. This works well and is simple, but some more modes may be needed to support service based workloads.

This issue proposes some new node selection methods for termination, which are:

Selection of nodes based on how easily drainable the node is. This would be determined with the drain simulation package provided by the cluster-autoscaler tool.
Selection of nodes based on how utilised they are. This would be determined by prioritising nodes with less requested resources and would terminate nodes that are close to idling or have low usage.

These node selection methods could potentially be used at the same time, with a weighted sum model used to determine the "ideal" or highest scoring nodes to terminate first. The weighted sum model would apply a score to each node when evaluating it against a set of criteria. The criteria could be how old the node is, how easily it is able to be drained and finally how utilised the node is. The nodes with the highest scores overall would be prioritised for termination.

Using the utilisation based termination method by itself may lead to a situation where some nodes aren't ever terminated because they are heavily utilised. Using a weighted sum model and pairing it with the current "oldest first" method, both utilisation and how old the node is would be considered before deciding which nodes to terminate.

Cluster autoscaler drain simlator: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/simulator
Weighted sum model: https://en.wikipedia.org/wiki/Weighted_sum_model

/cc @dadux @mwhittington21