GithubHelp home page GithubHelp logo

atlassian / escalator Goto Github PK

View Code? Open in Web Editor NEW
642.0 39.0 60.0 5.54 MB

Escalator is a batch or job optimized horizontal autoscaler for Kubernetes

License: Apache License 2.0

Go 99.65% Makefile 0.22% Dockerfile 0.13%
kubernetes autoscaling aws golang cluster auto-scaling-group asg scaling autoscaler workloads

escalator's Introduction

Escalator

Godoc Build Status Go Report Card license

Escalator is a batch or job optimized horizontal autoscaler for Kubernetes

It is designed for large batch or job based workloads that cannot be force-drained and moved when the cluster needs to scale down - Escalator will ensure pods have been completed on nodes before terminating them. It is also optimised for scaling up the cluster as fast as possible to ensure pods are not left in a pending state.

Key Features

  • Calculate requests and capacity to determine whether to scale up, down or to stay at the current scale
  • Waits until non-daemonset pods on nodes have completed before terminating the node
  • Designed to work on selected auto-scaling groups to allow the default Kubernetes Autoscaler to continue to scale service based workloads
  • Automatically terminate oldest nodes first
  • Support for slack space to ensure extra space in the event of a spike of scheduled pods
  • Does not terminate or factor cordoned nodes into calculations - allows cordoned nodes to persist for debugging
  • Support for different cloud providers - AWS only at the moment
  • Scaling and utilisation metrics
  • Leader election so you can run a HA Deployment inside a cluster.
  • Basic support for multiple different types of instances in a Node Group.

The need for this autoscaler is derived from our own experiences with very large batch workloads being scheduled and the default autoscaler not scaling up the cluster fast enough. These workloads can't be force-drained by the default autoscaler and must complete before the node can be terminated.

Documentation and Design

See Docs

Requirements

  • Kubernetes version 1.24+. Escalator has been tested and deployed on 1.24+ and newer. Older versions of Kubernetes may have bugs or issues that will prevent it from functioning properly.
  • Go version 1.20+
  • Dependencies and their locked versions can be found in go.mod and go.sum.

Building

# Fetch dependencies and build Escalator
make build

How to run - Quick Start

Locally (out of cluster)

go run cmd/main.go --kubeconfig=~/.kube/config --nodegroups=nodegroups_config.yaml

Deployment (in cluster)

See Deployment for full Deployment documentation.

# Build the docker image
docker build -t atlassian/escalator .

# Create RBAC configuration
kubectl create -f docs/deployment/escalator-rbac.yaml

# Create config map - modify to suit your needs
kubectl create -f docs/deployment/escalator-cm.yaml

# Create deployment
kubectl create -f docs/deployment/escalator-deployment.yaml

Configuring

See Configuration

Testing

make test

Test a specific package

For example, to test the controller package:

go test ./pkg/controller

Contributors

Pull requests, issues and comments welcome. For pull requests:

  • Add tests for new features and bug fixes
  • Follow the existing style (we are using goreturns to format and lint escalator)
  • Separate unrelated changes into multiple pull requests

See the existing issues for things to start contributing.

For bigger changes, make sure you start a discussion first by creating an issue and explaining the intended change.

Atlassian requires contributors to sign a Contributor License Agreement, known as a CLA. This serves as a record stating that the contributor is entitled to contribute the code/documentation/translation to the project and is willing to have it used in distributions and derivative works (or is willing to transfer ownership).

Prior to accepting your contributions we ask that you please follow the appropriate link below to digitally sign the CLA. The Corporate CLA is for those who are contributing as a member of an organization and the individual CLA is for those contributing as an individual.

License

Copyright (c) 2018 Atlassian and others. Apache 2.0 licensed, see LICENSE file.

escalator's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

escalator's Issues

Update dependencies and pin versions

Update the dep dependencies and pin packages to the following versions:

  • github.com/sirupsen/logrus: 1.0.5
  • k8s.io/api: kubernetes-1.10.3
  • k8s.io/apimachinery: kubernetes-1.10.3
  • k8s.io/client-go: 7.0.0
  • github.com/aws/aws-sdk-go: v1.13.57
  • k8s.io/kubernetes: 1.10.3

Filter out already cordoned nodes in calculations

We need to have a way to filter out/respect cordoned nodes in calculations, but when it is time to scale down, these cordoned nodes need to be prioritised for termination (after the grace periods and the node has been determined as empty).

This will be especially helpful when there certain nodes that are having issues and need to be prioritised for destroy.

We will also need a special taint that we can apply to a node that escalator will completely filter out the node from the calculations, but also to not terminate it. This will be helpful when investigating an issue with a node and you need the node to not be terminated.

Setup AWS config

Create the setup for the AWS library to make api calls using the chosen credentials or role

  • We want to both run in-cluster and out-cluster

Incorrect scale up calculations

Scale up calculations are calculated as follows:

  • CPU Utilisation = 80%
  • Scale up threshold (scale_up_threshhold_percent) = 70
  • Node worth (100 nodes) = 1.0
  • Remaining percentage needed to be below scale up threshold: 80 - 70 = 10
  • Scale up delta: 10 / 1 = 10.0
  • Amount sent to cloud provider to increase by: ceil(10.0) = 10

The key calculation is the 80 - 70 = 10 line to determine the amount to scale up by. 10% is the amount we have determined to scale by.

Unfortunately this won't scale up the node group by enough, after the 10 new nodes come up, the utilisation would be 72.7272727273%.

This isn't ideal as it would potentially take 2 or more scale up activities to bring the utilisation below the scale up threshold (70%).

The correct way to do the scale up would be to perform a percentage decrease calculation: = (original_value - new_value)/new_value * 100 or (80-70)/70 * 100

Using this, we should actually be increasing the node group by 14.2857142857% or 15 nodes. The utilisation after this scale up would be 69.5652173913%, below the scale up threshold of 70.

Add additional metrics

Add some additional Prometheus metrics, for example:

  • Whether the scaling lock is present
  • Time that the scaling lock has been present for
  • Time taken for new nodes to be created to be ready in Kubernetes
  • Time after node has been tainted until it was terminated
  • Current state (scaling up/doing nothing/scaling down), this could also just be the Node Delta value (negative, zero, positive)
  • ASG metrics (current ASG size, desired ASG size)

Add support for Kubernetes memory requests in millibytes

It was discovered in #115 that Escalator doesn't handle pod's having memory requests in millibytes. This will cause the memory requests for all pods in a node group to be inaccurate whilst there is a pod with memory in the millibytes. This has a flow on effect in that memory request metrics are inaccurate, and Escalator may start scaling down unexpectedly if only memory requests are specified.

Millibytes are supported in Kubernetes - kubernetes/kubernetes#28741 for backwards compatible reasons.

We will need tests around different memory and cpu formats to handle edge cases like this as well.

AWS internal ASG size becoming out of date/stale

We use the following for calculating the new size to tell AWS to set the ASG to:

size := n.Size()
size + delta

For some reason, n.Size() does not return the correct size, lower than it is currently and Escalator doesn't tell the ASG to scale up by enough.

Requested a scale up by 8:

time="2018-03-08T05:35:51Z" level=debug msg="**********[START NODEGROUP shared]**********"
time="2018-03-08T05:35:51Z" level=info msg="pods total: 977" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="nodes remaining total: 106" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="nodes remaining untainted: 106" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="nodes remaining tainted: 0" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="cpu: 72.85534591194968, memory: 76.92229586166764" nodegroup=shared
time="2018-03-08T05:35:51Z" level=debug msg="Unlocking scale lock"
time="2018-03-08T05:35:51Z" level=info msg="lock(false): there are 0 upcoming nodes requested." nodegroup=shared
time="2018-03-08T05:35:51Z" level=debug msg="Delta= 8" nodegroup=shared
time="2018-03-08T05:35:51Z" level=warning msg="There are no tainted nodes to untaint" nodegroup=shared
time="2018-03-08T05:35:51Z" level=info msg="increasing asg by 8" drymode=false nodegroup=shared
time="2018-03-08T05:35:51Z" level=debug msg="Locking scale lock"
time="2018-03-08T05:35:51Z" level=debug msg="DeltaScaled= 8" nodegroup=shared
time="2018-03-08T05:35:51Z" level=debug msg="Scaling took a total of 352.493456ms"

But only told AWS to increase the ASG to 109:

    "requestParameters": {
        "desiredCapacity": 109,
        "autoScalingGroupName": "",
        "honorCooldown": true
    },

Add option to perform a drain before terminating a node

Add a option (in the nodegroup.yaml configuration file) to perform a drain before terminating an instance in a node group.

This option should be per node group, as some node groups may have jobs only vs some with services only.

Instance does not belong in ASG when scaling down

Under specific circumstances, Escalator isn't able to terminate nodes in an ASG. This is due to the DeleteNodes() function in the AWS cloud provider returning the following error:

node ip-10-153-110-221.ec2.internal belongs in a different asg than <asg>

The cloud provider Belongs() function is responsible for determining if a node belongs in the target ASG and in this situation it is returning that the node isn't in the ASG.

The Belongs() function uses the results from an AWS DescribeAutoScalingGroups API call in the current run, so it could potentially be an issue with AWS returning invalid ASG information.

The problem seems to fix itself by restarting the pod that Escalator is running in, which may mean it is a problem with Escalator.

Some possible causes:

  • The Belongs() function does not work properly - highly doubt this is the issue, it's a very simple function
  • AWS is returning incorrect information on the DescribeAutoScalingGroups API call
    • We've seen this occur in the past with Autoscaling API outages with AWS which leads me to believe it is a problem with AWS
    • Maybe Escalator is using a problematic Autoscaling API endpoint and by restarting the pod it uses a newer, working one
  • This comment suggests that it's an issue with memory/how the ASG result is being accessed that is resulting in incorrect values

A node count less than the configured "min_nodes" doesn't trigger a scale up

Having a node count less than the configured "min_nodes" option in the nodegroups yaml won't trigger a scale up to ensure that the node group is at the minimum.

This leaves the node group in a broken state where it needs to scale up but can't because it is below the minimum node count.

See below logs for example.

time="2018-03-02T00:50:03Z" level=info msg="Starting with log level debug"
time="2018-03-02T00:50:03Z" level=info msg="Validating options: [PASS]" nodegroup=shared
time="2018-03-02T00:50:03Z" level=info msg="Registered with drymode false" nodegroup=shared
time="2018-03-02T00:50:03Z" level=info msg="Using in cluster config"
time="2018-03-02T00:50:03Z" level=info msg="Waiting for cache to sync..."
time="2018-03-02T00:50:03Z" level=debug msg="Trying to sync cache: tries = 0, max = 3"
time="2018-03-02T00:50:04Z" level=info msg="Cache took 201.721232ms to sync"
time="2018-03-02T00:50:05Z" level=info msg="aws session created successfully"
time="2018-03-02T00:50:05Z" level=debug msg="**********[AUTOSCALER FIRST LOOP]**********"
time="2018-03-02T00:50:05Z" level=debug msg="**********[START NODEGROUP shared]**********"
time="2018-03-02T00:50:05Z" level=info msg="pods total: 0" nodegroup=shared
time="2018-03-02T00:50:05Z" level=info msg="nodes remaining total: 3" nodegroup=shared
time="2018-03-02T00:50:05Z" level=info msg="nodes remaining untainted: 3" nodegroup=shared
time="2018-03-02T00:50:05Z" level=info msg="nodes remaining tainted: 0" nodegroup=shared
time="2018-03-02T00:50:05Z" level=warning msg="Node count of 3 less than minimum of 10" nodegroup=shared
time="2018-03-02T00:50:05Z" level=debug msg="Scaling took a total of 58.830382ms"

Utilisation is rounded down when determining scale delta

Utilisation is rounded down when calculating scale delta. For example:

  • Utilisation: 70.05%
  • Scale up threshold: 70

The utilisation is then rounded down to 70%, and because 70% is not greater than 70, no scaling is done.

70.05% should really be rounded up to 71% to ensure we scale when we exceed the threshold.

Create documentation

Add some initial documentation for the project. All documentation that does not fit on the homepage README should go into the docs/ folder

Child Issues:

Move metrics inside controller (not global)

At the moment, metrics are global and can be used by just calling metrics.<metric-name>. They should ideally be part of the controller package. This will make it easier in the future when we will need to test the controller package, we can easily mock out the metrics provider.

Implement healthcheck endpoint

Implement healthcheck endpoint to respond with the health of Escalator.

Should perform perform the following checks internally:

  • Whether the cloud provider session can list/describe node groups in the cloud provider
  • Whether it can list pods/nodes on the apiserver (depends on how long this takes and whether it is taxing)

Update documentation

  • Configuration options documentation
    • Threshold configuration
    • Slack capacity
  • Add documentation on deployment of escalator
    • K8s deployment, service, config and RBAC
    • Deployment on AWS
      • IAM roles and policy
      • AWS region
      • ASG name
  • Update diagram
  • Calculation/algorithm documentation
    • Usage calculation
    • Capacity calculation
    • Daemonsets
    • Multiple containers per pod
  • Node/pod labels/selectors
  • Common issues/gotchas
  • Prometheus metrics
  • Best practices
  • Add contributing section
  • Add copyright section
  • Add glossary of terms
  • scale down process/tainting
  • scale lock
  • Clear project mission statement
  • List of existing/planned features
  • List of requirements
  • Install/deployment instructions
  • Sharable roadmap for future development
  • LICENSE.txt
  • CODE_OF_CONDUCT.md

Delete node from Kubernetes before terminating in cloud provider

We have seen instances where terminating nodes directly from the cloud provider triggers log entries of Kubernetes trying to poll terminated nodes. This is because the node hasn't been gracefully removed from Kubernetes.

An enhancement would be to delete the node from Kubernetes before terminating it from the cloud provider.

Scaling up stops working when max_nodes is reached

If the cluster already has the max number of nodes with some tainted nodes, when the scale-up threshold is reached nothing will happen. No node will be untainted and of course, no node will be added to the cloud provider as well.
This bug also causes another problem. If you have tainted nodes and need to scale up, the max number of nodes that will be untainted at a time will be the difference between the max_nodes and the current number of nodes.
E.g: Nodes = 9
Max number = 10
Tainted = 5
Need more 3 nodes? Instead of untainting 3 nodes at once, only 1 node will be untainted each time scaling up is ran.

Another example from a log:
time="2018-06-13T18:23:36Z" level=debug msg="**********[AUTOSCALER MAIN LOOP]**********" time="2018-06-13T18:23:37Z" level=debug msg="**********[START NODEGROUP default]**********" time="2018-06-13T18:23:37Z" level=info msg="pods total: 48" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining total: 18" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="cordoned nodes remaining total: 0" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining untainted: 12" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="nodes remaining tainted: 6" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="cpu: 110.43333333333334, memory: 82.31950436831336" nodegroup=default time="2018-06-13T18:23:37Z" level=debug msg="Unlocking scale lock" time="2018-06-13T18:23:37Z" level=debug msg="Delta: 5" nodegroup=default time="2018-06-13T18:23:37Z" level=info msg="increasing nodes exceeds maximum (18). Clamping add amount to (0)" time="2018-06-13T18:23:37Z" level=warning msg="Scale up delta is less than or equal to 0 after clamping: 0" time="2018-06-13T18:23:37Z" level=debug msg="DeltaScaled: 0" nodegroup=default time="2018-06-13T18:23:37Z" level=debug msg="Scaling took a total of 110.042288ms"

Publish Docker image

Thanks for making this! I'd love to be able to pull a canonical image, especially tagged with a release number, rather than building the image myself. Is that planned?

Remove cordon/uncordon when terminating instance

Remove the cordon/uncordon process when terminating instances in the node group. This is a carry over from phase 2 of Escalator which only involved tainting/cordoning of nodes and letting the default cluster autoscaler handle the rest.

Scale Up Calculation

Implement the logic to determine how many nodes need to be added/untainted on a scale up event

Support autodiscovery via ASG labels

We currently have to define autoscaling group sizes in two places: Once when creating the actual ASG, and once when configuring escalator. It would be nice if escalator could, given the name of the ASG it should be managing, find the max and min sizes. We've had situations in the past where we update the sizes in one place and not the other, which is easy to do since ASG configuration and Kubernetes deployment configuration usually lives in pretty different places.

The most important use case for us is to have autodiscovery of cluster size, but having autodiscovery of all configuration would be really cool.

Document system architecture

Need to document how the application will be designed. Including modules and structures. Includes UML diagram.

Multiple node groups for the same set of pods

We have the following use-case. We want escalator to manage auto-scaling two node groups where all of our pods could be scheduled in any of these two node groups, i.e. there would not be a 1-1 mapping between pod labels and node groups labels.

The reason why we want to have two node groups, is for example, so that one node group manages spot instances while another handles on-demand instances. We would prefer to scale up spot instances whenever possible, but if they are not available, we would be ok scaling up the on demand instance.

My question is, is this currently possible? If not, what would it take for us to implement this?

Thanks!

escalator_node_group_mem_request metric is frequently 0

I've started to monitor metrics and am seeing that escalator_node_group_mem_request only sometimes has values. This is showing escalator_node_group_mem_request in blue and escalator_node_group_mem_capacity in green:

screen shot 2018-07-13 at 6 42 08 pm

When I look at the metrics directly, I see:

# HELP escalator_node_group_mem_request milli value of node request mem
# TYPE escalator_node_group_mem_request gauge
escalator_node_group_mem_request{node_group="mygroup"} 0

escalator_node_group_pods is still showing pods (in the hundreds) and I know they have requests, so it looks like somehow the data is getting lost.

Remove unused/deprecated configuration options

Remove unused/deprecated configuration options as can be confusing, options include:

  • UntaintUpperCapacityThreshholdPercent
  • UntaintLowerCapacityThreshholdPercent
  • SoftTaintEffectPercent
  • DampeningStrength
  • DaemonSetUsagePercent
  • MinSlackSpacePercent
  • ScaleDownMinGracePeriodSeconds

Update metrics to be namespaced under "escalator"

At the moment, metrics are exposed without a namespace, e.g. "node_group_untainted_nodes". When there are a lot of metrics being scraped by prometheus, the Escalator metrics are very generic and can be confusing.

We should be using a common namespace across all of the metrics so that they are easily findable.

The updated metric name would be: "escalator_node_group_untainted_nodes"

Implement leader election mechanism

Implement Kubernetes leader election mechanism to allow for running multiple Escalator replicas, with only one performing the actual work.

At the moment, if multiple Escalator replicas are deployed, all will perform the scaling and will lead to undesired results.

Different node selection methods for termination

At the moment Escalator supports only one type of mode for the selection of which nodes to terminate - oldest first. This mode just prioritises the oldest nodes in the Kubernetes API by the creation timestamp. This works well and is simple, but some more modes may be needed to support service based workloads.

This issue proposes some new node selection methods for termination, which are:

  • Selection of nodes based on how easily drainable the node is. This would be determined with the drain simulation package provided by the cluster-autoscaler tool.
  • Selection of nodes based on how utilised they are. This would be determined by prioritising nodes with less requested resources and would terminate nodes that are close to idling or have low usage.

These node selection methods could potentially be used at the same time, with a weighted sum model used to determine the "ideal" or highest scoring nodes to terminate first. The weighted sum model would apply a score to each node when evaluating it against a set of criteria. The criteria could be how old the node is, how easily it is able to be drained and finally how utilised the node is. The nodes with the highest scores overall would be prioritised for termination.

Using the utilisation based termination method by itself may lead to a situation where some nodes aren't ever terminated because they are heavily utilised. Using a weighted sum model and pairing it with the current "oldest first" method, both utilisation and how old the node is would be considered before deciding which nodes to terminate.

Cluster autoscaler drain simlator: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/simulator
Weighted sum model: https://en.wikipedia.org/wiki/Weighted_sum_model

/cc @dadux @mwhittington21

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.