GithubHelp home page GithubHelp logo

vrivellino / spoptimize Goto Github PK

View Code? Open in Web Editor NEW
10.0 5.0 1.0 717 KB

Spoptimize: Replace AWS AutoScaling instances with spot instances

License: Mozilla Public License 2.0

Python 94.68% Shell 5.32%
aws autoscaling autoscaling-groups spot-instances ec2-spot ec2-spot-instances aws-step-functions python cloudformation serverless-application-model

spoptimize's Introduction

SPOPTIMIZE

Build Status Coverage Status

Spoptimize is a tool that automates use of Amazon EC2 spot instances in your AutoScaling Groups.

About Spoptimize

Spoptimize was inspired by AutoSpotting, performs very similar actions, but has its own - completely unique - implementation.

But why reinvent the wheel and not use AutoSpotting?

I had been noodling on ways to utilize spot instances in AutoScaling groups for quite awhile. Before writing Spoptimize, I had brainstormed a few different ideas before I came across AutoSpotting. I thought the idea was ingenious, but I thought it might be fun to build a similar system that was event driven vs using polling. I had never used AWS Step Functions before, so I took the opportunity to build my own tool using Step Functions whose executions were initiated by AutoScaling Launch Notifications.

How it works

Each launch notification is processed by a Lambda, which in turns begins an execution of Spoptimize's Step Funcions.

The Step Function execution manages the execution of Lambda functions which perform these actions:

  1. Wait following new instance launch. (See spoptimize:init_sleep_interval below)
  2. Verify that the new on-demand instance is healthy according to autoscaling.
  3. Request Spot Instance using specifications defined in autoscaling group's launch configuration.
  4. Wait for Spot Request to be fulfilled and for spot instance to be online. (See spoptimize:spot_req_sleep_interval below)
  5. Acquire an exclusive lock on the autoscaling group. This step prevents multiple executions from attaching & terminating instances simultaneously.
  6. Attach spot instance to autoscaling group and terminate original on-demand instance.
  7. Wait for spot instance to be healthy according to autoscaling. (See spoptimize:spot_attach_sleep_interval below)
  8. Verify health of spot instance and release exclusive lock.

Screenshot of a successful execution: AWS Step Function execution

Deploying

Here's a breakdown the privileges required for deployment. Deployment requires the ability to:

  • create/update/delete:
    • CloudFormation stacks
    • IAM Managed Policy
    • IAM Roles
    • CloudWatch Alarms
    • DynamoDb tables whose table names begin with spoptimize
    • Lambda functions whose function names begin with spoptimize
    • Step Functions whose names begin with spoptimize
  • create a SNS topic named spoptimize-init
  • create a S3 bucket named spoptimize-artifacts-YOUR_AWS_ACCOUNT_ID
  • read/write to aforementioned S3 bucket with a prefix of spoptimize

Note: many of the names and prefixes can be overridden via setting environment variables prior to running the deployment script.

Quick Launch

You can deploy Spoptimize via the CloudFormation console using the following launch button. It will deploy the latest build:

Launch

Deployment Script

If you wish to deploy Spoptimize via a shell or an automated process, you can utilize the included deploy script.

Prerequisites:

  • Bash
  • AWS CLI
  • API access to an AWS account

First clone this repo, or download a tar.gz or zip from Releases.

Deploy both the IAM stack and the Step Functions & Lambdas:

$ ./deploy.sh

Deploy just the IAM stack:

$ ./deploy.sh iam

Deploy just the Step Functions and Lambdas:

$ ./deploy.sh cfn

Configuration

After Spoptimize is deployed, configure your autoscaling groups to send launch notifications to the spoptimize-init SNS topic.

Set via CloudFormation (see NotificationConfigurations):

  LaunchGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      LaunchConfigurationName: !Ref LaunchConfig
      DesiredCapacity: 0
      MinSize: 0
      MaxSize: 12
      VPCZoneIdentifier: 
        - !Select [ 0, !Ref SubnetIds ]
        - !Select [ 1, !Ref SubnetIds ]
      MetricsCollection: 
        - Granularity: 1Minute
      HealthCheckGracePeriod: 120
      Cooldown: 180
      HealthCheckType: ELB
      TargetGroupARNs:
        - !Ref ElbTargetGroup
      Tags:
        - Key: Name
          Value: !Ref AWS::StackName
          PropagateAtLaunch: true
      NotificationConfigurations:
        - TopicARN: !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:spoptimize-init"
          NotificationTypes:
            - autoscaling:EC2_INSTANCE_LAUNCH

And in the console: EC2 AutoScaling console showing notifications tab

Newly launched instances will (eventually) be replaced by spot instances.

Configuration Overrides

Spoptimize's wait intervals may be overridden per AutoScaling via the use of tags.

  • spoptimize:min_protected_instances: Set a minimum number of on-demand instances for the autoscaling group. Defaults to 0. This prevents Spoptimize from replacing all on-demand instances with spot instances. NOTE: Spoptimzie leverages Instance Protection to achieve this.
  • spoptimize:init_sleep_interval: Initial wait interval after launch notification is received. Spoptimize won't do anything during this wait period. Defaults to approximately the group's Health Check Grace Period times the Desired Capacity plus 30-90s. This is directly correlated to the capacity to allow for rolling updates to complete before any instances are replaced.
  • spoptimize:spot_req_sleep_interval: Wait interval following spot instance request. Default is 30s.
  • spoptimize:spot_attach_sleep_interval: Wait interval following attachment of spot instance to autoscaling group. Defaults to the group's Health Check Grace Period plus 30s.
  • spoptimize:spot_failure_sleep_interval: Wait interval between iterations following a spot instance failure. Defaults to 1 hour. A spot failure may be a failed spot instance request or a failure of the spot instance after it comes online.

Below are override tags I used during development. (Note: these are very aggressive so that I could watch Spoptimize in action.)

Set via CloudFormation:

      Tags:
        - Key: Name
          Value: !Ref AWS::StackName
          PropagateAtLaunch: true
        - Key: spoptimize:min_protected_instances
          Value: 1
          PropagateAtLaunch: false
        - Key: spoptimize:init_sleep_interval
          Value: 45
          PropagateAtLaunch: false
        - Key: spoptimize:spot_req_sleep_interval
          Value: 10
          PropagateAtLaunch: false
        - Key: spoptimize:spot_attach_sleep_interval
          Value: 125
          PropagateAtLaunch: false
        - Key: spoptimize:spot_failure_sleep_interval
          Value: 900
          PropagateAtLaunch: false

And in the console: EC2 AutoScaling console showing tags tab

Notes

  • Auto-Scaling groups that deploy EC2 instances to VPCs are tested. Auto-Scaling groups in EC2-Classic should work, but is not tested.

spoptimize's People

Contributors

vrivellino avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

harmy

spoptimize's Issues

Cancel spot requests for terminated instances

Perhaps wire-up a lambda to ASG termination notices to cancel spot requests. EC2 will eventually close them on its own, but open spot requests associated with terminate instances go against the account-holder's limit.

Wait for cloudformation

If an auto-scaling group is managed by cloudformation and the associated cloudformation stack status is IN_PROGRESS wait for stack status to settle before proceeding with execution.

This will prevent spoptimize from doing anything during stack updates.

Revisit locking retry/back-off

Exclusive locking is implemented via step functions' retry semantics:

spoptimize/sam.yml

Lines 322 to 326 in 4aa555c

"Retry": [{
"ErrorEquals": [ "GroupLocked" ],
"IntervalSeconds": 5,
"MaxAttempts": 20,
"BackoffRate": 1.5

I'm not sure this is the right long-term solution. For larger autoscaling groups, it may take hours for all instances to be replaced after a deploy or mass update.

Perhaps allow for more than one instance to be replaced my Spoptimize (configurable via tag)? Or just have a static interval between retries?

Update deployment documentation

Split out getting-started/quick-deploy from advanced deployment topics.

Advanced deployment docs would include details on how to override defaults via environment variables.

Update readme to note MaxSpotInstanceCountExceeded

During testing, I came across this error from the request-spot lambda:

An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded: ClientError
Traceback (most recent call last):
File "/var/task/handler.py", line 64, in handler
event['launch_subnet_id'], client_token)
File "/var/task/spoptimize/stepfns.py", line 106, in request_spot_instance
return spot_helper.request_spot_instance(launch_config, az, subnet_id, client_token)
File "/var/task/spoptimize/spot_helper.py", line 65, in request_spot_instance
Type='one-time', ClientToken=client_token)
File "/var/runtime/botocore/client.py", line 317, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/var/runtime/botocore/client.py", line 615, in _make_api_call
raise error_class(parsed_response, operation_name)
ClientError: An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded

The solution was to request service limit increase via AWS Support.

Refactor handler.py & stepfns.py

Handler.py has zero test coverage, and it contains some logic that probably belongs in stepfns.py.

It'd great to get test coverage for handler.py and keep as much logic in stepfns.py.

All lambda return values should be defined somewhere (perhaps in a standalone module) so that a test can compare those strings in sam.yml.

ECS Support

Hello!

Can spoptimize be used for EC2 instances with ECS?

Thanks!

Simplify spot instance attachment; Add locking

When testing the initial implementation, I found that with an autoscaling group with desired-capacity of 1, spoptimize and autoscaling get into a loop:

  1. Instance is launch by ASG
  2. Spoptimize attaches a spot instance in same AZ as launched on-demand instance
  3. ASG launches a new instance in another AZ, attempting to rebalance
  4. The original instance and the spot instance get nuked
  5. Process repeats in the other AZ

Rather than worry about seamless attachments and terminations, spoptmize should instead:

  • terminate on-demand and attach spot in same step
  • provide a lock-out mechanism to prevent parallel executions from attaching & terminating in the same ASG

With locking implemented, there won't be any service downtime as long as the autoscaling group has more than one instance running. And an autoscaling group of 1 implies that some service downtime is acceptable.

Update readme to note protected & standby instances

Make a note in the documentation that protected and standby instances are not replaced by spoptimize and execution will stop if the launched instance is detected by spoptimize to be protected from scale-in or is marked as standby.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.