GithubHelp home page GithubHelp logo

aws-eks-spot-workshop's Introduction

AWS EKS Spot workshop

Introduction

This workshop includes a demonstration of how Spot can be adopted on EKS.

  1. Application with pods running on both on-demand and spot nodes
  2. Gracefully handle spot node interruption
  3. Verify performance impact during spot node interruption

Note: the content is customized from eksworkshop.com with the focus on Spot in mind.

Pre-requisites

  1. An existing workload on AWS
  2. Tech team with foundation knowledge of EKS

Audience persona

  1. CTO / Decision maker
  2. Tech lead
  3. DevOps lead

Table of content

  1. Why Spot on EKS?
  2. Set up AWS environment
  3. Create a template EKS cluster
  4. Create a Spot nodegroup
  5. Set up monitoring tools: CloudWatch Event, CloudWatch Container Insights
  6. Simulate Spot interruption event
  7. Tracing with X-ray
  8. Partner's solutions

Why Spot on EKS

Spot strategy

  • Containers are often stateless, fault-tolerant, and a great fit for Spot Instances
  • Deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances Given the statelessness of services and elastic scaling, achieving 100% spot on EKS is totally possible

Quick tips

  1. Use Spot fleet with multiple instance types instead of a specific instance type to increase resource availability
  2. Utilize 2-minutes Spot interruption window to gracefully handle instance termination

Set up AWS environment

Complete this Start the workshop section

Create a template EKS cluster

Complete this Launch using eksctl section

Create a Spot nodegroup

  1. Complete this Add EC2 Workers - Spot page

Additional notes

  1. We set the lifecycle for the nodes as Ec2Spot
  2. We are also tainting with PreferNoSchedule to prefer pods not be scheduled on Spot Instances
  3. Nodes are created with various set of instance types to increase availability
  • Taints is a property of a node to repel a set of pods. This marks that the node should not accept any pods that do not tolerate the taints. Possible Taint effects are NoSchedule, PreferNoSchedule or NoExecute
  • Tolerations is a property of a pod to to allow (but do not require) the pods to be scheduled onto nodes with matching taints

Deploy the AWS Node termination handler

Spot stats

  1. Complete this Helm CLI installation page
  2. Complete this Deploy the AWS Node termination handler page

Analyse the frontend deployment.yml, pay attention to affinity, matchExpressions, and tolerations

Deploy an application on Spot

  1. Complete this Deploy an application on Spot section
  2. Remember to cd .. to get back to parent directory

Additional notes

  1. Affinity and anti-affinity is a property of pods that attracts them to a set of nodes. This is similar to nodeSelector, but
    • Offers more flexible matching rules (vs. exact matches)
    • Offers soft/preferences (vs. hard requirement) so pods will always be scheduled
  2. Anti-affinity (inter-pod affinity) Offer constraints against other pods (vs. just node) allowing rules to let/prevent pods co-location.

Please do not perform Cleanup, we will need these resource for subsequent part of this workshop.

Set up monitoring tools

CloudWatch Event

Configure CloudWatch Event rule which listens to all EC2 events

  • Service: EC2
  • Events: All events
  • Target: CloudWatch log group e.g. /ec2Events

CloudWatch Container Insights

  1. Since we are using different roles for 2 NodeGroups, we will add the additional IAM policy to those roles from the GUI. You can refer to the Preparing to Install CloudWatch Container Insights page for more descriptions of these steps.
  2. Complete the Installing CloudWatch Container Insights page

Simulate Spot interruption event

Referring to this page for description, but the instructions are outdated

  1. Login to AWS EC2 Console

    • In the left hand menu bar, choose Spot Requests

    • Click on Request Spot Instances button

    • Launch template: empty (so we can change configuration parameters)

    • Search for AMI under Amazon AMIs, refer to old spot’s AMI (e.g. amazon-eks-node-1.15-v20200507)

    • VPC: select eksworkshop

    • Select AZs with valid subnets

    • Key pair name: eksworkshop

    • Additional configurations:

      • Security groups: select both *ng-spot* and *ClusterSharedNodeSecurityGroup*
      • IPv4: enabled
      • IAM instance profile: ng-spot
      • User data
      #!/bin/bash
      set -o xtrace
      /etc/eks/bootstrap.sh eksworkshop-eksctl --kubelet-extra-args --node-labels=lifecycle=Ec2Spot
      • Tags:
      Key Value
      Name EKSSpot-SpotFleet-Node
      kubernetes.io/cluster/eksworkshop-eksctl owned
      k8s.io/cluster-autoscaler/enabled true
      Spot true
    • Ensure “Maintain target capacity” is checked. Select Create

    • Wait for few minutes (about 8-10)

  2. Verify that new nodes are added to the cluster kubectl get nodes. Record the new node ID.

  3. Scale up front-end. Verify that front-end pods are running on the new spot node.

    kubectl scale deployment ecsdemo-frontend --replicas 10
    kubectl get pods -o wide
  4. Follow the ”interruption handler” pod logs

    kubectl get pods -A -o wide

    Record the termination-handler pod that run on the new spot node. Replace the pod id for <aws-node-termination-handler-id-*****> on the command below.

    kubectl --namespace kube-system logs -f <aws-node-termination-handler-id-*****>
  5. Verify CloudWatch Container Insights is working

  6. Preparing your load test

  7. Get ALB of frontend deployment kubectl get svc -o wide. Running your load test for 5 minutes

    siege -q -t 300S -c 50 -i http://${FE_ALB}
  8. Reduce the previous Spot Requests down to 0

  9. Verify metrics on CloudWatch dashboards

  10. After 5 minutes, stop the load test ctrl + c and verify siege stats with: availability: 100.00

  11. Verify impact to the applications

    • Termination handler pod log
    • Spot Requests’ history tab: “termination_notified” event and its timestamp
    • EC2 instance is shutting down
    • Verify CloudWatch insights for terminated spot instance
    • Verify pod was evicted and deployed to other nodes kubectl get pods -o wide (notice the age)
  12. Verify EC2 spot events through CloudWatch Events. Run CloudWatch insights query: (replace <i-09e6a1b2cff2*****> with the spot instance ID from Spot Requests)

    fields @timestamp, @message, `detail-type`, detail.state
    | sort @timestamp desc
    | limit 20
    | filter `detail.instance-id` = "<i-09e6a1b2cff2*****>"

Tracing with X-ray

  1. Remove previous applications

    cd ~/environment/ecsdemo-frontend
    kubectl delete -f kubernetes/service.yaml
    kubectl delete -f kubernetes/deployment.yaml
    
    cd ~/environment/ecsdemo-crystal
    kubectl delete -f kubernetes/service.yaml
    kubectl delete -f kubernetes/deployment.yaml
    
    cd ~/environment/ecsdemo-nodejs
    kubectl delete -f kubernetes/service.yaml
    kubectl delete -f kubernetes/deployment.yaml
  2. Going to 100% spot

    • Change the ASG for on-demand instances to 0
    • Increase Spot Requests capacity to 2
  3. Increase the previous Spot Requests up to 1

  4. Read this X-ray overview page page

  5. Manually add AWSXRayDaemonWriteAccess to spot nodegroup's IAM role (*ng-spot*)

  6. Deploy X-ray daemonset

  7. Deploy example microservices

    wget https://eksworkshop.com/intermediate/245_x-ray/sample-front.files/x-ray-sample-front-k8s.yml
    wget https://eksworkshop.com/intermediate/245_x-ray/sample-back.files/x-ray-sample-back-k8s.yml
    • Configure front and back deployment to have the same affinity and tolerations (as the previous frontend deployment)
    • Deploy the example X-ray applications
    kubectl apply -f x-ray-sample-front-k8s.yml
    kubectl apply -f x-ray-sample-back-k8s.yml
    • Scale both deployments to 10
    kubectl scale deployment x-ray-sample-front-k8s --replicas 10
    kubectl scale deployment x-ray-sample-back-k8s --replicas 10
    • Verify pods are on all 4 spot nodes
    kubectl get nodes; kubectl get pods -o wide
  8. Run load test again on new X-ray applications

    • Get service ALB
    kubectl get svc -o wide
    • Run load test. Replace ALB_URL below
    siege -q -t 300S -c 50 -i <ALB_URL>/api 
  9. Leave load test running for 5 minutes

  10. Interrupt spot instances by reduce Spot Requests capacity from 2 to 0

  11. Verify X-ray panels xray map xray stats xray map

  12. Stop siege load test after 5 minutes. Verify siege stats siege map

Clean up

kubectl delete -f x-ray-sample-front-k8s.yml
kubectl delete -f x-ray-sample-back-k8s.yml
kubectl delete -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0/aio/deploy/recommended.yaml

Remove X-ray and CloudWatch policies (if added manually) from NodeGroups

eksctl delete cluster --name=eksworkshop-eksctl

Go to your Cloud9 Environment, select the environment named eksworkshop and pick delete

Partner's solutions

Spot.io

  • Charge based on cost savings
  • Has algorithm that predicts an instance (based on types/regions) which would be reclaimed and move pods in advance
  • Offload the config operations / monitoring effort
  • Right sizing recommendations
  • Cloud Analyzer provides recommendations on which components/pods can run on RI/OD/spot

aws-eks-spot-workshop's People

Contributors

andrewaddo avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.