AWS EKS Spot workshop

Introduction

This workshop includes a demonstration of how Spot can be adopted on EKS.

Application with pods running on both on-demand and spot nodes
Gracefully handle spot node interruption
Verify performance impact during spot node interruption

Note: the content is customized from eksworkshop.com with the focus on Spot in mind.

Pre-requisites

An existing workload on AWS
Tech team with foundation knowledge of EKS

Audience persona

CTO / Decision maker
Tech lead
DevOps lead

Table of content

Why Spot on EKS?
Set up AWS environment
Create a template EKS cluster
Create a Spot nodegroup
Set up monitoring tools: CloudWatch Event, CloudWatch Container Insights
Simulate Spot interruption event
Tracing with X-ray
Partner's solutions

Why Spot on EKS

Containers are often stateless, fault-tolerant, and a great fit for Spot Instances
Deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances Given the statelessness of services and elastic scaling, achieving 100% spot on EKS is totally possible

Quick tips

Use Spot fleet with multiple instance types instead of a specific instance type to increase resource availability
Utilize 2-minutes Spot interruption window to gracefully handle instance termination

Set up AWS environment

Complete this Start the workshop section

Create a template EKS cluster

Complete this Launch using eksctl section

Create a Spot nodegroup

Complete this Add EC2 Workers - Spot page

Additional notes

We set the lifecycle for the nodes as Ec2Spot
We are also tainting with PreferNoSchedule to prefer pods not be scheduled on Spot Instances
Nodes are created with various set of instance types to increase availability

Taints is a property of a node to repel a set of pods. This marks that the node should not accept any pods that do not tolerate the taints. Possible Taint effects are NoSchedule, PreferNoSchedule or NoExecute
Tolerations is a property of a pod to to allow (but do not require) the pods to be scheduled onto nodes with matching taints

Deploy the AWS Node termination handler

Complete this Helm CLI installation page
Complete this Deploy the AWS Node termination handler page

Analyse the frontend deployment.yml, pay attention to affinity, matchExpressions, and tolerations

Deploy an application on Spot

Complete this Deploy an application on Spot section
Remember to cd .. to get back to parent directory

Additional notes

Affinity and anti-affinity is a property of pods that attracts them to a set of nodes. This is similar to nodeSelector, but
- Offers more flexible matching rules (vs. exact matches)
- Offers soft/preferences (vs. hard requirement) so pods will always be scheduled
Anti-affinity (inter-pod affinity) Offer constraints against other pods (vs. just node) allowing rules to let/prevent pods co-location.

Please do not perform Cleanup, we will need these resource for subsequent part of this workshop.

Set up monitoring tools

CloudWatch Event

Configure CloudWatch Event rule which listens to all EC2 events

Service: EC2
Events: All events
Target: CloudWatch log group e.g. /ec2Events

CloudWatch Container Insights

Since we are using different roles for 2 NodeGroups, we will add the additional IAM policy to those roles from the GUI. You can refer to the Preparing to Install CloudWatch Container Insights page for more descriptions of these steps.
Complete the Installing CloudWatch Container Insights page

Simulate Spot interruption event

Referring to this page for description, but the instructions are outdated

Login to AWS EC2 Console
- In the left hand menu bar, choose Spot Requests
- Click on Request Spot Instances button
- Launch template: empty (so we can change configuration parameters)
- Search for AMI under Amazon AMIs, refer to old spot’s AMI (e.g. amazon-eks-node-1.15-v20200507)
- VPC: select eksworkshop
- Select AZs with valid subnets
- Key pair name: eksworkshop
- Additional configurations:
  - Security groups: select both *ng-spot* and *ClusterSharedNodeSecurityGroup*
  - IPv4: enabled
  - IAM instance profile: ng-spot
  - User data
```
#!/bin/bash
set -o xtrace
/etc/eks/bootstrap.sh eksworkshop-eksctl --kubelet-extra-args --node-labels=lifecycle=Ec2Spot
```
  - Tags:
  Key Value
  
  Name EKSSpot-SpotFleet-Node
  
  kubernetes.io/cluster/eksworkshop-eksctl owned
  
  k8s.io/cluster-autoscaler/enabled true
  
  Spot true
- Ensure “Maintain target capacity” is checked. Select Create
- Wait for few minutes (about 8-10)
Verify that new nodes are added to the cluster kubectl get nodes. Record the new node ID.

Key	Value
Name	EKSSpot-SpotFleet-Node
`kubernetes.io/cluster/eksworkshop-eksctl`	owned
`k8s.io/cluster-autoscaler`/enabled	true
Spot	true

Scale up front-end. Verify that front-end pods are running on the new spot node.

kubectl scale deployment ecsdemo-frontend --replicas 10
kubectl get pods -o wide

Follow the ”interruption handler” pod logs
```
kubectl get pods -A -o wide
```
Record the termination-handler pod that run on the new spot node. Replace the pod id for <aws-node-termination-handler-id-*****> on the command below.
```
kubectl --namespace kube-system logs -f <aws-node-termination-handler-id-*****>
```
Verify CloudWatch Container Insights is working
Preparing your load test
Get ALB of frontend deployment kubectl get svc -o wide. Running your load test for 5 minutes
```
siege -q -t 300S -c 50 -i http://${FE_ALB}
```
Reduce the previous Spot Requests down to 0
Verify metrics on CloudWatch dashboards
After 5 minutes, stop the load test ctrl + c and verify siege stats with: availability: 100.00
Verify impact to the applications
- Termination handler pod log
- Spot Requests’ history tab: “termination_notified” event and its timestamp
- EC2 instance is shutting down
- Verify CloudWatch insights for terminated spot instance
- Verify pod was evicted and deployed to other nodes kubectl get pods -o wide (notice the age)
Verify EC2 spot events through CloudWatch Events. Run CloudWatch insights query: (replace <i-09e6a1b2cff2*****> with the spot instance ID from Spot Requests)
```
fields @timestamp, @message, `detail-type`, detail.state
| sort @timestamp desc
| limit 20
| filter `detail.instance-id` = "<i-09e6a1b2cff2*****>"
```

Tracing with X-ray

Remove previous applications

cd ~/environment/ecsdemo-frontend
kubectl delete -f kubernetes/service.yaml
kubectl delete -f kubernetes/deployment.yaml

cd ~/environment/ecsdemo-crystal
kubectl delete -f kubernetes/service.yaml
kubectl delete -f kubernetes/deployment.yaml

cd ~/environment/ecsdemo-nodejs
kubectl delete -f kubernetes/service.yaml
kubectl delete -f kubernetes/deployment.yaml

Going to 100% spot
- Change the ASG for on-demand instances to 0
- Increase Spot Requests capacity to 2
Increase the previous Spot Requests up to 1
Read this X-ray overview page page
Manually add AWSXRayDaemonWriteAccess to spot nodegroup's IAM role (*ng-spot*)
Deploy X-ray daemonset

Deploy example microservices

wget https://eksworkshop.com/intermediate/245_x-ray/sample-front.files/x-ray-sample-front-k8s.yml
wget https://eksworkshop.com/intermediate/245_x-ray/sample-back.files/x-ray-sample-back-k8s.yml

Configure front and back deployment to have the same affinity and tolerations (as the previous frontend deployment)
Deploy the example X-ray applications

kubectl apply -f x-ray-sample-front-k8s.yml
kubectl apply -f x-ray-sample-back-k8s.yml

Scale both deployments to 10

kubectl scale deployment x-ray-sample-front-k8s --replicas 10
kubectl scale deployment x-ray-sample-back-k8s --replicas 10

Verify pods are on all 4 spot nodes

kubectl get nodes; kubectl get pods -o wide

Run load test again on new X-ray applications
- Get service ALB
```
kubectl get svc -o wide
```
- Run load test. Replace ALB_URL below
```
siege -q -t 300S -c 50 -i <ALB_URL>/api 
```
Leave load test running for 5 minutes
Interrupt spot instances by reduce Spot Requests capacity from 2 to 0
Verify X-ray panels
Stop siege load test after 5 minutes. Verify siege stats

Clean up

kubectl delete -f x-ray-sample-front-k8s.yml
kubectl delete -f x-ray-sample-back-k8s.yml
kubectl delete -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0/aio/deploy/recommended.yaml

Remove X-ray and CloudWatch policies (if added manually) from NodeGroups

eksctl delete cluster --name=eksworkshop-eksctl

Go to your Cloud9 Environment, select the environment named eksworkshop and pick delete

Partner's solutions

Spot.io

Charge based on cost savings
Has algorithm that predicts an instance (based on types/regions) which would be reclaimed and move pods in advance
Offload the config operations / monitoring effort
Right sizing recommendations
Cloud Analyzer provides recommendations on which components/pods can run on RI/OD/spot

ejlp12 / aws-eks-spot-workshop Goto Github PK