This workshop includes a demonstration of how Spot can be adopted on EKS.
- Application with pods running on both on-demand and spot nodes
- Gracefully handle spot node interruption
- Verify performance impact during spot node interruption
Note: the content is customized from eksworkshop.com with the focus on Spot in mind.
- An existing workload on AWS
- Tech team with foundation knowledge of EKS
- CTO / Decision maker
- Tech lead
- DevOps lead
- Why Spot on EKS?
- Set up AWS environment
- Create a template EKS cluster
- Create a Spot nodegroup
- Set up monitoring tools: CloudWatch Event, CloudWatch Container Insights
- Simulate Spot interruption event
- Tracing with X-ray
- Partner's solutions
- Containers are often stateless, fault-tolerant, and a great fit for Spot Instances
- Deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances Given the statelessness of services and elastic scaling, achieving 100% spot on EKS is totally possible
Quick tips
- Use Spot fleet with multiple instance types instead of a specific instance type to increase resource availability
- Utilize 2-minutes Spot interruption window to gracefully handle instance termination
Complete this Start the workshop section
Complete this Launch using eksctl section
- Complete this Add EC2 Workers - Spot page
Additional notes
- We set the lifecycle for the nodes as Ec2Spot
- We are also tainting with PreferNoSchedule to prefer pods not be scheduled on Spot Instances
- Nodes are created with various set of instance types to increase availability
- Taints is a property of a node to repel a set of pods. This marks that the node should not accept any pods that do not tolerate the taints. Possible Taint effects are NoSchedule, PreferNoSchedule or NoExecute
- Tolerations is a property of a pod to to allow (but do not require) the pods to be scheduled onto nodes with matching taints
- Complete this Helm CLI installation page
- Complete this Deploy the AWS Node termination handler page
Analyse the frontend deployment.yml, pay attention to affinity, matchExpressions, and tolerations
- Complete this Deploy an application on Spot section
- Remember to
cd ..
to get back to parent directory
Additional notes
- Affinity and anti-affinity is a property of pods that attracts them to a set of nodes. This is similar to nodeSelector, but
- Offers more flexible matching rules (vs. exact matches)
- Offers soft/preferences (vs. hard requirement) so pods will always be scheduled
- Anti-affinity (inter-pod affinity) Offer constraints against other pods (vs. just node) allowing rules to let/prevent pods co-location.
Please do not perform Cleanup, we will need these resource for subsequent part of this workshop.
Configure CloudWatch Event rule which listens to all EC2 events
- Service: EC2
- Events: All events
- Target: CloudWatch log group e.g. /ec2Events
- Since we are using different roles for 2 NodeGroups, we will add the additional IAM policy to those roles from the GUI. You can refer to the Preparing to Install CloudWatch Container Insights page for more descriptions of these steps.
- Complete the Installing CloudWatch Container Insights page
Referring to this page for description, but the instructions are outdated
-
Login to AWS EC2 Console
-
In the left hand menu bar, choose Spot Requests
-
Click on Request Spot Instances button
-
Launch template: empty (so we can change configuration parameters)
-
Search for AMI under Amazon AMIs, refer to old spot’s AMI (e.g. amazon-eks-node-1.15-v20200507)
-
VPC: select eksworkshop
-
Select AZs with valid subnets
-
Key pair name: eksworkshop
-
Additional configurations:
- Security groups: select both *ng-spot* and *ClusterSharedNodeSecurityGroup*
- IPv4: enabled
- IAM instance profile: ng-spot
- User data
#!/bin/bash set -o xtrace /etc/eks/bootstrap.sh eksworkshop-eksctl --kubelet-extra-args --node-labels=lifecycle=Ec2Spot
- Tags:
Key Value Name EKSSpot-SpotFleet-Node kubernetes.io/cluster/eksworkshop-eksctl
owned k8s.io/cluster-autoscaler
/enabledtrue Spot true -
Ensure “Maintain target capacity” is checked. Select Create
-
Wait for few minutes (about 8-10)
-
-
Verify that new nodes are added to the cluster
kubectl get nodes
. Record the new node ID. -
Scale up front-end. Verify that front-end pods are running on the new spot node.
kubectl scale deployment ecsdemo-frontend --replicas 10 kubectl get pods -o wide
-
Follow the ”interruption handler” pod logs
kubectl get pods -A -o wide
Record the termination-handler pod that run on the new spot node. Replace the pod id for <aws-node-termination-handler-id-*****> on the command below.
kubectl --namespace kube-system logs -f <aws-node-termination-handler-id-*****>
-
Get ALB of frontend deployment
kubectl get svc -o wide
. Running your load test for 5 minutessiege -q -t 300S -c 50 -i http://${FE_ALB}
-
Reduce the previous Spot Requests down to 0
-
After 5 minutes, stop the load test
ctrl + c
and verify siege stats with:availability: 100.00
-
Verify impact to the applications
- Termination handler pod log
- Spot Requests’ history tab: “termination_notified” event and its timestamp
- EC2 instance is shutting down
- Verify CloudWatch insights for terminated spot instance
- Verify pod was evicted and deployed to other nodes
kubectl get pods -o wide
(notice the age)
-
Verify EC2 spot events through CloudWatch Events. Run CloudWatch insights query: (replace <i-09e6a1b2cff2*****> with the spot instance ID from Spot Requests)
fields @timestamp, @message, `detail-type`, detail.state | sort @timestamp desc | limit 20 | filter `detail.instance-id` = "<i-09e6a1b2cff2*****>"
-
Remove previous applications
cd ~/environment/ecsdemo-frontend kubectl delete -f kubernetes/service.yaml kubectl delete -f kubernetes/deployment.yaml cd ~/environment/ecsdemo-crystal kubectl delete -f kubernetes/service.yaml kubectl delete -f kubernetes/deployment.yaml cd ~/environment/ecsdemo-nodejs kubectl delete -f kubernetes/service.yaml kubectl delete -f kubernetes/deployment.yaml
-
Going to 100% spot
- Change the ASG for on-demand instances to 0
- Increase Spot Requests capacity to 2
-
Increase the previous Spot Requests up to 1
-
Read this X-ray overview page page
-
Manually add AWSXRayDaemonWriteAccess to spot nodegroup's IAM role (*ng-spot*)
-
Deploy example microservices
wget https://eksworkshop.com/intermediate/245_x-ray/sample-front.files/x-ray-sample-front-k8s.yml wget https://eksworkshop.com/intermediate/245_x-ray/sample-back.files/x-ray-sample-back-k8s.yml
- Configure front and back deployment to have the same affinity and tolerations (as the previous frontend deployment)
- Deploy the example X-ray applications
kubectl apply -f x-ray-sample-front-k8s.yml kubectl apply -f x-ray-sample-back-k8s.yml
- Scale both deployments to 10
kubectl scale deployment x-ray-sample-front-k8s --replicas 10 kubectl scale deployment x-ray-sample-back-k8s --replicas 10
- Verify pods are on all 4 spot nodes
kubectl get nodes; kubectl get pods -o wide
-
Run load test again on new X-ray applications
- Get service ALB
kubectl get svc -o wide
- Run load test. Replace ALB_URL below
siege -q -t 300S -c 50 -i <ALB_URL>/api
-
Leave load test running for 5 minutes
-
Interrupt spot instances by reduce Spot Requests capacity from 2 to 0
kubectl delete -f x-ray-sample-front-k8s.yml
kubectl delete -f x-ray-sample-back-k8s.yml
kubectl delete -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0/aio/deploy/recommended.yaml
Remove X-ray and CloudWatch policies (if added manually) from NodeGroups
eksctl delete cluster --name=eksworkshop-eksctl
Go to your Cloud9 Environment, select the environment named eksworkshop and pick delete
- Charge based on cost savings
- Has algorithm that predicts an instance (based on types/regions) which would be reclaimed and move pods in advance
- Offload the config operations / monitoring effort
- Right sizing recommendations
- Cloud Analyzer provides recommendations on which components/pods can run on RI/OD/spot