GithubHelp home page GithubHelp logo

dmathieu / dice Goto Github PK

View Code? Open in Web Editor NEW
51.0 51.0 3.0 6.5 MB

Roll all instances within a kubernetes cluster, using a zero-downtime strategy.

Home Page: https://godoc.org/github.com/dmathieu/dice

License: MIT License

Makefile 0.78% Go 99.09% Dockerfile 0.13%
aws kubernetes

dice's Introduction

Well, Hello There!

I am a software engineer with a focus on backend, resilience and observability, currently working at @elastic.

Linkedin GitHub Goodreads

Some of the technologies I work with are Go, Ruby, Kubernetes, OpenTelemetry. I am also a contributor to Open-Source.

I am writing about software engineering, and food fermentation (fr).

Languages and Tools




dice's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

dice's Issues

AWS EKS 1.11 not working - x509: failed to load system roots and no roots

Getting following error:
ERROR: logging before flag.Parse: E1214 13:44:55.925002 1 delete_node.go:106] RequestError: send request failed caused by: Post https://ec2.us-east-1.amazonaws.com/: x509: failed to load system roots and no roots provided

Succeed to drain node and raise a new one but fails to delete the unscheduled nodes.

Dice evicting itself

Hi, when running dice as a job inside the cluster, how do you get around the problems that come from dice draining the node it's currently running on?

When testing in a cluster with only 2 nodes (which I'm sure isn't really the intended use case), this caused a complete outage. Dice was running on Node A. Dice drained and deleted Node B, then when Node C came up, it cordoned/drained Node A. Dice started up again on Node C, flagging all the nodes with "dice=roll" again. It cordoned/drained Node C, and at that point there were no usable nodes left, and it stayed that way until our autoscaler was able to get new nodes up and joined.

But even in typical larger clusters, it seems that dice restarting whenever it drains its own node would make for an endless loop. Have you seen this behavior? It's very possible I am missing something with how it's supposed to be run.
Thanks!!

One-off dice run interacts badly with existing dice loop deployment

On our cluster we have dice in loop mode configured as a deployment, pretty much the same as that in the example in this repository, except it's configured for a max uptime of 24 hours.

After updating the AMI for our ASG, I ran a one-off dice roll to cycle the nodes onto the new image, using:
kubectl apply -f https://raw.githubusercontent.com/dmathieu/dice/master/examples/dice-aws.yml

However the logs showed:

$ kubectl logs dice-5kjtv -n kube-system --follow
ERROR: logging before flag.Parse: I0621 10:31:21.622566       1 run.go:22] Starting controllers
ERROR: logging before flag.Parse: I0621 10:31:21.649642       1 all_nodes_flagger.go:28] Found flagged nodes. Continuing with them.
ERROR: logging before flag.Parse: I0621 10:31:21.649878       1 run.go:36] Started all controllers
...

...ie: it found already-flagged nodes and so didn't flag all nodes as needing to be rolled.

This is presumably due to the loop dice deployment having flagged nodes for uptime at the same time.

The solution would be to presumably have two different flag values (eg dice=roll-loop, dice=roll-run) so that the loop's flags don't stop the one-off run's flags from being added:

dice/kubernetes/nodes.go

Lines 11 to 24 in 9e93d86

const (
flagName = "dice"
flagValue = "roll"
)
// Node represents a kubernetes node
type Node struct {
*corev1.Node
}
// IsFlagged checks whether the node has the dice label
func (n *Node) IsFlagged() bool {
return n.Labels[flagName] == flagValue
}

for _, n := range nodes {
if n.IsFlagged() {
glog.Infof("Found flagged nodes. Continuing with them.")
return nil
}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.