GithubHelp home page GithubHelp logo

nais / babylon Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 0.0 713 KB

Sommerstudentprosjekt 2021 - opprydning av kubernetes ressurser

License: MIT License

Dockerfile 1.80% Go 95.86% Makefile 2.34%
golang kubernetes

babylon's People

Contributors

aasehaa avatar chinatsu avatar dependabot[bot] avatar erlendmariusommundsen avatar henrikhorluck avatar jhrv avatar muni10 avatar sechmann avatar sondr3 avatar sonhal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

babylon's Issues

Grace period

  • Add label to deployment/pod/whatever that it is marked for being rolled back/scaled down

Notify teams with pods that struggle

A very useful feature would be to just notify the teams that have pods with a status that indicate something is wrong with the pod, but we can't assume a rollback is appropriate.

For a feature like this, i think we can look at the problem the other way around,. Instead of defining which states are bad - we could say that any Pod with a status other than Running or Completed notifies the team.

Unleash

  • Legge til og begynne å bruke feature-toggles

  • Skru av/på sende meldinger til Slack

  • ??

Ikke rollback/disable i en helg

  • Give babylon working hours
  • Decide on the whens
  • Use crontab format for timing, maybe upstream from alertmanager
    • Can we have overlapping time periods?

We should create metrics regardless of namespace allowlist

Currently the application doesn't create metrics for failing deploys that are skipped due to not being on our allowlist, which is not what we want. Configure so that metrics are created regardless of whether the resource is actually ignored.

Configuration of how and when to delete

  • Put options into nais.yml
    • Update liberator
    • Figure out what to do :shipit:
  • Document what options and what they do
  • Which options are relevant for overriding?
    • disable pruning by babylon
    • disable rollbacks
    • turn of deploy instead of rolling back

more proposals

  • configure resource age/notification timeout/restart threshold
  • ++

Use a different name than babylon in the spec, suggestion for disableCleanUp

add change-cause annotation to rolled back deployment metadata

Add kubernetes.io/change-cause annotation to Deployment saying that the deployment was rolled back by Babylon because it could not start.
This annotation is copied from the Deployment to the ReplicaSet automatically.

  1. This can be very helpful when debugging.

  2. This annotation should also be used to prevent rolling back an application "twice", as this would just cause an endless loop of rollbacks. In other words: Babylon should just ignore ReplicaSets that have an annotation that says it was rolled back by Babylon.

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#checking-rollout-history-of-a-deployment

Better logging

  • Reconsider the logging, its far too noisy now
  • Remove unnessecary logging
  • Add logging for when we scale down deployments, successful rollbacks etc

Speed up integration test CI

  • Look into using ci-gcp for integration tests instead of a local docker instance of minikube/kuttl
  • Cache minikube/kuttl?

Lagre data over lengre tid i InfluxDB/BigQuery

  • Snakk med Terje Sannum

  • Decide on backend

  • Decide on what data is useful to store

  • Where and how do we fetch and write the data (Prometheus etc)

  • Fix access to Influx on OnPrem-Clusters

  • Fix time for loggin - decouple from tickrate?

  • Are we logging the right stuff?

Overførsel av babylon til NAIS

  • Booke møte med NAIS

Hva?

  • Slack-kanaler
  • Alerts
  • NAIS.yaml
  • Overordnet hvordan ting funker
    • Sekvensdiagram
  • Bruk av ingresser, de er antakeligvis ikke nødvendig

Integration test with k8s in CI

  • Run k8s during tests (E2E)
  • Various failing resources
  • Simple to run (ideally one command, and fast)
  • Documented how to run in README

Babylon sover forskjellig i clusterne

Per 1044 mandag 2. august er det kun i dev-gcp at vi ikke får loggmeldinger om at den sover.

  • Har sjekket at date gir samme klokkeslett og tidssone i dev-fss og dev-gcp
  • Begge clusterne bruker samme dockerbilde
  • Men de leser forskjellige working-hours??? Virker som om flere deploys ikke har plukket opp at ConfigMap ble endret i commit ad4bd77

Ask for confirmation by virtue of Slack message

It would be very useful for users that have gotten resources downscaled (and potentially notified about it) to easily delete it as well:

  • Get notification about downscaling of resources
  • Babylon asks for confirmation with a button in the Slack message, asking if the user wants to also delete the resource
  • User clicks button and the resource is deleted (might have to log in for safety)

Grafana integration (useful logging of resource usage/deletion)

Potential metrics

  • No. would-be-pruned resources

  • No. alerts sent with different time intervals

  • No. failing resources sorted by reason for failing

  • No. of teams affected

  • No. deployments currently in Pending or Failing, and their error messages

  • Add rule activations to controller metrics

Get application to deploy to `dev-gcp` again

When we switched to using controller-runtime's client over the kubernetes go client our app switched endpoints for health and readyness checks. As it stands the checks are on a different port than the application itself, and we cannot use the same one to serve both the checks and the metrics. Currently metrics are under :8080/metrics and health checks are :8081/healthz.

CI fails to deploy because NAIS deploy cannot check the state of our application.

How do we handle state

  • Is a DB even needed?
  • Include NAIS people
  • Should we use postgres, etcd, some other persistent storage?
  • What should be stateful?
  • What goes where? (annotations, storage, etc)

Presentasjon

  •  Spille inn demo til presentasjon
  •  Bestemme innhold/rekkefølge
  • Anslag på kostbesparelse

Use `alerterator` to notify teams of `babylon`

  • Figure out how to send alerts to alerterator
  • DEMO 👻
  • We will likely have to default to the slack-channel of each team
  • Override when any existing alert-channels are available in the namespace
  • Allow manual override in (probably) /teams
  • Find out if creating lots of child-alert resources or using slack web hooks??
  • Notify teams about which revision/containerImage to rollback

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.