nais / babylon Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 0.0 713 KB

Sommerstudentprosjekt 2021 - opprydning av kubernetes ressurser

License: MIT License

Dockerfile 1.80% Go 95.86% Makefile 2.34%

golang kubernetes

babylon's People

Contributors

Stargazers

Watchers

babylon's Issues

Arm and run in summer projects

Find the namespaces they use
Add to environment variables

Grace period

Add label to deployment/pod/whatever that it is marked for being rolled back/scaled down

Notify teams with pods that struggle

A very useful feature would be to just notify the teams that have pods with a status that indicate something is wrong with the pod, but we can't assume a rollback is appropriate.

For a feature like this, i think we can look at the problem the other way around,. Instead of defining which states are bad - we could say that any Pod with a status other than Running or Completed notifies the team.

Delete resources failing to start

CrashLoopBackOff
Don't delete resources that are too young

Deploy step should be split in dev/prod

Prod deploys only if dev succeeds
Disable fail-fast on deploys

Instead of deleting deploys, turn the replica-set down to 0 pods

From session

Use rollback to scale down if there are multiple revisions, if the current revision is 1 then we can just scale down the deployment directly.
https://github.com/kubernetes/dashboard/pull/4535/files#diff-dbe3ac2b295bb579f948f48dd53da069dfed0ad67d8f99d4684ae1847d9d9a8cR27-R52

Unleash

Legge til og begynne å bruke feature-toggles
Skru av/på sende meldinger til Slack
??

Do we actually need CLI flags?

Extract common options from dev-.yaml/prod-.yaml in .nais to a common.yaml file

Ikke rollback/disable i en helg

Give babylon working hours
Decide on the whens
Use crontab format for timing, maybe upstream from alertmanager
- Can we have overlapping time periods?

We should create metrics regardless of namespace allowlist

Currently the application doesn't create metrics for failing deploys that are skipped due to not being on our allowlist, which is not what we want. Configure so that metrics are created regardless of whether the resource is actually ignored.

Configuration of how and when to delete

Put options into nais.yml
- Update liberator
- Figure out what to do
Document what options and what they do
Which options are relevant for overriding?
- disable pruning by babylon
- disable rollbacks
- turn of deploy instead of rolling back

more proposals

configure resource age/notification timeout/restart threshold
++

Use a different name than babylon in the spec, suggestion for disableCleanUp

What log should be debug/info? ℹ️

add change-cause annotation to rolled back deployment metadata

Add kubernetes.io/change-cause annotation to Deployment saying that the deployment was rolled back by Babylon because it could not start.
This annotation is copied from the Deployment to the ReplicaSet automatically.

This can be very helpful when debugging.
This annotation should also be used to prevent rolling back an application "twice", as this would just cause an endless loop of rollbacks. In other words: Babylon should just ignore ReplicaSets that have an annotation that says it was rolled back by Babylon.

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#checking-rollout-history-of-a-deployment

Better logging

Reconsider the logging, its far too noisy now
Remove unnessecary logging
Add logging for when we scale down deployments, successful rollbacks etc

Speed up integration test CI

Look into using ci-gcp for integration tests instead of a local docker instance of minikube/kuttl
Cache minikube/kuttl?

Allowlist and denylist by namespace ✅ 🚫

Decrease tickrate

Create metric for when deployments are scaled down

Setup application config

Find initial criteria for resource pruning

Document the criteria we select
Research (look on dev-gcp, previous projects etc)
Get NAIS to look at - and approve - what we found

Other smaller tasks for finishing babylon project

rewrite Architecture.md and include sequence diagram
push armed babylon to dev for all teams
InfluxDB

Send alerts til #babylon-alerts i tillegg til teamet sin alert-kanal

Set notification timeout as annotation

Delete containers failing due to `ImagePullBackoff`

Lagre data over lengre tid i InfluxDB/BigQuery

Snakk med Terje Sannum
Decide on backend
Decide on what data is useful to store
Where and how do we fetch and write the data (Prometheus etc)
Fix access to Influx on OnPrem-Clusters
Fix time for loggin - decouple from tickrate?
Are we logging the right stuff?

Armere på alle dev-* klustere

Skriv announcement i #nais-internal som kan deles innen 1400
???
Watch the world burn

Weekend-off-time is not included in delay between detection and purging

The more natural behaviour here would possible be to ignore the weekend-time? So if there has been an weekend in-between detection and current check, add 48h to the delay.

Export Prometheus metrics of numbers of deleted resources

Overførsel av babylon til NAIS

Booke møte med NAIS

Hva?

Slack-kanaler
Alerts
NAIS.yaml
Overordnet hvordan ting funker
- Sekvensdiagram
Bruk av ingresser, de er antakeligvis ikke nødvendig

Integration test with k8s in CI

Run k8s during tests (E2E)
Various failing resources
Simple to run (ideally one command, and fast)
Documented how to run in README

Babylon sover forskjellig i clusterne

Per 1044 mandag 2. august er det kun i dev-gcp at vi ikke får loggmeldinger om at den sover.

Har sjekket at date gir samme klokkeslett og tidssone i dev-fss og dev-gcp
Begge clusterne bruker samme dockerbilde
Men de leser forskjellige working-hours??? Virker som om flere deploys ikke har plukket opp at ConfigMap ble endret i commit ad4bd77

Ask for confirmation by virtue of Slack message

It would be very useful for users that have gotten resources downscaled (and potentially notified about it) to easily delete it as well:

Get notification about downscaling of resources
Babylon asks for confirmation with a button in the Slack message, asking if the user wants to also delete the resource
User clicks button and the resource is deleted (might have to log in for safety)

Babylon forsover seg

Per loggene har babylon ikke startet til 1000 på mandag 2. august

Grafana integration (useful logging of resource usage/deletion)

Potential metrics

No. would-be-pruned resources
No. alerts sent with different time intervals
No. failing resources sorted by reason for failing
No. of teams affected
No. deployments currently in Pending or Failing, and their error messages
Add rule activations to controller metrics

Metrics should maybe be reset once a previous deployment has been redeployed?

Get application to deploy to `dev-gcp` again

When we switched to using controller-runtime's client over the kubernetes go client our app switched endpoints for health and readyness checks. As it stands the checks are on a different port than the application itself, and we cannot use the same one to serve both the checks and the metrics. Currently metrics are under :8080/metrics and health checks are :8081/healthz.

CI fails to deploy because NAIS deploy cannot check the state of our application.