GithubHelp home page GithubHelp logo

argoproj / argo-rollouts Goto Github PK

View Code? Open in Web Editor NEW
2.5K 2.5K 781.0 77.23 MB

Progressive Delivery for Kubernetes

Home Page: https://argo-rollouts.readthedocs.io/

License: Apache License 2.0

Dockerfile 0.09% Makefile 0.28% Go 86.11% Shell 0.35% SCSS 0.81% TypeScript 12.26% HTML 0.01% JavaScript 0.09% CSS 0.01%
argo-rollouts argoproj bluegreen canary deployments experiments gitops hacktoberfest kubernetes progressive-delivery

argo-rollouts's Introduction

slack

Argoproj - Get stuff done with Kubernetes

Argo Image

What is Argoproj?

Argoproj is a collection of tools for getting work done with Kubernetes.

  • Argo Workflows - Container-native Workflow Engine
  • Argo CD - Declarative GitOps Continuous Delivery
  • Argo Events - Event-based Dependency Manager
  • Argo Rollouts - Progressive Delivery with support for Canary and Blue Green deployment strategies

Also argoproj-labs is a separate GitHub org that we setup for community contributions related to the Argoproj ecosystem. Repos in argoproj-labs are administered by the owners of each project. Please reach out to us on the Argo slack channel if you have a project that you would like to add to the org to make it easier to others in the Argo community to find, use, and contribute back.

Community Blogs and Presentations

Project specific community blogs and presentations are at

Adopters

Each Argo sub-project maintains its own list of adopters. Those lists are available in the respective project repositories:

Contributing

To learn about how to contribute to Argoproj, see our contributing documentation. Argo contributors must follow the CNCF Code of Conduct.

For help contributing, visit the #argo-contributors channel in CNCF Slack.

To learn about Argoproj governance, see our community governance document.

Project Resources

argo-rollouts's People

Contributors

34fathombelow avatar agrawroh avatar alexef avatar blkperl avatar chetan-rns avatar cronik avatar dependabot[bot] avatar dthomson25 avatar duboisf avatar github-actions[bot] avatar harikrongali avatar huikang avatar jessesuen avatar khhirani avatar kostis-codefresh avatar leoluz avatar mclarke47 avatar meeech avatar moensch avatar noam-codefresh avatar openguidou avatar perenesenko avatar ravihari avatar rbreeze avatar saradhis avatar schakrad avatar terrytangyuan avatar thomas-riccardi avatar zachaller avatar zcc35357949 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

argo-rollouts's Issues

Make number of replicas running under preview service customizable

If a user wants to run some tests on the preview stack before switching that replicaset to the active stack, they may want to run the preview stack with a reduced number of replicas. After they are finished testing, they will want to scale up the replicaset before the switch to the active service.

Rollouts should surface errors in ReplicaSets

When the controller tries to process a Rollout with an invalid init container, the method to create a replicaset fails with no error condition in the Rollout. As a result, the user has no idea that there is an issue with their rollout, and they can only detect to issue by checking the controller logs. The controller should add a status that makes it clear that the rollout is in a degraded state when it cannot modify ReplicaSets.

Experiment CRD

To drive canary analysis, we are proposing to introduce a mechanism to launch an ephemeral run of one or more ReplicaSets, typically finishing after some specified time duration. To power this, we would introduce an "Experiment" CRD, which might look something like:

apiVersion: argoproj.io/v1alpha1
kind: Experiment
name:
  name: guesbook-experiment
spec:
  durationSeconds: 600
  templates:
  - name: canary
    replicas: 1
    spec:
      containers:
      - name: guestbook
        image: guesbook:v2
  - name: baseline
    replicas: 1
    spec:
      containers:
      - name: guestbook
        image: guesbook:v1

This CRD would launch two replicasets, with the respective pod templates, for some time duration.

After the ReplicaSets completed the time duration, it would then be followed by an analysis which returns a score.

spec:
  durationSeconds: 600
  templates:
  - name: canary
...
  analysis:
    # syntax TBD????
    intervalSeconds: 60
    realtime: false
    prometheus:
      server: prometheus.default:9000
      query: grpc_server_handled_total{job="argocd-server-metrics",grpc_service="application.ApplicationService",grpc_code="Error"} > 0

Note that we would also need the ability to perform analysis of the Experiment in real-time, e.g. if the experiment is going horribly wrong and the score is below some threshold, the Experiment should be stopped prematurely and fail the rollout.

spec:
  durationSeconds: 600
  templates:
  - name: canary
...
  analysis:
    failureThreshold: 50
    failFast: true
status:
  score: 80

The way that this integrates with a Rollout, is by introducing a new step type into the canary strategy, which initiate the run of the experiment, and only proceed to the promotion, if the experiment was successful.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
name:
  name: guestbook
spec:
...
  strategy:
    canary:
      steps:
      - experiment: # syntax TBD
         templates:
         - name: baseline
           specFrom: stable
         - name: canary
           specFrom: canary
      - setWeight: 50
      - pause: {}

Controller is missing patch event privileges

It seems rollout-controller also needs patch privileges on events.

E0523 07:06:00.421401       1 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"rollouts-demo.15a13e031675200e", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"17258", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Rollout", Namespace:"default", Name:"rollouts-demo", UID:"259e2a57-7d29-11e9-b280-0eb058d58717", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"17355", FieldPath:""}, Reason:"ScalingReplicaSet", Message:"Scaled up replica set rollouts-demo-7d95bc4d88 to 3", Source:v1.EventSource{Component:"rollouts-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63694191929, loc:(*time.Location)(0x1e4f6e0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf31af76186cff09, ext:347013080949, loc:(*time.Location)(0x1e4f6e0)}}, Count:2, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "rollouts-demo.15a13e031675200e" is forbidden: User "system:serviceaccount:argo-rollouts:argo-rollouts" cannot patch resource "events" in API group "" in the namespace "default"' (will not retry!)

Rollout don't remove `rollouts-pod-template-hash` service selector after switching from blue-green to canary strategy

Rollout don't remove rollouts-pod-template-hash service selector after switching from blue-green to canary strategy.

Steps to reproduce:

  1. Deploy rollout with using blue-green strategy
  2. Change strategy to canary and update rollout image
  3. Rollout deploys canary version as expected but service don't route any traffic to it because it still has rollouts-pod-template-hash service selector which was set during blue-green deployment.

Expected behavior:

Rollout should ensure that rollouts-pod-template-hash service selector is removed unless strategy is blue-green

Version: v0.3.1

Implicit understanding of rollback based on steps completion and pod hash

In the case of an emergency rollback where the N-1 revision of manifests were re-applied, and the rollout strategy contains unfinished steps, we have difficulty understanding whether or not the steps logic should be repeated (e.g. a gradual canary), vs. moving as fast as possible back to the N-1 revision (due to a rollback).

The proposal is to have implicit behavior where if the steps were not completed for the current rollout, and we are moving towards the N-1 revision of manifests, the entire steps logic should be skipped.

argo-rollouts pod in CrashLoopBackOff

i installed argo-rollouts using the install.yaml but the argo-rollouts constantly failing on
time="2019-06-03T10:51:20Z" level=info msg="Creating event broadcaster" time="2019-06-03T10:51:20Z" level=info msg="Setting up event handlers" time="2019-06-03T10:51:20Z" level=info msg="Starting Rollout controller" time="2019-06-03T10:51:20Z" level=info msg="Waiting for informer caches to sync" E0603 10:51:20.412853 1 reflector.go:134] github.com/argoproj/argo-rollouts/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Rollout: rollouts.argoproj.io is forbidden: User "system:serviceaccount:default:argo-rollouts" cannot list resource "rollouts" in API group "argoproj.io" at the cluster scope E0603 10:51:20.412853 1 reflector.go:134] github.com/argoproj/argo-rollouts/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Rollout: rollouts.argoproj.io is forbidden: User "system:serviceaccount:default:argo-rollouts" cannot list resource "rollouts" in API group "argoproj.io" at the cluster scope log: exiting because of error: log: cannot create log: open /tmp/rollouts-controller.argo-rollouts-694bc885dd-gqmst.unknownuser.log.ERROR.20190603-105120.1: no such file or directory

what should i do?

Controller crashes due to inability to write to /tmp

Since moving to scratch container, I think we now hit this crash in some cases:

log: exiting because of error: log: cannot create log: open /tmp/rollouts-controller.argo-rollouts-6f46569798-crbq7.unknownuser.log.ERROR.20190523-065729.1: no such file or directory

This is because /tmp does not exist and one of the logging libraries (glog) wants to write to it.

ReplicaSet informer unnecessarily looking at all ReplicaSets

All ReplicaSets created by Argo Rollouts has label rollouts-pod-template-hash. We should improve the informer so it uses tweakListOptions to only look at replicaset with our label. This should reduce the size of cache so that rollouts is not looking at replicasets from other unrelated parents, such as deployments.

Rollout controller might became unresponsive

Version: v0.3.0

I managed to get rollout controller into unresponsive state by deleting and recreating rollout with the same name frequently. Looks like if rollout get deleted unexpectedly controller keep requeuing it and keep getting same 404 error. See attached logs:

LOGS

time="2019-05-14T14:55:59Z" level=info msg="Started syncing rollout at (2019-05-14 14:55:59.040180787 +0000 UTC m=+59068.994216643)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:55:59Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:55:59.040358       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:55:59Z" level=info msg="Started syncing rollout at (2019-05-14 14:55:59.24018185 +0000 UTC m=+59069.194217617)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:55:59Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:55:59.240403       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:55:59Z" level=info msg="Started syncing rollout at (2019-05-14 14:55:59.34017327 +0000 UTC m=+59069.294209024)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:55:59Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:55:59.340366       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:55:59Z" level=info msg="Started syncing rollout at (2019-05-14 14:55:59.440193666 +0000 UTC m=+59069.394229434)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:55:59Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:55:59.440376       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:55:59Z" level=info msg="Started syncing rollout at (2019-05-14 14:55:59.540148342 +0000 UTC m=+59069.494184111)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:55:59Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:55:59.540401       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:55:59Z" level=info msg="Started syncing rollout at (2019-05-14 14:55:59.700700772 +0000 UTC m=+59069.654736526)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:55:59Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:55:59.700911       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:56:00Z" level=info msg="Started syncing rollout at (2019-05-14 14:56:00.021237817 +0000 UTC m=+59069.975273572)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:56:00Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:56:00.021424       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:56:00Z" level=info msg="Started syncing rollout at (2019-05-14 14:56:00.661712539 +0000 UTC m=+59070.615748294)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:56:00Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:56:00.661891       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:56:01Z" level=info msg="Started syncing rollout at (2019-05-14 14:56:01.94215705 +0000 UTC m=+59071.896192804)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:56:01Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:56:01.942348       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:56:04Z" level=info msg="Started syncing rollout at (2019-05-14 14:56:04.502726419 +0000 UTC m=+59074.456762300)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:56:04Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:56:04.502907       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:56:09Z" level=info msg="Started syncing rollout at (2019-05-14 14:56:09.623132575 +0000 UTC m=+59079.577168343)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:56:09Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:56:09.623335       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:56:19Z" level=info msg="Started syncing rollout at (2019-05-14 14:56:19.863624285 +0000 UTC m=+59089.817660217)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:56:19Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:56:19.863863       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:56:40Z" level=info msg="Started syncing rollout at (2019-05-14 14:56:40.344075937 +0000 UTC m=+59110.298111705)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:56:40Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:56:40.344260       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:57:21Z" level=info msg="Started syncing rollout at (2019-05-14 14:57:21.304533558 +0000 UTC m=+59151.258569461)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:57:21Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:57:21.304718       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T14:58:43Z" level=info msg="Started syncing rollout at (2019-05-14 14:58:43.22498375 +0000 UTC m=+59233.179019518)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T14:58:43Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 14:58:43.225241       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T15:01:27Z" level=info msg="Started syncing rollout at (2019-05-14 15:01:27.06559138 +0000 UTC m=+59397.019627149)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T15:01:27Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 15:01:27.065966       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T15:06:54Z" level=info msg="Started syncing rollout at (2019-05-14 15:06:54.746241107 +0000 UTC m=+59724.700276883)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T15:06:54Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 15:06:54.746715       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T15:17:50Z" level=info msg="Started syncing rollout at (2019-05-14 15:17:50.106886564 +0000 UTC m=+60380.060922317)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T15:17:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 15:17:50.107175       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T15:34:30Z" level=info msg="Started syncing rollout at (2019-05-14 15:34:30.107543933 +0000 UTC m=+61380.061579767)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T15:34:30Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 15:34:30.107895       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T15:51:10Z" level=info msg="Started syncing rollout at (2019-05-14 15:51:10.10826354 +0000 UTC m=+62380.062299294)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T15:51:10Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 15:51:10.108537       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T16:07:50Z" level=info msg="Started syncing rollout at (2019-05-14 16:07:50.108818088 +0000 UTC m=+63380.062853841)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T16:07:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 16:07:50.109132       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T16:24:30Z" level=info msg="Started syncing rollout at (2019-05-14 16:24:30.109308392 +0000 UTC m=+64380.063344160)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T16:24:30Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 16:24:30.109659       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T16:41:10Z" level=info msg="Started syncing rollout at (2019-05-14 16:41:10.109865118 +0000 UTC m=+65380.063900872)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T16:41:10Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 16:41:10.110163       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T16:57:50Z" level=info msg="Started syncing rollout at (2019-05-14 16:57:50.110436371 +0000 UTC m=+66380.064472125)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T16:57:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 16:57:50.110854       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T17:14:30Z" level=info msg="Started syncing rollout at (2019-05-14 17:14:30.111425422 +0000 UTC m=+67380.065461226)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T17:14:30Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 17:14:30.111783       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T17:31:10Z" level=info msg="Started syncing rollout at (2019-05-14 17:31:10.112063732 +0000 UTC m=+68380.066099504)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T17:31:10Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 17:31:10.112419       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T17:47:50Z" level=info msg="Started syncing rollout at (2019-05-14 17:47:50.112713319 +0000 UTC m=+69380.066749087)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T17:47:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 17:47:50.112943       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T18:04:30Z" level=info msg="Started syncing rollout at (2019-05-14 18:04:30.113256844 +0000 UTC m=+70380.067292668)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T18:04:30Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 18:04:30.113582       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T18:21:10Z" level=info msg="Started syncing rollout at (2019-05-14 18:21:10.113824178 +0000 UTC m=+71380.067859947)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T18:21:10Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 18:21:10.114195       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T18:37:50Z" level=info msg="Started syncing rollout at (2019-05-14 18:37:50.114465338 +0000 UTC m=+72380.068501150)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T18:37:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 18:37:50.114909       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T18:54:30Z" level=info msg="Started syncing rollout at (2019-05-14 18:54:30.115345278 +0000 UTC m=+73380.069381130)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T18:54:30Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 18:54:30.115763       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T19:11:10Z" level=info msg="Started syncing rollout at (2019-05-14 19:11:10.115983341 +0000 UTC m=+74380.070019094)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T19:11:10Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 19:11:10.116342       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T19:27:50Z" level=info msg="Started syncing rollout at (2019-05-14 19:27:50.11662979 +0000 UTC m=+75380.070665558)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T19:27:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 19:27:50.116963       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T19:44:30Z" level=info msg="Started syncing rollout at (2019-05-14 19:44:30.117377294 +0000 UTC m=+76380.071413068)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T19:44:30Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 19:44:30.117850       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T20:01:10Z" level=info msg="Started syncing rollout at (2019-05-14 20:01:10.11812602 +0000 UTC m=+77380.072161842)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T20:01:10Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 20:01:10.118475       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T20:17:50Z" level=info msg="Started syncing rollout at (2019-05-14 20:17:50.118718394 +0000 UTC m=+78380.072754161)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T20:17:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 20:17:50.119031       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T20:34:30Z" level=info msg="Started syncing rollout at (2019-05-14 20:34:30.119455411 +0000 UTC m=+79380.073491313)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T20:34:30Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 20:34:30.119980       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T20:51:10Z" level=info msg="Started syncing rollout at (2019-05-14 20:51:10.120209412 +0000 UTC m=+80380.074245266)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T20:51:10Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 20:51:10.120658       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T21:07:50Z" level=info msg="Started syncing rollout at (2019-05-14 21:07:50.120956465 +0000 UTC m=+81380.074992220)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T21:07:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 21:07:50.121327       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T21:24:30Z" level=info msg="Started syncing rollout at (2019-05-14 21:24:30.121691186 +0000 UTC m=+82380.075726940)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T21:24:30Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 21:24:30.121965       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T21:41:10Z" level=info msg="Started syncing rollout at (2019-05-14 21:41:10.122324546 +0000 UTC m=+83380.076360408)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T21:41:10Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 21:41:10.122824       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing
time="2019-05-14T21:57:50Z" level=info msg="Started syncing rollout at (2019-05-14 21:57:50.122972894 +0000 UTC m=+84380.077008783)" rollout=rollouts-demo/rollouts-demo
time="2019-05-14T21:57:50Z" level=info msg="Rollout rollouts-demo/rollouts-demo has been deleted" rollout=rollouts-demo/rollouts-demo
E0514 21:57:50.123445       1 controller.go:259] error syncing 'rollouts-demo/rollouts-demo': rollout.argoproj.io "rollouts-demo" not found, requeuing

Add back service informer to handle Service recreations quicker

A user did the following:

  1. Created a blue/green Rollout with active and preview services.
  2. Deploy the services and Rollout. Wait for Rollout to fully deploy
  3. Delete the services
  4. Redeploy the services. The services will be missing the pod hash label added by Rollout

After step 4, because we no longer have Service informer, we do not detect that the services are missing the pod hash. As a consequence, both the active and preview services will be servicing traffic from both the active and preview deployments.

A while back we made a conscious decision not to add a Service informer because we felt user would not be deleting the pod-template-hash labels, and it would save a API server watch on services. However, I did not consider the service delete/recreate use case, which is what happened here, and may be common/desirable to do.

In order quickly correct situations where a Service was deleted and recreated (or pod template hash labels were removed), we should consider adding back the Service informer to immediately restore the labels.

The tricky part is knowing how efficiently associate Service updates which are related to Rollouts (since they would not have any our labels yet).

Clean up Status fields

In order to make it easier to understand, the Status struct can be cleaned up by streamlining the naming of fields used by both strategies (i.e. verifyingPreview and setPause) and moving Strategy specific fields to a strategy status struct (i.e. activeSelector -> bluegreenStatus.activeSelector)

Rollout Controller deployment should use the recreate strategy

Since we need to have the Argo Rollout Controller to deploy rollouts, we have a chicken and egg problem for deploying the controller. As a result, the rollout controller is deployed with a deployment. Currently, it is deployed using the rolling update strategy. Instead, the controller should use the recreate strategy because controllers operate under the assumption that they are the only one modifying a resource.

StepsHash is not stable

Similar to #88, the function to calculate the steps hash is not stable any time we change the steps type definition. We need to have a stable hashing function or else adding new types of steps will break suspended canary rollouts, causing them to start back at step 1.

Aggregate role names collide with workflow aggregate roles

Looks like there was some copy and paste issues. This should not be named argo-aggregate-to-view since that's what we are using for argo workflow-controller.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: argo-aggregate-to-view
  labels:
    rbac.authorization.k8s.io/aggregate-to-view: "true"

Integration between Rollout with Experiment CRD

Using the canary strategy, users would like to spin up new pods of the new desired pod spec, but without serving any production traffic. Additionally, users would like a way to reach the new canary pods through a service object. Essentially this is similar to blue-green's preview service, but for the canary strategy.

The proposal is to accomplish this use case with the following spec definition:
UPDATE: will solve this using experiment integration instead of a new preview step type.

spec:
  strategy:
    canary:
      canaryService: guestbook-canary
      steps:
      - setWeight: 0
      - preview:
          replicas: 1
      - pause: {}

The second step would bring up a new ReplicaSet of the new desired spec, but would only have the pod-hash as a label to the pods. It would be missing all of the other pod labels in the spec (e.g. app=guestbook). Removing the other labels would allow us to prevent production traffic from hitting the preview pods.

NOTE: This assumes that the feature for canaryService (#91), would modify the service to select only on the pod hash, and nothing else. In fact, we may need to raise an error condition if we detect any other user-defined label selectors in the canaryService (e.g. app=guestbook).

Introduce steps into blueGreen strategy

blueGreen deploy strategy should have a steps field to control the rollout. Here is the current proposal:

  # manual gate
  strategy:
    blueGreen: 
      activeService: active-service
      previewService: preview-service
      steps:
      - setPreview: true
      - pause: true

  # manual teardown
  strategy:
    blueGreen: 
      activeService: active-service
      steps:
      - switchActive: true
      - pause: true

  # manual gate and manual teardown
  strategy:
    blueGreen: 
      activeService: active-service
      steps:
      - setPreview: true
      - pause: true
      - switchActive: true
      - pause: true

  # fully automated
  strategy:
    blueGreen: 
      activeService: active-service

  # fully automated with delayed teardown
  strategy:
    blueGreen: 
      activeService: active-service
      steps:
      - switchActive: true
      - wait: 600

Rollouts unprotected from invalid specs

Despite having Open API validation in the Rollout CRD, it was still possible for a user to set a bad label on a rollout:

metadata:
  labels:
    this-should-be-a-string-not-an-int: 0

When this happened the entire rollout-controller ceased to function, and was spewing the following in the logs:

E0522 19:14:14.782189       1 reflector.go:134] github.com/argoproj/argo-rollouts/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Rollout: v1alpha1.RolloutList.Items: []v1alpha1.Rollout: v1alpha1.Rollout.Spec: v1alpha1.RolloutSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.ObjectMeta: v1.ObjectMeta.Labels: ReadString: expects " or n, but found 0, error found in #10 byte of ...|version":0,"rollout-|..., bigger context ...|1170dd38c2c2fae7a8c2a","rolling-restart-version":0,"rollout-config-map-hash":"e4984dbdd2eb949fbfd756|...
E0522 19:14:15.788187       1 reflector.go:134] github.com/argoproj/argo-rollouts/pkg/client/informers/externalversions/factory.go:117: Failed to list *v1alpha1.Rollout: v1alpha1.RolloutList.Items: []v1alpha1.Rollout: v1alpha1.Rollout.Spec: v1alpha1.RolloutSpec.Template: v1.PodTemplateSpec.Spec: v1.PodSpec.ObjectMeta: v1.ObjectMeta.Labels: ReadString: expects " or n, but found 0, error found in #10 byte of ...|version":0,"rollout-|..., bigger context ...|1170dd38c2c2fae7a8c2a","rolling-restart-version":0,"rollout-config-map-hash":"e4984dbdd2eb949fbfd756|...

This is essentially the same failure scenario as in: kubernetes/kubernetes#57705

It appears that even having Open API validation does not always save us from bad specs. We likely need to use the unstructued informer approach to protect us.

Additional printer colums

Rollouts should have similar printer columns as a deployment. We could also have WEIGHT there:

$ k get deploy
NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
argocd-application-controller   1/1     1            1           3h41m
argocd-repo-server              1/1     1            1           3h41m
argocd-server                   1/1     1            1           3h41m
dex-server                      1/1     1            1           3h41m

Improve the kubernetes events emitted

Argo Rollouts can do a better job of emitting Kubernetes events on important changes in the system. For example, Argo Rollouts should emit an event when either the active or preview service is modified or when the stable replicaset is changed in a canary strategy.

Rollout is not scaling down old replicasets properly

Here is a rollout which is in a Suspended state, but with four replicasets scaled to two:

image

The expectation is in a steady state (Suspended), there should only be two replicasets scaled higher than 0 (active and preview)

I ran a diff against the last three revisions (12, 11, 9) of the ReplicaSet. I'm not sure what happened to ReplicaSet revision 10. Notice that the only differences are in metadata and status. The replicaset spec is the same, which means the pod template is the same. However the bug is that the replicaset hash name is not the same.

$ diff rs-12 rs-11
6,8c6,7
<     rollout.argoproj.io/revision: '12'
<     rollout.argoproj.io/revision-history: '10'
<   creationTimestamp: '2019-05-14T21:54:49Z'
---
>     rollout.argoproj.io/revision: '11'
>   creationTimestamp: '2019-05-14T22:16:39Z'
16c15
<     rollouts-pod-template-hash: 65c456b799
---
>     rollouts-pod-template-hash: 7d58696fd9
18c17
<   name: web-service-integration-65c456b799
---
>   name: web-service-integration-7d58696fd9
27c26
<   resourceVersion: '95549146'
---
>   resourceVersion: '95556597'
29,30c28,29
<     /apis/apps/v1/namespaces/fdp-connectivity-web-service-integration-usw2-ppd-qal/replicasets/web-service-integration-65c456b799
<   uid: e584ea76-7692-11e9-9427-0a985b86565a
---
>     /apis/apps/v1/namespaces/fdp-connectivity-web-service-integration-usw2-ppd-qal/replicasets/web-service-integration-7d58696fd9
>   uid: f1c6de3d-7695-11e9-9427-0a985b86565a
36c35
<       rollouts-pod-template-hash: 65c456b799
---
>       rollouts-pod-template-hash: 7d58696fd9
57c56
<         rollouts-pod-template-hash: 65c456b799
---
>         rollouts-pod-template-hash: 7d58696fd9
$ diff rs-12 rs-9
6,7c6
<     rollout.argoproj.io/revision: '12'
<     rollout.argoproj.io/revision-history: '10'
---
>     rollout.argoproj.io/revision: '9'
16c15
<     rollouts-pod-template-hash: 65c456b799
---
>     rollouts-pod-template-hash: 748b545485
18c17
<   name: web-service-integration-65c456b799
---
>   name: web-service-integration-748b545485
27c26
<   resourceVersion: '95549146'
---
>   resourceVersion: '95539841'
29,30c28,29
<     /apis/apps/v1/namespaces/fdp-connectivity-web-service-integration-usw2-ppd-qal/replicasets/web-service-integration-65c456b799
<   uid: e584ea76-7692-11e9-9427-0a985b86565a
---
>     /apis/apps/v1/namespaces/fdp-connectivity-web-service-integration-usw2-ppd-qal/replicasets/web-service-integration-748b545485
>   uid: e58000cc-7692-11e9-9427-0a985b86565a
36c35
<       rollouts-pod-template-hash: 65c456b799
---
>       rollouts-pod-template-hash: 748b545485
57c56
<         rollouts-pod-template-hash: 65c456b799
---
>         rollouts-pod-template-hash: 748b545485

This implies that the pod template hash may be getting different hashes for the same pod template.

During this time, we know from talking to the user, that the rollout's spec.template.spec was changed to only modify resource requests/limits to equivalent values (e.g. 2000m -> '2'). I suspect that the underlying issue is when we call: controller.ComputeHash(), it does not correctly considering these values to be the same, and results in different pod template hashes.

Rework Rollout Conditions

Currently, there are invalidSpec and available conditions. The invalidSpec condition highlights invalid yaml, but the available condition needs to be fleshed out more. The available condition is not used for the Canary strategy and its use in the BlueGreen strategy does not indicate the status of a rollout well. Instead, the condition indicates that the active service is pointing at a replicaset and that replicaset is at the desired number of replicas. As a result, the rollout could have an available condition set to true while the controller is still working on the rollout.

With the work required to rework the available condition, we will need to make sure that we are able to answer the following questions. What does it mean for a rollout to be available? Does that mean the rollout is in a steady-state? Does that mean the rollout is serving traffic to prod?

Blue-Green Rollout is paused after first syncronization

The Rollout automatically switched to paused state after the first synchronization even there is only green version if deployment. The Rollout should be paused only if there are both green and blue and the user has to promote blue to green.

Implement Replica-Based CanaryUpdate Strategy

This strategy will allow users to define a list of steps that they want the controller to run through before promoting a new replica set into the stable replica set. These steps include setting the percentage of traffic between the canary and new replica set, waiting for a user-defined amount of time after the rollout has achieved the desired weights, and pausing the rollout until the user chooses to continue the rollout.

Here is a very high-level overview of the work required:

  • Calculate the Replica Counts for the New and Canary (#23)
  • Implement the Pause functionality (#28)
  • Implement SetWeight functionality (#28)
  • Implement the Wait functionality (#28)
  • Implement the Scale functionality (#30)
  • Implement the Status functionality including validation of the rollout CRD (#30)
  • Add Status changes to Argo CD Lua health script
  • Discuss and implement expected behavior of the Argo CD wait command (argoproj/argo-cd#1187)
  • Document Canary Strategy (#31)

Support gradual increase of replicas for traffic shaped canary

There are two conflicting use cases with traffic-shaped canary (as opposed to replicaset-based canary).

  1. After changing the weight of a canary, the amount of traffic directed at the canary could be disproportionate to the scale, and overwhelm the pods. In this case, we would only want to set the weight in a gradual manner and eventually reach the target weight for the canary.

  2. The weight set on a canary should be instantaneous despite disproportionate number of replicas. For example, going from 100% old, to 100% new atomically.

We need to allow users to be able to configure which of the two modes should happen when weights are set.

Expose Prometheus metrics on rollouts

Argo Rollouts should expose Prometheus metrics in order to provide better overall insight into the rollouts in a cluster. Some initial metrics to include could be:

  • Number of rollouts and which strategies they are using
  • When a rollout starts progressing from a new pod spec
  • When a rollout finishes progressing from a new pod spec
  • Every time the rollout increments a step

As a reference implement, we can probably use argoproj/argo-workflows@b9cffe9 as a baseline

Test refactoring, cleanup, and standardization

  • standardize on helper functions to create new replicasets
  • standardize on fmt.Sprintf usage (or the removal of it)
  • provide ability to write tests which verify the end state of rollouts/replicasets, rather than verifying exact order of kuberentes API calls
  • helpers to avoid hard-coding hashes in tests

Rollout does not scale preview service properly on changes

We have a v0.3.1 Rollout in the following condition:

image

The bottom rollout is actually the desired spec as a preview, and has 3 replicas. However, the rollout spec has 2 desired replicas and currently has the following error condition:

Rollout "partner-support-portal" has timed out progressing.

[Bug] Kustomize configMapGenerator and merged patch do not work with apiVersion: argoproj.io/v1alpha1

Hi,

I've spent a few hours trying to debug an issue introduced when trying out Rollouts with Kustomize.

I was following the guide at the link below which suggests that on each kustomize build, the config map name has a hash suffix appended to trigger a rolling update of pods on configmap changes.

https://github.com/kubernetes-sigs/kustomize/blob/master/examples/configGeneration.md

The issue is, Kustomize works as expected when the standard Deployment resource type is used (see Kustomize renaming api-2-config-map -> production-api-2-config-map-g875tt789c):

image

However, when using the Rollout resource type under apiVersion argoproj.io/v1alpha1:

image

Note how the name in configMapKeyRef is not updated with the prefix and hash suffix.

Environment:
Kustomize v2.0.3 (installed on Mac via Homebrew)
Argo Rollouts API version: argoproj.io/v1alpha1
Resource type: Rollout

Rollout should fast-rollback on BlueGreen

When a rollout has a template with a pod hash that matches the selector within the active service, the controller should treat that rollout in the ideal state. In this case, the controller should unpause the rollout and scale down old replicasets.

Change in pod template hash after upgrading k8s dependencies

I tried updating our k8s dependencies (e.g. k8s.io/client-go, k8s.io/kubernetes). However, after updating our dependencies, unit tests started failing because the computed hash of the pod template spec was different than from before the update.

Even worse, when starting the version of controller with the updated k8s libraries, it:

  1. caused existing rollouts to become redeployed, since the controller detected the new pod template hash as a change in the spec.
  2. somehow caused the active to service selector to select no valid pods, essentially causing downtime.

Currently we use the k8s.io/kubernetes/pkg/controller library to compute the hash:

controller.ComputeHash(&rollout.Spec.Template, rollout.Status.CollisionCount)

I suspect what is happening is that there are new fields in the pod template, which causing hashes to be different. Since rollouts does need to keep up with kubernetes libraries, we need to figure out a way to allow library updates to happen, but not trigger redeploys of rollouts.

Ability to specify canaryService Service object to reach only canary pods

Similar to blue-green's previewService, for users of the canary strategy, users would like to reach the pods of the canary through a service object, without hitting any of the old/stable pods. This proposal is to allow specifying a canary service like so:

spec:
  strategy:
    canary:
      canaryService: guestbook-canary

If the user specifies a canaryService in the strategy, the controller will modify the specified service to add the rollouts-pod-template-hash as a label selector.

For this to work, we will need to inject the rollouts-pod-template-hash label to the replicasets created by the canary strategy. This is something we already do for the blue-green strategy, but will need to do for canary to accomplish this feature.

Controller panics due to NPE in service.go

Controller fails if encountered service which belogs to rollout with nil strategy

Logs

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1ce729b]

goroutine 68 [running]:
github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x1e2fca0, 0x2cf6010)
	/usr/local/Cellar/go/1.11.5/libexec/src/runtime/panic.go:513 +0x1b9
github.com/argoproj/argo-rollouts/controller.(*Controller).getRolloutsForService(0xc0003febe0, 0xc0003da1e0, 0xc0003b17a0, 0xc00071bd78, 0xc00071bcd0, 0x3, 0x0)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/controller/service.go:132 +0x1ab
github.com/argoproj/argo-rollouts/controller.(*Controller).handleService(0xc0003febe0, 0x1f9a900, 0xc0003da1e0)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/controller/service.go:148 +0x4a
github.com/argoproj/argo-rollouts/controller.(*Controller).handleService-fm(0x1f9a900, 0xc0003da1e0)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/controller/controller.go:147 +0x3e
github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(0xc0005eb260, 0xc0005eb270, 0xc0005eb280, 0x1f9a900, 0xc0003da1e0)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache/controller.go:195 +0x49
github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0x102b753, 0xc00006edc8, 0x0)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache/shared_informer.go:554 +0x21d
github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0xc00082fe18, 0x102b262, 0xc0004913e0)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203 +0x9c
github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548 +0x89
github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00006ef68)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00082ff68, 0xdf8475800, 0x0, 0x1e0ee01, 0xc0000acd80)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe
github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc00006ef68, 0xdf8475800, 0xc0000acd80)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache.(*processorListener).run(0xc0005f2e80)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546 +0x8d
github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache.(*processorListener).run-fm()
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390 +0x2a
github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0004160d0, 0xc0005eb300)
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f
created by github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/Users/amatyushentsev/root/go/src/github.com/argoproj/argo-rollouts/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62

Unintuitive automatic pause behavior for blue green strategy

The current automatic pause behavior for the blue/green strategy is inconsistent, and hard to explain. Some examples:

  • it does not pause during the initial rollout
  • it does not pause if the preview service is unspecified, but will pause if it is.
  • it does not pause during a fast rollback

These rules are not obvious or documented. I think we need to rework the behavior so that it is more easily predictable.

I also feel the default behavior should be switched to automatically perform a cutover instead of the current behavior of automatically pausing if there is a preview service. This would make it consistent behavior with rolling update, recreate, canary, where the default behavior is to perform the promotion automatically. Just because I don't specify a preview service, doesn't necessarily mean I don't want it to pause, and vice versa.

Use rollout informer instead of API list

In service.go:

func (c *Controller) getRolloutsForService(service *corev1.Service) ([]*v1alpha1.Rollout, error) {
	allROs, err := c.rolloutsclientset.ArgoprojV1alpha1().Rollouts(service.Namespace).List(metav1.ListOptions{})
	if err != nil {
		return nil, err
	}

We should use the informer instead of list API

Spec documentation issues

	// This is a pointer to distinguish between explicit zero and not specified.
	// This is set to the max value of int32 (i.e. 2147483647) by default, which means
	// "retaining all old ReplicaSets".
	RevisionHistoryLimit *int32 `json:"revisionHistoryLimit,omitempty"`

This is wrong and is actually defaulted to 10

Investigate and Implement Istio Based Canary Deployments

Currently, Canary deployments are limited to traffic shaping using the round-robin behavior of the Kubernetes Service. This strategy works with applications with at least 10 replicas running because they can limit the scope of their change to a smaller percentage of the traffic. However, smaller applications cannot limit the percentage to the same degree. With a service mesh solution, the small applications achieve the same percentages as the larger applications.

This issue will be updated after an initial investigation to determine the work required to add an Istio based canary deployment.

Rollout deletionTimestamp are not honored

When deleting a rollout, it seems that we do not honor deletion timestamp, and may end up creating new replicasets even though the rollout is in a deleting state.

I believe this will only happen in a transition state (e.g. when both the pod template is new, and rollout deletion timestamp is set). I observed this behavior when upgrading rollouts from v0.3 to v0.4, while a rollout was in the middle of transition (and the pod template hash changed).

CRD validation needs to be removed for resource requests/limits

Since adding resource validation for pod template, it seems that removed the ability to set numeric values for cpu/memory. This caused kubernetes to reject creations of rollouts with floating point/integer cpu/mem requests/limits. We need to remove this validation. I added a convenience function to do this, which just needs to be add more for resource cpu/memory.

removeValidataion(&un, "spec.template.metadata.creationTimestamp")

Provide OpenApi Definition in Json for Kustomize CRD's

I'd like to add rollouts as a CRD to kustomize. An example of how to do this can be found here: https://github.com/kubernetes-sigs/kustomize/blob/master/docs/kustomization.yaml#L215

The provided generated yaml does not work as kustomize requires json.
https://github.com/argoproj/argo-rollouts/blob/master/manifests/crds/rollout-crd.yaml

Converting the above yaml to json by commenting out https://github.com/argoproj/argo-rollouts/blob/master/hack/gen-openapi-spec/main.go#L82-L85 does not resolve the issue either.

Rollout with Canary deployment does not cleanup replicasets

Steps to reproduce:

  1. Create a rollout with canary deployment strategy and revisionHistoryLimit field set to e.g. 3
  2. Deploy new version several times (more than number in revisionHistoryLimit )
  3. See that replicasets are not getting deleted

Expected behavior

Max number of replicasets should be limited to number in revisionHistoryLimit field

Version: v0.3

Rollouts should add owner references to Services

Per the Controller Ref proposal, there are three laws that a controller should follow:

  1. Take ownership of a Object
  2. Don't interfere with an object that the controller does not own
  3. Don't share owned objects with other controllers

We aren't respecting these laws with the service objects because the rollout controller modifies services that the controller does not own. To be in line with the Kubernetes Best standards, the rollout controller should be following these laws. However, the owner references are also leveraged with Kubernetes garage collection, and the rollout controller should not delete a service when a rollout is deleted. Instead, the rollout controller should orphan the service. More investigation is required to see if it is possible to add this type of behavior in the rollout controller.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.