Checklist: <input type

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Enqueued AnalysisRun next reconcile time not respected about argo-rollouts HOT 7 OPEN

kevinqian-db commented on June 10, 2024 1

Enqueued AnalysisRun next reconcile time not respected

from argo-rollouts.

Comments (7)

zachaller commented on June 10, 2024

I know this might not allude to the root issue but could you try bumping --analysis-threads flag on the controller to say something like 60 --analysis-threads=60 I think it defaults to 30, and see if that helps at all?

from argo-rollouts.

kevinqian-db commented on June 10, 2024

I know this might not allude to the root issue but could you try bumping --analysis-threads flag on the controller to say something like 60 --analysis-threads=60 I think it defaults to 30, and see if that helps at all?

Our args configuration looks like the following:

- args:
        - --analysis-threads
        - "264"
        - --rollout-threads
        - "88"
        - --qps
        - "80"
        - --burst
        - "160"
        - --leader-elect
        - "true"

IMO it should have been sufficient amount of threads

from argo-rollouts.

kevinqian-db commented on June 10, 2024

It seems that with v1.6.3, the reconciliation time for completed AnalysisRuns becomes substantially slower under heavy load, and when combined with periodic rescheduling of all completed AnalysisRuns for reconciliation due to resync period of AnalysisRun informers (15min) + workqueue being FIFO, this repeatedly starves live AnalysisRuns that requires progress.

I think it strengthens the necessity of #3285. Do you mind help check if it makes sense? Thanks! @zachaller

from argo-rollouts.

zachaller commented on June 10, 2024

I do think you analysis of the issue makes sense I don't know if I am sold just yet on rollouts managing the ttl, I don't think the original design of analysis runs was meant to be used outside of a rollout. I have to see and think on what a proposal of that would look like.

from argo-rollouts.

zachaller commented on June 10, 2024

How are you guys creating your analysis runs, what do those specs look like, does it make sense for that tool to manage the cleanup?

from argo-rollouts.

kevinqian-db commented on June 10, 2024

How are you guys creating your analysis runs, what do those specs look like, does it make sense for that tool to manage the cleanup?

We have an internal tool that directly generates an AnalysisRun (without template) manifest based on a set of configurations (so internal users do not realistically know the presence of Argo, but they will configure their own special metrics, and these metrics can be quite volatile, so template would not be too helpful), along with the manifest of the type of the resource user requested (e.g. StatefulSet). After applying both manifest, the tool will await for updates of AnalysisRun, and decide whether to rollback based on its terminal status (or rollback if timeout). Basically we are relying on Argo's versatility to contact different metrics endpoints and periodic metrics collection capabilities for this use case. Not super sure if other people have also tried to use Argo in similar ways

Since this tool is essentially just a long running script with one separate instance per deployment, it does not have knowledge about other historical AnalysisRuns. We can definitely choose to let it delete the AnalysisRun after completion, but we do hope to keep these completed AnalysisRuns alive for 1-2 months before we are sure it is safe to delete. We can also just create separate cron jobs on each cluster to do the cleanups, but due to the amount of k8s clusters we are maintaining and other infra complexity, baking it directly into Argo might be the easiest way, as long as it makes sense to be a meaningful feature for general Argo use cases

Also cc @gavin-db for other possible input.

from argo-rollouts.

gavin-db commented on June 10, 2024

@zachaller Users can apply AnalysisRuns directly to their cluster using kubecfg apply. This is actually explicitly supported by the API and Argo docs, which support the following command:

This command creates a new AnalysisRun from an existing AnalysisTemplate resources or from an AnalysisTemplate file.

kubectl argo rollouts create analysisrun [flags]

The use case is performing Analyses for non-Rollout resources (eg StatefulSets, DaemonSets, etc). Users can simply trigger an AnalysisRun when deploying StatefulSets/DaemonSets/ConfigMaps/etc and use the result to gauge the health of their system (and make subsequent decisions). We described this in some detail at ArgoCon last year.

Anytime a user triggers an AnalysisRun in this way, there is no cleanup mechanism today, and Argo Rollouts just leaves the AnalysisRuns in a terminal phase indefinitely (and continues reindexing them every 15 minutes forever, which inevitably kills the controller's performance).

We have a rough proposal for supporting a TTL (an alternative would be to upgrade the controller to not reindex terminal AnalysisRuns, but that would require a more significant change). #3285. Can contribute upstream if interface makes sense.

from argo-rollouts.

Enqueued AnalysisRun next reconcile time not respected about argo-rollouts HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs