Comments (7)
I know this might not allude to the root issue but could you try bumping --analysis-threads
flag on the controller to say something like 60 --analysis-threads=60
I think it defaults to 30, and see if that helps at all?
from argo-rollouts.
I know this might not allude to the root issue but could you try bumping
--analysis-threads
flag on the controller to say something like 60--analysis-threads=60
I think it defaults to 30, and see if that helps at all?
Our args configuration looks like the following:
- args:
- --analysis-threads
- "264"
- --rollout-threads
- "88"
- --qps
- "80"
- --burst
- "160"
- --leader-elect
- "true"
IMO it should have been sufficient amount of threads
from argo-rollouts.
It seems that with v1.6.3, the reconciliation time for completed AnalysisRuns becomes substantially slower under heavy load, and when combined with periodic rescheduling of all completed AnalysisRuns for reconciliation due to resync period of AnalysisRun informers (15min) + workqueue being FIFO, this repeatedly starves live AnalysisRuns that requires progress.
I think it strengthens the necessity of #3285. Do you mind help check if it makes sense? Thanks! @zachaller
from argo-rollouts.
I do think you analysis of the issue makes sense I don't know if I am sold just yet on rollouts managing the ttl, I don't think the original design of analysis runs was meant to be used outside of a rollout. I have to see and think on what a proposal of that would look like.
from argo-rollouts.
How are you guys creating your analysis runs, what do those specs look like, does it make sense for that tool to manage the cleanup?
from argo-rollouts.
How are you guys creating your analysis runs, what do those specs look like, does it make sense for that tool to manage the cleanup?
We have an internal tool that directly generates an AnalysisRun (without template) manifest based on a set of configurations (so internal users do not realistically know the presence of Argo, but they will configure their own special metrics, and these metrics can be quite volatile, so template would not be too helpful), along with the manifest of the type of the resource user requested (e.g. StatefulSet). After applying both manifest, the tool will await for updates of AnalysisRun, and decide whether to rollback based on its terminal status (or rollback if timeout). Basically we are relying on Argo's versatility to contact different metrics endpoints and periodic metrics collection capabilities for this use case. Not super sure if other people have also tried to use Argo in similar ways
Since this tool is essentially just a long running script with one separate instance per deployment, it does not have knowledge about other historical AnalysisRuns. We can definitely choose to let it delete the AnalysisRun after completion, but we do hope to keep these completed AnalysisRuns alive for 1-2 months before we are sure it is safe to delete. We can also just create separate cron jobs on each cluster to do the cleanups, but due to the amount of k8s clusters we are maintaining and other infra complexity, baking it directly into Argo might be the easiest way, as long as it makes sense to be a meaningful feature for general Argo use cases
Also cc @gavin-db for other possible input.
from argo-rollouts.
@zachaller Users can apply AnalysisRuns directly to their cluster using kubecfg apply. This is actually explicitly supported by the API and Argo docs, which support the following command:
This command creates a new AnalysisRun from an existing AnalysisTemplate resources or from an AnalysisTemplate file.
kubectl argo rollouts create analysisrun [flags]
The use case is performing Analyses for non-Rollout resources (eg StatefulSets, DaemonSets, etc). Users can simply trigger an AnalysisRun when deploying StatefulSets/DaemonSets/ConfigMaps/etc and use the result to gauge the health of their system (and make subsequent decisions). We described this in some detail at ArgoCon last year.
Anytime a user triggers an AnalysisRun in this way, there is no cleanup mechanism today, and Argo Rollouts just leaves the AnalysisRuns in a terminal phase indefinitely (and continues reindexing them every 15 minutes forever, which inevitably kills the controller's performance).
We have a rough proposal for supporting a TTL (an alternative would be to upgrade the controller to not reindex terminal AnalysisRuns, but that would require a more significant change). #3285. Can contribute upstream if interface makes sense.
from argo-rollouts.
Related Issues (20)
- sha256 evaluation is incorrect HOT 1
- Rollout status is jumping from paused to progressing/failed in certain timeframe.
- Notification using old rollout/pre-trigger state? HOT 1
- [Possible Regression] A rollout object that uses workloadRef does not transparently pass through `template` (Deployment reference) to the notification webhook templates HOT 4
- Support Dynamic Stable Scale without Traffic Routing
- All Nginx ingress paths are included in each canary ingress HOT 1
- How to configure Argo Rollout UI Dashboard in Existing Argocd UI HOT 1
- Controller should return reconciliation error to Rollout object with exponential backoff
- Argorollout Experiment Duration Should Override Success
- Job Analysis Run never fails when the job image can't be pulled (it just succeeds and promotes). HOT 1
- During the Canary deployment, the Stable ReplicaSet temporarily drops to zero and then recovers, causing brief downtime.
- Canary deployment w/ 5 steps with a total of 10m of waits, takes ~3.73h to complete HOT 1
- [Rollouts dashboard] Load time improvements by caching 5mb javascript bundle
- Argo Rollouts needs a custom annotation option for canary ingresses HOT 1
- Argo rollout not synchronizing new git image change after RolloutAborted state HOT 2
- Unable to pull docker image
- We need to scale up pods with warm-up by canary deployment, but it doesn't
- Some logs are in text format ignoring logformat json HOT 2
- Canary deployment causing 503s after reaching 100% setWeight
- Argo controller had rolled out negative values during the promotion HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from argo-rollouts.