Comments (6)
for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior
Maybe even move it to subprojects then so it's more clearly a separate thing
wait for the end of the scale up # 1
wait for the GPU label to appear #2
wait for the GPU Operator
Yep that's definitely what I had in mind
from ci-artifacts.
this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test
If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.
takes 8 minutes to build
With regards to the 8 minutes it saves and CI performance in general - we do a lot of things in the CI unnecessarily linearly - e.g. "start the image helper build, wait for it to build":
I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.
For example, that image could easily be built while the machineset is scaling. If we just apply all the manifests and rely on the eventual reconciliation of everything we wouldn't even have to think about it - things that can happen in parallel will happen in parallel, things that block on other things will simply wait for them to complete.
from ci-artifacts.
If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.
for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior
I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.
good point, although I'm afraid it might complicate the troubleshooting when things go wrong
for maybe it could be coded differently: instead of scale up cluster; wait for the end of the scale up; build the GPU Operator
,
it could be more focused on the needs:
deploy NFD (without waiting for it)
scale up cluster (without waiting for it)
build the GPU Operator
wait for the end of the scale up # 1
wait for the GPU label to appear #2
wait for the GPU Operator
(#1 and #2 are kind of redundant, but help troubleshooting)
from ci-artifacts.
I added
- Call
hack/must-gather.sh
script instead of custom scripts
to the list, once this PR is merged: https://gitlab.com/nvidia/kuberetes/gpu-operator/-/merge_requests/346
from ci-artifacts.
Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"
- this image is 100% static, never updated, there's no need to rebuilt it for every
master
GPU Operator test- this image is duplicated in NFD master test
- takes 8 minutes to build
2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021 23:32:23 +0000 (0:00:00.697) 0:00:08.016 ******* 2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built
It would be simple enough to turn these ImageStreams into Dockerfiles and use the built-in image build on them. The thing about these statements though is that it suggests that these should be defined in separate repository to avoid building them on every test. If that was the case then it would just import the image at the beginning of the test from an integration stream. Similar to how OCP images are automatically imported from their integration streams. It also takes a lot less time to import an image than it does to build it and reconcile it so this would be the best approach in reducing the test run time.
from ci-artifacts.
one thing I realize I forgot to stress about the current workflow, is that it works the same way in your local cluster as in Prow (hence the notion of toolbox
, you can reuse all of these tools to setup your cluster and/or test your ongoing development locally).
Looking farther, this also enables seamless portability to any other CI infrastructure. Only /testing/prow/
is lightly Prow-specific, as it expects a few env var and some secret files at the right location; and of course the test definitions in openshift/release
from ci-artifacts.
Related Issues (20)
- Add the ability to entitle only GPU nodes HOT 2
- set_scale.sh: cannot specify the source machineset HOT 3
- entitlement: using the same content for the entitlement.pem and entitlement-key.pem isn't safe HOT 4
- Learn about ansible TAGS and refactor the roles to us it
- Test OpenShift upgrade with GPU workload running
- Force Ansible "connection: local" for running all the commands HOT 3
- Use Ansible role "template" files instead of custom sed replacement
- documentation: add roles/*/README.md descriptions
- GPU Operator: Prepare a disconnected driver-container POC HOT 1
- Prepare a 'release' checklist and scripts to properly cut `ci-artifacts` releases
- Rewrite the toolbox scripts with a proper CLI parameters handling framework HOT 1
- Include "p3.2xlarge" GPU instance during CI HOT 1
- Ansible-lint tests only modified files
- Create a full weekly suite for the PSAP operators suite HOT 3
- Create a bot to highlight weekly PR`s
- GPU Operator: test PROXY configuration
- GPU Operator: run gpu-operator test_operatorhub should not specify the exact operator version
- Generic command for installing operators from OperatorHub HOT 2
- Typos HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ci-artifacts.