Some things I have in mind for improving/fixing ci-artifacts

Turn the image helper <a href="https://github.com/op

ci-artifacts maintenance overview about ci-artifacts HOT 6 CLOSED

openshift-psap commented on August 12, 2024

ci-artifacts maintenance overview

from ci-artifacts.

Comments (6)

omertuc commented on August 12, 2024 1

for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior

Maybe even move it to subprojects then so it's more clearly a separate thing

wait for the end of the scale up # 1
wait for the GPU label to appear #2
wait for the GPU Operator

Yep that's definitely what I had in mind

from ci-artifacts.

omertuc commented on August 12, 2024

this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test

If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.

takes 8 minutes to build

With regards to the 8 minutes it saves and CI performance in general - we do a lot of things in the CI unnecessarily linearly - e.g. "start the image helper build, wait for it to build":

ci-artifacts/roles/gpu_operator_bundle_from_commit/tasks/build_helper.yml

Lines 28 to 37 in d0eef0c

 - name: Wait for the helper image to be built 

 command: 

 oc get pod -lopenshift.io/build.name 

 -ocustom-columns=phase:status.phase 

 --no-headers 

 -n gpu-operator-ci-utils 

 register: wait_helper_image 

 until: "'Succeeded' in wait_helper_image.stdout or 'Failed' in wait_helper_image.stdout or 'Error' in wait_helper_image.stdout" 

 retries: 40 

 delay: 30

I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.

For example, that image could easily be built while the machineset is scaling. If we just apply all the manifests and rely on the eventual reconciliation of everything we wouldn't even have to think about it - things that can happen in parallel will happen in parallel, things that block on other things will simply wait for them to complete.

from ci-artifacts.

kpouget commented on August 12, 2024

If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.

for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior

I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.

good point, although I'm afraid it might complicate the troubleshooting when things go wrong

for maybe it could be coded differently: instead of scale up cluster; wait for the end of the scale up; build the GPU Operator,
it could be more focused on the needs:

deploy NFD (without waiting for it)
scale up cluster (without waiting for it)

build the GPU Operator

wait for the end of the scale up # 1
wait for the GPU label to appear  #2
wait for the GPU Operator

(#1 and #2 are kind of redundant, but help troubleshooting)

from ci-artifacts.

kpouget commented on August 12, 2024

I added

Call hack/must-gather.sh script instead of custom scripts

to the list, once this PR is merged: https://gitlab.com/nvidia/kuberetes/gpu-operator/-/merge_requests/346

from ci-artifacts.

lranjbar commented on August 12, 2024

Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"

this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test

this image is duplicated in NFD master test

takes 8 minutes to build
2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021  23:32:23 +0000 (0:00:00.697)       0:00:08.016 ******* 
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built

It would be simple enough to turn these ImageStreams into Dockerfiles and use the built-in image build on them. The thing about these statements though is that it suggests that these should be defined in separate repository to avoid building them on every test. If that was the case then it would just import the image at the beginning of the test from an integration stream. Similar to how OCP images are automatically imported from their integration streams. It also takes a lot less time to import an image than it does to build it and reconcile it so this would be the best approach in reducing the test run time.

from ci-artifacts.

kpouget commented on August 12, 2024

one thing I realize I forgot to stress about the current workflow, is that it works the same way in your local cluster as in Prow (hence the notion of toolbox, you can reuse all of these tools to setup your cluster and/or test your ongoing development locally).

Looking farther, this also enables seamless portability to any other CI infrastructure. Only /testing/prow/ is lightly Prow-specific, as it expects a few env var and some secret files at the right location; and of course the test definitions in openshift/release

from ci-artifacts.

ci-artifacts maintenance overview about ci-artifacts HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	- name: Wait for the helper image to be built
	command:
	oc get pod -lopenshift.io/build.name
	-ocustom-columns=phase:status.phase
	--no-headers
	-n gpu-operator-ci-utils
	register: wait_helper_image
	until: "'Succeeded' in wait_helper_image.stdout or 'Failed' in wait_helper_image.stdout or 'Error' in wait_helper_image.stdout"
	retries: 40
	delay: 30