GithubHelp home page GithubHelp logo

Comments (6)

omertuc avatar omertuc commented on August 12, 2024 1

for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior

Maybe even move it to subprojects then so it's more clearly a separate thing

wait for the end of the scale up # 1
wait for the GPU label to appear #2
wait for the GPU Operator

Yep that's definitely what I had in mind

from ci-artifacts.

omertuc avatar omertuc commented on August 12, 2024

this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test

If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.

takes 8 minutes to build

With regards to the 8 minutes it saves and CI performance in general - we do a lot of things in the CI unnecessarily linearly - e.g. "start the image helper build, wait for it to build":

- name: Wait for the helper image to be built
command:
oc get pod -lopenshift.io/build.name
-ocustom-columns=phase:status.phase
--no-headers
-n gpu-operator-ci-utils
register: wait_helper_image
until: "'Succeeded' in wait_helper_image.stdout or 'Failed' in wait_helper_image.stdout or 'Error' in wait_helper_image.stdout"
retries: 40
delay: 30

I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.

For example, that image could easily be built while the machineset is scaling. If we just apply all the manifests and rely on the eventual reconciliation of everything we wouldn't even have to think about it - things that can happen in parallel will happen in parallel, things that block on other things will simply wait for them to complete.

from ci-artifacts.

kpouget avatar kpouget commented on August 12, 2024

If we stop always building it means that when someone works on the repo and actually tries to change it they might be surprised that their changed have no effect and they have to start patching things around to get the repo to use their revision, which might be annoying.

for that, I was thinking putting a big bold warning at the beginning of the file, so that people don't get surprised,
and then 1st merge the PR to update Dockerfile, and then keep working on the new behavior

I think if we shift the approach to "apply as many manifests as you possibly can for everything", then separately do the waiting after-the-fact, we would get much more significant performance improvements that would make optimizations such as getting rid of the helper image pretty insignificant.

good point, although I'm afraid it might complicate the troubleshooting when things go wrong

for maybe it could be coded differently: instead of scale up cluster; wait for the end of the scale up; build the GPU Operator,
it could be more focused on the needs:

deploy NFD (without waiting for it)
scale up cluster (without waiting for it)

build the GPU Operator

wait for the end of the scale up # 1
wait for the GPU label to appear  #2
wait for the GPU Operator

(#1 and #2 are kind of redundant, but help troubleshooting)

from ci-artifacts.

kpouget avatar kpouget commented on August 12, 2024

I added

  • Call hack/must-gather.sh script instead of custom scripts

to the list, once this PR is merged: https://gitlab.com/nvidia/kuberetes/gpu-operator/-/merge_requests/346

from ci-artifacts.

lranjbar avatar lranjbar commented on August 12, 2024
  • Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"

    • this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test
    • this image is duplicated in NFD master test
    • takes 8 minutes to build
2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021  23:32:23 +0000 (0:00:00.697)       0:00:08.016 ******* 
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built

It would be simple enough to turn these ImageStreams into Dockerfiles and use the built-in image build on them. The thing about these statements though is that it suggests that these should be defined in separate repository to avoid building them on every test. If that was the case then it would just import the image at the beginning of the test from an integration stream. Similar to how OCP images are automatically imported from their integration streams. It also takes a lot less time to import an image than it does to build it and reconcile it so this would be the best approach in reducing the test run time.

from ci-artifacts.

kpouget avatar kpouget commented on August 12, 2024

one thing I realize I forgot to stress about the current workflow, is that it works the same way in your local cluster as in Prow (hence the notion of toolbox, you can reuse all of these tools to setup your cluster and/or test your ongoing development locally).

Looking farther, this also enables seamless portability to any other CI infrastructure. Only /testing/prow/ is lightly Prow-specific, as it expects a few env var and some secret files at the right location; and of course the test definitions in openshift/release

from ci-artifacts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.