GithubHelp home page GithubHelp logo

canonical / training-operator Goto Github PK

View Code? Open in Web Editor NEW
3.0 8.0 6.0 1.1 MB

Kubeflow Training Operator

License: Apache License 2.0

Python 6.27% Jinja 93.72% Shell 0.01%
kubeflow charm charmed-kubeflow single-charm

training-operator's Introduction

Training Operator

Overview

This repository hosts the Kubernetes Training Operator for Kubeflow training jobs.

Description

The Kubeflow Training Operator provides Kubernetes custom resources to run distributed or non-distributed training jobs, such as TFJobs and PytorchJobs. The Training Operator in this repository is a Python script which wraps the latest released Kubeflow Training Operator manifests, providing lifecycle management and handling events (install, upgrade, integrate, remove). It is one of the Charmed Kubeflow operators.

Usage

While it is possible to deploy the Training Operator as a standalone operator, it works best when deployed alongside other components included in the Kubeflow bundle. For installation steps, please refer to the installation guide.

training-operator's People

Contributors

beliaev-maksim avatar ca-scribner avatar colmbhandal avatar dnplas avatar dparv avatar i-chvets avatar kimwnasptd avatar knkski avatar misohu avatar natalian98 avatar nohaihab avatar orfeas-k avatar phoevos avatar renovate[bot] avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

training-operator's Issues

Update charm for 1.7 release

Updating training operator charm for 1.7 release

using the Contributing guide as reference
Work items are tracked in: https://warthogs.atlassian.net/browse/KF-905
Branch: https://github.com/canonical/training-operator/tree/KF-905-update-charm-1.7-release

Checklist:

  • image updated to v1.6.0-rc
  • CRDs updated
  • auth manifests updated
  • tests pass

Integration tests notes:

  • No examples for the newly added Job Kind: PaddleJob yet on upstream
  • Not testing MXNetjobs until we have a CPU mxjob example

Make charm's images configurable in track/<last-version> branch

Description

The goal of this task is to make all images configurable so that when this charm is deployed in an airgapped environment, all image resources are pulled from an arbitrary local container image registry (avoiding pulling images from the internet).
This serves as a tracking issue for the required changes and backports to the latest stable track/* Github branch.

Required changes

The following files have to be modified and/or verified to enable image configuration:

  • metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:
containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096
  • config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:
apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...
  • tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.

  • src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

  1. Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)

  2. Build the charm making sure that all the changes for airgap are in place.

  3. Deploy the charms manually and observe the charm go to active and idle.

  4. Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

Make charm's images configurable in branch

Description

The goal of this task is to make all images configurable so that when this charm is deployed in an airgapped environment, all image resources are pulled from an arbitrary local container image registry (avoiding pulling images from the internet).
This serves as a tracking issue for the required changes and backports to the latest stable track/* Github branch.

Required changes

The following files have to be modified and/or verified to enable image configuration:

  • metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:
containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096
  • config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:
apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...
  • tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.

  • src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

  1. Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)

  2. Build the charm making sure that all the changes for airgap are in place.

  3. Deploy the charms manually and observe the charm go to active and idle.

  4. Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

upgrade from 1.5 to 1.6 intermittently fails due to 409 conflict during k8s resource creation

Reproduce by:

juju deploy training-operator --channel 1.5/stable --trust
# wait to settle
juju refresh training-operator --channel 1.6/stable --trust

training-operator will appear in juju status as constantly working and in MaintenanceStatus, and logs will show it repeatedly trying to resolve the pebble-ready event but ending with a 409 conflict error.

oddly, if we then

juju remove-application training-operator
juju deploy training-operator --channel 1.5/stable --trust
# wait to settle
juju refresh training-operator --channel 1.6/stable --trust

training-operator 1.6 deploys successfully. Unclear if this is something to do with the event order, or some other inconsistency

integration-with-profiles tests failed in CI with "Failed to execute kubectl auth"

integration-with-profiles tests failed in CI with the following error

FAILED tests/integration/test_charm_with_profile.py::test_authorization_for_creating_resources[examples/tfjob.yaml] - AssertionError: Failed to execute kubectl auth (1): no

The error seems intermittent since rerunning the CI fixed it. I created this issue in order to document that we have came across this in the case we stumble upon it again.

Reproduce

Not sure how to reproduce since the error seems intermittent.

Logs

Nothing looks off in the CI logs. These are the last workload logs

2023-09-12T12:10:41.199Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.machinelock machinelock.go:202 created rotating log file "/var/log/machine-lock.log" with max size 10 MB and max backups 5
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.machinelock machinelock.go:186 machine lock released for training-operator/0 uniter (run start hook)
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter.operation executor.go:121 lock released for training-operator/0
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter resolver.go:188 no operations in progress; waiting for changes
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] idle:

and these the last operator logs

2023-09-12T12:10:14.116Z [training-operator] 2023-09-12T12:10:14Z	INFO	Starting workers	{"controller": "mpijob-controller", "worker count": 1}
2023-09-12T12:10:14.920Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.002027076s 200
2023-09-12T12:10:27.782Z [pebble] GET /v1/plan?format=yaml 3.65407ms 200
2023-09-12T12:10:39.591Z [pebble] GET /v1/plan?format=yaml 133.702µs 200

Make charm's images configurable in track/<last-version> branch

Description

The goal of this task is to make all images configurable so that when this charm is deployed in an airgapped environment, all image resources are pulled from an arbitrary local container image registry (avoiding pulling images from the internet).
This serves as a tracking issue for the required changes and backports to the latest stable track/* Github branch.

Required changes

The following files have to be modified and/or verified to enable image configuration:

  • metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:
containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096
  • config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:
apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...
  • tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.

  • src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

  1. Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)

  2. Build the charm making sure that all the changes for airgap are in place.

  3. Deploy the charms manually and observe the charm go to active and idle.

  4. Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

Update `training-operator` manifests

Context

Each charm has a set of manifest files that have to be upgraded to their target version. The process of upgrading manifest files usually means going to the component’s upstream repository, comparing the charm’s manifest against the one in the repository and adding the missing bits in the charm’s manifest.

What needs to get done

https://docs.google.com/document/d/1a4obWw98U_Ndx-ZKRoojLf4Cym8tFb_2S7dq5dtRQqs/edit?pli=1#heading=h.jt5e3qx0jypg

Definition of Done

  1. Manifests are updated
  2. Upstream image is used

upgrade tests are flaky

it looks like that when we call for an upgrade sometimes we have a race when the charm is refreshed but manifests are still of the old version.
That causes tests to fail (which is healed by rerun)

Traceback (most recent call last):
  File "/home/runner/work/training-operator/training-operator/tests/integration/test_charm.py", line 261, in test_upgrade
    assert (
AssertionError: assert ('tfjobs.kubeflow.org', 'v0.6.0') in [('xgboostjobs.kubeflow.org', 'v0.10.0'), ('tfjobs.kubeflow.org', 'v0.10.0'), ('pytorchjobs.kubeflow.org', 'v0.10.0'), ('mxjobs.kubeflow.org', 'v0.10.0'), ('mpijobs.kubeflow.org', 'v0.10.0'), ('paddlejobs.kubeflow.org', 'v0.10.0')]

see https://github.com/canonical/training-operator/actions/runs/5331136179/jobs/9660014634

Default role kubeflow-edit not permitted to create/update pytorchjob

Hello

We have this case (sf # 00361974) raised by Xperi where Profile owner is trying to create pipelines with pytorchjobs and getting below permission issue with default sa/default-editor (bound to default role/kubeflow-edit):

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "pytorchjobs.kubeflow.org is forbidden: User \"system:serviceaccount:maria-schumacher:default-editor\" cannot create resource \"pytorchjobs\" in API group \"kubeflow.org\" in the namespace \"maria-schumacher\"",
  "reason": "Forbidden",
  "details": {
    "group": "kubeflow.org",
    "kind": "pytorchjobs"
  },
  "code": 403
}

I checked the default permission charts on kf docs:

https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/user-management/#permissions-charts

and none of the default kubeflow roles have create/update permissions for pytorchjobs.

We can add the permission to the default kubeflow-edit role as a quick fix (using [1]) but I wanted to understand & recommend them the best practices around the RBAC in kubeflow.

Should we ask them to edit existing kubeflow-edit role, considering the user concerned in this scenario is a profile owner ?

Should we ask them to create a custom role using a copy of existing kubeflow-edit role and bind this custom role to the user ?

Refrences:
[1] https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/user-management/#adding-permissions-for-a-kubeflow-user

Thanks & regards
Kamal Bhaskar

Add logging relation to training-operator charm

Context

Add logging relation using loki_push_api interface and LogForwarder to training-operator charm. Alternatively the LogProxyConsumer, could be used if the service is using log file instead of STDOUT, however this needs to be discussed because LogProxyConsumer is marked as deprecated.

This task is part of the COS integration initiative for all Kubeflow charms.

What needs to get done

  1. Add logging relation using loki_push_api interface
  2. Use LogForwarder or LogProxyConsumer
  3. Use chisme abstraction for testing

Definition of Done

  1. Charm could be related with grafana-agent-k8s via logging relation
  2. This relation was tested by integration tests
  3. This relation was tested manually with COS deployed

Re-view Pebble event handler

Description

Pebble event handler does not perform any specific actions. It can be removed and Pebble event can be hooked up to main event handler.

Training-operators fails to create/patch `mpijobs.kubeflow.org` CRD on upgrade

When upgrading training-operator from 1.3/stable to latest/edge, the charm gets into waiting status with message waiting for units settled down. The unit is blocked with message Patching resources failed with code 404.
Note: 1.3/stable is the training-operator's channel in 1.4 bundle.

Steps to reproduce

  1. Deploy training-operator from 1.4 bundle:
    juju deploy ch:training-operator --channel 1.3/stable --trust
  2. Refresh (upgrade) the charm to latest 1.6 version:
    juju refresh training-operator --channel latest/edge
    Same behaviour can be observed when a previously deployed training-operator 1.3 is removed from the model and the re-deployed from latest/edge channel using juju deploy ch:training-operator --channel latest/edge --trust.
    It's observed both in a bundle and when deployed as the only charm in a model.

The following error can be observed:

unit-training-operator-0: 16:29:15 ERROR unit.training-operator/0.juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 176, in raise_for_status
    resp.raise_for_status()
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/httpx/_models.py", line 736, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/mpijobs.kubeflow.org'
For more information check: https://httpstatuses.com/404

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 190, in _on_config_changed
    self._patch_resource(resource_type="crds")
  File "./src/charm.py", line 114, in _patch_resource
    client.patch(
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/client.py", line 208, in patch
    return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 233, in request
    return self.handle_response(method, resp, br)
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 184, in handle_response
    self.raise_for_status(resp)
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 178, in raise_for_status
    raise transform_exception(e)
lightkube.core.exceptions.ApiError: customresourcedefinitions.apiextensions.k8s.io "mpijobs.kubeflow.org" not found

When upgraded, training-operator doesn't create one of CRDs mpijobs.kubeflow.org on install event without raising errors. It then tries to patch it on config-changed, but since that CRD doesn't exist, resource patching fails.
The CRD can be created with kubectl. It wasn't present in 1.3.

After re-running the install hook, the charm will get active but mpijobs.kubeflow.org will still not be created:

juju run --unit training-operator/0 -- "export JUJU_DISPATCH_PATH=hooks/install; ./dispatch"
# unit gets active
juju run --unit training-operator/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"
# unit gets blocked again

A workaround is to create that CRD with kubectl and run install hook. If config-changed hook runs afterwards, it doesn't produce errors anymore and the unit remains active.

Here are full training-operator logs, both when upgraded and re-deployed in the same model.

As a side note, this log line seems to be produced only for the first CRD it should loop through. So for example, if xgboostjobs.kubeflow.org, tfjobs.kubeflow.org, pytorchjobs.kubeflow.org and mxjobs.kubeflow.org are already in the cluster when training-operator is installed (and CRDs don't get removed on juju remove-application), this log message will only be present for the first found resource:

unit-training-operator-0: 13:35:29 INFO unit.training-operator/0.juju-log xgboostjobs.kubeflow.org CRD already present. It will be used by the operator.

but not for the rest of them. It would also be the case for auth resources if there were more than 1 ClusterRole.

jira task

Make charm's images configurable in track/1.6 branch

Description

The goal of this task is to make all images configurable so that when this charm is deployed in an airgapped environment, all image resources are pulled from an arbitrary local container image registry (avoiding pulling images from the internet).
This serves as a tracking issue for the required changes and backports to the latest stable track/* Github branch.

TL;DR

Mark the following as done

  • Required changes (in metadata.yaml, config.yaml, src/charm.py)
  • Test on airgap environment
  • Publish to /stable

Required changes

WARNING: No breaking changes should be backported into the track/<version> branch. A breaking change can be anything that requires extra steps to refresh from the previous /stable other than just juju refresh. Please avoid at all costs these situations.

The following files have to be modified and/or verified to enable image configuration:

  • metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:
containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096
  • config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:
apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...
  • tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.

  • src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

  1. Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)

  2. Build the charm making sure that all the changes for airgap are in place.

  3. Deploy the charms manually and observe the charm go to active and idle.

  4. Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

Publishing

After completing the changes and testing, this charm has to be published to its stable risk in Charmhub. For that you must wait for the charm to be published to /edge, which is the revision to be promoted to /stable. Use the workflow dispatch for this (Actions>Release charm to other tracks...>Run workflow).

Suggested changes/backports

training-operator is blocked when deployed as part Kubeflow bundle 1.6/stable

When deploying Kubeflow bundle using 1.6/stable (juju deploy kubeflow --trust --channel 1.6/stable), training-operator component is stuck in blocked state reporting failure to create K8S resources. Removing and redeploying result in the same issue.

training-operator                                     waiting          1  training-operator        1.5/stable       65  10.152.183.125  no       installing agent
training-operator/0*          blocked      idle       10.1.234.79                     Patching resources failed with code 404.

Related logs:

lightkube.core.exceptions.ApiError: customresourcedefinitions.apiextensions.k8s.io "mpijobs.kubeflow.org" not found

Complete Juju debug log:
https://pastebin.canonical.com/p/m6rbzmQfCK/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.