canonical / training-operator Goto Github PK

View Code? Open in Web Editor NEW

3.0 8.0 6.0 1.1 MB

Kubeflow Training Operator

License: Apache License 2.0

Python 6.27% Jinja 93.72% Shell 0.01%

kubeflow charm charmed-kubeflow single-charm

training-operator's Introduction

Training Operator

Overview

This repository hosts the Kubernetes Training Operator for Kubeflow training jobs.

Description

The Kubeflow Training Operator provides Kubernetes custom resources to run distributed or non-distributed training jobs, such as TFJobs and PytorchJobs. The Training Operator in this repository is a Python script which wraps the latest released Kubeflow Training Operator manifests, providing lifecycle management and handling events (install, upgrade, integrate, remove). It is one of the Charmed Kubeflow operators.

Usage

While it is possible to deploy the Training Operator as a standalone operator, it works best when deployed alongside other components included in the Kubeflow bundle. For installation steps, please refer to the installation guide.

training-operator's People

Contributors

Stargazers

Watchers

Forkers

dnplas natalian98 isabella232 aym-frikha dparv thanhphan1147

training-operator's Issues

Update charm for 1.7 release

Updating training operator charm for 1.7 release

using the Contributing guide as reference
Work items are tracked in: https://warthogs.atlassian.net/browse/KF-905
Branch: https://github.com/canonical/training-operator/tree/KF-905-update-charm-1.7-release

Checklist:

image updated to v1.6.0-rc
CRDs updated
auth manifests updated
tests pass

Integration tests notes:

No examples for the newly added Job Kind: PaddleJob yet on upstream
Not testing MXNetjobs until we have a CPU mxjob example

Make charm's images configurable in track/<last-version> branch

Description

The goal of this task is to make all images configurable so that when this charm is deployed in an airgapped environment, all image resources are pulled from an arbitrary local container image registry (avoiding pulling images from the internet).
This serves as a tracking issue for the required changes and backports to the latest stable track/* Github branch.

Required changes

The following files have to be modified and/or verified to enable image configuration:

metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:

containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096

config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...

tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.
src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)
Build the charm making sure that all the changes for airgap are in place.
Deploy the charms manually and observe the charm go to active and idle.
Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

Make charm's images configurable in branch

Description

Required changes

The following files have to be modified and/or verified to enable image configuration:

metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:

containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096

config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...

tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.
src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)
Build the charm making sure that all the changes for airgap are in place.
Deploy the charms manually and observe the charm go to active and idle.
Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

upgrade from 1.5 to 1.6 intermittently fails due to 409 conflict during k8s resource creation

Reproduce by:

juju deploy training-operator --channel 1.5/stable --trust
# wait to settle
juju refresh training-operator --channel 1.6/stable --trust

training-operator will appear in juju status as constantly working and in MaintenanceStatus, and logs will show it repeatedly trying to resolve the pebble-ready event but ending with a 409 conflict error.

oddly, if we then

juju remove-application training-operator
juju deploy training-operator --channel 1.5/stable --trust
# wait to settle
juju refresh training-operator --channel 1.6/stable --trust

training-operator 1.6 deploys successfully. Unclear if this is something to do with the event order, or some other inconsistency

Add the trainining operator to the Katib bundle

The training operator is needed for the Katib bundle to work, so we should add it.

Include aggregated ClusterRoles from upstream manifests

Here are some ClusterRoles that are necessary for training-operator to work properly with a multi-user Kubeflow deployment:

https://github.com/kubeflow/manifests/blob/54aa7ae/apps/training-operator/upstream/overlays/kubeflow/kubeflow-training-roles.yaml

Right now, training-operator doesn't create them. Since it's a sidecar charm with full Kubernetes API access vs a pod spec charm, it can create them itself.

integration-with-profiles tests failed in CI with "Failed to execute kubectl auth"

integration-with-profiles tests failed in CI with the following error

FAILED tests/integration/test_charm_with_profile.py::test_authorization_for_creating_resources[examples/tfjob.yaml] - AssertionError: Failed to execute kubectl auth (1): no

The error seems intermittent since rerunning the CI fixed it. I created this issue in order to document that we have came across this in the case we stumble upon it again.

Reproduce

Not sure how to reproduce since the error seems intermittent.

Logs

Nothing looks off in the CI logs. These are the last workload logs

2023-09-12T12:10:41.199Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.machinelock machinelock.go:202 created rotating log file "/var/log/machine-lock.log" with max size 10 MB and max backups 5
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.machinelock machinelock.go:186 machine lock released for training-operator/0 uniter (run start hook)
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter.operation executor.go:121 lock released for training-operator/0
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter resolver.go:188 no operations in progress; waiting for changes
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] idle:

and these the last operator logs

2023-09-12T12:10:14.116Z [training-operator] 2023-09-12T12:10:14Z	INFO	Starting workers	{"controller": "mpijob-controller", "worker count": 1}
2023-09-12T12:10:14.920Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.002027076s 200
2023-09-12T12:10:27.782Z [pebble] GET /v1/plan?format=yaml 3.65407ms 200
2023-09-12T12:10:39.591Z [pebble] GET /v1/plan?format=yaml 133.702µs 200

Make charm's images configurable in stable branch

something

Title

Make charm's images configurable in track/<last-version> branch

Description

Required changes

The following files have to be modified and/or verified to enable image configuration:

metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:

containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096

config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...

tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.
src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)
Build the charm making sure that all the changes for airgap are in place.
Deploy the charms manually and observe the charm go to active and idle.
Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

Update integration test to use model.applications[].refresh()

Description

Update integration test to use model.applications[].refresh() instead of calling juju refresh command.

Update `training-operator` manifests

Context

Each charm has a set of manifest files that have to be upgraded to their target version. The process of upgrading manifest files usually means going to the component’s upstream repository, comparing the charm’s manifest against the one in the repository and adding the missing bits in the charm’s manifest.

What needs to get done

https://docs.google.com/document/d/1a4obWw98U_Ndx-ZKRoojLf4Cym8tFb_2S7dq5dtRQqs/edit?pli=1#heading=h.jt5e3qx0jypg

Definition of Done

Manifests are updated
Upstream image is used

bump training-operator version 1.6 -> 1.7 for CKF 1.8

This issue tracks the process of bumping training-operator version from 1.6 to 1.7. For the process, we 're following our internal release handbook document that has a section about manifest files upgrades and images upgrades.

The changes that this process will introduce should match the upstream ones kubeflow/manifests#2501.

upgrade tests are flaky

it looks like that when we call for an upgrade sometimes we have a race when the charm is refreshed but manifests are still of the old version.
That causes tests to fail (which is healed by rerun)

Traceback (most recent call last):
  File "/home/runner/work/training-operator/training-operator/tests/integration/test_charm.py", line 261, in test_upgrade
    assert (
AssertionError: assert ('tfjobs.kubeflow.org', 'v0.6.0') in [('xgboostjobs.kubeflow.org', 'v0.10.0'), ('tfjobs.kubeflow.org', 'v0.10.0'), ('pytorchjobs.kubeflow.org', 'v0.10.0'), ('mxjobs.kubeflow.org', 'v0.10.0'), ('mpijobs.kubeflow.org', 'v0.10.0'), ('paddlejobs.kubeflow.org', 'v0.10.0')]

see https://github.com/canonical/training-operator/actions/runs/5331136179/jobs/9660014634

Add upgrade option to tox.ini update-requirements

Description

Add upgrade option to tox.ini update-requirements

Merge into:

track/1.6 (release KF v1.7)
main

Default role kubeflow-edit not permitted to create/update pytorchjob

Hello

We have this case (sf # 00361974) raised by Xperi where Profile owner is trying to create pipelines with pytorchjobs and getting below permission issue with default sa/default-editor (bound to default role/kubeflow-edit):

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "pytorchjobs.kubeflow.org is forbidden: User \"system:serviceaccount:maria-schumacher:default-editor\" cannot create resource \"pytorchjobs\" in API group \"kubeflow.org\" in the namespace \"maria-schumacher\"",
  "reason": "Forbidden",
  "details": {
    "group": "kubeflow.org",
    "kind": "pytorchjobs"
  },
  "code": 403
}

I checked the default permission charts on kf docs:

https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/user-management/#permissions-charts

and none of the default kubeflow roles have create/update permissions for pytorchjobs.

We can add the permission to the default kubeflow-edit role as a quick fix (using [1]) but I wanted to understand & recommend them the best practices around the RBAC in kubeflow.

Should we ask them to edit existing kubeflow-edit role, considering the user concerned in this scenario is a profile owner ?

Should we ask them to create a custom role using a copy of existing kubeflow-edit role and bind this custom role to the user ?

Refrences:
[1] https://archive-docs.d2iq.com/dkp/kaptain/1.2.0/user-management/#adding-permissions-for-a-kubeflow-user

Thanks & regards
Kamal Bhaskar

Update training operator to use chisme to apply manifests

Training operator uses an outdated way of applying manifests, we should use our standard approach implemented in chisme.

This would also fix #44

Add logging relation to training-operator charm

Context

Add logging relation using loki_push_api interface and LogForwarder to training-operator charm. Alternatively the LogProxyConsumer, could be used if the service is using log file instead of STDOUT, however this needs to be discussed because LogProxyConsumer is marked as deprecated.

This task is part of the COS integration initiative for all Kubeflow charms.

What needs to get done

Add logging relation using loki_push_api interface
Use LogForwarder or LogProxyConsumer
Use chisme abstraction for testing

Definition of Done

Charm could be related with grafana-agent-k8s via logging relation
This relation was tested by integration tests
This relation was tested manually with COS deployed

training-operator failed to upgrade 1.6 to 1.7

Failed to reach active/idle:

training-operator blocked: K8S resources creation failed

Jira

Merge into:

track/1.6 (release KF v1.7)
main

Re-view Pebble event handler

Description

Pebble event handler does not perform any specific actions. It can be removed and Pebble event can be hooked up to main event handler.

Training-operators fails to create/patch `mpijobs.kubeflow.org` CRD on upgrade

When upgrading training-operator from 1.3/stable to latest/edge, the charm gets into waiting status with message waiting for units settled down. The unit is blocked with message Patching resources failed with code 404.
Note: 1.3/stable is the training-operator's channel in 1.4 bundle.

Steps to reproduce

Deploy training-operator from 1.4 bundle:
juju deploy ch:training-operator --channel 1.3/stable --trust
Refresh (upgrade) the charm to latest 1.6 version:
juju refresh training-operator --channel latest/edge
Same behaviour can be observed when a previously deployed training-operator 1.3 is removed from the model and the re-deployed from latest/edge channel using juju deploy ch:training-operator --channel latest/edge --trust.
It's observed both in a bundle and when deployed as the only charm in a model.

The following error can be observed:

unit-training-operator-0: 16:29:15 ERROR unit.training-operator/0.juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 176, in raise_for_status
    resp.raise_for_status()
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/httpx/_models.py", line 736, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/mpijobs.kubeflow.org'
For more information check: https://httpstatuses.com/404

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 190, in _on_config_changed
    self._patch_resource(resource_type="crds")
  File "./src/charm.py", line 114, in _patch_resource
    client.patch(
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/client.py", line 208, in patch
    return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 233, in request
    return self.handle_response(method, resp, br)
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 184, in handle_response
    self.raise_for_status(resp)
  File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 178, in raise_for_status
    raise transform_exception(e)
lightkube.core.exceptions.ApiError: customresourcedefinitions.apiextensions.k8s.io "mpijobs.kubeflow.org" not found

When upgraded, training-operator doesn't create one of CRDs mpijobs.kubeflow.org on install event without raising errors. It then tries to patch it on config-changed, but since that CRD doesn't exist, resource patching fails.
The CRD can be created with kubectl. It wasn't present in 1.3.

After re-running the install hook, the charm will get active but mpijobs.kubeflow.org will still not be created:

juju run --unit training-operator/0 -- "export JUJU_DISPATCH_PATH=hooks/install; ./dispatch"
# unit gets active
juju run --unit training-operator/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"
# unit gets blocked again

A workaround is to create that CRD with kubectl and run install hook. If config-changed hook runs afterwards, it doesn't produce errors anymore and the unit remains active.

Here are full training-operator logs, both when upgraded and re-deployed in the same model.

As a side note, this log line seems to be produced only for the first CRD it should loop through. So for example, if xgboostjobs.kubeflow.org, tfjobs.kubeflow.org, pytorchjobs.kubeflow.org and mxjobs.kubeflow.org are already in the cluster when training-operator is installed (and CRDs don't get removed on juju remove-application), this log message will only be present for the first found resource:

unit-training-operator-0: 13:35:29 INFO unit.training-operator/0.juju-log xgboostjobs.kubeflow.org CRD already present. It will be used by the operator.

but not for the rest of them. It would also be the case for auth resources if there were more than 1 ClusterRole.

jira task

Make charm's images configurable in track/1.6 branch

Description

TL;DR

Mark the following as done

Required changes (in metadata.yaml, config.yaml, src/charm.py)
Test on airgap environment
Publish to /stable

Required changes

WARNING: No breaking changes should be backported into the track/<version> branch. A breaking change can be anything that requires extra steps to refresh from the previous /stable other than just juju refresh. Please avoid at all costs these situations.

The following files have to be modified and/or verified to enable image configuration:

metadata.yaml - the container image(s) of the workload containers have to be specified in this file. This only applies to sidecar charms. Example:

containers:
  training-operator:
    resource: training-operator-image
resources:
  training-operator-image:
    type: oci-image
    description: OCI image for training-operator
    upstream-source: kubeflow/training-operator:v1-855e096

config.yaml - in case the charm deploys containers that are used by resource(s) the operator creates. Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-config
  namespace: {{ namespace }}
data:
  predictor_servers: |-
    {
        "TENSORFLOW_SERVER": {
          "protocols" : {
            "tensorflow": {
              "image": "tensorflow/serving", <--- this image should be configurable
              "defaultImageVersion": "2.1.0"
              },
            "seldon": {
              "image": "seldonio/tfserving-proxy",
              "defaultImageVersion": "1.15.0"
              }
            }
        },
...

tools/get-images.sh - is a bash script that returns a list of all the images that are used by this charm. In the case of a multi-charm repo, this is located at the root of the repo and gathers images from all charms in it.
src/charm.py - verify that nothing inside the charm code is calling a subprocess that requires internet connection.

Testing

Spin up an airgap environment following canonical/bundle-kubeflow#682 and canonical/bundle-kubeflow#703 (comment)
Build the charm making sure that all the changes for airgap are in place.
Deploy the charms manually and observe the charm go to active and idle.
Additionally, run integration tests or simulate them. For instance, creating a workload (like a PytorchJob, a SeldonDeployment, etc.).

Publishing

After completing the changes and testing, this charm has to be published to its stable risk in Charmhub. For that you must wait for the charm to be published to /edge, which is the revision to be promoted to /stable. Use the workflow dispatch for this (Actions>Release charm to other tracks...>Run workflow).

Suggested changes/backports

training-operator is blocked when deployed as part Kubeflow bundle 1.6/stable

When deploying Kubeflow bundle using 1.6/stable (juju deploy kubeflow --trust --channel 1.6/stable), training-operator component is stuck in blocked state reporting failure to create K8S resources. Removing and redeploying result in the same issue.

training-operator                                     waiting          1  training-operator        1.5/stable       65  10.152.183.125  no       installing agent
training-operator/0*          blocked      idle       10.1.234.79                     Patching resources failed with code 404.

Related logs:

lightkube.core.exceptions.ApiError: customresourcedefinitions.apiextensions.k8s.io "mpijobs.kubeflow.org" not found

Complete Juju debug log:
https://pastebin.canonical.com/p/m6rbzmQfCK/

Missing on delete cleanup functionality

After the charm is deleted crds and other manifest are not cleaned. Detailed problem description in: canonical/bundle-kubeflow#525

Re-factor Pebble Layer update using chisme

Description

Re-factor Pebble Layer update using chisme

This PR has been sitting in this repo since Auguest 2022.
Most likely would need to do a new PR, since code base diverged.

canonical / training-operator Goto Github PK

training-operator's Introduction

Training Operator

Overview

Description

Usage

training-operator's People

Contributors

Stargazers

Watchers

Forkers

training-operator's Issues

Updating training operator charm for 1.7 release

Checklist:

Integration tests notes:

Description

Required changes

Testing

Description

Required changes

Testing

Reproduce

Logs

Title

Description

Required changes

Testing

Description

Context

What needs to get done

Definition of Done

Description

Context

What needs to get done

Definition of Done

Description

Steps to reproduce

Description

TL;DR

Required changes

Testing

Publishing

Suggested changes/backports

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs