When upgrading training-operator from 1.3/stable to latest/edge, the charm gets into waiting status with message waiting for units settled down
. The unit is blocked with message Patching resources failed with code 404
.
Note: 1.3/stable is the training-operator's channel in 1.4 bundle.
Steps to reproduce
- Deploy training-operator from 1.4 bundle:
juju deploy ch:training-operator --channel 1.3/stable --trust
- Refresh (upgrade) the charm to latest 1.6 version:
juju refresh training-operator --channel latest/edge
Same behaviour can be observed when a previously deployed training-operator 1.3 is removed from the model and the re-deployed from latest/edge channel using juju deploy ch:training-operator --channel latest/edge --trust
.
It's observed both in a bundle and when deployed as the only charm in a model.
The following error can be observed:
unit-training-operator-0: 16:29:15 ERROR unit.training-operator/0.juju-log Traceback (most recent call last):
File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 176, in raise_for_status
resp.raise_for_status()
File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/httpx/_models.py", line 736, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions/mpijobs.kubeflow.org'
For more information check: https://httpstatuses.com/404
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./src/charm.py", line 190, in _on_config_changed
self._patch_resource(resource_type="crds")
File "./src/charm.py", line 114, in _patch_resource
client.patch(
File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/client.py", line 208, in patch
return self._client.request("patch", res=res, name=name, namespace=namespace, obj=obj,
File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 233, in request
return self.handle_response(method, resp, br)
File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 184, in handle_response
self.raise_for_status(resp)
File "/var/lib/juju/agents/unit-training-operator-0/charm/venv/lightkube/core/generic_client.py", line 178, in raise_for_status
raise transform_exception(e)
lightkube.core.exceptions.ApiError: customresourcedefinitions.apiextensions.k8s.io "mpijobs.kubeflow.org" not found
When upgraded, training-operator doesn't create one of CRDs mpijobs.kubeflow.org
on install event without raising errors. It then tries to patch it on config-changed, but since that CRD doesn't exist, resource patching fails.
The CRD can be created with kubectl. It wasn't present in 1.3.
After re-running the install hook, the charm will get active but mpijobs.kubeflow.org
will still not be created:
juju run --unit training-operator/0 -- "export JUJU_DISPATCH_PATH=hooks/install; ./dispatch"
# unit gets active
juju run --unit training-operator/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"
# unit gets blocked again
A workaround is to create that CRD with kubectl and run install hook. If config-changed hook runs afterwards, it doesn't produce errors anymore and the unit remains active.
Here are full training-operator logs, both when upgraded and re-deployed in the same model.
As a side note, this log line seems to be produced only for the first CRD it should loop through. So for example, if xgboostjobs.kubeflow.org
, tfjobs.kubeflow.org
, pytorchjobs.kubeflow.org
and mxjobs.kubeflow.org
are already in the cluster when training-operator is installed (and CRDs don't get removed on juju remove-application
), this log message will only be present for the first found resource:
unit-training-operator-0: 13:35:29 INFO unit.training-operator/0.juju-log xgboostjobs.kubeflow.org CRD already present. It will be used by the operator.
but not for the rest of them. It would also be the case for auth resources if there were more than 1 ClusterRole.
jira task