canonical / seldonio-rocks Goto Github PK
View Code? Open in Web Editor NEWROCKs for Seldon Core
License: Apache License 2.0
ROCKs for Seldon Core
License: Apache License 2.0
Base MLserver ROCK needs to be created and made available along side other Seldon ROCKs.
There are multiple MLserver containers in SeldonIO. They are all based on the following Dockerfile (v1.3.5):
https://github.com/SeldonIO/MLServer/blob/1.3.5/Dockerfile
For each server a separate runtime is used, eg. huggingface
, sklearn
, etc. When those containers are built the runtime is specified and a particular version of Docker image is created. This was mapped to Rockcraft framework and each ROCK in this repository included building/installing of the specified runtime.
To build base mlserver
container image all runtimes are specified, eg. all
runtime is given when building image. This will build and install all runtimes.
To create base MLserver ROCK a Rockcraft file needs to build and install all runtimes and it should be based on this Dockerfile https://github.com/SeldonIO/MLServer/blob/1.3.5/Dockerfile
All realted ROCKs are already tracked in this repository under mlserver-*/
and can be used as sample ROCKs.
Integration tests fail because ROCK rounds up results a bit differently which result in the following error:
Full diff:
[
('data',
{'names': ['t:0',
't:1',
't:2',
't:3',
't:4',
't:5',
't:6',
't:7',
't:8',
't:9'],
'tensor': {'shape': [1, 10],
'values': [8.49343338e-22,
- 2.85119398e-35,
+ 2.85119369e-35,
0.123584226,
0.0665731356,
1.18265652e-28,
0.809836566,
4.16546084e-13,
1.48641526e-19,
6.06191043e-06,
2.40174282e-20]}}),
('meta', {'requestPath': {'classifier': 'seldonio/tfserving-proxy:1.17.1'}}),
]
This could be caused by the following warning that we observed also in #83
╰─$ docker run charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5
2024-01-15T16:04:15.170Z [pebble] Started daemon.
2024-01-15T16:04:15.177Z [pebble] POST /v1/services 6.265239ms 202
2024-01-15T16:04:15.177Z [pebble] Started default services with change 1.
2024-01-15T16:04:15.180Z [pebble] Service "tensorflow-serving" starting: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"'
2024-01-15T16:04:15.254Z [tensorflow-serving] 2024-01-15 16:04:15.254664: I external/org_tensorflow/tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-15T16:04:15.293Z [tensorflow-serving] 2024-01-15 16:04:15.293029: I tensorflow_serving/model_servers/server.cc:74] Building single TensorFlow model file config: model_name: model model_base_path: /models/model
...
canonical/rockcraft#476 clarifies how entrypoint-service
's args should be specified to the command
. This:
should be updated to be:
command: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"' [ ]
(where [ args ]
is changed to [ ]
(with a space between the brackets). The content between the brackets is interpreted as a default set of args that will be passed if none others are sent, eg these two are equivalent atm:
docker run this-image
docker run this-image args
-
No response
Taking a look at new MLServer ROCK (PR #40) with @i-chvets, we noticed the following discrepancies with other MLServer-*
ROCKs and these should be investigated and probably fixed.
List of items that need to be addressed:
MLserver-*
as part of the update PR.EDIT: source /hack/activate-env.sh
is not needed since the appropriate files do not exist in the image and they log (same happens in upstream image too)
Environment tarball not found at '/mnt/models/environment.tar.gz'
Environment not found at './envs/environment'
After omitting the activation of environment in the ROCK's command due to #70, we noticed that during integration tests of mlserver-mlflow
ROCK, the output of the upstream SeldonDeployment and the ROCK we use was different.
This can also be observed in changes we 've done in the past. When integrating the ROCK for 1.7, we changed the expected output. And we essentailly reverted this change in the PR that updated Seldon from 1.15 (using ROCKs) to 1.17 (using upstream images). This difference has also been documented here.
The result of the metric is essentially the same but the differences are in other parameters of the response. Here's the tests' output:
AssertionError: assert [('id', 'None'), ('model_name', 'classifier'), ('model_version', 'v1'), ('outputs', [{'name': 'output-1', 'shape': [1, 1], 'datatype': 'FP64', 'parameters': {'content_type': 'np'}, 'data': [6.016145744177844]}]), ('parameters', {'content_type': 'np'})] == [('id', 'None'), ('model_name', 'classifier'), ('model_version', 'v1'), ('outputs', [{'name': 'predict', 'shape': [1], 'datatype': 'FP64', 'parameters': None, 'data': [6.016145744177844]}]), ('parameters', None)]
At index 3 diff: ('outputs', [{'name': 'output-1', 'shape': [1, 1], 'datatype': 'FP64', 'parameters': {'content_type': 'np'}, 'data': [6.016145744177844]}]) != ('outputs', [{'name': 'predict', 'shape': [1], 'datatype': 'FP64', 'parameters': None, 'data': [6.016145744177844]}])
Full diff:
[
('id', 'None'),
('model_name', 'classifier'),
('model_version', 'v1'),
('outputs',
[{'data': [6.016145744177844],
'datatype': 'FP64',
- 'name': 'predict',
? ^^^^^
+ 'name': 'output-1',
? +++ ^ ++
- 'parameters': None,
+ 'parameters': {'content_type': 'np'},
- 'shape': [1]}]),
+ 'shape': [1, 1]}]),
? +++
- ('parameters', None),
+ ('parameters', {'content_type': 'np'}),
]
This could be due to different packages installed. Keep in mind that we 're using a different Python version to run the deployment's .py
file (logs from the SeldonDeployment pod)
2023-12-13T10:12:18.469Z [mlserver-mlflow] 2023/12/13 10:12:18 WARNING mlflow.pyfunc: The version of Python that the model was saved in, `Python 3.7.10`, differs from the version of Python that is currently running, `Python 3.8.16`, and may be incompatible
For a reference, here are the output of pip freeze
from upstream image and our ROCK, where we can see most of the packages having been bumped.
upstream-pip-freeze.txt
rock-no-env-pip-freeze.txt
Here's the full logs of the classifier
container of the seldonDeployment's pod
╰─$ kl mlflow-default-0-classifier-cd8b874b4-48gtr 1 ↵
Defaulted container "classifier" out of: classifier, seldon-container-engine, classifier-model-initializer (init)
2023-12-13T10:12:14.621Z [pebble] Started daemon.
2023-12-13T10:12:14.630Z [pebble] POST /v1/services 8.759098ms 202
2023-12-13T10:12:14.631Z [pebble] Started default services with change 1.
2023-12-13T10:12:14.635Z [pebble] Service "mlserver-mlflow" starting: bash -c 'export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && mlserver start ${MLSERVER_MODELS_DIR}'
2023-12-13T10:12:16.696Z [mlserver-mlflow] 2023-12-13 10:12:16,696 [mlserver.parallel] DEBUG - Starting response processing loop...
2023-12-13T10:12:16.697Z [mlserver-mlflow] /opt/mlserver/.local/lib/python3.8/site-packages/starlette_exporter/middleware.py:97: FutureWarning: group_paths and filter_unhandled_paths will change defaults from False to True in the next release. See https://github.com/stephenhillier/starlette_exporter/issues/79 for more info
2023-12-13T10:12:16.697Z [mlserver-mlflow] warnings.warn(
2023-12-13T10:12:16.697Z [mlserver-mlflow] 2023-12-13 10:12:16,697 [mlserver.rest] INFO - HTTP server running on http://0.0.0.0:9000
2023-12-13T10:12:16.720Z [mlserver-mlflow] INFO: Started server process [16]
2023-12-13T10:12:16.720Z [mlserver-mlflow] INFO: Waiting for application startup.
2023-12-13T10:12:16.733Z [mlserver-mlflow] 2023-12-13 10:12:16,733 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:6000
2023-12-13T10:12:16.733Z [mlserver-mlflow] 2023-12-13 10:12:16,733 [mlserver.metrics] INFO - Prometheus scraping endpoint can be accessed on http://0.0.0.0:6000/prometheus
2023-12-13T10:12:16.733Z [mlserver-mlflow] INFO: Started server process [16]
2023-12-13T10:12:16.733Z [mlserver-mlflow] INFO: Waiting for application startup.
2023-12-13T10:12:17.786Z [mlserver-mlflow] INFO: Application startup complete.
2023-12-13T10:12:17.787Z [mlserver-mlflow] 2023-12-13 10:12:17,787 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:9500
2023-12-13T10:12:17.787Z [mlserver-mlflow] INFO: Application startup complete.
2023-12-13T10:12:17.787Z [mlserver-mlflow] INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
2023-12-13T10:12:17.787Z [mlserver-mlflow] INFO: Uvicorn running on http://0.0.0.0:6000 (Press CTRL+C to quit)
2023-12-13T10:12:18.469Z [mlserver-mlflow] 2023/12/13 10:12:18 WARNING mlflow.pyfunc: The version of Python that the model was saved in, `Python 3.7.10`, differs from the version of Python that is currently running, `Python 3.8.16`, and may be incompatible
2023-12-13T10:12:18.744Z [mlserver-mlflow] 2023-12-13 10:12:18,744 [mlserver] INFO - Loaded model 'classifier' succesfully.
2023-12-13T10:12:18.746Z [mlserver-mlflow] 2023-12-13 10:12:18,746 [mlserver] INFO - Loaded model 'classifier' succesfully.
2023-12-13T10:12:28.772Z [mlserver-mlflow] INFO: 192.168.2.3:57036 - "GET /v2/health/ready HTTP/1.1" 200 OK
2023-12-13T10:12:28.773Z [mlserver-mlflow] INFO: 192.168.2.3:57048 - "GET /v2/health/ready HTTP/1.1" 200 OK
2023-12-13T10:12:29.163Z [mlserver-mlflow] INFO: 192.168.2.3:42970 - "POST /v2/models/classifier/infer HTTP/1.1" 200 OK
Update the corresponding rockcraft.yaml
file to make sure the version of CKF 1.8 is built.
DoD:
rockcraft.yaml
is updatedTo make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future
As we can see in the ROCKs integrate PR, the tests that use this server fail.
What I 've observed until now
Both upstream image and ROCK have (approx) the same behaviour when doing docker run
╰─$ docker run tensorflow/serving:2.1.0
2024-01-15 16:03:43.701551: I tensorflow_serving/model_servers/server.cc:86] Building single TensorFlow model file config: model_name: model model_base_path: /models/model
2024-01-15 16:03:43.701845: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2024-01-15 16:03:43.701855: I tensorflow_serving/model_servers/server_core.cc:573] (Re-)adding model: model
2024-01-15 16:03:43.701992: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:362] FileSystemStoragePathSource encountered a filesystem access error: Could not find base path /models/model for servable model
╰─$ docker run charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5
2024-01-15T16:04:15.170Z [pebble] Started daemon.
2024-01-15T16:04:15.177Z [pebble] POST /v1/services 6.265239ms 202
2024-01-15T16:04:15.177Z [pebble] Started default services with change 1.
2024-01-15T16:04:15.180Z [pebble] Service "tensorflow-serving" starting: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"'
2024-01-15T16:04:15.254Z [tensorflow-serving] 2024-01-15 16:04:15.254664: I external/org_tensorflow/tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-15T16:04:15.293Z [tensorflow-serving] 2024-01-15 16:04:15.293029: I tensorflow_serving/model_servers/server.cc:74] Building single TensorFlow model file config: model_name: model model_base_path: /models/model
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294340: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294352: I tensorflow_serving/model_servers/server_core.cc:594] (Re-)adding model: model
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294922: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:353] FileSystemStoragePathSource encountered a filesystem access error: Could not find base path /models/model for servable model with error NOT_FOUND: /models/model not found
However, when charm uses the ROCK for the server and we apply the tf-serving or hpt CRs, those SeldonDeployments do not behave as expected. As a result, their tests time out since they cannot extract a prediction.
╰─$ kl hpt-default-0-classifier-6b9fc6cfbf-bzfqq -c classifier
error: unknown flag `port'
This port args is passed by the ROCK itself, which is how it is done in upstream too though
╰─$ kl hpt-default-0-classifier-6b9fc6cfbf-bzfqq --all-containers 1 ↵
2024/01/15 16:21:40 NOTICE: Config file "/.rclone.conf" not found - using defaults
2024/01/15 16:21:41 INFO : 00000123/saved_model.pb: Copied (new)
2024/01/15 16:21:42 INFO : 00000123/variables/variables.data-00000-of-00001: Copied (new)
2024/01/15 16:21:42 INFO : 00000123/variables/variables.index: Copied (new)
2024/01/15 16:21:42 INFO : 00000123/assets/foo.txt: Copied (new)
2024/01/15 16:21:42 INFO :
Transferred: 12.058 KiB / 12.058 KiB, 100%, 0 B/s, ETA -
Transferred: 4 / 4, 100%
Elapsed time: 1.6s
error: unknown flag `port'
{"level":"info","ts":1705335761.3749456,"logger":"entrypoint","msg":"Full health checks ","value":false}
{"level":"info","ts":1705335761.3751297,"logger":"entrypoint.maxprocs","msg":"maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined"}
{"level":"info","ts":1705335761.3751352,"logger":"entrypoint","msg":"Hostname unset will use localhost"}
{"level":"info","ts":1705335761.3769522,"logger":"entrypoint","msg":"Starting","worker":1}
{"level":"info","ts":1705335761.3769732,"logger":"entrypoint","msg":"Starting","worker":2}
{"level":"info","ts":1705335761.376975,"logger":"entrypoint","msg":"Starting","worker":3}
{"level":"info","ts":1705335761.3769767,"logger":"entrypoint","msg":"Starting","worker":4}
{"level":"info","ts":1705335761.3769782,"logger":"entrypoint","msg":"Starting","worker":5}
{"level":"info","ts":1705335761.37698,"logger":"entrypoint","msg":"Starting","worker":6}
{"level":"info","ts":1705335761.3769813,"logger":"entrypoint","msg":"Starting","worker":7}
{"level":"info","ts":1705335761.376983,"logger":"entrypoint","msg":"Starting","worker":8}
{"level":"info","ts":1705335761.3769846,"logger":"entrypoint","msg":"Starting","worker":9}
{"level":"info","ts":1705335761.376987,"logger":"entrypoint","msg":"Starting","worker":10}
{"level":"info","ts":1705335761.3774252,"logger":"entrypoint","msg":"Running http server ","port":8000}
{"level":"info","ts":1705335761.3774323,"logger":"entrypoint","msg":"Creating non-TLS listener","port":8000}
{"level":"info","ts":1705335761.3775222,"logger":"entrypoint","msg":"Running grpc server ","port":5001}
{"level":"info","ts":1705335761.377525,"logger":"entrypoint","msg":"Creating non-TLS listener","port":5001}
{"level":"info","ts":1705335761.377585,"logger":"entrypoint","msg":"Setting max message size ","size":2147483647}
{"level":"info","ts":1705335761.3777068,"logger":"entrypoint","msg":"gRPC server started"}
{"level":"info","ts":1705335761.3780322,"logger":"SeldonRestApi","msg":"Listening","Address":"0.0.0.0:8000"}
{"level":"info","ts":1705335761.3780477,"logger":"entrypoint","msg":"http server started"}
{"level":"error","ts":1705335781.3396814,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/[email protected]/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}
{"level":"error","ts":1705335782.2400084,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/[email protected]/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}
Looking at the seldon-core logs, I don't see anything that doesn't look expected but I 'm attaching those here for reference
seldon-core container logs.txt. It logs the following reconciler errror, but after that, it seems to reconcile without errors.
Failed to update InferenceService status","SeldonDeployment":"default/hpt"
configmap__predictor__tensorflow__tensorflow
) or directly in the charm's configmap (TENSORFLOW_SERVER.protocols.tensorflow fields)tox -e seldon-servers-integration -- --model testing -k tensorflow
or tox -e seldon-servers-integration -- --model testing -k tf-serving
Juju 3.1
Microk8s 1.26
It looks like tests had passed in ROCKs repo and the ROCK was published because we hadn't configured the tests properly
Update the corresponding rockcraft.yaml
file to make sure the version of CKF 1.8 is built.
DoD:
rockcraft.yaml
is updatedTo make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future
Most of the seldonio ROCKs correspond to images that are Inference Runtimes that allow you to define how your model should be used within MLServer.
The unusual with those ROCKs is that they are all based of the same Dockerfile.
args
received, installs only the corresponding wheel.This way, multiple release artefacts are produced ensuring there is a "Docker image for each Inference Runtime containing only that specific runtime" (corresponding to the mlserver-*
ROCKs) while also there also a "Docker image containing every inference runtime maintained within the MLServer repo" (corresponding to the mlserver
ROCK).
During updating Seldon ROCKs for CKF 1.8 #37, we concluded with @i-chvets that current rockcraft.yaml file is based on upstream Dockerfile.conda and doesn't take into account the Dockerfile. Looking at the upstream Makefile though, we see that in order to build their image, they use both (with Dockerfile.conda as a BASE_IMAGE
). Thus, we need to implement the other Dockerfile into our ROCK as well.
Update the corresponding rockcraft.yaml
file to make sure the version of CKF 1.8 is built.
DoD:
rockcraft.yaml
is updatedTo make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future
Update the corresponding rockcraft.yaml
file to make sure the version of CKF 1.8 is built.
DoD:
rockcraft.yaml
is updatedTo make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future
This issue tracks the process of updating Seldon ROCKs to Seldon's 1.17 for CKF release 1.8. For the process, we 're following our internal Kubeflow ROCK Images Best Practices that has a section about Upgrade of ROCK Images.
The changes that this process will introduce should match what the upstream has for version release 1.17.1
.
Starting this thread to document our knowledge on sklearnserver ROCK implementation, since this is quite convoluted. This comes after #44, where we came to false conclusions realising that its implementation is not clear to us.
Workflow that builds upstream image for sklearnserver:
seldon-core-s2i
image (Dockerfile) is built with this command which means that it uses docker.io/seldonio/conda-ubi8 as its base image (Dockerfile.conda).Thus, rockcraft.yaml should implement the above workflow.
Update the corresponding rockcraft.yaml
file to make sure the version of CKF 1.8 is built.
DoD:
rockcraft.yaml
is updatedTo make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future
After adding the source /hack/activate-env.sh ${MLSERVER_ENV_TARBALL}
in the command
field as upstream does here, the ROCK fails to start during integration tests with the following output
Defaulted container "classifier" out of: classifier, seldon-container-engine, classifier-model-initializer (init)
2023-12-12T10:15:05.442Z [pebble] Started daemon.
2023-12-12T10:15:05.447Z [pebble] POST /v1/services 4.247605ms 202
2023-12-12T10:15:05.448Z [pebble] Started default services with change 1.
2023-12-12T10:15:05.451Z [pebble] Service "mlserver-mlflow" starting: bash -c 'export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && source /hack/activate-env.sh ${MLSERVER_ENV_TARBALL} && mlserver start ${MLSERVER_MODELS_DIR}'
2023-12-12T10:15:05.776Z [mlserver-mlflow] --> Unpacking environment at /mnt/models/environment.tar.gz...
2023-12-12T10:15:10.227Z [mlserver-mlflow] --> Sourcing new environment at ./envs/environment...
2023-12-12T10:15:10.248Z [mlserver-mlflow] --> Calling conda-unpack...
2023-12-12T10:15:10.592Z [mlserver-mlflow] --> Disabling user-installed packages...
2023-12-12T10:15:10.732Z [mlserver-mlflow] Traceback (most recent call last):
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/envs/environment/bin/mlserver", line 5, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from mlserver.cli import main
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/__init__.py", line 2, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from .server import MLServer
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/server.py", line 7, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from mlserver.repository.factory import ModelRepositoryFactory
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/repository/__init__.py", line 1, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from .repository import (
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/repository/repository.py", line 9, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from ..errors import ModelNotFound
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/errors.py", line 1, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from fastapi import status
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/__init__.py", line 7, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from .applications import FastAPI as FastAPI
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/applications.py", line 15, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from fastapi import routing
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/routing.py", line 23, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from fastapi.datastructures import Default, DefaultPlaceholder
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/datastructures.py", line 3, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from starlette.datastructures import URL as URL # noqa: F401
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/starlette/datastructures.py", line 7, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from starlette.concurrency import run_in_threadpool
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 6, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] import anyio
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/anyio/__init__.py", line 21, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from ._core._fileio import AsyncFile as AsyncFile
2023-12-12T10:15:10.732Z [mlserver-mlflow] File "/opt/mlserver/.local/lib/python3.8/site-packages/anyio/_core/_fileio.py", line 10, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow] from typing import (
2023-12-12T10:15:10.732Z [mlserver-mlflow] ImportError: cannot import name 'Final' from 'typing' (/opt/mlserver/envs/environment/lib/python3.7/typing.py)
2023-12-12T10:15:10.744Z [pebble] Service "mlserver-mlflow" stopped unexpectedly with code 1
2023-12-12T10:15:10.744Z [pebble] Service "mlserver-mlflow" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2023-12-12T10:15:11.248Z [pebble] Service "mlserver-mlflow" starting: bash -c 'export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && source /hack/activate-env.sh ${MLSERVER_ENV_TARBALL} && mlserver start ${MLSERVER_MODELS_DIR}'
I cannot understand if this means that we use Python3.8
interpreter(which we shouldn't) and try to use Python3.7
libs or the other way around.
Note that we came across this now because before we didn't pass the proper arguments to activate-env.sh
script which means that the script never ran and instead just logged
[mlserver-mlflow] Invalid number of arguments
[mlserver-mlflow] Usage: ./activate-env.sh <envTarball>
Add the source /hack/activate-env.sh ${MLSERVER_ENV_TARBALL}
to the command and run ROCK's integration tests
tox -e pack
tox -e export-to-docker
tox -e integration
This will run seldon-server-integration
test with -k mlflowserver-v2
argument.
For reference, here's what upstream image logs during integration tests.
╰─$ kl mlflow-default-0-classifier-666d794cb-xw98v -f
Defaulted container "classifier" out of: classifier, seldon-container-engine, classifier-model-initializer (init)
--> Unpacking environment at /mnt/models/environment.tar.gz...
--> Sourcing new environment at ./envs/environment...
--> Calling conda-unpack...
--> Disabling user-installed packages...
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
INFO: 192.168.2.3:37242 - "GET /v2/health/ready HTTP/1.1" 200 OK
ROCKs in Seldon IO ROCK repository cannot be re-built. When ROCKs were initially created, they were building without issues. Here is the log from Jul 17, 2023 with successful build of mlserver-sklearn
ROCK.
As of Sep 4, 2023, without any changes to rockcraft.yaml
file and Github repository the same mlserver-sklearn
ROCK fails to build. When building manually the following error is obseved (no changes to source code and rockcraft.yaml
):
sed: can't read opt/mlserver/.local/bin/mlserver: No such file or directory
'override-stage' in part 'mlserver-sklearn' failed with code 2.
The problem occurs on the following line of rockcraft.yaml
:
https://github.com/canonical/seldonio-rocks/blob/main/mlserver-sklearn/rockcraft.yaml#L90
It looks like that opt/mlserver/.local/bin/
directory does not exist.
Possible causes for this are:
pip
that was introduced with updated version of Ubuntu 22.04 (is this possible?)pip
on this line which corresponds to this line in original Dockerfile. It looks like upgraded pip
installs packages differently than original when --prefix
option is given.The above causes require more investigration.
No solution proposed at this time. More investigation into root cause is needed.
Possible approaches:
pip
from Conda installation.pip
(this will cause ROCK to be out of sync with upstream Dockerfile)Using 20.04 should make all mlserver-*
ROCK less prone to unexpected changes, because upstream is using Python 3.8 and 20.04 provides an easier maintenance base for ROCKs.
Example of ROCK implementation using 20.04: #40
Update the corresponding rockcraft.yaml
file to make sure the version of CKF 1.8 is built.
DoD:
rockcraft.yaml
is updatedTo make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future
(this issue doesn't apply to mlserver-huggingface
.)
While updating the mlserver-*
ROCKs to MLServer version 1.3.5 (for Seldon 1.17.1), the ROCKs started failing with error 'FastAPI' object has no attribute 'debug'
. Note that FastApi package is pinned in MLServer images.
Googling this prompts me that this has to do with starlette
package installed tiangolo/fastapi#5977. However, starlette
isn't pinned anywhere in the MLServer repo, considering also the things we install in our ROCK.
Somehow, I have a .rock
file that is working as expected. I 'm not sure from where exactly this was built although it uses mlserver package 1.3.5. Comparing the two pip
freezes, it looks like the "working" one uses starlette package version of 0.22.0
. However, when I tried to downgrade it inside the image's command modifying it to the below (essentially adding pip install --prefix opt/mlserver/.local starlette==0.22.0
)
docker run mlserver-sklearn:1.3.5 exec bash -c 'export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && pip install --prefix opt/mlserver/.local starlette==0.22.0 && mlserver start /mnt/models/
the image returns the error
File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/__init__.py", line 5, in <module>
from starlette import status as status
ModuleNotFoundError: No module named 'starlette'
On top of that, I tried to remove the pip install requirements/docker.txt line from the rockcraft.yaml file, which ended up with starlette==0.22.0
but the produced image still returned 'FastAPI' object has no attribute 'debug'
.
Here are the pip freeze
results from the aforementioned images (from mlserver-sklearn
)
pip-freeze-1.3.5.txt
pip-freeze-1.3.5-no-pip-install-requirements-line.txt
pip-freeze-1.3.5-working.txt
Comparing also the .tar
files using google-container-diff from the ROCK that fails and the one that returns expected results, the only remove/add difference I see is about the starlette package.
These entries have been deleted from ./mlserver-xgboost:working.tar:
FILE SIZE
/opt/mlserver/.local/lib/python3.8/site-packages/starlette-0.22.0.dist-info 12.2K
/opt/mlserver/.local/lib/python3.8/site-packages/starlette-0.22.0.dist-info/INSTALLER 4B
...
The rest of the output mentions that all the other files that are different between the two images.
╰─$ docker run mlserver-sklearn:1.3.5 130 ↵
2024-01-09T13:58:01.674Z [pebble] Started daemon.
2024-01-09T13:58:01.679Z [pebble] POST /v1/services 3.752579ms 202
2024-01-09T13:58:01.679Z [pebble] Started default services with change 1.
2024-01-09T13:58:01.682Z [pebble] Service "mlserver-sklearn" starting: bash -c 'cd /opt/mlserver && export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && mlserver start ${MLSERVER_MODELS_DIR}'
2024-01-09T13:58:02.637Z [mlserver-sklearn] Traceback (most recent call last):
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/bin/mlserver", line 8, in <module>
2024-01-09T13:58:02.637Z [mlserver-sklearn] sys.exit(main())
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/cli/main.py", line 263, in main
2024-01-09T13:58:02.637Z [mlserver-sklearn] root()
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
2024-01-09T13:58:02.637Z [mlserver-sklearn] return self.main(*args, **kwargs)
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 1078, in main
2024-01-09T13:58:02.637Z [mlserver-sklearn] rv = self.invoke(ctx)
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
2024-01-09T13:58:02.637Z [mlserver-sklearn] return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
2024-01-09T13:58:02.637Z [mlserver-sklearn] return ctx.invoke(self.callback, **ctx.params)
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 783, in invoke
2024-01-09T13:58:02.637Z [mlserver-sklearn] return __callback(*args, **kwargs)
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/cli/main.py", line 23, in wrapper
2024-01-09T13:58:02.637Z [mlserver-sklearn] return asyncio.run(f(*args, **kwargs))
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/conda/lib/python3.8/asyncio/runners.py", line 44, in run
2024-01-09T13:58:02.637Z [mlserver-sklearn] return loop.run_until_complete(main)
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/cli/main.py", line 46, in start
2024-01-09T13:58:02.637Z [mlserver-sklearn] server = MLServer(settings)
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/server.py", line 32, in __init__
2024-01-09T13:58:02.637Z [mlserver-sklearn] self._metrics_server = MetricsServer(self._settings)
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/metrics/server.py", line 25, in __init__
2024-01-09T13:58:02.637Z [mlserver-sklearn] self._app = self._get_app()
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/metrics/server.py", line 28, in _get_app
2024-01-09T13:58:02.637Z [mlserver-sklearn] app = FastAPI(debug=self._settings.debug)
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/applications.py", line 146, in __init__
2024-01-09T13:58:02.637Z [mlserver-sklearn] self.middleware_stack: ASGIApp = self.build_middleware_stack()
2024-01-09T13:58:02.637Z [mlserver-sklearn] File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/applications.py", line 152, in build_middleware_stack
2024-01-09T13:58:02.637Z [mlserver-sklearn] debug = self.debug
2024-01-09T13:58:02.637Z [mlserver-sklearn] AttributeError: 'FastAPI' object has no attribute 'debug'
2024-01-09T13:58:02.732Z [pebble] Service "mlserver-sklearn" stopped unexpectedly with code 1
2024-01-09T13:58:02.732Z [pebble] Service "mlserver-sklearn" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
Currently build and scan workflow is executed nightly. This needs to be changed to weekly schedule to confirm to spec.
Similarly to what we have in kubeflow-rocks
repo
To reduce the manual work required of testing that images can be built and published
Update the corresponding rockcraft.yaml
file to make sure the version of CKF 1.8 is built.
DoD:
rockcraft.yaml
is updatedTo make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future
Add mlserver
rock to build_and_scan_rocks workflow
During #40, we forgot to add mlserver
to build_and_scan_rocks workflow.
We need to add it in order to ensure we do not miss any CVEs from the final report
Main issue: canonical/bundle-kubeflow#692
Workflows are needed for this repository:
Without integration tests in ROCK repository how do we ensure ROCK is acceptable to be used in charm? Do we publish ROCK regardless as long as it passed sanity tests?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.