canonical / seldonio-rocks Goto Github PK

View Code? Open in Web Editor NEW

0.0 5.0 1.0 192 KB

ROCKs for Seldon Core

License: Apache License 2.0

Python 100.00%

charmed-kubeflow kubeflow rocks seldon seldon-core

seldonio-rocks's Introduction

seldonio-rocks

seldonio-rocks's People

Contributors

Watchers

Forkers

sed-i

seldonio-rocks's Issues

Create MLserver ROCK

Description

Base MLserver ROCK needs to be created and made available along side other Seldon ROCKs.

There are multiple MLserver containers in SeldonIO. They are all based on the following Dockerfile (v1.3.5):
https://github.com/SeldonIO/MLServer/blob/1.3.5/Dockerfile

For each server a separate runtime is used, eg. huggingface, sklearn, etc. When those containers are built the runtime is specified and a particular version of Docker image is created. This was mapped to Rockcraft framework and each ROCK in this repository included building/installing of the specified runtime.

To build base mlserver container image all runtimes are specified, eg. all runtime is given when building image. This will build and install all runtimes.

Solution

To create base MLserver ROCK a Rockcraft file needs to build and install all runtimes and it should be based on this Dockerfile https://github.com/SeldonIO/MLServer/blob/1.3.5/Dockerfile

All realted ROCKs are already tracked in this repository under mlserver-*/ and can be used as sample ROCKs.

tensorflow-serving rounds up results differently than upstream

Integration tests fail because ROCK rounds up results a bit differently which result in the following error:

  Full diff:
    [
     ('data',
      {'names': ['t:0',
                 't:1',
                 't:2',
                 't:3',
                 't:4',
                 't:5',
                 't:6',
                 't:7',
                 't:8',
                 't:9'],
       'tensor': {'shape': [1, 10],
                  'values': [8.49343338e-22,
-                            2.85119398e-35,
+                            2.85119369e-35,
                             0.123584226,
                             0.0665731356,
                             1.18265652e-28,
                             0.809836566,
                             4.16546084e-13,
                             1.48641526e-19,
                             6.06191043e-06,
                             2.40174282e-20]}}),
     ('meta', {'requestPath': {'classifier': 'seldonio/tfserving-proxy:1.17.1'}}),
    ]

This could be caused by the following warning that we observed also in #83

╰─$ docker run charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5
2024-01-15T16:04:15.170Z [pebble] Started daemon.
2024-01-15T16:04:15.177Z [pebble] POST /v1/services 6.265239ms 202
2024-01-15T16:04:15.177Z [pebble] Started default services with change 1.
2024-01-15T16:04:15.180Z [pebble] Service "tensorflow-serving" starting: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"'
2024-01-15T16:04:15.254Z [tensorflow-serving] 2024-01-15 16:04:15.254664: I external/org_tensorflow/tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-15T16:04:15.293Z [tensorflow-serving] 2024-01-15 16:04:15.293029: I tensorflow_serving/model_servers/server.cc:74] Building single TensorFlow model file config:  model_name: model model_base_path: /models/model
...

fix tensorflow-serving rock's default args

Bug Description

canonical/rockcraft#476 clarifies how entrypoint-service's args should be specified to the command. This:

seldonio-rocks/tensorflow-serving/rockcraft.yaml

Line 18 in 941be9a

 command: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"' [ args ] 

should be updated to be:

 command: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"' [ ]

(where [ args ] is changed to [ ] (with a space between the brackets). The content between the brackets is interpreted as a default set of args that will be passed if none others are sent, eg these two are equivalent atm:

docker run this-image

docker run this-image args

To Reproduce

Environment

Relevant Log Output

Additional Context

No response

MLServer ROCK discrepancies

Description

Taking a look at new MLServer ROCK (PR #40) with @i-chvets, we noticed the following discrepancies with other MLServer-* ROCKs and these should be investigated and probably fixed.

List of items that need to be addressed:

It doesn't copy the openapi directory. We didn't do it previously in our MLServer ROCKs but it is being added as part of the update PR.
has this redundant chmod command. This isn't anywhere in the upstream Dockerfile.
Pins setuptools<65.6.0. Pinning was ditched as of 1.3.5 version upstream.
Doesn't pip install the requirements/docker.txt as the upstream Dockerfile does. We add this in the rest of MLserver-* as part of the update PR.
~~doesn't source /hack/activate-env.sh as we do in the rest of ROCKS~~

EDIT: source /hack/activate-env.sh is not needed since the appropriate files do not exist in the image and they log (same happens in upstream image too)

Environment tarball not found at '/mnt/models/environment.tar.gz'
Environment not found at './envs/environment'

`mlserver-mlflow` ROCK returns response in different format than upstream during integration tests

Context

After omitting the activation of environment in the ROCK's command due to #70, we noticed that during integration tests of mlserver-mlflow ROCK, the output of the upstream SeldonDeployment and the ROCK we use was different.

This can also be observed in changes we 've done in the past. When integrating the ROCK for 1.7, we changed the expected output. And we essentailly reverted this change in the PR that updated Seldon from 1.15 (using ROCKs) to 1.17 (using upstream images). This difference has also been documented here.

The result of the metric is essentially the same but the differences are in other parameters of the response. Here's the tests' output:

AssertionError: assert [('id', 'None'), ('model_name', 'classifier'), ('model_version', 'v1'), ('outputs', [{'name': 'output-1', 'shape': [1, 1], 'datatype': 'FP64', 'parameters': {'content_type': 'np'}, 'data': [6.016145744177844]}]), ('parameters', {'content_type': 'np'})] == [('id', 'None'), ('model_name', 'classifier'), ('model_version', 'v1'), ('outputs', [{'name': 'predict', 'shape': [1], 'datatype': 'FP64', 'parameters': None, 'data': [6.016145744177844]}]), ('parameters', None)]
  At index 3 diff: ('outputs', [{'name': 'output-1', 'shape': [1, 1], 'datatype': 'FP64', 'parameters': {'content_type': 'np'}, 'data': [6.016145744177844]}]) != ('outputs', [{'name': 'predict', 'shape': [1], 'datatype': 'FP64', 'parameters': None, 'data': [6.016145744177844]}])
  Full diff:
    [
     ('id', 'None'),
     ('model_name', 'classifier'),
     ('model_version', 'v1'),
     ('outputs',
      [{'data': [6.016145744177844],
        'datatype': 'FP64',
-       'name': 'predict',
?                 ^^^^^
+       'name': 'output-1',
?                +++ ^ ++
-       'parameters': None,
+       'parameters': {'content_type': 'np'},
-       'shape': [1]}]),
+       'shape': [1, 1]}]),
?                 +++
-    ('parameters', None),
+    ('parameters', {'content_type': 'np'}),
    ]

This could be due to different packages installed. Keep in mind that we 're using a different Python version to run the deployment's .py file (logs from the SeldonDeployment pod)

2023-12-13T10:12:18.469Z [mlserver-mlflow] 2023/12/13 10:12:18 WARNING mlflow.pyfunc: The version of Python that the model was saved in, `Python 3.7.10`, differs from the version of Python that is currently running, `Python 3.8.16`, and may be incompatible

Additional Context

For a reference, here are the output of pip freeze from upstream image and our ROCK, where we can see most of the packages having been bumped.
upstream-pip-freeze.txt
rock-no-env-pip-freeze.txt

Here's the full logs of the classifier container of the seldonDeployment's pod

╰─$ kl mlflow-default-0-classifier-cd8b874b4-48gtr                                                            1 ↵
Defaulted container "classifier" out of: classifier, seldon-container-engine, classifier-model-initializer (init)
2023-12-13T10:12:14.621Z [pebble] Started daemon.
2023-12-13T10:12:14.630Z [pebble] POST /v1/services 8.759098ms 202
2023-12-13T10:12:14.631Z [pebble] Started default services with change 1.
2023-12-13T10:12:14.635Z [pebble] Service "mlserver-mlflow" starting: bash -c 'export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && mlserver start ${MLSERVER_MODELS_DIR}'
2023-12-13T10:12:16.696Z [mlserver-mlflow] 2023-12-13 10:12:16,696 [mlserver.parallel] DEBUG - Starting response processing loop...
2023-12-13T10:12:16.697Z [mlserver-mlflow] /opt/mlserver/.local/lib/python3.8/site-packages/starlette_exporter/middleware.py:97: FutureWarning: group_paths and filter_unhandled_paths will change defaults from False to True in the next release. See https://github.com/stephenhillier/starlette_exporter/issues/79 for more info
2023-12-13T10:12:16.697Z [mlserver-mlflow]   warnings.warn(
2023-12-13T10:12:16.697Z [mlserver-mlflow] 2023-12-13 10:12:16,697 [mlserver.rest] INFO - HTTP server running on http://0.0.0.0:9000
2023-12-13T10:12:16.720Z [mlserver-mlflow] INFO:     Started server process [16]
2023-12-13T10:12:16.720Z [mlserver-mlflow] INFO:     Waiting for application startup.
2023-12-13T10:12:16.733Z [mlserver-mlflow] 2023-12-13 10:12:16,733 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:6000
2023-12-13T10:12:16.733Z [mlserver-mlflow] 2023-12-13 10:12:16,733 [mlserver.metrics] INFO - Prometheus scraping endpoint can be accessed on http://0.0.0.0:6000/prometheus
2023-12-13T10:12:16.733Z [mlserver-mlflow] INFO:     Started server process [16]
2023-12-13T10:12:16.733Z [mlserver-mlflow] INFO:     Waiting for application startup.
2023-12-13T10:12:17.786Z [mlserver-mlflow] INFO:     Application startup complete.
2023-12-13T10:12:17.787Z [mlserver-mlflow] 2023-12-13 10:12:17,787 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:9500
2023-12-13T10:12:17.787Z [mlserver-mlflow] INFO:     Application startup complete.
2023-12-13T10:12:17.787Z [mlserver-mlflow] INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
2023-12-13T10:12:17.787Z [mlserver-mlflow] INFO:     Uvicorn running on http://0.0.0.0:6000 (Press CTRL+C to quit)
2023-12-13T10:12:18.469Z [mlserver-mlflow] 2023/12/13 10:12:18 WARNING mlflow.pyfunc: The version of Python that the model was saved in, `Python 3.7.10`, differs from the version of Python that is currently running, `Python 3.8.16`, and may be incompatible
2023-12-13T10:12:18.744Z [mlserver-mlflow] 2023-12-13 10:12:18,744 [mlserver] INFO - Loaded model 'classifier' succesfully.
2023-12-13T10:12:18.746Z [mlserver-mlflow] 2023-12-13 10:12:18,746 [mlserver] INFO - Loaded model 'classifier' succesfully.
2023-12-13T10:12:28.772Z [mlserver-mlflow] INFO:     192.168.2.3:57036 - "GET /v2/health/ready HTTP/1.1" 200 OK
2023-12-13T10:12:28.773Z [mlserver-mlflow] INFO:     192.168.2.3:57048 - "GET /v2/health/ready HTTP/1.1" 200 OK
2023-12-13T10:12:29.163Z [mlserver-mlflow] INFO:     192.168.2.3:42970 - "POST /v2/models/classifier/infer HTTP/1.1" 200 OK

Update rockraft file of mlserver-mlflow for CKF 1.8

What needs to get done

Update the corresponding rockcraft.yaml file to make sure the version of CKF 1.8 is built.

DoD:

The rockcraft.yaml is updated
There is automation that builds the image in each PR
There is automation that pushes the image, once a PR is merged

Why it needs to get done

To make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future

tensorflow-serving ROCK integration tests fail

As we can see in the ROCKs integrate PR, the tests that use this server fail.

Debugging

What I 've observed until now

docker run

Both upstream image and ROCK have (approx) the same behaviour when doing docker run

╰─$ docker run tensorflow/serving:2.1.0                         
2024-01-15 16:03:43.701551: I tensorflow_serving/model_servers/server.cc:86] Building single TensorFlow model file config:  model_name: model model_base_path: /models/model
2024-01-15 16:03:43.701845: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2024-01-15 16:03:43.701855: I tensorflow_serving/model_servers/server_core.cc:573]  (Re-)adding model: model
2024-01-15 16:03:43.701992: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:362] FileSystemStoragePathSource encountered a filesystem access error: Could not find base path /models/model for servable model

╰─$ docker run charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5
2024-01-15T16:04:15.170Z [pebble] Started daemon.
2024-01-15T16:04:15.177Z [pebble] POST /v1/services 6.265239ms 202
2024-01-15T16:04:15.177Z [pebble] Started default services with change 1.
2024-01-15T16:04:15.180Z [pebble] Service "tensorflow-serving" starting: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"'
2024-01-15T16:04:15.254Z [tensorflow-serving] 2024-01-15 16:04:15.254664: I external/org_tensorflow/tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-15T16:04:15.293Z [tensorflow-serving] 2024-01-15 16:04:15.293029: I tensorflow_serving/model_servers/server.cc:74] Building single TensorFlow model file config:  model_name: model model_base_path: /models/model
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294340: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294352: I tensorflow_serving/model_servers/server_core.cc:594]  (Re-)adding model: model
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294922: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:353] FileSystemStoragePathSource encountered a filesystem access error: Could not find base path /models/model for servable model with error NOT_FOUND: /models/model not found

However, when charm uses the ROCK for the server and we apply the tf-serving or hpt CRs, those SeldonDeployments do not behave as expected. As a result, their tests time out since they cannot extract a prediction.

╰─$ kl hpt-default-0-classifier-6b9fc6cfbf-bzfqq -c classifier
error: unknown flag `port'

This port args is passed by the ROCK itself, which is how it is done in upstream too though

╰─$ kl hpt-default-0-classifier-6b9fc6cfbf-bzfqq --all-containers                                             1 ↵
2024/01/15 16:21:40 NOTICE: Config file "/.rclone.conf" not found - using defaults
2024/01/15 16:21:41 INFO  : 00000123/saved_model.pb: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/variables/variables.data-00000-of-00001: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/variables/variables.index: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/assets/foo.txt: Copied (new)
2024/01/15 16:21:42 INFO  : 
Transferred:   	   12.058 KiB / 12.058 KiB, 100%, 0 B/s, ETA -
Transferred:            4 / 4, 100%
Elapsed time:         1.6s

error: unknown flag `port'
{"level":"info","ts":1705335761.3749456,"logger":"entrypoint","msg":"Full health checks ","value":false}
{"level":"info","ts":1705335761.3751297,"logger":"entrypoint.maxprocs","msg":"maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined"}
{"level":"info","ts":1705335761.3751352,"logger":"entrypoint","msg":"Hostname unset will use localhost"}
{"level":"info","ts":1705335761.3769522,"logger":"entrypoint","msg":"Starting","worker":1}
{"level":"info","ts":1705335761.3769732,"logger":"entrypoint","msg":"Starting","worker":2}
{"level":"info","ts":1705335761.376975,"logger":"entrypoint","msg":"Starting","worker":3}
{"level":"info","ts":1705335761.3769767,"logger":"entrypoint","msg":"Starting","worker":4}
{"level":"info","ts":1705335761.3769782,"logger":"entrypoint","msg":"Starting","worker":5}
{"level":"info","ts":1705335761.37698,"logger":"entrypoint","msg":"Starting","worker":6}
{"level":"info","ts":1705335761.3769813,"logger":"entrypoint","msg":"Starting","worker":7}
{"level":"info","ts":1705335761.376983,"logger":"entrypoint","msg":"Starting","worker":8}
{"level":"info","ts":1705335761.3769846,"logger":"entrypoint","msg":"Starting","worker":9}
{"level":"info","ts":1705335761.376987,"logger":"entrypoint","msg":"Starting","worker":10}
{"level":"info","ts":1705335761.3774252,"logger":"entrypoint","msg":"Running http server ","port":8000}
{"level":"info","ts":1705335761.3774323,"logger":"entrypoint","msg":"Creating non-TLS listener","port":8000}
{"level":"info","ts":1705335761.3775222,"logger":"entrypoint","msg":"Running grpc server ","port":5001}
{"level":"info","ts":1705335761.377525,"logger":"entrypoint","msg":"Creating non-TLS listener","port":5001}
{"level":"info","ts":1705335761.377585,"logger":"entrypoint","msg":"Setting max message size ","size":2147483647}
{"level":"info","ts":1705335761.3777068,"logger":"entrypoint","msg":"gRPC server started"}
{"level":"info","ts":1705335761.3780322,"logger":"SeldonRestApi","msg":"Listening","Address":"0.0.0.0:8000"}
{"level":"info","ts":1705335761.3780477,"logger":"entrypoint","msg":"http server started"}
{"level":"error","ts":1705335781.3396814,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/[email protected]/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}
{"level":"error","ts":1705335782.2400084,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/[email protected]/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}

Looking at the seldon-core logs, I don't see anything that doesn't look expected but I 'm attaching those here for reference
seldon-core container logs.txt. It logs the following reconciler errror, but after that, it seems to reconcile without errors.

Failed to update InferenceService status","SeldonDeployment":"default/hpt"

Reproduce

Using published image charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5
, replace in images-list (configmap__predictor__tensorflow__tensorflow) or directly in the charm's configmap (TENSORFLOW_SERVER.protocols.tensorflow fields)
Run either tox -e seldon-servers-integration -- --model testing -k tensorflow or tox -e seldon-servers-integration -- --model testing -k tf-serving

Environment

Juju 3.1
Microk8s 1.26

Note

It looks like tests had passed in ROCKs repo and the ROCK was published because we hadn't configured the tests properly

Update rockraft file of sklearnserver for CKF 1.8

What needs to get done

Update the corresponding rockcraft.yaml file to make sure the version of CKF 1.8 is built.

DoD:

The rockcraft.yaml is updated
There is automation that builds the image in each PR
There is automation that pushes the image, once a PR is merged

Why it needs to get done

To make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future

Document how `mlserver-*` images are built upstream

Context

Most of the seldonio ROCKs correspond to images that are Inference Runtimes that allow you to define how your model should be used within MLServer.

The unusual with those ROCKs is that they are all based of the same Dockerfile.

How they are built upstream

During release, upstream uses the same Dockerfile and passes the build-args RUNTIME
Inside the Dockerfile, there is this part in the Dockerfile which according to the args received, installs only the corresponding wheel.

This way, multiple release artefacts are produced ensuring there is a "Docker image for each Inference Runtime containing only that specific runtime" (corresponding to the mlserver-* ROCKs) while also there also a "Docker image containing every inference runtime maintained within the MLServer repo" (corresponding to the mlserver ROCK).

sklearn-server ROCK should implement a second Dockerfile too

During updating Seldon ROCKs for CKF 1.8 #37, we concluded with @i-chvets that current rockcraft.yaml file is based on upstream Dockerfile.conda and doesn't take into account the Dockerfile. Looking at the upstream Makefile though, we see that in order to build their image, they use both (with Dockerfile.conda as a BASE_IMAGE). Thus, we need to implement the other Dockerfile into our ROCK as well.

Update rockraft file of mlserver-huggingface for CKF 1.8

What needs to get done

Update the corresponding rockcraft.yaml file to make sure the version of CKF 1.8 is built.

DoD:

The rockcraft.yaml is updated
There is automation that builds the image in each PR
There is automation that pushes the image, once a PR is merged

Why it needs to get done

To make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future

Update rockraft file of mlserver-sklearn for CKF 1.8

What needs to get done

Update the corresponding rockcraft.yaml file to make sure the version of CKF 1.8 is built.

DoD:

The rockcraft.yaml is updated
There is automation that builds the image in each PR
There is automation that pushes the image, once a PR is merged

Why it needs to get done

To make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future

Update Seldon ROCKs to 1.17.1 version for CKF release 1.8

This issue tracks the process of updating Seldon ROCKs to Seldon's 1.17 for CKF release 1.8. For the process, we 're following our internal Kubeflow ROCK Images Best Practices that has a section about Upgrade of ROCK Images.

The changes that this process will introduce should match what the upstream has for version release 1.17.1.

sklearnserver ROCK technical implementation

Starting this thread to document our knowledge on sklearnserver ROCK implementation, since this is quite convoluted. This comes after #44, where we came to false conclusions realising that its implementation is not clear to us.

Upstream Image build

Workflow that builds upstream image for sklearnserver:

Image uses the s2i in order to be built using as arguments this environmnet and seldon-core-s2i image as base image.
(TO DO: Add more information about what s2i build does)
The seldon-core-s2i image (Dockerfile) is built with this command which means that it uses docker.io/seldonio/conda-ubi8 as its base image (Dockerfile.conda).

Thus, rockcraft.yaml should implement the above workflow.

Update rockraft file of mlserver-xgboost for CKF 1.8

What needs to get done

Update the corresponding rockcraft.yaml file to make sure the version of CKF 1.8 is built.

DoD:

The rockcraft.yaml is updated
There is automation that builds the image in each PR
There is automation that pushes the image, once a PR is merged

Why it needs to get done

To make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future

`mlserver-mlflow`: Activating environment in the command results in failure when integrated

Bug Description

After adding the source /hack/activate-env.sh ${MLSERVER_ENV_TARBALL} in the command field as upstream does here, the ROCK fails to start during integration tests with the following output

Defaulted container "classifier" out of: classifier, seldon-container-engine, classifier-model-initializer (init)
2023-12-12T10:15:05.442Z [pebble] Started daemon.
2023-12-12T10:15:05.447Z [pebble] POST /v1/services 4.247605ms 202
2023-12-12T10:15:05.448Z [pebble] Started default services with change 1.
2023-12-12T10:15:05.451Z [pebble] Service "mlserver-mlflow" starting: bash -c 'export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && source /hack/activate-env.sh ${MLSERVER_ENV_TARBALL} && mlserver start ${MLSERVER_MODELS_DIR}'
2023-12-12T10:15:05.776Z [mlserver-mlflow] --> Unpacking environment at /mnt/models/environment.tar.gz...
2023-12-12T10:15:10.227Z [mlserver-mlflow] --> Sourcing new environment at ./envs/environment...
2023-12-12T10:15:10.248Z [mlserver-mlflow] --> Calling conda-unpack...
2023-12-12T10:15:10.592Z [mlserver-mlflow] --> Disabling user-installed packages...
2023-12-12T10:15:10.732Z [mlserver-mlflow] Traceback (most recent call last):
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/envs/environment/bin/mlserver", line 5, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from mlserver.cli import main
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/__init__.py", line 2, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from .server import MLServer
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/server.py", line 7, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from mlserver.repository.factory import ModelRepositoryFactory
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/repository/__init__.py", line 1, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from .repository import (
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/repository/repository.py", line 9, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from ..errors import ModelNotFound
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/errors.py", line 1, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from fastapi import status
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/__init__.py", line 7, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from .applications import FastAPI as FastAPI
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/applications.py", line 15, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from fastapi import routing
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/routing.py", line 23, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from fastapi.datastructures import Default, DefaultPlaceholder
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/datastructures.py", line 3, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from starlette.datastructures import URL as URL  # noqa: F401
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/starlette/datastructures.py", line 7, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from starlette.concurrency import run_in_threadpool
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 6, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     import anyio
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/anyio/__init__.py", line 21, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from ._core._fileio import AsyncFile as AsyncFile
2023-12-12T10:15:10.732Z [mlserver-mlflow]   File "/opt/mlserver/.local/lib/python3.8/site-packages/anyio/_core/_fileio.py", line 10, in <module>
2023-12-12T10:15:10.732Z [mlserver-mlflow]     from typing import (
2023-12-12T10:15:10.732Z [mlserver-mlflow] ImportError: cannot import name 'Final' from 'typing' (/opt/mlserver/envs/environment/lib/python3.7/typing.py)
2023-12-12T10:15:10.744Z [pebble] Service "mlserver-mlflow" stopped unexpectedly with code 1
2023-12-12T10:15:10.744Z [pebble] Service "mlserver-mlflow" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2023-12-12T10:15:11.248Z [pebble] Service "mlserver-mlflow" starting: bash -c 'export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && source /hack/activate-env.sh ${MLSERVER_ENV_TARBALL} && mlserver start ${MLSERVER_MODELS_DIR}'

I cannot understand if this means that we use Python3.8 interpreter(which we shouldn't) and try to use Python3.7 libs or the other way around.

Note that we came across this now because before we didn't pass the proper arguments to activate-env.sh script which means that the script never ran and instead just logged

[mlserver-mlflow] Invalid number of arguments
[mlserver-mlflow] Usage: ./activate-env.sh <envTarball>

To Reproduce

Add the source /hack/activate-env.sh ${MLSERVER_ENV_TARBALL} to the command and run ROCK's integration tests

tox -e pack
tox -e export-to-docker
tox -e integration

This will run seldon-server-integration test with -k mlflowserver-v2 argument.

Environment

microk8s 1.25-strict/stable
juju 3.1/stable

Relevant Log Output

For reference, here's what upstream image logs during integration tests.

╰─$ kl mlflow-default-0-classifier-666d794cb-xw98v -f
Defaulted container "classifier" out of: classifier, seldon-container-engine, classifier-model-initializer (init)
--> Unpacking environment at /mnt/models/environment.tar.gz...
--> Sourcing new environment at ./envs/environment...
--> Calling conda-unpack...
--> Disabling user-installed packages...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
INFO:     192.168.2.3:37242 - "GET /v2/health/ready HTTP/1.1" 200 OK

Additional Context

rock-pip-freeze.txt
upstream-pip-freeze.txt

mlserver-* ROCKs cannot be built

Description

ROCKs in Seldon IO ROCK repository cannot be re-built. When ROCKs were initially created, they were building without issues. Here is the log from Jul 17, 2023 with successful build of mlserver-sklearn ROCK.

As of Sep 4, 2023, without any changes to rockcraft.yaml file and Github repository the same mlserver-sklearn ROCK fails to build. When building manually the following error is obseved (no changes to source code and rockcraft.yaml):

sed: can't read opt/mlserver/.local/bin/mlserver: No such file or directory                                                                                                    
'override-stage' in part 'mlserver-sklearn' failed with code 2.

The problem occurs on the following line of rockcraft.yaml:
https://github.com/canonical/seldonio-rocks/blob/main/mlserver-sklearn/rockcraft.yaml#L90

It looks like that opt/mlserver/.local/bin/ directory does not exist.

Possible causes for this are:

~~Change in the way installation is done through PIP/Conda that was introduced.~~
~~Updated pip that was introduced with updated version of Ubuntu 22.04 (is this possible?)~~
~~Upgrade of pip on this line which corresponds to this line in original Dockerfile. It looks like upgraded pip installs packages differently than original when --prefix option is given.~~
See debugging comment below.

~~The above causes require more investigration.~~

Solution

~~No solution proposed at this time. More investigation into root cause is needed.~~

Possible approaches:

Consider use 20.04 base to avoid conflicts between Python 3.10 and 3.8
~~Use pip from Conda installation.~~
~~Do not upgrade pip (this will cause ROCK to be out of sync with upstream Dockerfile)~~

Using 20.04 should make all mlserver-* ROCK less prone to unexpected changes, because upstream is using Python 3.8 and 20.04 provides an easier maintenance base for ROCKs.

Example of ROCK implementation using 20.04: #40

Update rockraft file of jupyter-scipy for CKF 1.8

What needs to get done

Update the corresponding rockcraft.yaml file to make sure the version of CKF 1.8 is built.

DoD:

The rockcraft.yaml is updated
There is automation that builds the image in each PR
There is automation that pushes the image, once a PR is merged

Why it needs to get done

To make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future

`mlserver-*` fail to start with "'FastAPI' object has no attribute 'debug'"

(this issue doesn't apply to mlserver-huggingface.)
While updating the mlserver-* ROCKs to MLServer version 1.3.5 (for Seldon 1.17.1), the ROCKs started failing with error 'FastAPI' object has no attribute 'debug'. Note that FastApi package is pinned in MLServer images.

Debugging

Googling this prompts me that this has to do with starlette package installed tiangolo/fastapi#5977. However, starlette isn't pinned anywhere in the MLServer repo, considering also the things we install in our ROCK.
Somehow, I have a .rock file that is working as expected. I 'm not sure from where exactly this was built although it uses mlserver package 1.3.5. Comparing the two pip freezes, it looks like the "working" one uses starlette package version of 0.22.0. However, when I tried to downgrade it inside the image's command modifying it to the below (essentially adding pip install --prefix opt/mlserver/.local starlette==0.22.0)
```
docker run mlserver-sklearn:1.3.5 exec bash -c 'export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && pip install --prefix opt/mlserver/.local starlette==0.22.0 && mlserver start /mnt/models/
```
the image returns the error
```
  File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/__init__.py", line 5, in <module>
  from starlette import status as status
ModuleNotFoundError: No module named 'starlette'
```
On top of that, I tried to remove the pip install requirements/docker.txt line from the rockcraft.yaml file, which ended up with starlette==0.22.0 but the produced image still returned 'FastAPI' object has no attribute 'debug'.

Here are the pip freeze results from the aforementioned images (from mlserver-sklearn)
pip-freeze-1.3.5.txt
pip-freeze-1.3.5-no-pip-install-requirements-line.txt
pip-freeze-1.3.5-working.txt

Comparing also the .tar files using google-container-diff from the ROCK that fails and the one that returns expected results, the only remove/add difference I see is about the starlette package.

These entries have been deleted from ./mlserver-xgboost:working.tar:
FILE                                                                                                   SIZE
/opt/mlserver/.local/lib/python3.8/site-packages/starlette-0.22.0.dist-info                            12.2K
/opt/mlserver/.local/lib/python3.8/site-packages/starlette-0.22.0.dist-info/INSTALLER                  4B
...

The rest of the output mentions that all the other files that are different between the two images.

Full error logs

╰─$ docker run mlserver-sklearn:1.3.5                                                                                                                                                                                            130 ↵
2024-01-09T13:58:01.674Z [pebble] Started daemon.
2024-01-09T13:58:01.679Z [pebble] POST /v1/services 3.752579ms 202
2024-01-09T13:58:01.679Z [pebble] Started default services with change 1.
2024-01-09T13:58:01.682Z [pebble] Service "mlserver-sklearn" starting: bash -c 'cd /opt/mlserver && export PATH=/opt/conda/bin/:/opt/mlserver/.local/bin:${PATH}:/usr/bin && export PYTHONPATH=/opt/mlserver/.local/lib/python3.8/site-packages/:${PYTHONPATH} && eval $(/opt/conda/bin/conda shell.bash hook 2> /dev/null) && mlserver start ${MLSERVER_MODELS_DIR}'
2024-01-09T13:58:02.637Z [mlserver-sklearn] Traceback (most recent call last):
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/bin/mlserver", line 8, in <module>
2024-01-09T13:58:02.637Z [mlserver-sklearn]     sys.exit(main())
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/cli/main.py", line 263, in main
2024-01-09T13:58:02.637Z [mlserver-sklearn]     root()
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
2024-01-09T13:58:02.637Z [mlserver-sklearn]     return self.main(*args, **kwargs)
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 1078, in main
2024-01-09T13:58:02.637Z [mlserver-sklearn]     rv = self.invoke(ctx)
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
2024-01-09T13:58:02.637Z [mlserver-sklearn]     return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
2024-01-09T13:58:02.637Z [mlserver-sklearn]     return ctx.invoke(self.callback, **ctx.params)
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/click/core.py", line 783, in invoke
2024-01-09T13:58:02.637Z [mlserver-sklearn]     return __callback(*args, **kwargs)
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/cli/main.py", line 23, in wrapper
2024-01-09T13:58:02.637Z [mlserver-sklearn]     return asyncio.run(f(*args, **kwargs))
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/conda/lib/python3.8/asyncio/runners.py", line 44, in run
2024-01-09T13:58:02.637Z [mlserver-sklearn]     return loop.run_until_complete(main)
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/cli/main.py", line 46, in start
2024-01-09T13:58:02.637Z [mlserver-sklearn]     server = MLServer(settings)
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/server.py", line 32, in __init__
2024-01-09T13:58:02.637Z [mlserver-sklearn]     self._metrics_server = MetricsServer(self._settings)
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/metrics/server.py", line 25, in __init__
2024-01-09T13:58:02.637Z [mlserver-sklearn]     self._app = self._get_app()
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/mlserver/metrics/server.py", line 28, in _get_app
2024-01-09T13:58:02.637Z [mlserver-sklearn]     app = FastAPI(debug=self._settings.debug)
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/applications.py", line 146, in __init__
2024-01-09T13:58:02.637Z [mlserver-sklearn]     self.middleware_stack: ASGIApp = self.build_middleware_stack()
2024-01-09T13:58:02.637Z [mlserver-sklearn]   File "/opt/mlserver/.local/lib/python3.8/site-packages/fastapi/applications.py", line 152, in build_middleware_stack
2024-01-09T13:58:02.637Z [mlserver-sklearn]     debug = self.debug
2024-01-09T13:58:02.637Z [mlserver-sklearn] AttributeError: 'FastAPI' object has no attribute 'debug'
2024-01-09T13:58:02.732Z [pebble] Service "mlserver-sklearn" stopped unexpectedly with code 1
2024-01-09T13:58:02.732Z [pebble] Service "mlserver-sklearn" on-failure action is "restart", waiting ~500ms before restart (backoff 1)

Change schedule for build and scan to weekly

Description

Currently build and scan workflow is executed nightly. This needs to be changed to weekly schedule to confirm to spec.

Introduce ROCK automations for building and pushing ROCKs

What needs to get done

Similarly to what we have in kubeflow-rocks repo

A workflow for building ROCKs on each PR
https://github.com/canonical/kubeflow-rocks/blob/main/.github/workflows/on_pull_request.yaml
A workflow for automatically building/publishing a ROCK
https://github.com/canonical/kubeflow-rocks/blob/main/.github/workflows/build_and_publish_rock.yaml

Why it needs to get done

To reduce the manual work required of testing that images can be built and published

Update rockraft file of seldon-core-operator for CKF 1.8

What needs to get done

Update the corresponding rockcraft.yaml file to make sure the version of CKF 1.8 is built.

DoD:

The rockcraft.yaml is updated
There is automation that builds the image in each PR
There is automation that pushes the image, once a PR is merged

Why it needs to get done

To make sure we have less CVEs in 1.8 than upstream, and enable us to be able to potentially create patches for package versions in the future

Add `mlserver` ROCK to `build_and_scan_rocks` workflow

What needs to get done

Add mlserver rock to build_and_scan_rocks workflow

Why it needs to get done

During #40, we forgot to add mlserver to build_and_scan_rocks workflow.

We need to add it in order to ensure we do not miss any CVEs from the final report

Workflows are needed for this repository

Description

Main issue: canonical/bundle-kubeflow#692

Workflows are needed for this repository:

on pull request execute ROCKs sanity tests (a.k.a. smoke tests), peform CVE scans
on merge publish ROCK

Without integration tests in ROCK repository how do we ensure ROCK is acceptable to be used in charm? Do we publish ROCK regardless as long as it passed sanity tests?

canonical / seldonio-rocks Goto Github PK

seldonio-rocks's Introduction

seldonio-rocks

seldonio-rocks's People

Contributors

Watchers

Forkers

seldonio-rocks's Issues

Description

Solution

Bug Description

To Reproduce

Environment

Relevant Log Output

Additional Context

Description

Context

Additional Context

What needs to get done

Why it needs to get done

Debugging

docker run

Reproduce

Environment

Note

What needs to get done

Why it needs to get done

Context

How they are built upstream

What needs to get done

Why it needs to get done

What needs to get done

Why it needs to get done

Upstream Image build

What needs to get done

Why it needs to get done

Bug Description

To Reproduce

Environment

Relevant Log Output

Additional Context

Description

Solution

What needs to get done

Why it needs to get done

Debugging

Full error logs

Description

What needs to get done

Why it needs to get done

What needs to get done

Why it needs to get done

What needs to get done

Why it needs to get done

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs