GithubHelp home page GithubHelp logo

caikit-tgis-serving's People

Contributors

bdattoma avatar dagrayvid avatar danielezonca avatar dependabot[bot] avatar dtrifiro avatar esposem avatar guimou avatar heyselbi avatar jimknochelmann avatar jooho avatar maxusmusti avatar melissaflinn avatar openshift-ci[bot] avatar openshift-merge-robot avatar rhuss avatar rpancham avatar spolti avatar taneem-ibrahim avatar tarukumar avatar vaibhavjainwiz avatar vedantmahabaleshwarkar avatar xaenalt avatar ymoatti avatar z103cb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caikit-tgis-serving's Issues

SMMR update is not needed anymore

Restructure and Enhance the QuickStart documentation

The new documentation should have sections that answer the following questions:

  • Installation Steps
  • How do you add/deploy a model and a sample inference URL with a sample model in the example?
  • Remove a model (undeploy)
  • Canary rollout, A/B testing
  • How to upgrade runtime?
  • How to access metrics?

Synchronization issue when the model is just launched

Describe the bug

There is a synchronization issue at the launch of the Pod with the current images:

  • the containers get all Ready:
flan-t5-small-gpu-predictor-00001-deployment-6768c548d8-8btqc   4/4     Running   0          41s
  • the model appears as Loaded in the inference service:
  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Loaded
  • but the model takes several extra seconds to be able to serve requests:
HOST=...
METHOD=caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
while true; do
  GRPCURL_DATA=$(echo "{'max_new_tokens': 25, 'min_new_tokens': 25, 'text': 'At what temperature does liquid Nitrogen boil?'}" | sed "s/'/\"/g")
  grpcurl  -insecure  -d "$GRPCURL_DATA"  -H "mm-model-id: flan-t5-small-caikit"  $HOST  $METHOD
  sleep 1
done

ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
ERROR:
  Code: Internal
  Message: Unhandled exception during prediction
{
  "generated_text": "74 degrees F.C., a temperature of 74 degrees F.C., a temperature of ",
  "generated_tokens": "25",
  "finish_reason": "MAX_TOKENS",
  "producer_id": {
    "name": "Text Generation",
    "version": "0.1.0"
  },
  "input_token_count": "10"
}

in the transformer-container logs, we can see this error:

{"channel": "GP-SERVICR-I", "exception": null, "level": "warning", "log_code": "<RUN49049070W>", "message": "<_InactiveRpcError of RPC that terminated with:
\tstatus = StatusCode.UNAVAILABLE
\tdetails = \"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused\"
\tdebug_error_string = \"UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused {created_time:\"2023-10-24T11:48:51.016344787+00:00\", grpc_status:14}\"
>", "model_id": "flan-t5-small-caikit", "num_indent": 0, "stack_trace": "Traceback (most recent call last):
  File \"/caikit/lib/python3.9/site-packages/caikit/runtime/servicers/global_predict_servicer.py\", line 283, in _handle_predict_exceptions
    yield
  File \"/caikit/lib/python3.9/site-packages/caikit/runtime/servicers/global_predict_servicer.py\", line 260, in predict_model
    response = work.do()
  File \"/caikit/lib/python3.9/site-packages/caikit/runtime/work_management/abortable_action.py\", line 118, in do
    return self.__work_thread.get_or_throw()
  File \"/caikit/lib/python3.9/site-packages/caikit/core/toolkit/destroyable_thread.py\", line 188, in get_or_throw
    raise self.__runnable_exception
  File \"/caikit/lib/python3.9/site-packages/caikit/core/toolkit/destroyable_thread.py\", line 124, in run
    self.__runnable_result = self.runnable_func(
  File \"/caikit/lib/python3.9/site-packages/caikit_nlp/modules/text_generation/text_generation_tgis.py\", line 237, in run
    return self.tgis_generation_client.unary_generate(
  File \"/caikit/lib/python3.9/site-packages/caikit_nlp/toolkit/text_generation/tgis_utils.py\", line 315, in unary_generate
    batch_response = self.tgis_client.Generate(request)
  File \"/caikit/lib64/python3.9/site-packages/grpc/_channel.py\", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File \"/caikit/lib64/python3.9/site-packages/grpc/_channel.py\", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
\tstatus = StatusCode.UNAVAILABLE
\tdetails = \"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused\"
\tdebug_error_string = \"UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused {created_time:\"2023-10-24T11:48:51.016344787+00:00\", grpc_status:14}\"
>
", "thread_id": 140123215742720, "timestamp": "2023-10-24T11:48:51.017178"}

Platform

  • quay.io/opendatahub/text-generation-inference@sha256:0e3d00961fed95a8f8b12ed7ce50305acbbfe37ee33d37e81ba9e7ed71c73b69
  • quay.io/opendatahub/caikit-tgis-serving@sha256:adb8d1153b900e304fbcc934189c68cffea035d4b82848446c72c3d5554ee0ca

Sample Code

caikit_tgit_config.yaml.log
inference_service.yaml.log
serving_runtime.yaml.log

Add instructions about setting TGI(S) parameters

Users may need to set particular TGI(S) parameter when using Caikit+TGIS runtime on KServe. An example is the model timeout parameter which can be necessary to be tweaked based on the model size.

We should document the procedure in our docs

Additionally, in a future UI effort, this option should be present in the user interface

Caikit + TGIS returns empty answer when using HTTP calls

While using the ServingRuntime definition from https://github.com/opendatahub-io/caikit-tgis-serving/pull/131/files#diff-94e62eddc4f3b075ea6c7d9eb86d45728d2c9ebb3c00ae43fd81863ccb6c01f9 which leverages on REST call (HTTP port 8080) I'm facing issues in getting the model answers.

The query returns empty response. These are 2 example of REST calls I tried:

curl -d '{"model_id": "<model_name>","inputs": "At what temperature does water boil?"}' -insecure <ksvc_url>:8080/api/v1/task/text-generation
curl --json '{"model_id": "<model_name>","inputs": "At what temperature does water boil?"}' -insecure <ksvc_url>:8080/api/v1/task/text-generation

I also tried by getting the cluster CA secret and include it in the curl call like this:

  1. oc get secret -n openshift-ingress router-certs-default -o json | jq '.data."tls.crt"' | sed 's/"//g' | base64 -d > <filename>.crt
  2. curl --json '{"model_id": "<model_name>","inputs": "At what temperature does water boil?"}' -insecure <ksvc_url>:8080/api/v1/task/text-generation --cacert <filename>.crt

Is there anything wrong with the way I'm performing the call? Please notice that using the same ServingRuntime set to use gRPC port it works just fine

GRPC endpoint not responding properly after the InferenceService reports as `Loaded`

As part of my automated scale test, I observe that the InferenceService sometimes reports as Loaded, but the call to GRPC endpoint returns with errors.

Examples:

<command>
set -o pipefail;
i=0;

GRPCURL_DATA=$(cat "subprojects/llm-load-test/openorca-subset-006.json" | jq .dataset[$i].input )

grpcurl    -insecure    -d "$GRPCURL_DATA"    -H "mm-model-id: flan-t5-small-caikit"    u0-m7-predictor-watsonx-serving-scale-test-u0.apps.psap-watsonx-dgxa100.perf.lab.eng.bos.redhat.com:443    caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict
</command>

<stderr> ERROR:
<stderr>   Code: Unavailable
<stderr>   Message: connections to all backends failing; last error: UNKNOWN: ipv4:127.0.0.1:8033: Failed to connect to remote host: Connection refused
<command>
set -o pipefail;
set -e;
dest=/mnt/logs/016__watsonx_serving__validate_model_all/u0-m6/answers.json
queries=/mnt/logs/016__watsonx_serving__validate_model_all/u0-m6/questions.json
rm -f "$dest" "$queries"

for i in $(seq 10); do
  GRPCURL_DATA=$(cat "subprojects/llm-load-test/openorca-subset-006.json" | jq .dataset[$i].input )
  echo $GRPCURL_DATA >> "$queries"
  grpcurl    -insecure    -d "$GRPCURL_DATA"    -H "mm-model-id: flan-t5-small-caikit"    u0-m6-predictor-watsonx-serving-scale-test-u0.apps.psap-watsonx-dgxa100.perf.lab.eng.bos.redhat.com:443    caikit.runtime.Nlp.NlpService/TextGenerationTaskPredict    >> "$dest"
  echo "Call $i/10 passed"
done
</command>

<stdout> Call 1/10 passed
<stdout> Call 2/10 passed
<stdout> Call 3/10 passed
<stdout> Call 4/10 passed
<stdout> Call 5/10 passed
<stdout> Call 6/10 passed
<stdout> Call 7/10 passed
<stdout> Call 8/10 passed
<stdout> Call 9/10 passed
<stderr> ERROR:
<stderr>   Code: Unavailable
<stderr>   Message: error reading from server: EOF

Versions

NAME                          DISPLAY                                          VERSION    REPLACES                                   PHASE
jaeger-operator.v1.47.1-5     Red Hat OpenShift distributed tracing platform   1.47.1-5   jaeger-operator.v1.47.0-2-0.1696814090.p   Succeeded
kiali-operator.v1.65.9        Kiali Operator                                   1.65.9     kiali-operator.v1.65.8                     Succeeded
rhods-operator.2.3.0          Red Hat OpenShift Data Science                   2.3.0      rhods-operator.2.2.0                       Succeeded
serverless-operator.v1.30.1   Red Hat OpenShift Serverless                     1.30.1     serverless-operator.v1.30.0                Succeeded
servicemeshoperator.v2.4.4    Red Hat OpenShift Service Mesh                   2.4.4-0    servicemeshoperator.v2.4.3                 Succeeded
quay.io/opendatahub/text-generation-inference@sha256:0e3d00961fed95a8f8b12ed7ce50305acbbfe37ee33d37e81ba9e7ed71c73b69
quay.io/opendatahub/caikit-tgis-serving@sha256:ed920d21a4ba24643c725a96b762b114b50f580e6fee198f7ccd0bc73a95a6ab

[Bug] TGIS container fails to run on a FIPS cluster

When deploying a LLM model using the new Caikit+TGIS architecture introduced with #107 , the TGIS container (i.e., transformer-container) fails to start if the cluster has FIPS cryptography enabled.

These are the 2 errors I got in the container logs
There was a problem when trying to write in your cache folder (/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory. fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE
Note: the TRANSFORMERS_CACHE is actually set in the ServinRuntime

This was found on a OpenShift 4.13.18 cluster with RHODS 2.1.2 (aka 1.32.2) and KServe 0.11 installed

Investigate the use of dev containers as mechanism for supporting development on Apple silicon

Developing and testing of the caikit-tgis-serving component on an Apple laptop (Intel and ARM chipsets) does not seem to be supported by this project.

We need to find a way to allow developers using Apple hardware to make meanifull contributions to the project. To that end we should:

  1. Investigate if the use of dev containers would allow for meaningful contributions
  2. If dev containers prove to be beneficial, we should document:
  • how to setup the environment
  • how to debug / unit test / integration test in the environment

An initial pass at the problem has been discussed in #171.

Caikit Standalone image/SR

Caikit standalone image/SR needs to be created.
Several steps required:

  • Contribute the RH dockerfile upstream
  • Sync upstream with odh repo
  • Build the image and share the image here
  • Add the SR and Inference Service to caikit-tgis-serving repo
  • Communicate to QE and UI teams the new SR

KServe does not catch Caikit runtime status correctly when subsprocess (tgis) have issues

When I create a ServingRuntime+InferenceService with some incorrect parameters, Caikit cannot load the model.

{"channel": "MODEL-LOADER", "exception": null, "level": "error", "log_code": "<RUN62912924E>", "message": "load failed when processing path: /mnt/models/flan-t5-small-caikit with error: RuntimeError('TGIS failed to boot up with the model. See logs for details')", "model_id": "flan-t5-small-caikit", "num_indent": 0, "thread_id": 140660900353792, "timestamp": "2023-09-21T19:39:45.781105"}

This part is expected. However, the InferenceService still shows the model as Loaded, which is unexpected:

  modelStatus:
    copies:
      failedCopies: 0
      totalCopies: 1
    states:
      activeModelState: Loaded
      targetModelState: Loaded
    transitionStatus: UpToDate

Jaeger is needed for installing Service Mesh

With a fresh cluster, the scripts/doc is not working because SCMP is not running properly with this msg:

    - lastTransitionTime: '2023-10-20T12:54:10Z'
      message: >-
        Dependency "Jaeger CRD" is missing: error: no matches for kind "Jaeger"
        in version "jaegertracing.io/v1"
      reason: DependencyMissingError
      status: 'False'
      type: Reconciled
    - lastTransitionTime: '2023-10-20T12:54:10Z'
      message: >-
        Dependency "Jaeger CRD" is missing: error: no matches for kind "Jaeger"
        in version "jaegertracing.io/v1"
      reason: DependencyMissingError
      status: 'False'
      type: Ready

I am not sure if it is ServiceMesh issue or not but at least, it blocked KServe installation so we need to add jaeger as a pre-requisite. By the way, we removed jaeger with this confirm msg

The Kiali and Jaeger Tracing operators are not required to be installed

@bartoszmajsak do you have any idea?

ODH Operator v2 changed API so docs/scripts are not working now

OpenDataHub Operator v2.1 changed the api.

The DataScienceCluster resource at [this address](https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/custom-manifests/opendatahub/kserve-dsc.yaml) doesn't work with the latest RC update (true/false vs Managed/Removed)

Due to this change, manifests need to be updated.

Caikit/TGIS swallows model loading running out of memory

When trying to load a model in a Pod running with a memory limit too low, the out-of-memory error message is swallowed by TGIS and hard to troubleshoot (in addition to Caikit swallowing the TGIS error):

2023-09-26T09:40:45.259993Z  INFO text_generation_launcher: Starting shard 0
Shard 0: supports_causal_lm = False, supports_seq2seq_lm = True
2023-09-26T09:40:55.279072Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-09-26T09:40:57.571196Z ERROR text_generation_launcher: Shard 0 failed to start:

2023-09-26T09:40:57.571219Z  INFO text_generation_launcher: Shutting down shards
{"channel": "TGISPROC", "exception": null, "level": "error", "log_code": "<MTS11752287E>", "message": "exception raised: RuntimeError('TGIS failed to boot up with the model. See logs for details')", "num_indent": 0, "thread_id": 140590947739392, "timestamp": "2023-09-26T09:40:59.288074"}

while troubleshooting it, I observed that even TGIS return code does not refect the OOM error, although my attemps confirmed that not giving enough memory was the cause of the load failure:

sh-4.4$ text-generation-launcher --num-shard 1 --model-name /mnt/models/flan-t5-large/artifacts/ --port 3000;
2023-09-26T11:42:33.150862Z  INFO text_generation_launcher: Launcher args: Args { model_name: "/mnt/models/flan-t5-large/artifacts/", revision: None, deployment_framework: "hf_transformers", dtype: None, dtype_str: Some("float16"), num_shard: Some(1), max_concurrent_requests: 150, max_sequence_length: 4096, max_new_tokens: 1024, max_batch_size: 256, max_batch_weight: Some(47458400), max_prefill_weight: None, max_waiting_tokens: 24, port: 3000, grpc_port: 8033, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, json_output: false, tls_cert_path: None, tls_key_path: None, tls_client_ca_cert_path: None, output_special_tokens: false, cuda_process_memory_fraction: 1.0 }
2023-09-26T11:42:33.151097Z  INFO text_generation_launcher: Starting shard 0
Shard 0: supports_causal_lm = False, supports_seq2seq_lm = True
2023-09-26T11:42:43.180572Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-09-26T11:42:50.384697Z ERROR text_generation_launcher: Shard 0 failed to start:

2023-09-26T11:42:50.384723Z  INFO text_generation_launcher: Shutting down shards
sh-4.4$ echo $?
1

Review OpenShift CI flows

Per a comment in openshift-ci troubleshooting, it's possible our build/push workflow isn't quite correct: Thread

It's not 100% clear to me, but it seems like the mirror job also builds, or at least waits for the build, but the comment from the team there makes it unclear

It's also a good chance to review the entirety of our test workflows with the openshift-ci team, and use the 'request consultation' option to do an overall review of our openshift-ci jobs for better maintenance in the future

Remove namespaces from demo/kserve/custom-manifests/metrics yamls

In https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/metrics.md

It is mentioned to use -n $TEST_NS for applying:
custom-manifests/metrics/uwm-cm-enable.yaml and custom-manifests/metrics/uwm-cm-conf.yaml

image

However, since the namespace was already defined in the yamls, it will fail with:

$▶ oc apply -f ./custom-manifests/metrics/uwm-cm-enable.yaml -n $TEST_NS
error: the namespace from the provided object "openshift-monitoring" does not match the namespace "kserve-demo".
You must pass '--namespace=openshift-monitoring' to perform this operation.

Please update the yamls - and remove the namespace item, in order for above command to work.

Related feature: opendatahub-io/caikit#3

FAQ doc to gather known issues/solution

In order to solve typical issues, it would be a good idea to start documenting this in FAQ style.

This doc will help users solve their issues by themselves.

Add documentation about the available metrics

Based on watsonx requirements, we should make available these metrics, at least:

  • '# of inference requests over defined time period
  • Avg. response time over defined time period
  • '# of successful / failed inference requests over defined time period
  • Compute utilization (CPU,GPU,Memory)

However, users won't find metrics with the same name and some of them need to be computed by combination. Examples:

  • failed inference requests over defined time period: you must do sth like tgi_batch_inference_count-tgi_batch_inference_success plus adding the time period syntax
  • Memory consumption: there isn't a specific istio/tgi/caikit metric for it (at least, i didn't find it). I thought users can compute it with sth similar to: sum(container_memory_working_set_bytes{pod='<isvc_predictor_pod_name>',namespace='<isvc_namespace>',container='',}) BY (pod, namespace)

Moreover, there are additional metrics which deserves to be documented, like tgi_request_generated_tokens_count

add/improve docs for caikit+tgis setup

#107 added a caikit image which relies on a separate tgis container. This also includes an example setup of the ServingRuntime/InferenceService manifest that can be deployed and tested, but the documentation is missing and/or outdated.

Create a list of package requirements

Using the latest version https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/scripts/generate-wildcard-certs.sh the script is failing with the following error:

+ openssl req -x509 -newkey rsa:2048 -sha256 -days 3560 -nodes -subj '/CN=<value>' -extensions san -config <configpath> -CA <cert_path> -CAkey <key_path> -keyout <keyout_path> -out <out_path>
req: Unrecognized flag CA
req: Use -help for summary.

The cause was due an outdated pkg installed on the system. It could happen to anyone using the script, hence it would be good to explicitly say the minimum reqs

model_id is mandatory for model queries but it can have any values

in order to query a model using caikit+TGIS runtime, we must pass the model_id parameter in the HTTP payload (or mm-model-id for grpc).
However, it can have any value and as far as the endpoint is correct, the model responses.

In the following screenshot you can see 3 calls:

  1. first one using the actual model id
  2. second one using a dummy model id
  3. third one without the model id parameter

image

Caikit-tgis-serving with July 28th code drop cannot load models

Currently the July 28th caikit-nlp (the one used in pr-25) is not able to load models with errors of the following form:

{"channel": "TXT_GEN", "exception": null, "level": "error", "log_code": "<NLP51672289E>", "message": "exception raised: ValueError('value check failed: Cannot run model /opt/models/flan-t5-small-caikit/artifacts with TGIS locally since it has no base artifacts')", "num_indent": 0, "thread_id": 139931863193344, "timestamp": "2023-07-28T18:55:16.932746"}
{"channel": "MODEL-LOADER", "exception": null, "level": "error", "log_code": "<RUN62912924E>", "message": "load failed when processing path: /opt/models/flan-t5-small-caikit with error: ValueError('value check failed: Cannot run model /opt/models/flan-t5-small-caikit/artifacts with TGIS locally since it has no base artifacts')", "model_id": "flan-t5-small-caikit", "num_indent": 0, "thread_id": 139931863193344, "timestamp": "2023-07-28T18:55:16.933255"}

This seems to have something to do with the caikit tgis local backend

The kserve install failed with `TARGET_OPERATOR=rhods`

kserve install failed with the following error .

[perfci@f23-h33-000-6018r ~]$ oc -n redhat-ods-operator describe subs/rhods-operator
...
  Conditions:
    Message:               constraints not satisfiable: no operators found from catalog rhods-catalog in namespace openshift-marketplace referenced by subscription rhods-operator, subscription rhods-operator exists
    Reason:                ConstraintsNotSatisfiable
    Status:                True
    Type:                  ResolutionFailed
    Last Transition Time:  2023-08-22T19:06:09Z
    Message:               targeted catalogsource openshift-marketplace/rhods-catalog missing
    Reason:                UnhealthyCatalogSourceFound
    Status:                True
    Type:                  CatalogSourcesUnhealthy
  Last Updated:            2023-08-22T19:06:09Z

[RFE] support "custom" namespaces

currently the caikit install instructions for the demo of t5/flan have you use specific namespaces. But namespaces are cluster-scoped and must be unique. Two users in the same cluster cannot create a namespace minio

It would be good if the instructions let you specify namespaces ahead of time (via bash env) for minio and other components and then used those vars. For example, myminio and mydemo

Support demo/kserve script for ODH

At the moment, the scripts only support RHODS/PREVIEW RHODS.

However, the new odh operator v2.1 is out so we need to enhance the scripts to support opendatahub.

Update the caikit dependencies

Caikit has released new versions of the libraries. We need to update the dependencies and validate that the image is correct.

gRPC connection with Python over SSL does not (always) work

  • Caikit+TGIS stack deployed as per procedure.
  • Model loaded and working, confirmed with grpcurl command with the --insecure parameter that bypass certificate validation.

I'm now trying to make this work with the grpc library in Python. The Python implementation does not allow to bypass certificate validation for TLS encryption (grpcurl is written in Go, for which the bypass is implemented, therefore working).

So to get it to work you have to export the SSL certificate and use it when defining the channel. Like this:

with open('certificate.pem', 'rb') as f:
    creds = grpc.ssl_channel_credentials(f.read())

server_address = inference_server_url

channel = grpc.secure_channel(server_address, creds)

This work on some servers, but not on others where you get this error:

_MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:52.87.25.239:443: Peer name caikit-example-isvc-predictor-kserve-demo.apps.aisrhods-dell.bj30.p1.openshiftapps.com is not in peer certificate"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:52.87.25.239:443: Peer name caikit-example-isvc-predictor-kserve-demo.apps.aisrhods-dell.bj30.p1.openshiftapps.com is not in peer certificate {grpc_status:14, created_time:"2023-10-09T16:09:55.378527928+00:00"}"
>

Self-signed certificates format are identical in both cases (only the CN or Organization changes obviously), installation of the Caikit+TGIS stack is identical as far as we can tell.

So to solve the issue it's either:

  • provide a working recipe to work with the self-signed certificates used in the deployment
  • provide a working recipe to bypass self-signed certificates
  • provide an http interface as Python requests methods will allow to bypass self-signed certificates
  • Any other solution to be able to consume Caikit+TGIS from Python (!grpcurl not an option)

Metrics for caikit, tgi and istio were not observed in the deployed namespace

After following instructions to deploy and access Metrics on RHODS 1.32 v2 RC7 (brew.registry.redhat.io/rh-osbs/iib:568805), according to:
https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/metrics.md
which involves applying 2 configmaps into a test namespace (for example TEST_NS=watsonx)

The configMaps were created in the test namespace:
image

$▶ oc describe configmap/cluster-monitoring-config -n ${TEST_NS}
Name:         cluster-monitoring-config
Namespace:    watsonx
Labels:       <none>
Annotations:  <none>

Data
====
config.yaml:
----
enableUserWorkload: true


BinaryData
====

Events:  <none>

$▶ oc describe configmap/user-workload-monitoring-config -n ${TEST_NS}
Name:         user-workload-monitoring-config
Namespace:    watsonx
Labels:       <none>
Annotations:  <none>

Data
====
config.yaml:
----
prometheus:
  logLevel: debug 
  retention: 15d #Change as needed


BinaryData
====

Events:  <none>

But the expected metrics for caikit, tgi or istio were not observed as expected:
image

Looking at the default namespace openshift-monitoring - we can see that the original configmap was not changed:

image

Apparently, if updating the default configmaps in openshift-user-workload-monitoring and in cluster-monitoring-config then the expected metrics show up.

Related feature: opendatahub-io/caikit#3

Update `caikit-nlp` version to resolve caikit/caikit-nlp#237

The current version of caikit-tgis-serving exposes some GRPC function arguments (and services) in random order. Two examples:
image
image

The critical part of it (random argument order) is already solved in caikit/caikit-nlp#237 main. PR is open to fix the service method order.

The caikit-nlp git ref should be updated before publishing the next release of caikit-tgis-serving.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.