GithubHelp home page GithubHelp logo

Comments (5)

kpouget avatar kpouget commented on August 23, 2024

I could work around the issue by increasing the memory limit of the Istio egress/ingress Pods (to 4GB, to be safe):

apiVersion: maistra.io/v2
kind: ServiceMeshControlPlane
metadata:
  name: minimal
  namespace: istio-system
spec:
  gateways:
    egress:
      runtime:
        container:
          resources:
            limits:
              memory: 4Gi
    ingress:
      runtime:
        container:
          resources:
            limits:
              memory: 4Gi

image
image

but this wasn't happening a few weeks ago, with RHOAI 2.1.0 and 300 models (when running on AWS with 35 nodes, whereas this bug occured on a single-node OpenShift)

image
image

Can this be a regression, or is it somehow expected?

from caikit-tgis-serving.

bartoszmajsak avatar bartoszmajsak commented on August 23, 2024

@kpouget I am wondering if we can get some insights into these metrics as well:

  • pilot_xds_push_time_bucket
  • pilot_proxy_convergence_time_bucket
  • pilot_proxy_queue_time_bucket

from caikit-tgis-serving.

bartoszmajsak avatar bartoszmajsak commented on August 23, 2024

but this wasn't happening a few weeks ago, with RHOAI 2.1.0 and 300 models (when running on AWS with 35 nodes, whereas this bug occured on a single-node OpenShift)

@kpouget was it also running on istio underneath? if so - how was it configured?

from caikit-tgis-serving.

kpouget avatar kpouget commented on August 23, 2024

@kpouget was it also running on istio underneath? if so - how was it configured?

yes it was. Istio was using these files for configuration (pinned commit from what I used at the time of the test)

from caikit-tgis-serving.

bartoszmajsak avatar bartoszmajsak commented on August 23, 2024

I managed to reduce resource consumption roughly by half. Here's the script which you can apply.

In short this script:

  • sets resource constraints for pilot and gateways
  • enables PILOT_FILTER_GATEWAY_CLUSTER_CONFIG
    • this reduces the amount of configuration data that Pilot sends to Istio gateways, specifically the egress and ingress gateways. It filters out unnecessary service registry information that is not relevant to a particular gateway.
  • limits outbound endpoints populated to sidecar proxies for each of the projects by using Sidecar resource.
    • This is based on the assumption that there is no cross-namespace communication in place. If that is not true we have to revise Sidecar settings @Jooho @israel-hdez
#!/bin/bash

cat <<EOF > smcp-patch.yaml 
apiVersion: maistra.io/v2
kind: ServiceMeshControlPlane
metadata:  
  name: data-science-smcp
  namespace: istio-system  
spec:
  gateways:
    egress:
      runtime:
        container:
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1G
    ingress:
      runtime:
        container:
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1G
  runtime:
    components:
      pilot:
        container:
          env:
            PILOT_FILTER_GATEWAY_CLUSTER_CONFIG: "true"
          resources:
            limits:
              cpu: 1024m
              memory: 4G
            requests:
              cpu: 128m
              memory: 1024Mi

EOF

trap '{ rm -rf -- smcp-patch.yaml; }' EXIT

kubectl patch smcp/data-science-smcp -n istio-system --type=merge --patch-file smcp-patch.yaml 

namespaces=$(kubectl get ns -ltopsail.scale-test -o name | cut -d'/' -f 2)


# limit sidecarproxy endpoints to its own ns and istio-system
for ns in $namespaces; do
    cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: default
  namespace: $ns
spec:
  egress:
  - hosts:
    - "./*"
    - "istio-system/*"
EOF
done

# force changes to take effect
for ns in $namespaces; do
    kubectl delete pods --all -n "${ns}"
done


# force re-creation of all pods with envoy service registry rebuilt
kubectl delete pods --all -n istio-system

Initial state

❯ istioctl proxy-config endpoint deployment/istio-ingressgateway -n istio-system | wc -l
1052

❯ istioctl proxy-config endpoint $(kubectl get pods -o name -n watsonx-scale-test-u1) -n watsonx-scale-test-u1 | wc -l
1065

❯ kubectl top pods -n istio-system
NAME                                        CPU(cores)   MEMORY(bytes)   
istio-egressgateway-6b7fdb6cb9-lh5jg        100m         2519Mi          
istio-ingressgateway-7dbdc66dd7-nkxxq       91m          2320Mi          
istiod-data-science-smcp-65f4877fff-tndf4   82m          1392Mi 

❯ kubectl k top pods -n watsonx-scale-test-u0 --containers
POD                                               NAME                    CPU(cores)   MEMORY(bytes)   
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   POD                     0m           0Mi             
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   istio-proxy             14m          372Mi           
...

Modifications

❯ istioctl proxy-config endpoint deployment/istio-ingressgateway -n istio-system | wc -l
1052 // it knows the whole world, so that is the same

❯ istioctl proxy-config endpoint $(kubectl get pods -o name -n watsonx-scale-test-u1) -n watsonx-scale-test-u1 | wc -l
34

❯ kubectl top pods -n istio-system
NAME                                        CPU(cores)   MEMORY(bytes)   
istio-egressgateway-5778df8594-j869r        83m          444Mi           
istio-ingressgateway-6847d4b974-sk25z       77m          946Mi           
istiod-data-science-smcp-5568884d7d-45zkz   36m          950Mi 

❯ kubectl k top pods -n watsonx-scale-test-u0 --containers
POD                                               NAME                    CPU(cores)   MEMORY(bytes)   
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   POD                     0m           0Mi             
u0-m0-predictor-00001-deployment-c46f9d59-jv9pq   istio-proxy             6m           136Mi           
...

from caikit-tgis-serving.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.