Comments (14)
Oh so they are rejected:
The table size of maglev must be prime
I didn't know about that constraint, and we don't validate it or document it -- but clearly we should
from istio.
Ah got it. Is this happening repeatedly? or you have a few stuck pods and haven't restarted all of them?
if repeatedly - can you send a new log now that the maglev issue is fixed to ensure that wasn't somehow messing with things?
from istio.
Steps to reproduce this issue.
- Starting with all the proxies in synced state, (istio version 1.19.7, also possible to repo in 1.18.x)
istioctl proxy-status -n istio-system
NAME CLUSTER CDS LDS EDS RDS ECDS ISTIOD VERSION
alertmanager-kube-prometheus-stack-alertmanager-0.istio-system Kubernetes SYNCED SYNCED SYNCED SYNCED NOT SENT istiod-55575fbb98-7rk66 1.19.7
grafana-6b7bd6c985-cgvxl.istio-system Kubernetes SYNCED SYNCED SYNCED SYNCED NOT SENT istiod-55575fbb98-7rk66 1.19.7
istio-ingressgateway-774c6b8695-2gz6t.istio-system Kubernetes SYNCED SYNCED SYNCED SYNCED NOT SENT istiod-55575fbb98-7rk66 1.19.7
istio-ingressgateway-774c6b8695-lk8jt.istio-system Kubernetes SYNCED SYNCED SYNCED SYNCED NOT SENT istiod-55575fbb98-7rk66 1.19.7
...
Relevant snapshot of /debug/sync from istiod
{
"cluster_id": "Kubernetes",
"proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"cluster_sent": "f3a59072-f389-4f09-876a-c1281254474c",
"cluster_acked": "f3a59072-f389-4f09-876a-c1281254474c",
"listener_sent": "4aeb4141-bd97-49f7-a9f3-691fff13bb25",
"listener_acked": "4aeb4141-bd97-49f7-a9f3-691fff13bb25",
"route_sent": "838a8e16-8181-4d77-8a64-31f0f9e7eee6",
"route_acked": "838a8e16-8181-4d77-8a64-31f0f9e7eee6",
"endpoint_sent": "10d2a8bf-0c70-46a5-8ecb-3f9e8b244089",
"endpoint_acked": "10d2a8bf-0c70-46a5-8ecb-3f9e8b244089"
},
{
"cluster_id": "Kubernetes",
"proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"cluster_sent": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
"cluster_acked": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
"listener_sent": "d1a184af-89ca-4b05-95c5-267ae1118a5e",
"listener_acked": "d1a184af-89ca-4b05-95c5-267ae1118a5e",
"route_sent": "705e2514-e807-43e5-ba96-fa9345f5d750",
"route_acked": "705e2514-e807-43e5-ba96-fa9345f5d750",
"endpoint_sent": "28d32732-be43-44a0-8f41-f602975786cf",
"endpoint_acked": "28d32732-be43-44a0-8f41-f602975786cf"
},
- Create a DestinationRule with Maglev lb example
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: fortio-test-routing
namespace: fortio
spec:
host: fortio-client.fortio.svc.cluster.local
trafficPolicy:
loadBalancer:
consistentHash:
httpQueryParameterName: url
maglev:
tableSize: 500
- CDS on all proxies becomes stale. We see the error
The table size of maglev must be prime
on all proxies.
/debug/sync info
{
"cluster_id": "Kubernetes",
"proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"cluster_sent": "d39434ea-00eb-4016-a490-873944dfc003",
"cluster_acked": "f3a59072-f389-4f09-876a-c1281254474c",
"listener_sent": "b7346660-ae73-4574-865c-e9556730ccb7",
"listener_acked": "b7346660-ae73-4574-865c-e9556730ccb7",
"route_sent": "ee5cb966-1e65-440b-af6d-5a8163d4e6bb",
"route_acked": "ee5cb966-1e65-440b-af6d-5a8163d4e6bb",
"endpoint_sent": "b68fd7c2-07da-4ee1-bce6-87a1e0efc2a7",
"endpoint_acked": "b68fd7c2-07da-4ee1-bce6-87a1e0efc2a7"
},
{
"cluster_id": "Kubernetes",
"proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"cluster_sent": "7954d038-fbb8-499c-8381-88a458efda70",
"cluster_acked": "752ba83c-e00d-4261-ae0f-27f3b567e4d6",
"listener_sent": "d00c1755-1edb-411f-98ab-86b85e7c871a",
"listener_acked": "d00c1755-1edb-411f-98ab-86b85e7c871a",
"route_sent": "7799a716-784e-46ce-b6bb-ab738ecf3d7d",
"route_acked": "7799a716-784e-46ce-b6bb-ab738ecf3d7d",
"endpoint_sent": "704fd12d-3941-4da3-ba78-24d5493dc025",
"endpoint_acked": "704fd12d-3941-4da3-ba78-24d5493dc025"
},
- Restart istiod. And proxies go into state
STALE (Never Acknowledged)
.
{
"cluster_id": "Kubernetes",
"proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"cluster_sent": "e4266698-343c-41de-895c-ec390e2f9895",
"listener_sent": "75f7814f-c4bb-4775-8bff-0affdc375d42",
"listener_acked": "75f7814f-c4bb-4775-8bff-0affdc375d42",
"route_sent": "62f4b423-b12c-4a98-83d3-8ad3ea2a2734",
"route_acked": "62f4b423-b12c-4a98-83d3-8ad3ea2a2734",
"endpoint_sent": "62ff557f-4bb9-423e-95db-1c0789d2d98c",
"endpoint_acked": "62ff557f-4bb9-423e-95db-1c0789d2d98c"
},
{
"cluster_id": "Kubernetes",
"proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"cluster_sent": "ea8157b8-6b65-40b2-a77d-47ddce075284",
"listener_sent": "bfaf6443-3264-4c91-8da7-d64c09f60b84",
"listener_acked": "bfaf6443-3264-4c91-8da7-d64c09f60b84",
"route_sent": "ab1b8361-9f84-4083-ad96-7d45402da957",
"route_acked": "ab1b8361-9f84-4083-ad96-7d45402da957",
"endpoint_sent": "acc4e7c0-cd0b-44e4-98b4-2f28a4f66d43",
"endpoint_acked": "acc4e7c0-cd0b-44e4-98b4-2f28a4f66d43"
},
- After sometime (~30 mins). Istiod stops sending CDS conf to all the proxies
{
"cluster_id": "Kubernetes",
"proxy": "istio-ingressgateway-774c6b8695-2gz6t.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"listener_sent": "68816399-c1b8-490a-a655-703283417d18",
"listener_acked": "68816399-c1b8-490a-a655-703283417d18",
"route_sent": "d311ea57-276e-42e5-b93a-749ab8b46b4d",
"route_acked": "d311ea57-276e-42e5-b93a-749ab8b46b4d",
"endpoint_sent": "7b6b697f-b6bd-47cc-8207-e7186582dfb2",
"endpoint_acked": "7b6b697f-b6bd-47cc-8207-e7186582dfb2"
},
{
"cluster_id": "Kubernetes",
"proxy": "istio-ingressgateway-774c6b8695-lk8jt.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"listener_sent": "ed77d779-6bed-4b4d-aa9e-6cb4325150e5",
"listener_acked": "ed77d779-6bed-4b4d-aa9e-6cb4325150e5",
"route_sent": "20f491fd-c194-4255-bbb4-4bafa0730d23",
"route_acked": "20f491fd-c194-4255-bbb4-4bafa0730d23",
"endpoint_sent": "44d5669b-6e82-4701-aa73-e1ad1ec1f82e",
"endpoint_acked": "44d5669b-6e82-4701-aa73-e1ad1ec1f82e"
}
- Now you can fix the DestinationRule or delete it the proxies will be stuck in this state.
from istio.
Were istiod-5756dc65b7-pl9ns and istiod-5756dc65b7-sw8sc (not sent) different in any way from istiod-5756dc65b7-42jmx (synced) ?
Any error or issues in their logs?
Can you reproduce this if everyone is running 1.19.10 ? I see you have 1.17.3 (12 proxies) - that's more than 2 minor version behind
from istio.
Were istiod-5756dc65b7-pl9ns and istiod-5756dc65b7-sw8sc (not sent) different in any way from istiod-5756dc65b7-42jmx (synced) ? Any error or issues in their logs?
Nothing stood out to me apart from some duplicate serviceentry which was causing errors on all of the istiod pods
"message": "Duplicate cluster outbound|15199||localhost.service.entry found while pushing CDS"
Can you reproduce this if everyone is running 1.19.10 ? I see you have 1.17.3 (12 proxies) - that's more than 2 minor version behind
we do not have control on the proxy pods, they are brought to the new version when their deployment is updated. Not sure if i can reproduce this anyway
from istio.
Do you have full envoy proxy logs for the proxies with the issue
from istio.
Attached, [ removed the actual traffic logs and the messages of connected to upstream XDS server istiod
]
log.txt
from istio.
The above problem was fixed it was on one of the destination rule, whose proxy was stuck ( envoy was not coming up ).
from istio.
Do you mean you fixed the maglev error and still see the original issue of clusters not sent? or both are fixed
from istio.
Yes, the maglev issue is fixed now, but the cluster updates are still not happening unless the proxies are restarted
from istio.
i still have some pods which i have not restarted and are broken
from istio.
@howardjohn any thing else we can check here ?
from istio.
@bseenu its a bit hard to tell because it could easily just be because of the maglev issue. It would help if you could reproduce it now that that error isn't present, and include the istiod logs
from istio.
Adding a bit more information about this issue and steps to reproduce the issue:
I looked at two proxies from the same ingress, where one pod was not getting CDS updates sent to it (replica jsnck
)
Info from /debug/syncz endpoint about two proxies
{
"cluster_id": "Kubernetes",
"proxy": "istio-internal-ingressgateway-ff7f456dd-252qg.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"cluster_sent": "224c32c9-6776-4997-8d72-59648893b2c2",
"cluster_acked": "224c32c9-6776-4997-8d72-59648893b2c2",
"listener_sent": "3f6b051c-5ca7-4356-92cc-82349d2a785e",
"listener_acked": "3f6b051c-5ca7-4356-92cc-82349d2a785e",
"route_sent": "f5ab51b2-25ce-4b43-8928-57c4885391db",
"route_acked": "f5ab51b2-25ce-4b43-8928-57c4885391db",
"endpoint_sent": "dc1aa3aa-8672-4eb6-b5fd-3e5172b18a33",
"endpoint_acked": "dc1aa3aa-8672-4eb6-b5fd-3e5172b18a33"
},
{
"cluster_id": "Kubernetes",
"proxy": "istio-internal-ingressgateway-ff7f456dd-jsnck.istio-system",
"proxy_type": "router",
"istio_version": "1.19.7",
"listener_sent": "97f4c48a-b8da-4a3c-8f3a-3161401022a5",
"listener_acked": "97f4c48a-b8da-4a3c-8f3a-3161401022a5",
"route_sent": "ef2a6589-3f4e-47a7-bffc-cfe2a88d3e92",
"route_acked": "ef2a6589-3f4e-47a7-bffc-cfe2a88d3e92",
"endpoint_sent": "b4e7e726-7a18-4191-b8bb-bf791abb8de9",
"endpoint_acked": "b4e7e726-7a18-4191-b8bb-bf791abb8de9"
},
Looking at the envoy config dump from the "stuck" replica, We can see, that it still has the incorrect maglev table size 500
last updated 4/1. Even though we had updated the DestinationRule to maglev table size 547 (prime number as required by envoy doc)
"dynamic_warming_clusters": [
{
"version_info": "2024-04-03T22:17:04Z/3068",
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "outbound|8080||svc-a.namespace-a.svc.cluster.local",
"type": "EDS",
"eds_cluster_config": {
"eds_config": {
"ads": {},
"initial_fetch_timeout": "0s",
"resource_api_version": "V3"
},
"service_name": "outbound|8080||svc-a.namespace-a.svc.cluster.local"
},
"connect_timeout": "10s",
"lb_policy": "MAGLEV",
"metadata": {
"filter_metadata": {
"istio": {
"services": [
{
"name": "svc-a",
"host": "svc-a.namespace-a.svc.cluster.local",
"namespace": "namespace-a"
}
],
"config": "/apis/networking.istio.io/v1alpha3/namespaces/namespace-a/destination-rule/svc-a-transcode-routing"
}
}
},
"common_lb_config": {
"locality_weighted_lb_config": {}
},
"filters": [
{
"name": "istio.metadata_exchange",
"typed_config": {
"@type": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange",
"protocol": "istio-peer-exchange"
}
}
],
"transport_socket_matches": [],
"maglev_lb_config": {
"table_size": "500"
}
},
"last_updated": "2024-04-03T22:17:05.266Z"
}
]
from istio.
Related Issues (20)
- Documentation unclear about sidecar customization (could be bug in injector) HOT 1
- istioctl should install Gateway API CRDs for you HOT 9
- Ambient: Race condition external network unreachable HOT 2
- Put envoyFileAccessLog in defaultProviders -> accessLogging, the path cannot take effect
- Need reference yaml configurations for AWS EKS + AWS Internet facing ALB + Istio Ingress
- REFRESH_GOLDEN does not work for openshift injection test
- Send HBONE does not work well with multi-network
- Support multiple outbound ports instead of only one 15001 to avoid local source port exhaustion HOT 2
- VM Workload Auto-registration Fails due to Can't find Workloadgroup HOT 3
- applying then deleting a ServiceEntry with the same host results in missing endpoint
- Running MQ on AKS we expose UI and was able to connect to it and test using port-forward. We use Istio to handle routing and TLS, and when trying to expose the UI via Istio we’re seeing “upstream connect error or disconnect/reset before headers. reset reason: connection termination” HOT 4
- [ambient] Ingress gateway to waypoint HOT 2
- ztunnel-config unable to unmarshal config dump when using `all` command
- Istiod checks wrong namespace for TLS cert in gateway HOT 2
- TLS passthrough sends all traffic on 443 to single service
- Ambient IPv6 integration test runs
- [release-1.22] CNI shouldn't pass trustdomain down
- Forwarding the b3 header from the service in cluster to the istio ingress gateway will create a new trace with the same x-request-id HOT 4
- There should be wrong when transfer virtualservice resource to envoy config HOT 3
- Broken HTTPS routing on Gateway when using same certificate in two separate gateways. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from istio.