Comments (20)
I may be missing something but what happens if something has changed that caused the kubelet to not be ready ... imagine, the Network plugin is not working ... during the initial period until the check changes the state pods will be scheduled on the node and will fail ....
I think that we are operating the assumption that this is a restart and during that time nothing changed, but is this a safe assumption or can we guarantee somehow that nothing has changed that could impact the node readiness state?
from kubernetes.
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from kubernetes.
/sig node
from kubernetes.
@AllenXu93: The label(s) sig/
cannot be applied, because the repository doesn't have them.
In response to this:
/sig node
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from kubernetes.
I can reproduce this issue in v1.30
When the container runtime is healthy, the kubelet should not report KubeletNotReady
with the reason runtime status check may not have completed yet
We called kl.updateRuntimeUp()
in the fastStatusUpdateOnce
function to update the status of the container runtime, but we did not do this in syncNodeStatus
,
kubernetes/pkg/kubelet/kubelet_node_status.go
Lines 467 to 469 in 37ca037
We only need to execute kl.updateRuntimeUp()
function once before fastStatusUpdateOnce
and syncNodeStatus
, just like we did in the fastStatusUpdateOnce
function, than this problem can be avoided.
But in #122338 (comment), @aojea seems to have some different opinions...
from kubernetes.
I can reproduce this issue in v1.30
When the container runtime is healthy, the kubelet should not report
KubeletNotReady
with the reasonruntime status check may not have completed yet
We called
kl.updateRuntimeUp()
in thefastStatusUpdateOnce
function to update the status of the container runtime, but we did not do this insyncNodeStatus
,kubernetes/pkg/kubelet/kubelet_node_status.go
Lines 467 to 469 in 37ca037
We only need to execute
kl.updateRuntimeUp()
function once beforefastStatusUpdateOnce
andsyncNodeStatus
, just like we did in thefastStatusUpdateOnce
function, than this problem can be avoided.But in #122338 (comment), @aojea seems to have some different opinions...
Yeah, in our env, we modify kubelet the same way as you said, to let it execute kl.updateRuntimeUp
once before syncNodeStatus
, it worked.
I can submit a PR later.
from kubernetes.
I may be missing something but what happens if something has changed that caused the kubelet to not be ready ... imagine, the Network plugin is not working ... during the initial period until the check changes the state pods will be scheduled on the node and will fail ....
I think that we are operating the assumption that this is a restart and during that time nothing changed, but is this a safe assumption or can we guarantee somehow that nothing has changed that could impact the node readiness state?
In this issue, what I found is that nothing has changed, every time I restart kubelet, node will become notReady only in first sync period, in next period it will be ready; It's not an assumption.
Network plugin not working
or other problem of course will cause node notReady, but they are not in this issue's scope, in this scope, notReady reason is container runtime status check may not have completed yet
message, have not relationship with other problem;
from kubernetes.
But in #122338 (comment), @aojea seems to have some different opinions...
ok, I misread this issue sorry, so it is not to blindly set the node to ready, is just to perform the runtime check before the other checks ... I think you both are right ... actually it seems that if fastStatusUpdateOnce
wins the race then there is no problem, righ @HirazawaUi ?
from kubernetes.
actually it seems that if
fastStatusUpdateOnce
wins the race then there is no problem, righ @HirazawaUi ?
Yes, but fastStatusUpdateOnce
and syncNodeStatus
run in different goroutines, so we cannot guarantee that fastStatusUpdateOnce
will complete faster.
So it seems like a good choice to execute updateRuntimeUp
before running syncNodeStatus
, or do you have better suggestions?
from kubernetes.
diff --git a/pkg/kubelet/kubelet.go b/pkg/kubelet/kubelet.go
index af74a095628..cd8acc7fbf0 100644
--- a/pkg/kubelet/kubelet.go
+++ b/pkg/kubelet/kubelet.go
@@ -1626,23 +1626,27 @@ func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
// Start volume manager
go kl.volumeManager.Run(kl.sourcesReady, wait.NeverStop)
+ // Check the container runtime status.
+ // This has to run before kl.syncNodeStatus (https://issues.k8s.io/124397)
+ go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
+
if kl.kubeClient != nil {
// Start two go-routines to update the status.
//
- // The first will report to the apiserver every nodeStatusUpdateFrequency and is aimed to provide regular status intervals,
- // while the second is used to provide a more timely status update during initialization and runs an one-shot update to the apiserver
+ // The first will is used to provide a more timely status update during initialization and runs an one-shot update to the apiserver
// once the node becomes ready, then exits afterwards.
+ go kl.fastStatusUpdateOnce()
+
+ // The second will report to the apiserver every nodeStatusUpdateFrequency and is aimed to provide regular status intervals,
//
// Introduce some small jittering to ensure that over time the requests won't start
// accumulating at approximately the same time from the set of nodes due to priority and
// fairness effect.
go wait.JitterUntil(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, 0.04, true, wait.NeverStop)
- go kl.fastStatusUpdateOnce()
// start syncing lease
go kl.nodeLeaseController.Run(context.Background())
}
- go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
// Set up iptables util rules
if kl.makeIPTablesUtilChains {
https://go.dev/play/p/752NWud709S
We need to put the goroutines in the right order to make the startup more predictable
from kubernetes.
But in #122338 (comment), @aojea seems to have some different opinions...
ok, I misread this issue sorry, so it is not to blindly set the node to ready, is just to perform the runtime check before the other checks ... I think you both are right ... actually it seems that if
fastStatusUpdateOnce
wins the race then there is no problem, righ @HirazawaUi ?
Yes, if fastStatusUpdateOnce
execute very fast, may there is no problem, because in fastStatusUpdateOnce
it will execute updateRuntimeUp
.
I have test in v1.28 (which code in fastStatusUpdateOnce almost same as 1.30 ) , it will not always reproduce, but still occurr some times.
In 1.22, fastStatusUpdateOnce
will sleep for about 100ms before everything, it can be reproduced everytimes.
from kubernetes.
We need to put the goroutines in the right order to make the startup more predictable
Arranging the goroutines in order can solve our problems in most cases, but in the updateRuntimeUp
method, we need to call the containerRuntime api to get the status. I am worried that its response will not be timely enough in special scenarios, there may be a risk that syncNodeStatus
will complete faster.
kubernetes/pkg/kubelet/kubelet.go
Lines 2873 to 2878 in 2806ffe
from kubernetes.
Arranging the goroutines in order can solve our problems in most cases
moving go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
will execute first https://go.dev/play/p/752NWud709S
from kubernetes.
moving
go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
will execute first https://go.dev/play/p/752NWud709S
I think I must not have expressed clearly.
What I'm worried about is that under special circumstances, updateRuntimeUp
calls the API of the container runtime and cannot return immediately (not sure if this situation really exists). Even if the updateRuntimeUp
method is executed first, it will still complete later than syncNodeStatus
:)
from kubernetes.
Arranging the goroutines in order can solve our problems in most cases
moving
go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
will execute first https://go.dev/play/p/752NWud709S
Add some detail:
in updateRuntimeUp
it call runtime api to check runtime status, then set lastBaseRuntimeSync
variable .
kubernetes/pkg/kubelet/kubelet.go
Line 2913 in 9227001
in syncNodeStatus
, it will check lastBaseRuntimeSync
variable, if lastBaseRuntimeSync
is nil, this issue's problem will occurr.
kubernetes/pkg/kubelet/kubelet_node_status.go
Line 748 in bf07ef3
kubernetes/pkg/kubelet/runtime.go
Line 108 in 9227001
Even if updateRuntimeUp
goroutine call firstly, call container runtime check will still need some time, it can't guarantee that when syncNodeStatus
firstly call, lastBaseRuntimeSync
is setted.
from kubernetes.
are you suggesting to lock syncNodeStatus on updateRuntimeUp?
I'm afraid of some corner cases we can hit ... if the container runtime does not return do we block forever?
from kubernetes.
are you suggesting to lock syncNodeStatus on updateRuntimeUp?
I'm afraid of some corner cases we can hit ... if the container runtime does not return do we block forever?
Container runtime API call have timeout.
kubernetes/pkg/kubelet/cri/remote/remote_runtime.go
Lines 620 to 626 in 695a984
from kubernetes.
Can you try this on a newer version? 1.22 is out of support and we have made changes to node readiness since.
/triage needs-information
from kubernetes.
Can you try this on a newer version? 1.22 is out of support and we have made changes to node readiness since.
/triage needs-information
I have tried in 1.28, it can reproduce, @HirazawaUi can reproduce in 1.30.
from kubernetes.
/remove-triage needs-information
from kubernetes.
Related Issues (20)
- Kubernets service not distributing traffic in equally , seeing imbalance in traffic . HOT 14
- Publish Markdown for OpenAPI field descriptions using an extension HOT 10
- Enhancement: allow to filter what fields to return from the API HOT 9
- [Failing Test] ci-crio-cgroupv1-node-e2e-conformance (Swap Tests) HOT 3
- [Flaking Test] integration-master (goroutine leak detection) HOT 6
- [Flaking Test] ci-node-e2e (Container Lifecycle) HOT 11
- Migrate existing features to versioned feature gate HOT 4
- verification machinery for compatibility version HOT 3
- [Flaking Test] TestLog/stateful_set_logs_with_all_pods HOT 4
- Pod deleted during image pull still starts HOT 10
- ValidatingAdmissionPolicy objects have different runtime type compared to CRDValidationRules HOT 8
- `kube-proxy`'s `--healthz-bind-address` should support IPv4 and IPv6 simultaneously (dual stack) HOT 24
- Bug: securityContext appArmorProfile unconfined not working with containerd HOT 5
- The old pod log file is not deleted from the /var/log/pods/ directory HOT 14
- Job controller reports the count of terminating pods with unnecessary delay HOT 4
- tracking issue; bump pause to 3.10 HOT 4
- kubernetes-sigs / scheduler-plugins go.mod Error HOT 3
- post-kubernetes-push-image-pause failed to publish version 3.10 HOT 15
- Failure cluster [6bc9e9c5...] HOT 8
- Apiserver log `Forcing xxx watcher close due to unresponsiveness` meaning consultation HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kubernetes.