Comments (8)
@paulomarchi according to the screenshot you provided, the error is from a snapshot on node ip-10-23-81-115
however this node is not present in the cluster. I'm not sure how that's possible - did you delete the node? Was the node in the process of being deleted when the failed snapshots were taken? Did you pull logs and kubectl get nodes
output from a completely different cluster?
We need logs from the node where the failed snapshot occurred - pulling logs from other nodes does not help.
Additionally, when pulling logs please use the node names in the log file names (or maybe just tar them up and use the node name as the tarball name), so we don't have to go poking through each individual log file to figure out which node it's from.
from rancher.
It looks like the issue is caused by multiple on-demand snapshots being taken simultaneously. Since the snapshot names use the timestamp in the name, they end up writing to the same files, and stepping on top of each other.
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:35Z" level=info msg="Checking if S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Checking if S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Checking if S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.285704Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.300327Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.304681Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.306343Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.304912Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.30844Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.312246Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.331945Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.33199Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.036856Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:220","msg":"completed snapshot read; closing"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.091707Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"25 MB","took":"now"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.091832Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Saving etcd snapshot on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 to S3"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Uploading snapshot to s3://REDACTED-etcd-bkp/cluster-teste-etcd-backup/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.137221Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:220","msg":"completed snapshot read; closing"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.154013Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"25 MB","took":"now"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=error msg="Failed to take etcd snapshot: could not rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 (rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476: no such file or directory)"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.17016Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:220","msg":"completed snapshot read; closing"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.184471Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"25 MB","took":"now"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=error msg="Failed to take etcd snapshot: could not rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 (rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476: no such file or directory)"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Uploaded snapshot metadata s3://REDACTED-etcd-bkp/cluster-teste-etcd-backup/.metadata/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="S3 upload complete for on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Jun 17 11:24:38 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:38Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
The rancher-system-agent logs only show a single save command being run, which reports two snapshots (one local, and one s3) being created:
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240617-112430/0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1"
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[Applyinator] Running command: rke2 [etcd-snapshot save]"
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --agent-token found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --cloud-provider-name found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --cni found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --disable-apiserver found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --disable-controller-manager found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --disable-scheduler found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --kubelet-arg found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-ip found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-label found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-label found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-label found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-taint found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --private-registry found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.\""
Jun 17 11:24:38 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:38Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:38Z\" level=info msg=\"Snapshot on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 saved.\""
Jun 17 11:24:38 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:38Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:38Z\" level=info msg=\"Snapshot on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 saved.\""
Jun 17 11:24:38 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:38Z" level=info msg="[Applyinator] Command rke2 [etcd-snapshot save] finished with err: <nil> and exit code: 0"
I'm not seeing any way that a single etcd-snapshot save
command would trigger multiple snapshots, but that sure looks like what's happening. There's also a mutex on the snapshot process that is supposed to prevent multiple snapshots from being taken at the same time, but there's clearly some issue that's preventing that from working as designed.
Thanks for the info, this gives me somewhere to start.
from rancher.
This appears to be a bug in etcd? Some how the temporary part file is created successfully here:
https://github.com/etcd-io/etcd/blob/3b252db4f6e68c3ae3ecaa87ab1b502f46d39d6e/client/v3/snapshot/v3_snapshot.go#L61-L65
But then the rename fails later when trying to move it to its requested name, in the same directory:
https://github.com/etcd-io/etcd/blob/3b252db4f6e68c3ae3ecaa87ab1b502f46d39d6e/client/v3/snapshot/v3_snapshot.go#L94-L95
from rancher.
Can you attach the complete logs from the rancher-system-agent
and rke2
systemd units, and the etcd
pod logs (from /var/log/pods
), on the node where this occurs? I don't see how this would even happen, unless something is deleting the .part file out from under etcd while its in the process of taking the snapshot.
from rancher.
@jakefhyde is also doing some investigations on this - but this doesn't mean we don't need what @brandond requested in the previous comment.
from rancher.
Hi Guys.
I can help you with logs because I'm having the same problem.
rke2-server-etcd3.log
rke2-server-etcd1.log
rke2-server-etcd2.log
rancher-system-agent-etcd1.log
rke2-server-controlplane2.log
rke2-server-controlplane1.log
rancher-system-agent-etcd3.log
rancher-system-agent-etcd2.log
rancher-system-agent-controlplane2.log
pod-etcd3.log
pod-etcd2.log
pod-etcd1.log
rancher-system-agent-controlplane1.log
root ~ # /var/lib/rancher/rke2/bin/kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-23-64-36.ap-northeast-1.compute.internal Ready worker 41h v1.28.10+rke2r1 10.23.64.36 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-65-170.ap-northeast-1.compute.internal Ready etcd 41h v1.28.10+rke2r1 10.23.65.170 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-67-83.ap-northeast-1.compute.internal Ready control-plane,master 41h v1.28.10+rke2r1 10.23.67.83 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-74-6.ap-northeast-1.compute.internal Ready worker 41h v1.28.10+rke2r1 10.23.74.6 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-76-235.ap-northeast-1.compute.internal Ready etcd 41h v1.28.10+rke2r1 10.23.76.235 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-85-109.ap-northeast-1.compute.internal Ready worker 41h v1.28.10+rke2r1 10.23.85.109 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-85-11.ap-northeast-1.compute.internal Ready etcd 41h v1.28.10+rke2r1 10.23.85.11 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-85-224.ap-northeast-1.compute.internal Ready control-plane,master 41h v1.28.10+rke2r1 10.23.85.224 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
root ~ # /var/lib/rancher/rke2/bin/crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
ac28e9301dcd4 11fefebefa034 41 hours ago Running liveness-probe 0 58dac4bc1869f ebs-csi-node-zgn7g
71b947ccab9f7 97183ef0473ec 41 hours ago Running node-driver-registrar 0 58dac4bc1869f ebs-csi-node-zgn7g
0f275ba37a239 47072691a2b51 41 hours ago Running ebs-plugin 0 58dac4bc1869f ebs-csi-node-zgn7g
92642360ba602 5c6ffd2b2a1d0 42 hours ago Running calico-node 0 df4c479836e66 calico-node-lplf7
20cc10f801024 b7e03d90f06bb 42 hours ago Running kube-proxy 0 7dc53e047c36e kube-proxy-ip-10-23-76-235.ap-northeast-1.compute.internal
d8d8165acae72 7893f7425a52a 42 hours ago Running etcd 0 0f1349313f87a etcd-ip-10-23-76-235.ap-northeast-1.compute.internal
root ~ # /var/lib/rancher/rke2/bin/crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
58dac4bc1869f 41 hours ago Ready ebs-csi-node-zgn7g kube-system 0 (default)
df4c479836e66 42 hours ago Ready calico-node-lplf7 calico-system 0 (default)
7dc53e047c36e 42 hours ago Ready kube-proxy-ip-10-23-76-235.ap-northeast-1.compute.internal kube-system 0 (default)
0f1349313f87a 42 hours ago Ready etcd-ip-10-23-76-235.ap-northeast-1.compute.internal kube-system 0 (default)
from rancher.
Hi Guys.
I created a new Cluster Test to reproduce the issue again.
Error:
on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476
could not rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 (rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476: no such file or directory)
Additional Infos for Cluster:
root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-23-66-29.ap-northeast-1.compute.internal Ready etcd 21m v1.28.10+rke2r1 10.23.66.29 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-66-35.ap-northeast-1.compute.internal Ready worker 18m v1.28.10+rke2r1 10.23.66.35 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-75-65.ap-northeast-1.compute.internal Ready etcd 21m v1.28.10+rke2r1 10.23.75.65 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-77-75.ap-northeast-1.compute.internal Ready worker 18m v1.28.10+rke2r1 10.23.77.75 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-78-61.ap-northeast-1.compute.internal Ready control-plane,master 21m v1.28.10+rke2r1 10.23.78.61 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-80-127.ap-northeast-1.compute.internal Ready etcd 21m v1.28.10+rke2r1 10.23.80.127 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-80-177.ap-northeast-1.compute.internal Ready worker 18m v1.28.10+rke2r1 10.23.80.177 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
ip-10-23-82-184.ap-northeast-1.compute.internal Ready control-plane,master 21m v1.28.10+rke2r1 10.23.82.184 <none> Amazon Linux 2 4.14.336-257.566.amzn2.x86_64 containerd://1.7.11-k3s2
root ~ # /var/lib/rancher/rke2/bin/crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
0cbd23dd9e731 11fefebefa034 18 minutes ago Running liveness-probe 0 a90617e1a8a53 ebs-csi-node-mxckl
0bf6c0adfe7af 97183ef0473ec 18 minutes ago Running node-driver-registrar 0 a90617e1a8a53 ebs-csi-node-mxckl
e66cfa1a7b38a 47072691a2b51 18 minutes ago Running ebs-plugin 0 a90617e1a8a53 ebs-csi-node-mxckl
fba5072a0523c 5c6ffd2b2a1d0 23 minutes ago Running calico-node 0 dbd97eeb0fef6 calico-node-8r6t6
6a8fef2226567 b7e03d90f06bb 24 minutes ago Running kube-proxy 0 8ba314f23804b kube-proxy-ip-10-23-66-29.ap-northeast-1.compute.internal
bb91f2f315c01 7893f7425a52a 35 minutes ago Running etcd 0 111dc5bda984e etcd-ip-10-23-66-29.ap-northeast-1.compute.internal
root ~ # /var/lib/rancher/rke2/bin/crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
a90617e1a8a53 18 minutes ago Ready ebs-csi-node-mxckl kube-system 0 (default)
dbd97eeb0fef6 23 minutes ago Ready calico-node-8r6t6 calico-system 0 (default)
8ba314f23804b 24 minutes ago Ready kube-proxy-ip-10-23-66-29.ap-northeast-1.compute.internal kube-system 0 (default)
111dc5bda984e 36 minutes ago Ready etcd-ip-10-23-66-29.ap-northeast-1.compute.internal kube-system 0 (default)
Logs:
Rancher Server:
I'm using a Rancher Server v2.8.4 running on K3s cluster v1.28.10+k3s1
from rancher.
Moving to Blocked as we're waiting on new RKE2 versions where this should be fixed.
from rancher.
Related Issues (20)
- [BUG] rancher-monitoring dashboards break when there are pods with the same name in different namespaces HOT 1
- How to get Project and associate namespace details along with Project quota and namespace quota using command line.
- [RFE] Include NeuVector OOB releases for AirGap environments in Rancher bundle repository
- aggregated cluster roles
- [BUG] Rancher allows clusters to use invalid labels for Kubernetes Node Labels
- [RFE] Allow applying labels to deployed infrastucrure VMs.
- [v2.7] Update ACI-CNI to 6.0.4.2 HOT 1
- [v2.8] Update ACI-CNI to 6.0.4.2 HOT 1
- [v2.9] Update ACI-CNI to 6.0.4.2 HOT 1
- Run `Operations_Test_SetB_*` tests for `rancher/kontainer-driver-metadata` tests
- [RFE] Metrics/centralized logging for v2prov tests
- Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": no endpoints available for service "rancher-webhook" HOT 1
- Can't deploy custom clusters using EC2 machine with AWS HOT 1
- token settings - specify MAX and DEFAULT separately
- User Retention Feature
- [BUG] unable to upgrade k3s HOT 5
- [BUG]Cannot upgrade k8s cluster with Amazon provider HOT 2
- [RFE] Add ability to configure session idle timeout on Rancher UI HOT 3
- [backport v2.8] add helm chart variable to set agent-tls-mode setting
- [BUG] no hava rancher-monitoring in rancher ui HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rancher.