Rancher Server Setup Rancher version: v2.9.0-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[BUG] ETCD snapshot fails on RKE2 clusters,about rancher/rancher

Comments (8)

brandond commented on July 27, 2024 1

@paulomarchi according to the screenshot you provided, the error is from a snapshot on node ip-10-23-81-115 however this node is not present in the cluster. I'm not sure how that's possible - did you delete the node? Was the node in the process of being deleted when the failed snapshots were taken? Did you pull logs and kubectl get nodes output from a completely different cluster?

We need logs from the node where the failed snapshot occurred - pulling logs from other nodes does not help.

Additionally, when pulling logs please use the node names in the log file names (or maybe just tar them up and use the node name as the tarball name), so we don't have to go poking through each individual log file to figure out which node it's from.

from rancher.

brandond commented on July 27, 2024 1

It looks like the issue is caused by multiple on-demand snapshots being taken simultaneously. Since the snapshot names use the timestamp in the name, they end up writing to the same files, and stepping on top of each other.

Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:35Z" level=info msg="Checking if S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Checking if S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Checking if S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="S3 bucket REDACTED-etcd-bkp exists"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.285704Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.300327Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.304681Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.306343Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.304912Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.30844Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:36Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.312246Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.331945Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Jun 17 11:24:36 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:36.33199Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.036856Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:220","msg":"completed snapshot read; closing"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.091707Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"25 MB","took":"now"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.091832Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Saving etcd snapshot on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 to S3"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Uploading snapshot to s3://REDACTED-etcd-bkp/cluster-teste-etcd-backup/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.137221Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:220","msg":"completed snapshot read; closing"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.154013Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"25 MB","took":"now"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=error msg="Failed to take etcd snapshot: could not rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 (rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476: no such file or directory)"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.17016Z","logger":"etcd-client.client","caller":"[email protected]/maintenance.go:220","msg":"completed snapshot read; closing"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: {"level":"info","ts":"2024-06-17T11:24:37.184471Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"25 MB","took":"now"}
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=error msg="Failed to take etcd snapshot: could not rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 (rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476: no such file or directory)"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Uploaded snapshot metadata s3://REDACTED-etcd-bkp/cluster-teste-etcd-backup/.metadata/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="S3 upload complete for on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476"
Jun 17 11:24:37 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:37Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Jun 17 11:24:38 ip-10-23-66-29.ap-northeast-1.compute.internal rke2[4698]: time="2024-06-17T11:24:38Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"

The rancher-system-agent logs only show a single save command being run, which reports two snapshots (one local, and one s3) being created:

Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240617-112430/0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1"
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[Applyinator] Running command: rke2 [etcd-snapshot save]"
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --agent-token found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --cloud-provider-name found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --cni found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --disable-apiserver found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --disable-controller-manager found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --disable-scheduler found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --kubelet-arg found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-ip found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-label found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-label found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-label found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --node-taint found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Unknown flag --private-registry found in config.yaml, skipping\\n\""
Jun 17 11:24:35 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:35Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:35Z\" level=warning msg=\"Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.\""
Jun 17 11:24:38 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:38Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:38Z\" level=info msg=\"Snapshot on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 saved.\""
Jun 17 11:24:38 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:38Z" level=info msg="[0e44cd31d8c2a573d0d864cdd7d6ea35158633db17a14c383adc520838c4d5bd_1:stderr]: time=\"2024-06-17T11:24:38Z\" level=info msg=\"Snapshot on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 saved.\""
Jun 17 11:24:38 ip-10-23-66-29.ap-northeast-1.compute.internal rancher-system-agent[7299]: time="2024-06-17T11:24:38Z" level=info msg="[Applyinator] Command rke2 [etcd-snapshot save] finished with err: <nil> and exit code: 0"

I'm not seeing any way that a single etcd-snapshot save command would trigger multiple snapshots, but that sure looks like what's happening. There's also a mutex on the snapshot process that is supposed to prevent multiple snapshots from being taken at the same time, but there's clearly some issue that's preventing that from working as designed.

Thanks for the info, this gives me somewhere to start.

from rancher.

brandond commented on July 27, 2024

This appears to be a bug in etcd? Some how the temporary part file is created successfully here:
https://github.com/etcd-io/etcd/blob/3b252db4f6e68c3ae3ecaa87ab1b502f46d39d6e/client/v3/snapshot/v3_snapshot.go#L61-L65

But then the rename fails later when trying to move it to its requested name, in the same directory:
https://github.com/etcd-io/etcd/blob/3b252db4f6e68c3ae3ecaa87ab1b502f46d39d6e/client/v3/snapshot/v3_snapshot.go#L94-L95

from rancher.

brandond commented on July 27, 2024

Can you attach the complete logs from the rancher-system-agent and rke2 systemd units, and the etcd pod logs (from /var/log/pods), on the node where this occurs? I don't see how this would even happen, unless something is deleting the .part file out from under etcd while its in the process of taking the snapshot.

from rancher.

snasovich commented on July 27, 2024

@jakefhyde is also doing some investigations on this - but this doesn't mean we don't need what @brandond requested in the previous comment.

from rancher.

paulomarchi commented on July 27, 2024

Hi Guys.
I can help you with logs because I'm having the same problem.

rke2-server-etcd3.log
rke2-server-etcd1.log
rke2-server-etcd2.log
rancher-system-agent-etcd1.log
rke2-server-controlplane2.log
rke2-server-controlplane1.log
rancher-system-agent-etcd3.log
rancher-system-agent-etcd2.log
rancher-system-agent-controlplane2.log
pod-etcd3.log
pod-etcd2.log
pod-etcd1.log
rancher-system-agent-controlplane1.log

root ~ # /var/lib/rancher/rke2/bin/kubectl get nodes -o wide
NAME                                              STATUS   ROLES                  AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-23-64-36.ap-northeast-1.compute.internal    Ready    worker                 41h   v1.28.10+rke2r1   10.23.64.36    <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-65-170.ap-northeast-1.compute.internal   Ready    etcd                   41h   v1.28.10+rke2r1   10.23.65.170   <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-67-83.ap-northeast-1.compute.internal    Ready    control-plane,master   41h   v1.28.10+rke2r1   10.23.67.83    <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-74-6.ap-northeast-1.compute.internal     Ready    worker                 41h   v1.28.10+rke2r1   10.23.74.6     <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-76-235.ap-northeast-1.compute.internal   Ready    etcd                   41h   v1.28.10+rke2r1   10.23.76.235   <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-85-109.ap-northeast-1.compute.internal   Ready    worker                 41h   v1.28.10+rke2r1   10.23.85.109   <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-85-11.ap-northeast-1.compute.internal    Ready    etcd                   41h   v1.28.10+rke2r1   10.23.85.11    <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-85-224.ap-northeast-1.compute.internal   Ready    control-plane,master   41h   v1.28.10+rke2r1   10.23.85.224   <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2



root ~ # /var/lib/rancher/rke2/bin/crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                    ATTEMPT             POD ID              POD
ac28e9301dcd4       11fefebefa034       41 hours ago        Running             liveness-probe          0                   58dac4bc1869f       ebs-csi-node-zgn7g
71b947ccab9f7       97183ef0473ec       41 hours ago        Running             node-driver-registrar   0                   58dac4bc1869f       ebs-csi-node-zgn7g
0f275ba37a239       47072691a2b51       41 hours ago        Running             ebs-plugin              0                   58dac4bc1869f       ebs-csi-node-zgn7g
92642360ba602       5c6ffd2b2a1d0       42 hours ago        Running             calico-node             0                   df4c479836e66       calico-node-lplf7
20cc10f801024       b7e03d90f06bb       42 hours ago        Running             kube-proxy              0                   7dc53e047c36e       kube-proxy-ip-10-23-76-235.ap-northeast-1.compute.internal
d8d8165acae72       7893f7425a52a       42 hours ago        Running             etcd                    0                   0f1349313f87a       etcd-ip-10-23-76-235.ap-northeast-1.compute.internal


root ~ # /var/lib/rancher/rke2/bin/crictl pods
POD ID              CREATED             STATE               NAME                                                         NAMESPACE           ATTEMPT             RUNTIME
58dac4bc1869f       41 hours ago        Ready               ebs-csi-node-zgn7g                                           kube-system         0                   (default)
df4c479836e66       42 hours ago        Ready               calico-node-lplf7                                            calico-system       0                   (default)
7dc53e047c36e       42 hours ago        Ready               kube-proxy-ip-10-23-76-235.ap-northeast-1.compute.internal   kube-system         0                   (default)
0f1349313f87a       42 hours ago        Ready               etcd-ip-10-23-76-235.ap-northeast-1.compute.internal         kube-system         0                   (default)

from rancher.

paulomarchi commented on July 27, 2024

Hi Guys.
I created a new Cluster Test to reproduce the issue again.

Error:

on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476
could not rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476 (rename /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476.part /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-10-23-66-29.ap-northeast-1.compute.internal-1718623476: no such file or directory)

Additional Infos for Cluster:

root ~ # /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes -o wide
NAME                                              STATUS   ROLES                  AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-23-66-29.ap-northeast-1.compute.internal    Ready    etcd                   21m   v1.28.10+rke2r1   10.23.66.29    <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-66-35.ap-northeast-1.compute.internal    Ready    worker                 18m   v1.28.10+rke2r1   10.23.66.35    <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-75-65.ap-northeast-1.compute.internal    Ready    etcd                   21m   v1.28.10+rke2r1   10.23.75.65    <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-77-75.ap-northeast-1.compute.internal    Ready    worker                 18m   v1.28.10+rke2r1   10.23.77.75    <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-78-61.ap-northeast-1.compute.internal    Ready    control-plane,master   21m   v1.28.10+rke2r1   10.23.78.61    <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-80-127.ap-northeast-1.compute.internal   Ready    etcd                   21m   v1.28.10+rke2r1   10.23.80.127   <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-80-177.ap-northeast-1.compute.internal   Ready    worker                 18m   v1.28.10+rke2r1   10.23.80.177   <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2
ip-10-23-82-184.ap-northeast-1.compute.internal   Ready    control-plane,master   21m   v1.28.10+rke2r1   10.23.82.184   <none>        Amazon Linux 2   4.14.336-257.566.amzn2.x86_64   containerd://1.7.11-k3s2



root ~ # /var/lib/rancher/rke2/bin/crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                    ATTEMPT             POD ID              POD
0cbd23dd9e731       11fefebefa034       18 minutes ago      Running             liveness-probe          0                   a90617e1a8a53       ebs-csi-node-mxckl
0bf6c0adfe7af       97183ef0473ec       18 minutes ago      Running             node-driver-registrar   0                   a90617e1a8a53       ebs-csi-node-mxckl
e66cfa1a7b38a       47072691a2b51       18 minutes ago      Running             ebs-plugin              0                   a90617e1a8a53       ebs-csi-node-mxckl
fba5072a0523c       5c6ffd2b2a1d0       23 minutes ago      Running             calico-node             0                   dbd97eeb0fef6       calico-node-8r6t6
6a8fef2226567       b7e03d90f06bb       24 minutes ago      Running             kube-proxy              0                   8ba314f23804b       kube-proxy-ip-10-23-66-29.ap-northeast-1.compute.internal
bb91f2f315c01       7893f7425a52a       35 minutes ago      Running             etcd                    0                   111dc5bda984e       etcd-ip-10-23-66-29.ap-northeast-1.compute.internal


root ~ # /var/lib/rancher/rke2/bin/crictl pods
POD ID              CREATED             STATE               NAME                                                        NAMESPACE           ATTEMPT             RUNTIME
a90617e1a8a53       18 minutes ago      Ready               ebs-csi-node-mxckl                                          kube-system         0                   (default)
dbd97eeb0fef6       23 minutes ago      Ready               calico-node-8r6t6                                           calico-system       0                   (default)
8ba314f23804b       24 minutes ago      Ready               kube-proxy-ip-10-23-66-29.ap-northeast-1.compute.internal   kube-system         0                   (default)
111dc5bda984e       36 minutes ago      Ready               etcd-ip-10-23-66-29.ap-northeast-1.compute.internal         kube-system         0                   (default)

Logs:

rancher-etcd-logs.zip

Rancher Server:

I'm  using a Rancher Server v2.8.4  running  on K3s cluster v1.28.10+k3s1

from rancher.

snasovich commented on July 27, 2024

Moving to Blocked as we're waiting on new RKE2 versions where this should be fixed.

from rancher.

[BUG] ETCD snapshot fails on RKE2 clusters about rancher HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs