ryaneorth / k8s-scheduled-volume-snapshotter Goto Github PK
View Code? Open in Web Editor NEWKubernetes operator for automatically creating volume snapshots
License: Apache License 2.0
Kubernetes operator for automatically creating volume snapshots
License: Apache License 2.0
hi, i've this setup:
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
name: scheduled-volume-snapshotter
spec:
gitImplementation: go-git
interval: 24h
ref:
tag: v0.14.1
timeout: 20s
url: https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: ${release}
namespace: ${namespace}
spec:
interval: 1m
timeout: 10m
releaseName: ${release}
targetNamespace: ${namespace}
test:
enable: true
timeout: 10m
chart:
spec:
chart: helm/charts/scheduled-volume-snapshotter
sourceRef:
kind: GitRepository
name: scheduled-volume-snapshotter
interval: 24h
values:
# see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/tree/main/helm/charts/scheduled-volume-snapshotter
schedule: "*/5 * * * *"
rbac:
enabled: true
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
logLevel: INFO
startingDeadlineSeconds: 120
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
name: ha-lts-snapshots-hourly
namespace: ${namespace}
spec:
# see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter#scheduling-snapshots
# see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/helm/charts/scheduled-volume-snapshotter/crds/scheduled-volume-snapshot-crd.yaml
snapshotClassName: ${release}
persistentVolumeClaimName: ha-core-data-lts
snapshotFrequency: 1h
snapshotRetention: 4h
snapshotLabels:
frequency: hourly
envName: ${envName}
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
name: ha-lts-snapshots-daily
namespace: ${namespace}
spec:
# see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter#scheduling-snapshots
# see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/helm/charts/scheduled-volume-snapshotter/crds/scheduled-volume-snapshot-crd.yaml
snapshotClassName: ${release}
persistentVolumeClaimName: ha-core-data-lts
snapshotFrequency: 24h
snapshotRetention: 7d
snapshotLabels:
frequency: daily
envName: ${envName}
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
name: ha-lts-snapshots-weekly
namespace: ${namespace}
spec:
# see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter#scheduling-snapshots
# see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/helm/charts/scheduled-volume-snapshotter/crds/scheduled-volume-snapshot-crd.yaml
snapshotClassName: ${release}
persistentVolumeClaimName: ha-core-data-lts
snapshotFrequency: 7d
snapshotRetention: 30d
snapshotLabels:
frequency: weekly
envName: ${envName}
but something is broken in how it manages daily snapshots... in image, the 1st 4 lines are good, it took a daily snapshot on days march 19, 20, 21, and 22, at about 13:10... then it went wrong, from the 5th lines and below, all those snapshots are all of day march 22, about 10-15 minutes far each from an other...
nothing changed in deployed resources, and i've an other problem now, many snapshots are broken because it says an other one was already there on the same share, even if i completely remove all the offending snapshots (all those with ready: false)...
Status:
Bound Volume Snapshot Content Name: snapcontent-fae7a4ed-2fd3-47a3-b9e7-438ebcb3f63e
Error:
Message: Failed to check and update snapshot content: failed to take snapshot of the volume 172.16.0.102#mnt/kube_data#mongodb/lab/datadir-common-mongodb-hidden-0_pvc-0132f065-a5d2-49
0b-903c-2be6345dbd1e#pvc-0132f065-a5d2-490b-903c-2be6345dbd1e#: "rpc error: code = Internal desc = failed to mount src nfs server: rpc error: code = Aborted desc = An operation with the given
Volume ID 172.16.0.102#mnt/kube_data#mongodb/lab/datadir-common-mongodb-hidden-0_pvc-0132f065-a5d2-490b-903c-2be6345dbd1e#pvc-0132f065-a5d2-490b-903c-2be6345dbd1e# already exists"
Time: 2024-03-25T15:00:09Z
Ready To Use: false
what's going on? how to fix? Thanks in advance
When a ScheduledVolumeSnaphot
has a snapshotClass, such as:
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
name: a
spec:
persistentVolumeClaimName: a-pvc
snapshotClassName: a-vsc
snapshotFrequency: 8h
snapshotRetention: 3
The snapshotter cronjob should create a VolumeSnapshot
including the volumeSnapshotClass, however what I get is:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: a-timestamp
spec:
source:
persistentVolumeClaimName: a-pvc
instead of :
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: a-timestamp
spec:
source:
persistentVolumeClaimName: a-pvc
volumeSnapshotClassName: a-vsc
When no default snapshot class is defined, this leads to the VolumeSnapshot stuck as non-provisionable, with the following message:
Failed to set default snapshot class with error cannot find default snapshot class
Apart from this issue, the VolumeSnaphot is ok, in fact if I add manually the missing attribute, snapshot provisioning starts immediately.
Unfortunately, using only the default VolumeSnapshotClass is not ok for my case, since I need to use different set of tags for different snapshots (see https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/docs/driver-parameters.md#volumesnapshotclass ).
I suspect that the cause lies here: https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/snapshotter.py#L95
In fact, the field should be volumeSnapshotClassName
rather than snapshotClassName
, as it is shown in https://kubernetes.io/docs/concepts/storage/volume-snapshots/#volumesnapshots .
So that code line might be changed from:
'snapshotClassName': scheduled_snapshot.get('spec', {}).get('snapshotClassName'),
to:
'volumeSnapshotClassName': scheduled_snapshot.get('spec', {}).get('snapshotClassName'),
Can you help? Thank you
When the schedule volume snapshot object is used to create a VolumeSnapshot, it uses its name as one of the labels. Labels can be max 63 chars, but object names can be longer.
Thus creation of the VolumeSnapshot would fail. We need to truncate the scheduled-volume-snapshot name when used in the label of VolumeSnapshot:
https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/snapshotter.py#L90
deprecationWarning: snapshot.storage.k8s.io/v1beta1 VolumeSnapshot is deprecated;
use snapshot.storage.k8s.io/v1 VolumeSnapshot
any suggestion on how to get notifications some way if snapshots are not done for any event, without using prometheus or grafana alerts? We found yesterday that the snapshotter pod was in pending state and did not create any snapshot for latest 12 days...
When using these Helm charts in K8s clusters 1.21+ it reports the following:
W0304 14:39:32.435391 4669 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0304 14:39:35.812064 4669 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
We are interesing to use this application using helm deploy, but we don't want to conserve cronjobs pods history, managed into cronjob in yaml file:
...
spec:
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
...
Is possible to add to values yaml file this attribute?
regards
Latest version of snapshotter (v0.10.3) has a host of critical and high vulnerabilities, which are detected by grype using the following command:
❯ grype ryaneorth/scheduled-volume-snapshotter:0.10.3 --add-cpes-if-none --only-fixed --fail-on high
✔ Vulnerability DB [no update available]
✔ Loaded image
✔ Parsed image
✔ Cataloged packages [455 packages]
✔ Scanned image [5858 vulnerabilities]
[...]
Note that high and critical vulnerabilities are less than 5% of those 5858 vulnerabilities.
Most of those 5% vulnerabilities stem from:
kubernetes==10.0.1
, whose vulnerabilities are fixed in version >= 10.1.0)I did try to fix some vulnerability myself, by:
python:3.7.3
to python:3.9.13-slim-bullseye
(-slim
to have a slimmer image, -bullseye
to have a recent OS version ... too bad we can't use scratch
or at least alpine
for Python :-( )requirements.txt
, version of kubernetes
from 10.0.1
to 10.1.0
My current result is:
❯ grype svs --add-cpes-if-none --only-fixed --fail-on high
✔ Vulnerability DB [no update available]
✔ Loaded image
✔ Parsed image
✔ Cataloged packages [126 packages]
✔ Scanned image [94 vulnerabilities]
NAME INSTALLED FIXED-IN TYPE VULNERABILITY SEVERITY
PyYAML 3.13 5.4 python GHSA-8q59-q68h-6hv4 Critical
pip 20.0.2 21.1 python GHSA-5xp3-jfq3-5q8x Medium
1 error occurred:
* discovered vulnerabilities at or above the severity threshold
However, a vulnerability remains in PyYAML==3.13
(not directly mentioned in requirements.txt
); maybe this can be fixed by bumping further kubernetes
, beyond 10.1.0
, if PyYAML
derives from it.
Could you provide a vulnerability-free image, at least for high and critical ones? Thanks!
hi, is it possible to add this kind of behaviour?
to give you a context of the problem, we had issues on azure, which has a limit of 200 snapshot per file share, as something was broken they reached a top of above 700, so i had to remove them manually.
but after i cleaned everything, all the missed snapshots started to be enqueued to be executed, and this just does not make sense: i don't need 20 daily snapshots all for the SAME day, if one is missed, just wait for next round...
as per the article above, it seems we need 2 values, concurrencyPolicy
and startingDeadlineSeconds
:
set the following property to Forbid in CronJob yaml (this is already done in chart template, i just checked)
.spec.concurrencyPolicy
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#concurrency-policy
spec.concurrencyPolicy: Forbid
will hold off starting a second job if there is still an old one running. However that job will be queued to start immediately after the old job finishes.
To skip running a new job entirely and instead wait until the next scheduled time, set .spec.startingDeadlineSeconds to be smaller than the cronjob interval (but larger than the max expected startup time of the job).
If you're running a job every 30 minutes and know the job will never take more than one minute to start, set .spec.startingDeadlineSeconds: 60
thanks
There is a conflict if you attempt to schedule multiple snapshots of the same PVC (e.g. with different frequencies, retention intervals or deletion policies). The issue occurs because the name of the VolumeSnapshot
is based on the name of the PVC and current time as UNIX timestamp. Because both initially trigger at the same UNIX timestamp, there is a conflict.
In my opinion the correct solution is that snapshot names are based on the name of the ScheduledVolumeSnapshot
and UNIX timestamp (or ISO 8601-encoded time in UTC, e.g. 2022-03-04T14:56+00:00
).
Example:
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
name: my-frequent-snapshot
namespace: my-namespace
spec:
persistentVolumeClaimName: my-pvc
snapshotClassName: my-snapshotclass
snapshotFrequency: 1h
snapshotLabels:
backup: frequent
snapshotRetention: 7d
---
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
name: my-infrequent-snapshot
namespace: my-namespace
spec:
persistentVolumeClaimName: my-pvc
snapshotClassName: my-snapshotclass
snapshotFrequency: 7d
snapshotLabels:
backup: infrequent
snapshotRetention: 90d
Results in the following error:
INFO:root:Creating snapshot my-pvc-1646405113 in namespace my-namespace
INFO:root:Creating snapshot my-pvc-1646405113 in namespace my-namespace
ERROR:root:Unable to create volume snapshot my-pvc-1646405113 in namespace my-namespace
Traceback (most recent call last):
File "snapshotter.py", line 115, in create_new_snapshot
volume_snapshot_body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/custom_objects_api.py", line 178, in create_namespaced_custom_object
(data) = self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/custom_objects_api.py", line 277, in create_namespaced_custom_object_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
body=body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST
body=body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '86aa6dd3-cf2c-49fd-9254-54b6cb95c11b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'd8dd147f-7043-420b-9e24-d2d2c7964694', 'X-Kubernetes-Pf-Prioritylevel-Uid': '38ae6d53-8107-49f0-b303-e28321b88c05', 'Date': 'Fri, 04 Mar 2022 14:45:14 GMT', 'Content-Length': '320'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"volumesnapshots.snapshot.storage.k8s.io \"my-pvc-1646405113\" already exists","reason":"AlreadyExists","details":{"name":"my-pvc-1646405113","group":"snapshot.storage.k8s.io","kind":"volumesnapshots"},"code":409}
hi, i've some issues using latest (0.12.2 updated yesterday from previous 0.10.4, now i just saw you released 0.12.3 in latest hours)
i have some PVCs of 10gb on both lab and qa NS
but as you can see, the VS of the lab ones (in green, on top) have a restoresize of 100gb, so wrong... the QA ns one seems correct, though, and the snapshotter is the same for both envs...
if you need any other info, just ask
best regards, and thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.