GithubHelp home page GithubHelp logo

ryaneorth / k8s-scheduled-volume-snapshotter Goto Github PK

View Code? Open in Web Editor NEW
44.0 1.0 13.0 95 KB

Kubernetes operator for automatically creating volume snapshots

License: Apache License 2.0

Dockerfile 1.94% Python 89.63% Mustache 8.43%
kubernetes volumes operator helm snapshots volumesnapshots csi persistent-volumes scheduled-snapshots kubernetes-operator

k8s-scheduled-volume-snapshotter's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

k8s-scheduled-volume-snapshotter's Issues

daily schedule not respected

hi, i've this setup:

apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: scheduled-volume-snapshotter
spec:
  gitImplementation: go-git
  interval: 24h
  ref:
    tag: v0.14.1
  timeout: 20s
  url: https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: ${release}
  namespace: ${namespace}
spec:
  interval: 1m
  timeout: 10m
  releaseName: ${release}
  targetNamespace: ${namespace}
  test:
    enable: true
    timeout: 10m
  chart:
    spec:
      chart: helm/charts/scheduled-volume-snapshotter
      sourceRef:
        kind: GitRepository
        name: scheduled-volume-snapshotter
      interval: 24h
  values:
    # see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/tree/main/helm/charts/scheduled-volume-snapshotter
    schedule: "*/5 * * * *"
    rbac:
      enabled: true
    successfulJobsHistoryLimit: 3
    failedJobsHistoryLimit: 1
    logLevel: INFO
    startingDeadlineSeconds: 120
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
  name: ha-lts-snapshots-hourly
  namespace: ${namespace}
spec:
  # see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter#scheduling-snapshots
  # see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/helm/charts/scheduled-volume-snapshotter/crds/scheduled-volume-snapshot-crd.yaml
  snapshotClassName: ${release}
  persistentVolumeClaimName: ha-core-data-lts
  snapshotFrequency: 1h
  snapshotRetention: 4h
  snapshotLabels:
    frequency: hourly
    envName: ${envName}
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
  name: ha-lts-snapshots-daily
  namespace: ${namespace}
spec:
  # see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter#scheduling-snapshots
  # see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/helm/charts/scheduled-volume-snapshotter/crds/scheduled-volume-snapshot-crd.yaml
  snapshotClassName: ${release}
  persistentVolumeClaimName: ha-core-data-lts
  snapshotFrequency: 24h
  snapshotRetention: 7d
  snapshotLabels:
    frequency: daily
    envName: ${envName}
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
  name: ha-lts-snapshots-weekly
  namespace: ${namespace}
spec:
  # see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter#scheduling-snapshots
  # see https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/helm/charts/scheduled-volume-snapshotter/crds/scheduled-volume-snapshot-crd.yaml
  snapshotClassName: ${release}
  persistentVolumeClaimName: ha-core-data-lts
  snapshotFrequency: 7d
  snapshotRetention: 30d
  snapshotLabels:
    frequency: weekly
    envName: ${envName}

but something is broken in how it manages daily snapshots... in image, the 1st 4 lines are good, it took a daily snapshot on days march 19, 20, 21, and 22, at about 13:10... then it went wrong, from the 5th lines and below, all those snapshots are all of day march 22, about 10-15 minutes far each from an other...

image

nothing changed in deployed resources, and i've an other problem now, many snapshots are broken because it says an other one was already there on the same share, even if i completely remove all the offending snapshots (all those with ready: false)...

Status:
  Bound Volume Snapshot Content Name:  snapcontent-fae7a4ed-2fd3-47a3-b9e7-438ebcb3f63e
  Error:
    Message:     Failed to check and update snapshot content: failed to take snapshot of the volume 172.16.0.102#mnt/kube_data#mongodb/lab/datadir-common-mongodb-hidden-0_pvc-0132f065-a5d2-49
0b-903c-2be6345dbd1e#pvc-0132f065-a5d2-490b-903c-2be6345dbd1e#: "rpc error: code = Internal desc = failed to mount src nfs server: rpc error: code = Aborted desc = An operation with the given
 Volume ID 172.16.0.102#mnt/kube_data#mongodb/lab/datadir-common-mongodb-hidden-0_pvc-0132f065-a5d2-490b-903c-2be6345dbd1e#pvc-0132f065-a5d2-490b-903c-2be6345dbd1e# already exists"
    Time:        2024-03-25T15:00:09Z
  Ready To Use:  false

what's going on? how to fix? Thanks in advance

Volumesnapshot does not work when default class is not defined

When a ScheduledVolumeSnaphot has a snapshotClass, such as:

apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
  name: a
spec:
  persistentVolumeClaimName: a-pvc
  snapshotClassName: a-vsc
  snapshotFrequency: 8h
  snapshotRetention: 3

The snapshotter cronjob should create a VolumeSnapshot including the volumeSnapshotClass, however what I get is:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: a-timestamp
spec:
  source:
    persistentVolumeClaimName: a-pvc

instead of :

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: a-timestamp
spec:
  source:
    persistentVolumeClaimName: a-pvc
  volumeSnapshotClassName: a-vsc

When no default snapshot class is defined, this leads to the VolumeSnapshot stuck as non-provisionable, with the following message:

Failed to set default snapshot class with error cannot find default snapshot class

Apart from this issue, the VolumeSnaphot is ok, in fact if I add manually the missing attribute, snapshot provisioning starts immediately.

Unfortunately, using only the default VolumeSnapshotClass is not ok for my case, since I need to use different set of tags for different snapshots (see https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/docs/driver-parameters.md#volumesnapshotclass ).

I suspect that the cause lies here: https://github.com/ryaneorth/k8s-scheduled-volume-snapshotter/blob/main/snapshotter.py#L95

In fact, the field should be volumeSnapshotClassName rather than snapshotClassName, as it is shown in https://kubernetes.io/docs/concepts/storage/volume-snapshots/#volumesnapshots .

So that code line might be changed from:

'snapshotClassName': scheduled_snapshot.get('spec', {}).get('snapshotClassName'),

to:

'volumeSnapshotClassName': scheduled_snapshot.get('spec', {}).get('snapshotClassName'),

Can you help? Thank you

notification

any suggestion on how to get notifications some way if snapshots are not done for any event, without using prometheus or grafana alerts? We found yesterday that the snapshotter pod was in pending state and did not create any snapshot for latest 12 days...

Deprecated K8s API versions in K8s 1.21+

When using these Helm charts in K8s clusters 1.21+ it reports the following:

  W0304 14:39:32.435391    4669 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
  W0304 14:39:35.812064    4669 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob

Feature - cronjob max complete/failed pods history limits

We are interesing to use this application using helm deploy, but we don't want to conserve cronjobs pods history, managed into cronjob in yaml file:

...
spec:
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
...

Is possible to add to values yaml file this attribute?

regards

Snapshotter container image has critical/high vulnerabilities

Latest version of snapshotter (v0.10.3) has a host of critical and high vulnerabilities, which are detected by grype using the following command:

❯ grype ryaneorth/scheduled-volume-snapshotter:0.10.3 --add-cpes-if-none --only-fixed --fail-on high
 ✔ Vulnerability DB        [no update available]
 ✔ Loaded image
 ✔ Parsed image
 ✔ Cataloged packages      [455 packages]
 ✔ Scanned image           [5858 vulnerabilities]
[...]

Note that high and critical vulnerabilities are less than 5% of those 5858 vulnerabilities.

Most of those 5% vulnerabilities stem from:

  • old python version (3.7.3) and "fat" base image with lots of files
  • old version of Python dependencies (such as kubernetes==10.0.1, whose vulnerabilities are fixed in version >= 10.1.0)

I did try to fix some vulnerability myself, by:

  • updating base image from python:3.7.3 to python:3.9.13-slim-bullseye (-slim to have a slimmer image, -bullseye to have a recent OS version ... too bad we can't use scratch or at least alpine for Python :-( )
  • updating, in requirements.txt, version of kubernetes from 10.0.1 to 10.1.0

My current result is:

❯ grype svs --add-cpes-if-none --only-fixed --fail-on high
 ✔ Vulnerability DB        [no update available]
 ✔ Loaded image
 ✔ Parsed image
 ✔ Cataloged packages      [126 packages]
 ✔ Scanned image           [94 vulnerabilities]
NAME    INSTALLED  FIXED-IN  TYPE    VULNERABILITY        SEVERITY
PyYAML  3.13       5.4       python  GHSA-8q59-q68h-6hv4  Critical
pip     20.0.2     21.1      python  GHSA-5xp3-jfq3-5q8x  Medium
1 error occurred:
	* discovered vulnerabilities at or above the severity threshold

However, a vulnerability remains in PyYAML==3.13 (not directly mentioned in requirements.txt); maybe this can be fixed by bumping further kubernetes, beyond 10.1.0, if PyYAML derives from it.

Could you provide a vulnerability-free image, at least for high and critical ones? Thanks!

add options to avoid queue and just wait for next schedule time

hi, is it possible to add this kind of behaviour?

to give you a context of the problem, we had issues on azure, which has a limit of 200 snapshot per file share, as something was broken they reached a top of above 700, so i had to remove them manually.

but after i cleaned everything, all the missed snapshots started to be enqueued to be executed, and this just does not make sense: i don't need 20 daily snapshots all for the SAME day, if one is missed, just wait for next round...

as per the article above, it seems we need 2 values, concurrencyPolicy and startingDeadlineSeconds:

set the following property to Forbid in CronJob yaml (this is already done in chart template, i just checked)

.spec.concurrencyPolicy
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#concurrency-policy

spec.concurrencyPolicy: Forbid will hold off starting a second job if there is still an old one running. However that job will be queued to start immediately after the old job finishes.

To skip running a new job entirely and instead wait until the next scheduled time, set .spec.startingDeadlineSeconds to be smaller than the cronjob interval (but larger than the max expected startup time of the job).

If you're running a job every 30 minutes and know the job will never take more than one minute to start, set .spec.startingDeadlineSeconds: 60

thanks

Conflict on multiple scheduled snapshots of same PVC

There is a conflict if you attempt to schedule multiple snapshots of the same PVC (e.g. with different frequencies, retention intervals or deletion policies). The issue occurs because the name of the VolumeSnapshot is based on the name of the PVC and current time as UNIX timestamp. Because both initially trigger at the same UNIX timestamp, there is a conflict.

In my opinion the correct solution is that snapshot names are based on the name of the ScheduledVolumeSnapshot and UNIX timestamp (or ISO 8601-encoded time in UTC, e.g. 2022-03-04T14:56+00:00).

Example:

apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
  name: my-frequent-snapshot
  namespace: my-namespace
spec:
  persistentVolumeClaimName: my-pvc
  snapshotClassName: my-snapshotclass
  snapshotFrequency: 1h
  snapshotLabels:
    backup: frequent
  snapshotRetention: 7d
---
apiVersion: k8s.ryanorth.io/v1beta1
kind: ScheduledVolumeSnapshot
metadata:
  name: my-infrequent-snapshot
  namespace: my-namespace
spec:
  persistentVolumeClaimName: my-pvc
  snapshotClassName: my-snapshotclass
  snapshotFrequency: 7d
  snapshotLabels:
    backup: infrequent
  snapshotRetention: 90d

Results in the following error:

INFO:root:Creating snapshot my-pvc-1646405113 in namespace my-namespace
INFO:root:Creating snapshot my-pvc-1646405113 in namespace my-namespace
ERROR:root:Unable to create volume snapshot my-pvc-1646405113 in namespace my-namespace
Traceback (most recent call last):
  File "snapshotter.py", line 115, in create_new_snapshot
    volume_snapshot_body)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/custom_objects_api.py", line 178, in create_namespaced_custom_object
    (data) = self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/custom_objects_api.py", line 277, in create_namespaced_custom_object_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    body=body)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST
    body=body)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '86aa6dd3-cf2c-49fd-9254-54b6cb95c11b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'd8dd147f-7043-420b-9e24-d2d2c7964694', 'X-Kubernetes-Pf-Prioritylevel-Uid': '38ae6d53-8107-49f0-b303-e28321b88c05', 'Date': 'Fri, 04 Mar 2022 14:45:14 GMT', 'Content-Length': '320'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"volumesnapshots.snapshot.storage.k8s.io \"my-pvc-1646405113\" already exists","reason":"AlreadyExists","details":{"name":"my-pvc-1646405113","group":"snapshot.storage.k8s.io","kind":"volumesnapshots"},"code":409}

difference between pvc size and snapshot size

hi, i've some issues using latest (0.12.2 updated yesterday from previous 0.10.4, now i just saw you released 0.12.3 in latest hours)

i have some PVCs of 10gb on both lab and qa NS
image

but as you can see, the VS of the lab ones (in green, on top) have a restoresize of 100gb, so wrong... the QA ns one seems correct, though, and the snapshotter is the same for both envs...

image

if you need any other info, just ask

best regards, and thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.