GithubHelp home page GithubHelp logo

backube / volsync Goto Github PK

View Code? Open in Web Editor NEW
509.0 13.0 62.0 18.8 MB

Asynchronous data replication for Kubernetes volumes

Home Page: https://volsync.readthedocs.io

License: GNU Affero General Public License v3.0

Ruby 0.01% Shell 3.84% Dockerfile 0.18% Makefile 0.43% Go 92.87% Smarty 0.04% Jinja 0.18% Python 0.06% Roff 2.37% GAP 0.01%
kubernetes storage kubernetes-operator mirroring csi data-replication data-protection persistent-volume disaster-recovery data-migration

volsync's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

volsync's Issues

Make kubectl plugin `go install`-able

Describe the feature you'd like to have.
It should be possible to install the cli w/ a go install 1-liner

Current status:

$ go version
go version go1.16.4 linux/amd64
$ go install github.com/backube/volsync/cmd/volsync@main
go: downloading github.com/backube/volsync v0.3.1-0.20210805200137-216ac45a8ac7
$ which volsync
~/.gvm/pkgsets/go1.16.4/global/bin/volsync

...but the exe needs to be named kubectl-volsync for it to function as a plugin

I think this is just a matter of changing the directory and package name to make this work.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context

Return code ignored in Rclone & Rsync mover

Describe the bug
The main rclone mover script attempts to capture and print the return code of Rclone. There are a couple of problems:

  • With the addition of backube/scribe#104, extra commands have been inserted before rc is captured from $?
  • The script (correctly IMO) runs with set -e, meaning it will terminate on any non-zero command.
    This means it's not possible for the final status line to ever say anything other than rc=0

The error handling of this script should be re-evaluated:

  • Should we do something other than bail on command failure?
    • If so, commands need to be if [[ cmd ]] ... to permit catching and processing the failure
    • If not, rc should be removed so as to not mislead

Steps to reproduce

Expected behavior

Actual results

Additional context
I think this is fairly low-priority as I'm not aware of any current negative effects. I think this is more for clarity & cleanup.

The scipt ./bin/external-rsync-source should accept port flag

Describe the bug
To work with the local kind cluster, the script bin/external-rsync-source should accept a port number as an ssh option. Right now it does not accept one

Steps to reproduce

  1. Form a kind cluster
  2. Install replication destination in with Type: ClusterIP
  3. Do port forwarding of the pod
  4. Try ./bin/external-rsync-source

Expected behavior
the script should accept port number to establish ssh over the forwarded port

Actual results
Fails to establish ssh connection due perm error

Better feedback on problems

Describe the feature you'd like to have.
It would be good if the operator provided more actionable feedback in the CR status so that it was easier to troubleshoot when things aren't working as expected.

What is the value to the end user? (why is it a priority?)
End users would be able to more quickly diagnose what's not working and why so that they can get their environment up and running more quickly.

How will we know we have a good solution? (acceptance criteria)

  • The .status.conditions section of the Source and Destination CRs should provide information about where in the reconcile sequence the operator is.
  • Anything else?

Additional context
In #66, the operator wasn't able to complete the sync iteration because the snapshot couldn't be created. Unfortunately, there was no error (or status information) generated to indicate that this was the problem. Do the the async nature of kube, it's going to be hard to tell whether something is broken/misconfigured or just taking a while, but by having the operator expose more status information, it can at least narrow down where to look when things aren't progressing as quickly as expected.

Brainstorming a bit:
We have the "Synchronizing" condition:

const (
	ConditionSynchronizing     string = "Synchronizing"
	SynchronizingReasonSync    string = "SyncInProgress"
	SynchronizingReasonSched   string = "WaitingForSchedule"
	SynchronizingReasonManual  string = "WaitingForManual"
	SynchronizingReasonCleanup string = "CleaningUp"
)

While the expressiveness of "reason" is fairly limited (and I don't know that I want to expand the number of reason codes), the free form messages could be more descriptive:

  • "Creating keys"
  • "Creating Snapshot"
  • "Waiting for Job to complete"
  • etc.

corev1 imported more than once

Describe the bug

In some files, corev1 will be imported multiple times as either corev1 or v1.
For example, these two lines in scribe/replicationsource_test.go:

import ( 
	// ...
	corev1 "k8s.io/api/core/v1"
	v1 "k8s.io/api/core/v1"
	// ....
)

Steps to reproduce

Expected behavior

Actual results

Additional context

Benchmarking

Describe the feature you'd like to have.
We should find a way to characterize the performance of the data movers.

What is the value to the end user? (why is it a priority?)
Potential (and current) users would like to have some idea what performance to expect when they choose to replicate a volume with Scribe.

How will we know we have a good solution? (acceptance criteria)
We should be able to provide some sort of guidance on:

  • What size volumes work well (or don't)?
  • How does the volume's file count affect performance?
  • What data change rates are ok?
  • Are certain data types problematic (big files, small files, binary, compressed, etc.)?
  • For a given data set, change rate, and network bandwidth, what is a reasonable schedule (frequency to use)?
  • Are there certain environments that just don't work very well?

Additional context
While there are too many variables to provide definitive answers to the above, we can probably characterize the performance such that for a proposed workload we can say: 🔴 🟡 🟢
... and what to watch out for.

Support for private image registry

Describe the feature you'd like to have.
It should be possible to deploy VolSync using the helm chart when the container images (operator & movers) reside in a private registry.

What is the value to the end user? (why is it a priority?)
Users may want to host the VolSync images in private image repositories. Currently, to support this, a cluster-wide pull secret must be added (requiring admin privs). I should be possible to use the Helm chart's imagePullSecrets field to specify a secret to use.

How will we know we have a good solution? (acceptance criteria)

  • Users can deploy (and use) VolSync from a private registry using the Chart, without changing cluster-wide settings

Additional context
The existing imagePullSecrets field will be used for the operator container, but not for the movers. This requires:

  • Adding a CLI option to the operator to specify a pull secret to be used for movers
  • Operator behavior:
    • Copy the pull secret into the namespace that the mover will use
    • Add it to the mover pod

One concern I have with this solution is that we will be exposing the pull secret to users of the operator since the mover runs in the user's namespace.

Also, there are 2 ways we could specify the pull secret for the mover: directly in the podspec or via the ServiceAccount that will run the mover. Is there a reason to prefer one approach over the other?

Scribe on Operator Hub

Describe the feature you'd like to have.
I would like to be able to install Scribe from Operator hub

What is the value to the end user? (why is it a priority?)
Ease of deployment

How will we know we have a good solution? (acceptance criteria)

Additional context

cli should be usable w/o a config file

Describe the feature you'd like to have.
It should be possible to use the cli w/o having a config file.

Currently:

$ kubectl volsync help
error: Config File "config.yaml" Not Found in "[/home/jstrunk/.volsyncconfig]"

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

  • A missing config file should be treated as empty
  • It should be possible to provide an alternate location for the config file This exists --config <filename
  • It should be possible to use the cli in full w/o a config file (everything via cli parameters)

Additional context
Related to #22, #10

Consider stunnel instead of ssh for rsync

Describe the feature you'd like to have.
Consider replacing sshd with stunnel in the rsync mover.

What is the value to the end user? (why is it a priority?)
In order to use sshd on OpenShift, we need a custom SCC that allows extra caps beyond just running as anyuid. If we get rid of sshd, we may be able to just use the standard anyuid SCC within OCP.

How will we know we have a good solution? (acceptance criteria)

Additional context

  • This may complicate setups for syncing data in/out of kube environments since external systems may not have stunnel (but probably do have ssh)
  • https://github.com/konveyor/pvc-migrate may provide a useful example for rsync over stunnel

Restic unit test failure

Describe the bug
There appears to be a timing bug in the restic unit tests
Likely similar to #41

Steps to reproduce

Expected behavior

Actual results

Restic as a destination when used as destination when a destination volume is supplied [It] is used directly 
/home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:727

  Unexpected error:
      <*errors.StatusError | 0xc000553400>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "PersistentVolumeClaim \"dest\" not found",
              Reason: "NotFound",
              Details: {
                  Name: "dest",
                  Group: "",
                  Kind: "PersistentVolumeClaim",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 404,
          },
      }
      PersistentVolumeClaim "dest" not found
  occurred

  /home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:729
------------------------------
[BeforeEach] Restic as a destination
  /home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:644
[BeforeEach] when a destination volume is supplied
  /home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:707
[JustBeforeEach] Restic as a destination
  /home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:666
[JustBeforeEach] when used as destination
  /home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:674
[It] is used directly
  /home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:727
[AfterEach] Restic as a destination
  /home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:669

•••••••••••••••••••••••••

Summarizing 1 Failure:

[Fail] Restic as a destination when used as destination when a destination volume is supplied [It] is used directly 
/home/runner/work/volsync/volsync/controllers/mover/restic/restic_test.go:729

Additional context

Build rclone and restic binaries

Describe the feature you'd like to have.
Today, we download the released binaries for rclone and restic to include in the mover containers. This presents an auditing problem for the downstream build & release process.
We should clone and build the sources ourselves as a part of the Docker build of the mover containers.

What is the value to the end user? (why is it a priority?)

  • This ensures reproducible builds
  • Makes it easier to manage upstream/downstream releases and keeps them better in sync
  • Makes it easier to build for multiple architectures

How will we know we have a good solution? (acceptance criteria)

  • The upstream source is cloned & built on each container build
  • The build is locked to a specific release by tag and git hash
  • Version information is visible in the Dockerfile such that a simple diff of the file can determine if/when the version changed (i.e., the version shouldn't be a buildarg)

Additional context
Related to open-cluster-management/backlog#14604

CLI - `migration create`

Describe the feature you'd like to have.
kubectl migration create <args>
This sub-command should prepare a destination to receive incoming transfers from an external source.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

  • Saves relationship info
  • Creates the Namespace if it doesn't exist
  • Creates a PVC in the Namespace if it doesn't exist
    • Using the specified capacity, accessModes, StorageClass, and name
  • Creates an associated ReplicationDestination

Additional context

Operator / Build-Operator action fails on restic test

Describe the bug

During PR tests, operator build failed on test Restic as a source when used as source Source volume is handled properly when CopyMethod is Clone [It] the source is NOT used as the data PVC
Test fails due to reason: PersistentVolumeClaim "s" not found

Steps to reproduce

Not sure if this is a one-off scenario, but I've never encountered this error locally.

Expected behavior

Test should pass.

Actual results

Test errors.

Additional context

Error output given:

  Unexpected error:
      <*errors.StatusError | 0xc000437ea0>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "PersistentVolumeClaim \"s\" not found",
              Reason: "NotFound",
              Details: {
                  Name: "s",
                  Group: "",
                  Kind: "PersistentVolumeClaim",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 404,
          },
      }
      PersistentVolumeClaim "s" not found
  occurred

  /home/runner/work/scribe/scribe/controllers/mover/restic/restic_test.go:446

Identify the amount of resources various replication types use.

Describe the feature you'd like to have.
Since we have metrics now. We need to identify and potentially put limits in place for the various movers.

What is the value to the end user? (why is it a priority?)
This would ensure that we would not use too much resources

How will we know we have a good solution? (acceptance criteria)
Valid numbers that would allow for us to know the default or rough amount of resources used

Additional context
We could just generally throw a large file in a pvc and then run the various methods.

Consider adding schema for Helm chart values

Describe the feature you'd like to have.
Add a values.schema.json to the Helm chart to ensure the provided values conform to the expected schema

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context
Description here: https://helm.sh/docs/topics/charts/

Allow helm install to use sha hashes of container images

Describe the feature you'd like to have.
Currently, the helm chart allows specifying the container image and a tag name. It should also be possible to specify a sha hash instead of a named tag.

What is the value to the end user? (why is it a priority?)
We need to use the hashes in openshift ci builds, and currently it has to be hacked around.

How will we know we have a good solution? (acceptance criteria)

Additional context
cc: @sallyom

OpenShift e2e testing

Describe the feature you'd like to have.
OpenShift specific e2e testing.

What is the value to the end user? (why is it a priority?)
Just in the event that OCP operates in a different capacity it would be good to catch it early.

How will we know we have a good solution? (acceptance criteria)
testing is created

Additional context

Add trigger methods for synchronization

Currently, the only method for TriggerSpec is a cronspec string. A webhook endpoint trigger and/or a kubeapi trigger will provide more flexibility. With these methods, an on-demand synchronization will be enabled.

CLI - `migration rsync`

Describe the feature you'd like to have.
kubectl migration rsync <args>
Syncs data into an existing migration relationship using rsync over ssh

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

  • Wait for and retrieve the ssh keys
  • Wait for and retrieve the remote address
  • Invoke rsync using the default args to sync data
  • Make the "shutdown" call to signal completion
  • Delete the local copy of the ssh keys

Additional context

FIPS support

Describe the feature you'd like to have.
VolSync should be FIPS 140 compliant

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

  • VolSync operator and associated containers should only use FIPS 140 approved algorithms
  • When running on a FIPS enabled host, VolSync should use the proper, approved crypto libraries
  • The above items should be automatically verified in CI

Additional context

cli options should be more clear

The current CLI options are confusing to anyone trying to develop on the CLI.
The source options that are direct members of SetupReplicationOptions should be wrapped under another structure (say for example SourceOptions) just like DestinationOptions. That distinction between source and destination will be more clear.

Rclone unit test failure in CI

Describe the bug
One on the unit tests failed in CI

Steps to reproduce
This is probably a timing issue w/ no good reproducer

Expected behavior
Tests should pass

Actual results

Failure [0.282 seconds]
ReplicationDestination [rclone] when ReplicationDestination is provided with a minimal rclone spec when ReplicationDestination is provided AccessModes & Capacity when Job Status is set to succeeded when Using a CopyMethod of None when A Storage Class is specified [It] Is used in the destination PVC 
/home/runner/work/volsync/volsync/controllers/rclone_mover_test.go:218

  Expected success, but got an error:
      <*errors.StatusError | 0xc000907ea0>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Job.batch \"volsync-rclone-src-instance\" not found",
              Reason: "NotFound",
              Details: {
                  Name: "volsync-rclone-src-instance",
                  Group: "batch",
                  Kind: "Job",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 404,
          },
      }
      Job.batch "volsync-rclone-src-instance" not found

  /home/runner/work/volsync/volsync/controllers/rclone_mover_test.go:220
------------------------------
[BeforeEach] ReplicationDestination [rclone]
  /home/runner/work/volsync/volsync/controllers/rclone_mover_test.go:36
[BeforeEach] when ReplicationDestination is provided with a minimal rclone spec
  /home/runner/work/volsync/volsync/controllers/rclone_mover_test.go:104
[BeforeEach] when ReplicationDestination is provided AccessModes & Capacity
  /home/runner/work/volsync/volsync/controllers/rclone_mover_test.go:123
[BeforeEach] when Using a CopyMethod of None
  /home/runner/work/volsync/volsync/controllers/rclone_mover_test.go:143
[BeforeEach] when A Storage Class is specified
  /home/runner/work/volsync/volsync/controllers/rclone_mover_test.go:215
[JustBeforeEach] ReplicationDestination [rclone]
  /home/runner/work/volsync/volsync/controllers/rclone_mover_test.go:90

Additional context

I haven't looked into it, but I expect this is due to a delay in the object being visible via Get() directly after Create().

Evaluate syncthing as a mover

Describe the feature you'd like to have.
We should see if there is any benefit to adding Syncthing as an additional data mover.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context
I don't currently have an opinion on this (or know anything about syncthing). I'm just recording it so it doesn't get lost.

RFE: support deployment in podman

Describe the feature you'd like to have.

Replicating data between podman instances and between podman and kubernetes might help to provide a holistic approach for data in a hybrid container deployment environment.

What is the value to the end user? (why is it a priority?)

Some Use Cases:

  1. As someone who is managing containers in podman and considering migrating to kubernetes, data movement can be a pretty strong barrier to entry. A simple solution would make migration simpler.

  2. As someone who is managing containers in a mix of podman and kubernetes because of resource constraints or other considerations, managed data replication would result in less hand-built/maintained software managing the environment.

How will we know we have a good solution? (acceptance criteria)

Users can deploy containers in both podman and kubernetes and "share" data.

Switch condition manipulation to use apimachinery

Describe the feature you'd like to have.

In 1.19, apimachinery added metav1.Condition and helper functions. We should switch to using those and remove the dependency on operator-sdk's operatorlib. However, this needs to wait until controller-runtime is moved to 1.19.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context

CLI: extend function to complete pvc-copy leading to workload-copy

With scribe plugin, create a complete pvc-copy command that:

  1. created replicationDestination
  2. creates replicationSource
  3. syncs ssh-secret
  4. creates PVC in destination from copyMethod

Then, extend that to workload-copy where workload can be deployment, statefulset, etc

Race in `set-replication`

When set-replication performs the manual sync, it waits for the sync to finish on the source (trigger status == spec), then it immediately attempts to retrieve the latestImage from the ReplicationDestination. There is no guarantee that latestImage has been updated yet, even though we know the sync has completed from the PoV of the source.

Result in the following error being triggered:
https://github.com/backube/scribe/blob/a9c0d0c9d4084c3aeaa61f771c19b8d736acd8a5/pkg/cmd/set_replication.go#L197

This may be a bit tricky to solve since the mere presence of a latestImage is insufficient (it may be from a previous sync). We somehow need to poll and ensure what we find is generated after the completion of the sync. Perhaps:

  1. Check destination and remember what we find
  2. Do the manual sync on the source
  3. Poll the destination until it's different than what we saw in step 1 (or timeout)

Fix installation of golangci-lint

Describe the feature you'd like to have.
The method we use to install golangci-lint is not supported (though it does currently work). We should use a better method.

From CI:

golangci/golangci-lint info checking GitHub for tag 'v1.43.0'
golangci/golangci-lint info found version: 1.43.0 for v1.43.0/linux/amd64
golangci/golangci-lint info installed /home/runner/work/volsync/volsync/bin/golangci-lint
golangci/golangci-lint err this script is deprecated, please do not use it anymore. check https://github.com/goreleaser/godownloader/issues/207
/home/runner/work/volsync/volsync/bin/golangci-lint run ./...

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context

Investigate sparse & large file handling

Describe the feature you'd like to have.
Investigate how the various movers handle both sparse and large files.

What is the value to the end user? (why is it a priority?)
VM images, in particular, can be rather large. We should characterize what performance we expect (and optimize where we can). These images may also be sparse, and detecting/preserving that can also be a big benefit.

How will we know we have a good solution? (acceptance criteria)

  • Understand what performance we should expect w/ large files.
    • Are small changes in the file handled efficiently?
    • Are there any optimizations that would make it better?
  • Understand how sparse files are handled.
    • Is sparseness preserved?

Additional context
Rsync & sparse: https://gergap.wordpress.com/2013/08/10/rsync-and-sparse-files/
Related to backube/scribe#122
Prompted by @mykaul

When allocating a new volume based on another, use the correct capacity

When generating a clone of an existing PVC, we currently use the requested size:

corev1.ResourceStorage: *src.Spec.Resources.Requests.Storage(),

corev1.ResourceStorage: *original.Spec.Resources.Requests.Storage(),

corev1.ResourceStorage: *h.srcPVC.Spec.Resources.Requests.Storage(),

corev1.ResourceStorage: *h.srcPVC.Spec.Resources.Requests.Storage(),

We should be using the actual size (.status.capacity). These two could differ in the case of volume resize.

cli: set-replication doesn't obey `source-name`

Describe the bug
start-replication allows creating a custom named ReplicationSource via the source-name parameter. set-replication should also use this parameter since it's commonly used as part of a start, set sequence.

Steps to reproduce
Config file:

source-kube-context: kind-kind
source-name: wiki-src
source-namespace: wiki
source-copy-method: Clone
source-pvc: dokuwiki

dest-kube-context: admin
dest-name: wiki-dest
dest-namespace: wiki-dest
dest-service-type: LoadBalancer
dest-capacity: 2Gi
dest-access-mode: ReadWriteOnce
dest-copy-method: Snapshot
$ kubectl volsync start-replication
...
$ kubectl volsync set-replication
I0813 10:27:22.687209  523925 set_replication.go:144] Fetching ReplicationSource wiki-source in namespace wiki
Error from server (NotFound): replicationsources.volsync.backube "wiki-source" not found

$ kubectl --context kind-kind -n wiki get replicationsource
NAME       SOURCE     LAST SYNC              DURATION        NEXT SYNC
wiki-src   dokuwiki   2021-08-13T14:30:19Z   19.081158175s   2021-08-13T14:35:00Z

Expected behavior

Actual results
set-replication tried to use "wiki-source" instead of "wiki-src"

Additional context

Demo: rsync fail-back

Describe the feature you'd like to have.
Our demos thus far showing failover have gone 1-way. We should have something that shows how to re-sync and re-activate the original primary.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context

  • This will require a second source/destination for the reverse direction, but if these are created independently, they will cause a full re-sync from the destination back to the source even though the majority of the data is already on the source cluster.
  • We should be able to seed the re-sync w/ the "original" primary data to minimize copying.
  • I'm not sure how much code will be needed here, but it's non-zero.

Rename CopyMethod: None

Describe the feature you'd like to have.
When describing how data should be replicated, we have 3 methods defined: Snapshot, Clone, and None. The "none" method was originally meant to signify that neither a clone nor a snapshot would be used, so there would be no point-in-time image.

I propose to change None to Direct. That is to say, we are copying data directly to/from the PVC.

What is the value to the end user? (why is it a priority?)
Hopefully this will make the API a bit clearer

How will we know we have a good solution? (acceptance criteria)

Additional context
We don't really need to remove None. We could just add Direct that has the same functionality and update all the docs None -> Direct.

Pod affinity for RWO w/ copyMethod:None

Describe the feature you'd like to have.
When ReplicationSource is used on a RWO volume w/ copyMethod: None, the source mover needs to mount the live volume while the application is still using it. Kubernetes does not (natively) understand to co-schedule the mover on the node w/ the application, so it gets scheduled elsewhere and fails to start due to unmounted volumes.

We should detect this combination of options (RWO + None) and add the proper pod affinity to the mover so that it gets scheduled on the same node as the application. The process would look something like:

  • Notice RWO + None
  • Search for the pod that mounts the source PVC
  • Add required pod affinity to the mover, naming the app pod

What is the value to the end user? (why is it a priority?)
Currently, the combination of RWO + None won't work due to kubernetes scheduling limitations

How will we know we have a good solution? (acceptance criteria)

  • RWO + None will be usable

Additional context
There's also a need to ensure this affinity remains correct. The pod could be deleted and recreated w/ a different name by the RS or Deployment. We'd need to detect and update our rules on the mover Job/pod.

Refactor reconcile loop

Describe the feature you'd like to have.
The reconcile process needs to be refactored.

  • Reduce code duplication (there are replication_methods * 2 copies of some stuff)
  • Adjust the state machine to ensure proper cleanup in the presence of failed API requests

What is the value to the end user? (why is it a priority?)
This will cut the maintenance overhead and LoC of the operator significantly

How will we know we have a good solution? (acceptance criteria)

  • Common operations between source and dest will be combined
  • Reconcile will be robust to failed api calls and stale data

Additional context
The manifestation of the reconcile issues is that we sometimes leave a mover container around after a sync completes, and we then miss the next cycle, but get back on track for the 3rd.

The restructuring itself needs some discussion prior to actual coding.

Print mover container images at startup

Describe the feature you'd like to have.
When the operator starts, it should print the container image names/tags that will be used for the movers. Currently, this works for rsync & rclone, but the image path for restic (which uses the new mover interface) isn't printed.

Enhance the Mover interface such that the container image name gets printed to the log when a mover is added to the catalog.

What is the value to the end user? (why is it a priority?)
It should be easy to identify exactly which container images are used to help troubleshooting.

How will we know we have a good solution? (acceptance criteria)

  • The full container image name + tag are printed for all movers
  • There exists a standard way of doing this using the Mover interface

Additional context

volsync cmds should retrieve the parameters from volsync-config.yaml if not provided as part of cmd

Describe the bug
Though the volsync-config-.yaml is present at /root/.volsyncconfig dir, the volsync cmds are ignoring the config file and asking to provide the parameters as part of cmd. IMO, if config file is present, the values should be retrieved from there and any parameter provided as part of cmd should be taken into consideration by overriding the parameter from config file.

Steps to reproduce
Execute any volsync cmd without cmd line args
ex:
# kubectl volsync start-replication
Expected behavior
Things should work if min args are present in the config file

Actual results
Showing help results and not working with default config file

Add more developer docs

Create developer docs for kind and openshift use cases - like pointing to hack directory and useful scripts for developers

CLI - E2E test of the `migrate` sub-command

Describe the feature you'd like to have.
Add a kuttl-based e2e that uses the CLI to migrate some test data into a PVC

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

  • Uses the CLI to create a PVC in the test cluster
  • Syncs in some known data
  • Removes the relationship
  • Verifies resources got cleaned up
  • Starts a Job that verifies the PVC contents are as-expected

Additional context
Blocked by:

`start-replication` should read the ssh Secret name from `.status`

Describe the bug
Currently, start-replication assumes (correctly) that it knows the name of the auto-generated ssh Secret, and tries to retrieve it
https://github.com/backube/scribe/blob/a9c0d0c9d4084c3aeaa61f771c19b8d736acd8a5/pkg/cmd/create_replication.go#L271

Sometimes, the CLI tries to retrieve the Secret before it is available from the API server, resulting in an error.

Instead, it should poll for the name of the Secret in the .status field of the ReplicationDestination. By taking this approach:

  1. it will be robust to naming changes
  2. it will not race with the operator that is creating the secret

Steps to reproduce
This can be triggered intermittently via the kuttl e2e

Expected behavior

Actual results

Additional context

CLI - `migration delete`

Describe the feature you'd like to have.
kubectl migration delete <args>
Cleans up a migration relationship

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

  • Removes the ReplicationDestination that is associated w/ the relationship (if it exists in-cluster)
  • Deletes the relationship config file

Additional context

Restic: restore from old backups

Describe the feature you'd like to have.
Currently, the restic mover only supports restoring the latest backup even though we have retention policies that keep multiple older backups.

Something like:

spec:
  restic:
    restoreAsOf: "2021-07-14T12:56:27Z"
    previous: 2

Both fields would be optional, and if not specified would default to "the current time" and 0, respectively.

The "as-of" time refers to the backup that is the most recent as of that time. For example, if the as-of time is yesterday at noon, it would refer to the most recent backup still in the repo that is older than yesterday at noon.

The "previous" field can be used to reference older backups in sequence. For example, previous: 0 would be the most recent backup, and previous: 1 would be the backup prior to the most recent.

What is the value to the end user? (why is it a priority?)
Today, users can only restore from the most recent backup using Scribe. If they want to access an older backup, they must directly access the restic repo via the restic cli, and they are on their own for getting that data back into a PVC.

How will we know we have a good solution? (acceptance criteria)

Additional context
Restic normally requires specifying the "snapshot id" of the backup in order to restore it. The above proposal is designed to avoid having to deal directly with the IDs which would require some sort of interface to list them and have the user then specify the desired one.

The scribe cli could also be useful here to give access to the restic repo by getting the credentials from the cluster and invoking restic on the user's behalf.

Issue with destination pod on Openshift 4.7

Describe the bug
Simple setup following the documentation. I see a destination pod is created and listening.
I create a source custom resource based on the ReplicationSource CRD and I can see the source pod starts replicating:
[amatev@ocp-installer ~]$ oc logs -f volsync-rsync-src-woof-7rhpk
VolSync rsync container version: v0.3.0
Syncing data to 172.32.156.187:22 ...

Number of files: 1 (dir: 1)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 0
Total file size: 0 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 42
Total bytes received: 19

sent 42 bytes received 19 bytes 122.00 bytes/sec
total size is 0 speedup is 0.00
Rsync completed in 0s
Synchronization completed successfully. Notifying destination...
Initiating shutdown. Exit code: 0

When it finishes, the destination pod stops with state "Completed 0/1". Destination pod log:
VolSync rsync container version: v0.3.0
Waiting for connection...
Exiting... Exit code: 0

And then, after a new source pod is created in 5 minutes (as per the Trigger in the config), it can't connect, as there is no destination pod running.

Steps to reproduce

  1. Create destination RSync config following: https://volsync.readthedocs.io/en/latest/usage/rsync/index.html
  2. On the source, create a Secret with destnation.pub, source and source.pub from the Destination machine secret
  3. Create a new Source CR

Expected behavior

It is expected for the destination pod (created by Job) to keep running so it can handle next rsync connection requests

Actual results

Destination pod stops.

Additional context

CLI - need a review of defaulted option values

There are many API fields that have defaulted values and these are included in the cli options. There are a few that have been defaulted in CLI options:

destination pvc capacity: "2Gi"
source pvc capacity: "2Gi"
source-cron-spec: "*/3 * * * *"

Should others be defaulted? Should these not have defaults?

Broken links in README

Describe the bug
The following links in the main README are broken:

  • single cluster cross namespace example
  • multiple cluster example

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.