foundationdb / fdb-kubernetes-operator Goto Github PK

View Code? Open in Web Editor NEW

240.0 240.0 83.0 35.28 MB

A kubernetes operator for FoundationDB

License: Apache License 2.0

Dockerfile 0.12% Makefile 0.42% Go 98.39% Shell 0.86% Python 0.15% Smarty 0.06%

fdb-kubernetes-operator's People

Contributors

Stargazers

Watchers

fdb-kubernetes-operator's Issues

Using an annotation to determine which version of the config map a pod has

In the Update Config Map part of the reconciliation, we are checking every instance to see if it has the latest contents for the config map, even if the config map has not changed. I think that once we confirm that a pod has the latest config map contents, we should set an annotation on the pod with a hash of the config map, and if we detect it has a matching config map we can skip the check. This should reduce unnecessary traffic between the operator and the pods.

Options for disabling client compatibility check when upgrading

We have a check that all clients are compatible with the new version when doing an upgrade. I think this is a good default, but there may be some cases where someone doesn't want to use it, or where it's reporting false positives, and there's no way around it with the current implementation. I think we should add an option in the cluster spec for disabling this check. We could make it a list of IP subnets to ignore, which would allow skipping the check for specific clients or skipping the check entirely.

New strategy for generating pod names and instance IDs

We are currently generating instance IDs using an incrementing counter, which is stored in the cluster spec. This means that when we do replacements or delete stateless pods, we are moving to a higher set of instance IDs. I have an idea for an alternative strategy that will keep the pod names bounded to a smaller set and remove the need to track instance IDs in the spec.

I propose that we make the instance IDs take the form ${process_class}-${num}. The corresponding pod name would be ${cluster_name}-${process_class}-${num}. When adding new pods, we would first look for any PVCs that match the cluster and process class and do not have a matching pod, and we would try to re-use them. If we have no more re-usable PVCs, we would look for the smallest number num greater than 0 that does not have a corresponding pod for this process class, and would create a pod for that number.

When we replace instances by adding them to the removal list in the spec, this strategy will create a gap in the instance IDs. This means that unlike with StatefulSets, there is not a direct correspondence between the highest pod name and the number of instances. I think this will not be a problem for us, and it's necessary to preserve the ability to do live replacements. The next time we need to expand or replace instances, however, we will fill in these gaps, so there is a bound on how high the instance IDs will get based on the size of the cluster.

This strategy should be resilient against stale results from the cache in the operator. If we see a gap in instance IDs where it doesn't exist, then our attempt to create the pod will fail. If we fail to see a gap in instance IDs where it does exist, then a subsequent operation will eventually see the gap and fill it.

Another benefit of this naming convention is that it is clear from the pod name what role a pod is filling.

Setting up a PRB job

I think it would be good to have a PRB job that runs the unit tests for the operator, so that we can confirm the tests are passing when reviewing PRs.

Adding a custom user to our docker images

It's often considered a security best practice to run as a non-root user in your containers. We should define a user in our images to make this easy. I'd recommend creating a user called foundationdb with UID 4059, and creating a group with the same name and ID.

This applies to both the operator image and the sidecar image.

Breaking reconciliation into concrete subreconcilers

We have a pattern that is implicit in the code where we have subparts of the reconciliation handled by separate methods. These methods tend to return either an error, or an error and a boolean. This pattern produces code bloat and repetition when we're invoking the reconcilers, and it leads to a huge file of reconciliation code. I think we should introduce a formalization of this pattern, and break separate parts of reconciliation into separate files. Ideally we would also find a way to have dedicated tests for each subreconciler.

Include backup and restore mechanism in the operator

If we install the foundationDB operator and create the cluster, in current implementation the backup and restore mechanisms dont get automatically enabled, especially backup_agent. Which makes the backup and restore of the foundationDB very difficult to achieve.

I thin we should have the backup_agent and other utilities, that enables us to take backup an restore, enabled in the default configuration or there should at least be a field in the FoundationDBCluster spec using which we can get those utilities enabled.

Add tests for replacing pod while reusing its volume

The Kubernetes operator has support for deleting a pod and recreating it with the same volume. When I initially implemented with this I hit challenges creating unit tests for this scenario. We should revisit this and add tests for this case.

Using non-blocking excludes in the Kubernetes operator

When we do excludes in the operator, we are running the exclusion command with a short timeout, and then re-running the reconciliation until we can get the exclusion command to succeed. There are risks in relying on timeouts to determine if an operation has completed, since we can get timeouts for other reasons and this process can make it harder to debug those issues. Triggering an exclude can also restart data distribution, which can block progress on the excludes, getting us caught in a loop. Instead, I think we should be doing a non-blocking exclude for any new additions to the exclusion list, and use a separate mechanism to check if the exclusion has finished. That mechanism does not exist yet, but there's an issue for it on the FoundationDB repo: apple/foundationdb#2159

Preventing downgrades in the Kubernetes operator

FDB does not support downgrading a cluster to a previous version in the general case. We should add a check for this in the operator to prevent someone from trying to downgrade a cluster and getting it into a bad state that will be harder to recover from.

Breaking configuration changes into multiple stages

There are cases where we need to break database configuration changes into multiple stages, with a wait for the cluster to become healthy between each stage. This is necessary when making some changes to the region config, since we cannot change usable_regions and regions in the same command. We should automate this in the operator.

Supporting different sidecar versions for different FDB versions

The Kubernetes sidecar is versioned based on the FDB version, with a build number to allow us to push out independent changes to the sidecar for previous FDB versions. The operator has an option for this build number, but it’s applied regardless of the FDB version. Someone might be upgrading from 6.1.12, using sidecar 6.1.12-2, and want to go to 6.2.8, with sidecar 6.2.8-1. To support this, I think we should make the sidecar version settings specific to the FDB versions.

Recruiting additional coordinators when we have multiple regions

When we have more than one region defined in the region config, we should use 9 coordinators to make sure that we have a quorum of coordinators after losing a data center.

Webhook for rejecting invalid database config

It's possible for someone to supply a database configuration in the cluster spec that cannot be fulfilled. For instance, someone could define usable_regions to be higher than the number of regions, or use a redundancy_mode that is not supported. We should have a webhook that will reject these changes before they get persisted in Kubernetes, so that we can fail fast and avoid getting stuck in a loop where we can't complete reconciliation.

Usage and development documenation

We should add documentation on using the operator, as well as on developing on the operator. The usage documentation should include a stock YAML that someone can use to run the latest version of the operator in their environment.

Missing a mechanism to configure clients in the same cluster

Hello! We've been experimenting with this operator, and getting an FDB cluster up and running was very smooth! However, when we then went to deploy some other services alongside it that connect to the new FDB cluster, it wasn't obvious how best to wire that up. In this case, our services are just some Go apps that use the Go FDB bindings and are expecting a cluster file.

Of course, I could copy+paste the cluster config into each service, but I was hoping for maybe some invocation of the sidecar that could drop a config into my services, or even a configmap containing the dynamic config. But I couldn't find anything.

I might be able to slice off some time to open a PR if this functionality is missing, depending on what you want it to look like!

Add additional stateless processes for 6.2 and later

FDB 6.2 added new dedicated roles for data_distributor and ratekeeper. We should increase the default number of stateless processes by 2 to have dedicated processes for these roles, in order to prevent CPU contention. We should also support configuring process counts for these types in the process counts in the cluster spec.

Allowing all other reconciliation tasks to conclude while waiting for pods to get shut down

In the event of a network partition, it may take a long time after deleting a pod for it to finish being shut down. We should make it possible to complete as much of the reconciliation as we can when we are waiting for this kind of action. The only things I can think of that need to block on pod deletion are re-including instances and deleting further pods.

Disable automounting of service account tokens in pods created by the operator

By default, Kubernetes mounts a service account token into all pods. There's no reason for the FDB processes we create through the operator to have a service account token, so we should disable this automounting to avoid having credentials in places where they are not needed.

Improve handling of missing CLI versions

I recently tried to create a cluster through the operator on 6.2, with an operator image that did not have 6.2 binaries. This led to hard-to-trace errors at the time the operator tried to configure the database. I think we should improve this in two ways:

When we get an error from running a CLI command, we are casting it to ExitError unsafely, so that we can handle exit errors specially. We should make this a safe cast, and pass any other kind of error further up the chain.
If we get a request to reconcile a cluster with an unsupported version, we should raise an error immediately, and make sure to include information to help resolve the error.

Improving concurrency

I did some experiments with creating clusters in parallel in different namespaces through the Kubernetes operator. It seemed like the creation of the second cluster was totally blocked on the reconciliation of the first cluster, which had some operations that were polling for config files to get updated. We should see if there are ways to improve the concurrency story here so that we don’t have activities on one cluster blocking activities on another unnecessarily.

Using a hash instead of the full pod spec to determine if pods need to be recreated

We store the pod spec as an annotation on the pod to determine whether we need to recreate the pod in reconciliation. This makes the pod definition less readable, and may not be guaranteed to be deterministic. I think we should use a hash instead.

Pinning Ubuntu version in our dockerfile

The dockerfile currently bases the image on ubuntu:latest. This can introduce inconsistencies in the build. We should pin to a specific image version instead.

Protections against bouncing processes repeatedly

The Kubernetes operator bounces the processes under its control when it updates the start commands for the processes. There may be situations where it does multiple bounces close together, especially when someone is deploying across Kubernetes clusters and has multiple instances of the operator working on the same FDB cluster. We should try to add protections against this, like refusing to bounce when there has been a recovery recently.

Use fault domains to determine which processes to shrink

When we do shrinks through the operator, we are choosing processes to remove based on the instance ID. I think it would be better to choose the processes such that we optimize the distribution across fault domains after the removal, with the instance ID as a tie-breaker.

Add helm chart to deploy fdb operator

Pushing cluster file to pods after changing coordinators

When we change coordinators in the Kubernetes Operator, we don’t do the full work to push it out to the pods. This generally shouldn’t be necessary, because any connected process will update its own local copy and ignore the version the operator provides. The operator will eventually notice the discrepancy, though, and will try to reconcile it. Rather than this blocking an unrelated operation down the line, we should push out these changes after changing coordinators.

Provide up-to-date docker image

Hi!

I have started playing with this operator and was stuck on this error quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'

 {"level":"error","ts":1578666059.9345756,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"foundationdbcluster-controller","request":"default/sample-cluster","error":"quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'","stacktrace":"github.com/foundationdb/fdb-kubernetes-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/foundationdb/fdb-kubernetes-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/foundationdb/fdb-kubernetes-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/foundationdb/fdb-kubernetes-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/foundationdb/fdb-kubernetes-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/foundationdb/fdb-kubernetes-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/foundationdb/fdb-kubernetes-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/foundationdb/fdb-kubernetes-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/foundationdb/fdb-kubernetes-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/foundationdb/fdb-kubernetes-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/foundationdb/fdb-kubernetes-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/foundationdb/fdb-kubernetes-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

The error is happening when I'm using latest sample-cluster.yml with the current latest image on Docker Hub which is this one:

latest
Last updated 4 days ago by foundationdb
digest:sha256:78dd2d9ed5a34af2ffd2ef7c16a4a890ef6090d73dc477d8f79cfd3562c8c029

Pushing my own Docker image based on latest commit did fix the issue.

Could you setup some CI to push images for each commits on Docker Hub?

Thanks.

Provide a way to install the operator in non default namespaces

I think as of now, the operator can only in installed on default namespace which happens by default if we follow the README, we should have a way to install the cluster in the non default namespaces well.

Even if we deploy the operator in the default namespace and try to deploy local_cluster.yaml in another namespace the fdb cluster pod dont get into running state because there are some secrets that are required to fdb cluster to get the pods running.

Upgrading to Kubebuilder 2

We're on an old version of Kubebuilder, and there are substantial changes in Kubebuilder 2 to the way some parts of the project are built. We should look into upgrading to Kubebuilder 2 to avoid getting stuck on an unsupported toolchain.

Providing an instance ID in the locality parameters

There are various places in FDB, and in some of our internal tooling, where we use IP/port combinations as a persistent identifier for processes. This is going to have to change on Kubernetes, because the IPs are not persistent when pods are recreated. To address this, I think we should add an instance ID in the locality fields. There's an open issue in FDB to allow exclusions based on locality fields: apple/foundationdb#1750. I think we should add locality fields for a unique instance ID to get us ready for this feature.

The instance IDs will need to be unique across the entire cluster, even when the cluster is spread across multiple Kubernetes clusters with independent instances of the operator running. To support this, we can add a field in the cluster spec for the instance ID prefix, so that different instances of the operator will generate different instance IDs.

Skip check on dynamic conf when a process is pending removal

When we are replacing a process and also pushing out a change to the custom parameters, we will check the dynamic conf for all of the processes prior to removing the old processes. This could be a problem if we are replacing a process that is partitioned from the operator. The check for the conf updates could get stuck forever because it can't reach the sidecar for the process that is partitioned. I think we should skip the check for processes that are pending removal, so that we can move past the conf check for partitioned processes and get to the point where we can tear down the processes.

Enabling and disabling TLS for the sidecar in the Kubernetes operator

We have a flag to enable TLS for the Kubernetes sidecar in the operator, but it only works on creation. We should build a flow for turning this TLS on and off.

Making PendingRemovals a list instead of a map

The PendingRemovals field in the spec is a map that maps the names of pods to their IP addresses. During shrinks, both the name and IP are populated by the operator. For explicit replacements, the user can put the name in the spec mapped to an empty IP, and the operator will discover the IP at run-time. I think this is a little counter-intuitive, and preserving the IP address will be a little dangerous if we're in an environment where IPs can change. I think that we should replace this map with a list that contains instance IDs. We can then look up the pod name and IP address at the time we are working on them based on this instance ID. Once we have the ability to exclude by locality, we'll be able to exclude directly by instance ID, which will make this structure handy.

Request to use `go mod` to manage the dependencies

We are using go dep to manage the dependencies for the operator but I think go mod is more mature and widely accepted dependency tool. We should start using go mod to manage the dependencies.

Checking errors from calls to Update Status

In the updateStatus method in the Kubernetes operator, we call r.Status().Update, but we discard the errors it returns. We should check for errors from this method and bubble them up as errors in reconciliation. I tried doing this earlier and it caused a bunch of problems in unit tests, so we’ll need to dig into those errors.

Adding a more generic PodTemplate field to the cluster spec

We've been gradually adding new fields to the spec to allow customizing parts of the pods we create. For instance, we recently had to add security contexts and automount options to the cluster spec. Rather than keep complicating our own spec and requiring new code every time we identify a new customization points, I think we should add a podTemplate field that will allow nearly complete customization of the pod specs. This will allow us to deprecated several fields, and hopefully remove them before we release a production-ready version of the operator.

Making sure that connection strings are consistent when multiple instances of the operator are working on a cluster

The operator will try to change coordinators when it detects an unreachable coordinator or a discrepancy between the coordinator count and the database config. If we have multiple instances of the operator working on a single cluster, e.g. in a multi-DC setup, then the different instances could independently try to change the coordinators. This could lead to inconsistent configuration between the DCs. We should find a way to synchronize this action across different instances of the operator.

Guaranteeing fault tolerance when choosing coordinators in the Kubernetes operator

The Kubernetes operator has logic for choosing coordinators when creating a new cluster or replacing instances. This logic does not use the locality information to determine which coordinators it chooses. This can lead to it recruiting coordinators that are not fault tolerant. We should factor in the locality information when choosing coordinators here. As an alternative, we could consider using coordinators auto. We’ve historically preferred to have other tooling make the decision about coordinators, but the database seems like it would have as much or even more information to inform this choice, so it seems like a shame to have to reimplement it in different tooling.

Supporting different volume claim properties for different process types

We currently have two stateful process types: log and storage. They have different storage needs , since the log processes are keeping a small amount of data that needs to be written quickly, and the storage processes are keeping a large amount of data that needs to be read randomly. Someone may want to use different volume sizes or storage classes to help optimize the performance for each process type. We should find a way to configure the volume claims differently for the different process types.

Ignore pods that were not created by the operator

If there are pods that were not created by the operator, but have labels that match the pods the operator creates, the operator should ignore them. This will prevent us from deleting pods that aren't in the operator's scope, and make it easier for us to transition from our older Kubernetes tooling to the operator.

Adding FDB_INSTANCE_ID as a default substitution in the Kubernetes sidecar

We're adding FDB_INSTANCE_ID as a variable substitution for clusters created through the operator, since it is not included by default in the sidecar. We should add this as a default substitution in the sidecar and remove the code to add it through the sidecar config for versions of FDB that support it.

Special process count rules for satellites

There’s a couple of areas where it will be helpful to be able to factor the satellite log count into the desired process count in the Kubernetes operator. If a data center is only serving as a satellite, we should ignore the normal process count rules and only provision enough processes for the satellite logs plus the fault tolerance. If a data center is serving as a satellite and a main DC, we should add the satellite log count to the normal provisioned log count.

Converting between TLS and non-TLS configurations

We should build a safe process for converting a cluster between TLS and non-TLS configurations. We can do the conversion to TLS without downtime by launching the processes with both a TLS and non-TLS listener, changing the coordinators to use the TLS ports, and then launching the processes with only the TLS listeners. Disabling TLS can work through a similar process in reverse. The risk of this process is that if clients are not configured with valid TLS configuration, then they will be cut off when we change the coordinators to use the TLS ports. I think to mitigate that risk we need a new feature in FoundationDB: apple/foundationdb#1848

Moving more configuration for the sidecar from environment variables into the start command.

When we initially developed the sidecar, it was a flask app, which limited our ability to customize it through command-line parameters. I think command-line parameters are easier to make discoverable and easier to document. Now that we have it as a more simple python app, we should look for things that we can add as command-line parameters.

Dropping support for not using the status subresource in the operator

We have an environment variable to control whether the operator is running in a Kubernetes environment with support for the status subresource. I think the status subresource feature is going to be critical for us going forward to have reliable reconciliation, because it allows us to track generations reliably. We needed this flag in the past because Docker for Mac was shipping an old version of Kubernetes, but that has been fixed. The status subresource is supported in Kubernetes 1.10+, so I think we're safe to require it.

Strategy for resizing volumes

We allow specifying the volume size in the cluster spec, but we do not have a strategy for changing it. I think that we can handle this in a clean way by doing a live replacement. When we are in the UpdatePods stage of reconciliation, we can detect any pods whose PVCs do not match the volume size or storage class in the cluster spec and add them to the pending removals list. After making this change we can requeue reconciliation, as we normally do after updating the spec. This will cause the operator to do a replacement, and the new pods it creates will pick up the new spec.

We can also have an option for doing this for all changes to the pod spec. This will allow users to make changes that require recreating pods without taking on any loss of fault tolerance, at the cost of requiring additional temporary resources and additional data movement.

Bringing new processes up at the old version when upgrading and expanding at the same time

In theory, the Kubernetes Operator allows you to submit a change that changes the FDB version and the process count at the same time. In this scenario, the operator will try to bring up the new processes at the new version, which could prevent them from connecting to the old processes. The operator should bring up the processes at the old version instead. The old version should be stored as the RunningVersion in the cluster spec.

Using the version of FDB from the main FDB container when it matches the desired version

We are currently using the version of FoundationDB supplied by the sidecar even when it matches the version that is in the main container. While this makes things more consistent with the case where we are mid-upgrade, it may be out of step with people's expectations about what the responsibility of what each container should be doing. People may be more comfortable with using a binary provided by the main image. I think we should only copy binaries from the sidecar when we need to, during an upgrade. After the upgrade, we will do a rolling bounce to get the cluster back on a config where the main container provides the desired binary.

Checking client compatibility during upgrades

When we do upgrades, we need to check that the clients are compatible with the new version before pushing out the new server version. We should build this logic into the operator.

foundationdb / fdb-kubernetes-operator Goto Github PK

fdb-kubernetes-operator's People

Contributors

Stargazers

Watchers

Forkers

fdb-kubernetes-operator's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs