open-policy-agent / cert-controller Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
We are using cert-controller library for bootstrapping of webhookserver in our cluster. We'd like to reduce the certificate validity duration to comply with our security policies.
I see there is already an open issue for this. I can raise a pull request and add an option to either configure this or use the default 10 years validity.
https://github.com/open-policy-agent/cert-controller/blob/master/pkg/rotator/rotator.go#L587-L593
would be less overhead to do a GET on secret by secret-name than watching the full list of secrets
The Gatekeeper External Data Provider, like the other supported Webhook types, requires the caBundle
for the server.
We made this change in HNC's version of cert-controller and it reduced the initial startup time from >100s to about 10s. See kubernetes-retired/multi-tenancy@b070055.
I'd like to make the same change here - if the --cert-restart-on-secret-refresh
flag is set (name is negotiable), then cert-controller will call os.Exit()
when it updates a secret. This should only happen after initial installation or every 10y.
@maxsmythe , @ritazh , any thoughts on this?
I see that the master branch has been adapted.
Current solution:
go get github.com/open-policy-agent/cert-controller@master
go.mod:
require github.com/open-policy-agent/cert-controller v0.1.1-0.20210308205344-203624759536
This is blocked on #27
We should have the ability for one centralized process to manage the key rotation, so that it can be done in a gradual manner.
The alternative would be to have some sort of leader election to figure out which pod is managing key rotation.
We may want to support both models, given that different consumers may have different hosting schemes and availability requirements.
In the case the cert controller is added to a non-leader manager, i.e., with CertRotator.RequireLeaderElection
set to false, it fails with the following error message:
{"level":"error","ts":"2023-12-12T18:04:44.726776367Z","caller":"controller/controller.go:203","msg":"Could not wait for Cache to sync","controller":"cert-rotator","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced for Kind *v1.Secret","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:203\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}
I'm looking at using this awesome library in my admission webhook after a long search.
I'm curious if the library has any builtin mechanisms to coordinate first-time cert provisioning or renewals when the webhook itself is deployed as a ReplicaSet with >1 instances (and they race each other and end up with different certs or have write-write conflict on webhookconfiguration caBundle field)?
Or is this concern inherently not valid (maybe because Secrets eventually propagate and processes restart etc)?
This will prevent the old key from being marked as invalid before the new key is 100% distributed
With the addition of #45, the cert-controller can be set to run only in the leader - instructing the leader to be responsible for the certificate injection and management.
But how can we send the same signal to the followers?
With the current implementation, the ready channel will never be signaled.
Add config options for:
We should put off doing this until more frequent cert rotations are safe WRT availability. The work for doing so is listed in the "Allow Setting Cert Validity Duration" milestone.
We are using gatekeeper with the automatic certificate management provided by cert-controller. We'd like to be able to configure the duration for certificate validity (and likely lookahead interval) to align with our internal policies.
If you are open to this change, I am happy to create a PR for it.
I'm using the cert-controller in one of the projects to bootstrap a mutating webhook. I've configured the rotator using the example provided in the doc. Interestingly in most of the CI runs and local testing, I'm seeing a delay when the certs are available in the mount. Seeing it take upto 1m30s in few instances before the certs are ready in the mount path. The delay could be because the Kubernetes secret update is delayed and the mount republish is missed at the first attempt.
Is this a known behavior? Is that why there is RestartOnSecretRefresh
property in struct?
github.com/open-policy-agent/cert-controller v0.2.0
k8s.io/kubernetes v1.21.2
sigs.k8s.io/controller-runtime v0.9.2
Usage:
// Make sure certs are generated and valid if cert rotation is enabled.
setupFinished := make(chan struct{})
if !disableCertRotation {
entryLog.Info("setting up cert rotation")
if err := rotator.AddRotator(mgr, &rotator.CertRotator{
SecretKey: types.NamespacedName{
Namespace: util.GetNamespace(),
Name: secretName,
},
CertDir: webhookCertDir,
CAName: caName,
CAOrganization: caOrganization,
DNSName: dnsName,
IsReady: setupFinished,
Webhooks: webhooks,
}); err != nil {
entryLog.Error(err, "unable to set up cert rotation")
os.Exit(1)
}
} else {
close(setupFinished)
}
I use a different flag framework to parse flags, and would like this to be a configuration option you pass to the rotator when starting up. However this might be a breaking change for other users. How do you feel about changing this?
Based on my experimentation, it seems that the kubelet's latency to reflect the updates on a watched Secret (configMapAndSecretChangeDetectionStrategy=Watch) to a container's filesystem seems to be ranging from 30-100 seconds (i.e. not instant), regardless of minikube, kind, GKE or kubeadm clusters.
Does this basically mean that until the container that's running the webhook (and automating certificate management with cert-controller
package), the webhook actually will be down because this library updates WebhookConfiguration's .caBundle
field with the new CA cert (which instantly takes effect) and it will no longer match the served TLS certificate for another minute or so?
Is this a known issue, or something that's factored to the current design that's solved (maybe I'm seeing it incorrectly).
See
cert-controller/pkg/rotator/rotator.go
Line 203 in d025255
kustomize or operators could result in having a webhook config with no webhooks. This shouldn't be an error.
Different organizations may have different views of the best setting. What's a good one-size-fits-all value?
We should discuss this once all the other work in this milestone is complete.
having certs valid for 10 years seems sketchy and we want to test the rotation works by setting it to 5 minutes.
Give users an idea of how/when certs are generated and when they are rotated.
Also document the impact of generation/rotation (e.g. pod restarts)
If you use rotator.AddRotator
to build a rotator.ReconcileWH
and add it to the controller manager, it uses a context.Background()
which is never cancelled. This means the Watch added to the controller manager is never terminated, causing the controller manager to wait its entire Options.GracefulShutdownTimeout
(default: 30s) before exiting after SIGTERM/SIGINT.
https://github.com/open-policy-agent/cert-controller/blob/master/pkg/rotator/rotator.go#L110
Can rotator.AddRotator
be made to accept a context and pass it through so the controller can exit quickly & gracefully?
Or am I using it wrong?
I have a special scene which g8r is deployed out of cluster and I configure the --kubeconfig
option in controller-runtime
to make g8r watch the user behavior in the cluster which I would like to.
In this case, cert-controller will generate ca and update the secret which is in the remote cluster. However, the local file, such as tls.crt in certDir
will not update. So, because of the certFile check below, the webhook will not start.
cert-controller/pkg/rotator/rotator.go
Lines 700 to 722 in 54af894
I wonder if it check the tls.crt in the secret is better. And actually the caBundle which is injected in webhook is based on the secret, not the certFile. Or we should some sync logic if the caFile in the secret is different from the local File. I think the latter is better.
Is it not a problem that the ValidatingWebhookConfiguration and the Secret are updated independently from each other? I think this can lead to the condition where the CA is already renewed but the ValidatingWebhookConfiguration still have the old CA and thus calls to the webhook would fail?
I did not really had problems with this. I only looked into the code and thought that this might become a problem. Or do I miss something?
Follow up to #44, it appears that 4842e47 added the RestartOnSecretRefresh, which restarts the process (os.Exit(0)
) every time refreshCerts()
is called, to update the Secret
.
That said, Kubernetes typically takes ~up to 1 minute delivering the secret to kubelet (easily reproducible on minikube, or kind, or a GKE cluster) with default kubelet configurations.
Since the delivery of updated Secret
to the Pod is not instant (or even a duration that can be considered quick), what makes the os.Exit(0)
useful if the kubelet will still serve the old Secret upon the restart?
cc: @stijndehaes
/cc @maxsmythe
/cc @ritazh
Hey Max + Rita, I see that we've recently gotten rid of the pre-v1 APIs we were using to access CRDs (I'd forgotten that we weren't on v1 yet). Does it make sense to cut a v0.3.0 release so there's a stable tag for this repo that supports K8s 1.22+?
Thanks!
Keeping different secrets for different pods will allow gradual rotation, which will mitigate the downsides of an instantaneous rotation:
In KEDA we are integrating this solution as a "default not safe" for cert management. Our problem is that we have 3 different components, the admission webhooks, the operator and the metrics server and all of them should share the certs because we want to use them for some internal communications.
Using cert-manager or other solution, we can just create a certificate with multiple dns names and sharing the same secret, we can secure all the internal communications and also webhooks and api services, but for new adopters, the requirement of a 3rd party could be a problem, and that's why we want to introduce this project for certificate management in non-productive environments, but not having the option for setting multiple dns names block us.
I'll open a PR with the changes in case you think this is useful
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.