What
As a cluster-admin I want to onboard and eventually terminate tenants in our cluster by creating and eventually deleting a root namespace for the tenant. The onboarding process works like a charm, but I am not too happy about the termination process - which we had an opportunity to try out this week. Let me explain:
We use a GitOps process where tenants are onboarded via a pull request to our "tenants" project. A tenant is in this project defined by some simple resources:
- An Accurate root namespace with some templated resources defined in our
tenant-template
Accurate template namespace.
- A Flux gitops-reconciler service account granted admin permissions in the tenant root namespace. Both the SA and admin role binding are configured with propagation to sub-namespaces. This allows the tenant to use Flux to provision most resources using a GitOps process in their namespace tree.
- Flux GitRepository and Kustomization resources pointing to a Git project controlled by the tenant - allowing the tenant to bootstrap their resources.
So far, so good. But this week we received a request for a tenant termination. Using a modern GitOps tool like Flux, with pruning enabled, we thought that it was as simple as reverting the onboarding PR in Git. So we did that, after getting the PR approved by the tenant responsible. What we forgot to do, was to check if there were sub-namespaces defined under the tenant root namespace. After merging the tenant termination PR, Flux immediately reported an error: It got (correctly) blocked by the Accurate namespace webhook:
delete failed, errors: Namespace/blnc delete failed: admission webhook "namespace.accurate.cybozu.io" denied the request: child namespaces exist;
kustomization/flux-tenants.flux-tenants
But this error was reported only once, which kind of surprised me - as Flux is usually constantly reconciling until the actual state equals the desired state. And as I suspected, the tenant namespace was still present - including the child namespaces that we did not think about. However the Flux controller resource had no knowledge of the resources it used to control anymore, which left the tenant root namespace (including children) as orphans in our cluster. This is something we are trying hard to avoid.
After cleaning up manually, I reached out to the Flux maintainers on Slack, and you can read all the details in the CNCF Slack thread - if you are interested. TL;DR: Flux maintainers think this is an issue with Accurate, and I tend to agree now.
To fix this for future tenant terminations in our clusters, I suggest adding an opt-in allowing us to configure the Accurate namespace webhook allowing cascade namespace DELETE
requests.
How
I have some ideas after looking at the code, but please let me know what you think first! I will update this description once we agree on a solution. I'll be happy to submit a PR fixing this.
Checklist