GithubHelp home page GithubHelp logo

Comments (13)

zetaab avatar zetaab commented on June 4, 2024

seems that api nlb port 443 and 3988 (using dns=none) is now unhealthy and cluster contain 2 instances that are trying to get up but the error is:

Jan 24 17:29:01 i-001318f220a7e654f nodeup[1816]: W0124 17:29:01.529726    1816 main.go:133] got error running nodeup (will retry in 30s): failed to get node config from server: Post "https://10.124.0.222:3988/bootstrap": context deadline exceeded (Client.Timeout exceeded while awaiting headers); Post "https://10.124.6.165:3988/bootstrap": context deadline exceeded (Client.Timeout exceeded while awaiting headers); Post "https://10.124.9.236:3988/bootstrap": context deadline exceeded (Client.Timeout exceeded while awaiting headers

Basically this means that this bug is quite critical, it will break the cluster scalability if people execute update --yes

Screenshot 2024-01-24 at 20 23 42 Screenshot 2024-01-24 at 21 39 36

the problem might ne that we are assuming that it is possible to modify old NLBs that does not have security groups? It is not possible to create security group rules to existing NLBs.

So what happened? When executed --yes in one cluster, it deleted the existing security group rules. However, it could not add the new rules -> API NLB unhealthy (missing sec group rules) -> scaling does not work

from kops.

zetaab avatar zetaab commented on June 4, 2024

tried to recreate the nlb and now whole cluster does not work..

looks like its needed to rotate ALL nodes and control planes after that to get anything working.

from kops.

justinsb avatar justinsb commented on June 4, 2024

Sorry about this - this shouldn't be the case. The theory is that I introduced it with the ForAPIServer changes, so I'm taking a look.

from kops.

justinsb avatar justinsb commented on June 4, 2024

I think the issue might be #15993

The error is "InvalidConfigurationRequest: You cannot set security groups on a Network Load Balancer which was created without any security groups.", and indeed in 1.28 (and earlier) we did not create NLBs with security groups, this was introduced in the above PR which is only in 1.29

from kops.

justinsb avatar justinsb commented on June 4, 2024

We do have upgrade tests that cover this upgrade e.g. https://testgrid.k8s.io/sig-cluster-lifecycle-kops#kops-aws-upgrade-k127-ko128-to-k128-kolatest-many-addons , but it isn't using load balancers; another good reason to switch to dns=none in our tests!

from kops.

justinsb avatar justinsb commented on June 4, 2024

So the best workaround I've found so far is to delete the load balancer; the next kops update can then recreate it. The problem is that with dns=none (at least) the IP address changes which then requires a forced-rolling update of the nodes. With dns=none, the kubeconfig also changes, because the load balancer name is randomized by AWS.

I haven't found a way to keep the load balancer IP / name yet.

from kops.

justinsb avatar justinsb commented on June 4, 2024

Thinking on this, I think the upgrade experience is fundamentally going to be bad (thank you for flagging @zetaab, and sorry that you hit it!). Rather than talking many people through the process, I think we should do something to make it a smooth process.

Two ideas:

  • We revert the security group change on the NLB entirly. I don't love this, it feels like a big change.
  • We can look at creating a second NLB during the upgrade. We would likely have to move the security group rule cleanup to a new phase after the rolling update.

The second option is trickier, but I'm going to kick the tires on what it looks like.

from kops.

zetaab avatar zetaab commented on June 4, 2024

the process could look like:

  1. create entire new API NLB with security group rules and wait that all controlplanes and nodes are using it (how to verify or just wait for next kops version?)
  2. cleanup old security group rules & cleanup old api NLB

but the thing is that this is going to change kubernetes api dns address in case of dns=none and also if someone still using kops with gossip. We updated something like 20 CI pipelines because of this. It was not huge, but still something to keep in mind.

This is not easy problem to solve, or it is if we just want to make downtime for everyone. Though, AWS could have done this better, even to have possibility to keep the dns name same (or have possibility to press the button "yes I want recreate my nlb that causes short downtime, but I want it to support security groups")...

from kops.

justinsb avatar justinsb commented on June 4, 2024

That is the broad approach I ended up on also! I uploaded a WIP/hack that implements it: #16291 . It's big, but not that big, and I'm trying to extract out the refactorings that I think are good ideas regardless (and revert some things I did that are irrelevant!) so we can whittle it down to something easier to analyze.

We do still end up with needing to distribute a new kubeconfig, but I don't think there is downtime because the old NLB should still work until we run the (proposed) kops update cluster --cleanup-after-upgrade --yes command. The idea is that the new flow would be kops update cluster / kops rolling-update cluster / kops update cluster --cleanup-after-upgrade , and we now have somewhere to do those "cleanup" steps.

Here the cleanup involves deleting the old NLB (which is what breaks kubeconfig with dns=none / gossip), deleting the old TargetGroup (we can't have two NLBs pointing at the same TargetGroup), cleaning up the SecurityGroupRules that were allowing access from the old NLB, detaching the old TargetGroups from the AutoScalingGroup.

I think this is potentially a powerful technique, but does introduce another step for users (cc @hakman )

from kops.

zetaab avatar zetaab commented on June 4, 2024

hmm, could kops rolling-update cluster trigger cleanup-after-upgrade automatically without user really knowing that? I am thinking that rolling-update knows when the upgrade is finished - it could check is all instances migrated to new format. If those are, it could automatically trigger cleanup-after-upgrade?

Actually we do have pre and post migration possibilities when we manage kOps clusters. But something similar could exists in kOps itself?

from kops.

justinsb avatar justinsb commented on June 4, 2024

hmm, could kops rolling-update cluster trigger cleanup-after-upgrade automatically without user really knowing that? I am thinking that rolling-update knows when the upgrade is finished - it could check is all instances migrated to new format. If those are, it could automatically trigger cleanup-after-upgrade?

It certainly could. The pattern we have today is that kops update tells you that you need to run kops rolling-update, and we could have that pattern also.

Actually we do have pre and post migration possibilities when we manage kOps clusters. But something similar could exists in kOps itself?

Right, I think this is a good opportunity in that a post-upgrade cleanup can allow for bigger changes / safer changes. And it's reasonable to want to delay cleanup until we've verified that the cluster (or even workloads) are actually working. I think to plug in to your own logic, you would want this to be runnable separately. That said, I do agree that as we adds steps it becomes less user-friendly, and we may want a user-friendly "easy mode wrapper" that runs update / rolling-update / post-update.

from kops.

justinsb avatar justinsb commented on June 4, 2024

It came up in office hours that this might not hit the default configuration (cc @hakman). I think you're right, for gossip clusters on AWS created with 1.28 at least, although users do still have to delete the existing NLB manually. We did hit some other problems, so I'm going to verify with a few more cases...

I created a gossip cluster:

export CLUSTER_NAME=foo.k8s.local
unset KOPS_BASE_URL

kops-1.28.3 create cluster ${CLUSTER_NAME} --zones us-east-2a --ssh-public-key ~/.ssh/id_rsa.pub
kops-1.28.3 update cluster ${CLUSTER_NAME} --yes --admin
kops-1.28.3 validate cluster ${CLUSTER_NAME} --wait=10m

Then with latest I did see the NLB change in kops update:

  NetworkLoadBalancer/api.foo.k8s.local
        SecurityGroups           <nil> -> [name:api-elb.foo.k8s.local id:sg-047dec0e24d2ed5be]

That change is the one that requires a new NLB.

SSHing to one of the nodes though, I can see they have the internal IP address of the control plane. They got this address over gossip, AFAICT.

kops update cluster --yes --admin gives the expected error:

W0203 10:38:07.147773  203956 executor.go:141] error running task "NetworkLoadBalancer/api.foo.k8s.local" (9m58s remaining to succeed): Error updating security groups on Load Balancer: InvalidConfigurationRequest: You cannot set security groups on a Network Load Balancer which was created without any security groups.

We can delete the NLB with:

ARN=$(go run ./cmd/kops toolbox dump -ojson | jq -r '.resources[] | select(.type=="load-balancer") | .raw.LoadBalancerArn') 
echo "ARN=${ARN}"
export AWS_DEFAULT_REGION=us-east-2
aws elbv2 delete-load-balancer --load-balancer-arn ${ARN}

I can see that the cluster is trying to set the ApiserverAdditionalIPs in nodeup config, it's getting a little confused because that address is not known yet, but I don't think that's a real problem.

If we run go run ./cmd/kops validate cluster ${CLUSTER_NAME} --wait=10m, we need to wait a few minutes for the NLB to actually start serving, but it does validate at this point. (I had to delete a aws-node-termination-handler replicaset, which will be an issue at some point, but was caused by switching from release -> dev build I believe, and I guess we don't include the image sha256 for dev builds?)

The node still has the internal IP address of the control-plane VM, at this point.

So now I kick off the rolling-update:

go run ./cmd/kops rolling-update cluster ${CLUSTER_NAME}
go run ./cmd/kops rolling-update cluster ${CLUSTER_NAME} --yes

We're able to terminate and replace the control-plane VM, but the next problem is that cilium now fails to start on the node:

level=fatal msg="failed to start: daemon creation failed: unable to initialize kube-proxy replacement options: Invalid value for --kube-proxy-replacement: true" subsys=daemon

Using the ConfigMap with an OnDelete daemonset seems wrong, and I think the kube-proxy-replacement flag is indeed new in cilium 1.14. Not sure how I didn't hit this one before, but I deleted the crash-looping cilium pod to unblock the rolling-update.

And at this point the rolling-update did complete. So not terrible - particularly if we fix the cilium issue which I think we need to do anyway (hoping to figure out why that didn't hit me previously)!

from kops.

justinsb avatar justinsb commented on June 4, 2024

BTW the TLDR for 1.28 foo.k8s.local (gossip) with default public topology is that the normal process works if the user deletes the NLB, modulo two bugs(?) that have nothing to do with the NLB (aws-node-termination-handler and cilium)

I tried the same test with 1.28 foo.k8s.local (gossip) with private topology and got the same results.

I do think creating a second load balancer is a good direction for the project, but we may be able to get away without it for 1.29. That said, the behaviour isn't great (deleting the NLB), so we might still consider it. I'm going to continue testing the big scenarios while also trying to get some of the pre-work refactoring done (that will enable creating a second NLB), along with looking at the two bugs - I think at least the cilium one is real.

from kops.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.