This affects certain common configurations on OpenStack, but I don't think the issue i

/cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a href="https://github.com/kubernetes/enhancements/pull/1665" data-hovercard-type="pu

<a href="https://github.com/kubernetes/enhancements/pull/1665" data-hover

Kubelet no longer restricts InternalIP to --node-ip after upgrade to CCM,about kubernetes/cloud-provider

Comments (16)

andrewsykim commented on September 1, 2024

Bizarrely, we don't seem to have a canonical list of machine networks anywhere, so we would be left either using heuristics or asking the user to enter them. --node-ip is an existing mechanism that achieves this and is already well tested.

This might be the fundamental problem here, generally the cloud provider should be able to look up the desired node address range for internal IP, which should be configurable by the cluster admin (like the internal-network-names config you mentioned). If this is not the case, then --node-ip is the unfortunate fallback.

from cloud-provider.

mdbooth commented on September 1, 2024

Bizarrely, we don't seem to have a canonical list of machine networks anywhere, so we would be left either using heuristics or asking the user to enter them. --node-ip is an existing mechanism that achieves this and is already well tested.

This might be the fundamental problem here, generally the cloud provider should be able to look up the desired node address range for internal IP, which should be configurable by the cluster admin (like the internal-network-names config you mentioned). If this is not the case, then --node-ip is the unfortunate fallback.

This is the can of worms I'd prefer not to open today, although it is now on the waiting list for an appointment with the can opener :) I definitely agree we should be able to do this. I feel like CAPI should be the ultimate source of this information, but even in CAPI I don't think there's any explicit concept of 'this specific interface is a machine interface, that one is not'. But the worms are escaping, put them back!

My available sources of data today are essentially anything which would be available to CAPI (OpenShift is different, but similar enough), the existing contents of cloud.conf, and of course anything I haven't thought of. I don't think that's enough to unambiguously determine which interfaces are machine interfaces in all edge cases, but I'd love to be wrong.

If I'm not wrong, is there any appetite for enabling filtering by --node-ip in CCMs?

from cloud-provider.

mdbooth commented on September 1, 2024

If I'm not wrong, is there any appetite for enabling filtering by --node-ip in CCMs?

I'm also proposing to do the work, btw. My immediate problem is agreeing it's a problem that requires a solution, then coming up with an acceptable design.

from cloud-provider.

andrewsykim commented on September 1, 2024

If I'm not wrong, is there any appetite for enabling filtering by --node-ip in CCMs?

By this you mean that CCM should remove all other IPs that is not --node-ip right?

from cloud-provider.

mdbooth commented on September 1, 2024

If I'm not wrong, is there any appetite for enabling filtering by --node-ip in CCMs?

By this you mean that CCM should remove all other IPs that is not --node-ip right?

Or any other acceptable solution which would achieve the same result. Possibilities:

Update kubelet to [optionally] retain the same filtering behaviour [1]
Update cloud-provider to [optionally] do the filtering there for all cloud providers
Update OpenStack CCM to [optionally] do the filtering for just OpenStack CCM

I assume that at this stage the behaviour would have to be optional or we risk surprising users who have already made the switch a second time. I believe this would mean that if we did implement a solution in cloud-provider, individual cloud providers would still have to 'opt in' by adding their own config to enable it.

[1] Thinking about this some more, I don't think this can be done in kubelet; I think it has to be done by the CCM. The reason is that the filtering kubelet does above is done in the NodeAddress Setter, which is then written to NodeStatus and subsequently consumed directly from NodeStatus. We can't allow kubelet to update values owned by CCM, so any filtering like this would have to be done by the CCM.

from cloud-provider.

aojea commented on September 1, 2024

/cc @danwinship @aojea

from cloud-provider.

danwinship commented on September 1, 2024

KEP-1664 was supposed to solve this, along with too many other things at the same time, and a lot of it didn't happen. This comment gives a good summary of the final state with some links to relevant earlier comments. There was no actual objection to the "replace --node-ip with a more-consistent and cooler --node-ips" idea, it just didn't get enough momentum behind it to actually happen.

But anyway, the general consensus in discussion then seemed to be that external cloud providers behave this way because they are assumed to know enough about the cluster configuration that they can always provide the right primary IP. And that seems to work well for actually-in-the-cloud clouds. Maybe less so for VMware/OpenStack...

OpenStack CCM has an internal-network-names config directive which would allow us to restrict the NodeAddresses returned by OpenStack CCM

(FWIW you don't have to actually remove any IPs. You just need to list the right one first; the other ones are mostly ignored anyway.)

specifically to those which match the given network names. I have some issues with the implementation of this directive, but despite this I think it could work for us if we were able to use it. The principal issue with it is that we don't have an existing, unambiguous, canonical source we can use to automatically populate it.

Sure, but you don't have an existing, unambiguous, canonical source you can use to populate --node-ip either, do you? It seems to me that if you can autodetect the right node IP to use, then you can autodetect the right internal-network-names value instead, and if you can provide a config option for the end user to use to override --node-ip, then you can provide a config option for the end user to use to override internal-network-names instead.

Basically, in the actual-cloud case, the cloud provider knows what IPs to use because the cloud only allows configuring things in certain ways. In the DIY pseudo-cloud case, you might do something arbitrarily weird, so the CCM can't automatically know which IP is the right one, but either you have to autodetect/override on the kubelet side, or you have to autodetect/override on the cloud config side, and it doesn't seem like either should be easier or harder than the other. But doing it on the cloud-config side would make the OpenStack cloud provider behave more like the AWS/GCP/etc cloud providers, so maybe that would be good?

from cloud-provider.

mdbooth commented on September 1, 2024

KEP-1664 was supposed to solve this, along with too many other things at the same time, and a lot of it didn't happen. This comment gives a good summary of the final state with some links to relevant earlier comments. There was no actual objection to the "replace --node-ip with a more-consistent and cooler --node-ips" idea, it just didn't get enough momentum behind it to actually happen.

But anyway, the general consensus in discussion then seemed to be that external cloud providers behave this way because they are assumed to know enough about the cluster configuration that they can always provide the right primary IP. And that seems to work well for actually-in-the-cloud clouds. Maybe less so for VMware/OpenStack...

OpenStack CCM has an internal-network-names config directive which would allow us to restrict the NodeAddresses returned by OpenStack CCM

(FWIW you don't have to actually remove any IPs. You just need to list the right one first; the other ones are mostly ignored anyway.)

@bparees also suggested this, but I was wary that having 'ineligible' IPs listed as internal might end up biting us in edge cases. If we don't think this is the case then this solution would be simple to implement centrally for all CCMs, and presumably safe enough to do by default. This sounds like a good way forward.

specifically to those which match the given network names. I have some issues with the implementation of this directive, but despite this I think it could work for us if we were able to use it. The principal issue with it is that we don't have an existing, unambiguous, canonical source we can use to automatically populate it.

Sure, but you don't have an existing, unambiguous, canonical source you can use to populate --node-ip either, do you? It seems to me that if you can autodetect the right node IP to use, then you can autodetect the right internal-network-names value instead, and if you can provide a config option for the end user to use to override --node-ip, then you can provide a config option for the end user to use to override internal-network-names instead.

Unfortunately not easily. There are a couple of problems. Firstly node-ip is determined by a (janky, but battle-hardened) runtime test executed on the host before kubelet comes up. CCM doesn't run on the host and its configuration is cluster-wide, so centralising and maintaining node-ips would be its own challenge. Another problem is that we'd still have to translate node-ips to cloud network names. So yes, the information is theoretically currently there, but it's not in a form that is easy to consume centrally.

However, thanks for the prod! You prompted me to actually write down how I think this would have to work. I still don't like what I wrote down, but at least I'm in a better position to understand why :)

Basically, in the actual-cloud case, the cloud provider knows what IPs to use because the cloud only allows configuring things in certain ways. In the DIY pseudo-cloud case, you might do something arbitrarily weird, so the CCM can't automatically know which IP is the right one, but either you have to autodetect/override on the kubelet side, or you have to autodetect/override on the cloud config side, and it doesn't seem like either should be easier or harder than the other. But doing it on the cloud-config side would make the OpenStack cloud provider behave more like the AWS/GCP/etc cloud providers, so maybe that would be good?

I completely agree that the cloud provider should know this. However, I don't think cloud-provider specifically can be canonical as it's not responsible for creating infrastructure: that's the domain of CAPI (or MAPI in OpenShift), which is to me the logical place for this to live. Because network definitions in CAPI (and MAPI) are cloud-specific, I think it would have to be pushed all the way down to individual providers to implement, but that might be ok. We would then need another mechanism to move that information from the Machine to the Node object. Lots of unknowns there, but I would like to explore it properly eventually.

However, you've prompted me to think through exactly how we could do this centrally with the information we have today. This would require a new controller which watches Node, OpenStack cloud credentials, and knows whether internal or external cloud provider is currently in use. This mechanism specifically would be OpenShift specific because it makes assumptions that are not necessarily true for other clusters.

Step 1: Collate all node-ips

With legacy cloud provider we can determine node-ip centrally because we know (in OpenShift) that node-ip is set, and therefore kubelet will have filtered internal NodeAddresses on all Nodes to contain only that IP. With external cloud provider we can determine node-ip centrally because kubelet adds it as an annotation to the Node.

Step 2: Translate list of node-ips into a list of network names

We can already reference a Machine from the Node object (in OpenShift), so we now need to check which network attached to that Machine owns the node-ip. Unfortunately the Machine API isn't intended to be service this query, so we're going to have to directly query the cloud. In fact, this is the NodeAddresses query again, except this time we need to not throw away the network name of each interface. By querying each instance individually to look for the network corresponding to its node-ip we can unambiguously obtain a list of network names. We could avoid hitting the OpenStack API for this every time, which would get prohibitively expensive, by annotating the Machine object with its internal network name, and short-cutting in future.

After querying the cloud for every Node, our controller can collate a list of internal network names and update the CCM config. If it has changed, we restart the CCM. The CCM will re-query the cloud again for each instance to fetch NodeAddresses, but this time it will filter based on our centralised list of internal-network-names.

Non-bootstrap worker node case

In the case where we add a new Node whose internal-network-name is already configured:

CCM calculates NodeAddresses correctly in the first instance -> Node comes up
Controller queries new Node's Machine's cloud instance for internal-network-name, and annotates Machine
Controller does not update CCM config because it is unchanged, and CCM is not restarted

Bootstrap worker node case

In the case where we add a new Node whose internal-network-name is not already configured:

CCM initially populates Node with unfiltered list of NodeAddresses -> Errors, node does not come up
Controller queries new Node's Machine's cloud instance for internal-network-name, and annotates Machine
Controller updates CCM config, and CCM is restarted
CCM (among other startup tasks) refreshes NodeAddresses on every Node, applying the newly discovered internal-network-name to the filter
New Node updated with correctly filtered NodeAddresses
Open Questions
- Does it generate a new CSR? If so when?
- Will it restart static pods which were started with an invalid IP?
Hopefully new Node now comes up

Bootstrap master node case

I have never seen this configuration, but we don't prevent it as far as I'm aware. This might occur if, for example, non-local volume storage was required on the control plane. In this case, a master node with at least one non-internal network is coming up when its internal-network-name is not already configured.

kubelet bootstraps NodeAddresses from node-ip
Node comes up far enough to start CCM, which has the appropriate tolerations
CCM populates unfiltered NodeAddresses -> Errors, node not ready
Our controller needs the same tolerations as CCM to run
Proceeds as for bootstrap worker case

So, with the caveats above about the details of Node initialisation (and reinitialisation) that I'm not 100% clear on this might work. It would be complex, though, with a lot of complex interactions. To be clear, my hesitancy isn't around writing the controller: now I've written it down it's probably not that hard. My worry would primarily be around edge cases, timing, general robustness, and the volume of context required to debug/maintain it.

The other suggested approach of simply putting the annotated node-ip first in the NodeAddresses list is looking much more attractive, as long as we're confident there are no dragons hidden in the tail of invalid NodeAddresses.

from cloud-provider.

mdbooth commented on September 1, 2024

Thinking about it, putting the annotated NodeIP first in the returned list of NodeAddresses seems safe enough that we should probably do it anyway, even if we still have to look for edge cases.

@andrewsykim Do you think it's worth knocking up a PR in this repo to do this unconditionally for all cloud providers?

Specifically, the behaviour would be:

Validate that node-ip is present in NodeAddresses, which we already do
Move the validated node-ip to the head of the list, but do not filter any NodeAddresses

from cloud-provider.

andrewsykim commented on September 1, 2024

Move the validated node-ip to the head of the list, but do not filter any NodeAddresses

This seems reasonable to me -- worth noting that there's some history around the merge strategy for node addresses that clobbered the ordering. I think @danwinship actually fixed this in kubelet but need to double check if we also fixed it in the cloud node controller. We should double check that first.

from cloud-provider.

andrewsykim commented on September 1, 2024

This is the PR kubernetes/kubernetes#79391 I was referring to

from cloud-provider.

mdbooth commented on September 1, 2024

@danwinship @andrewsykim We'll hopefully knock that patch up today.

However, based on the linked PR I do have a concern that we're going to intentionally clobber the ordering of NodeAddresses. Specifically, an existing Node with a node-ip which was not previously listed first will have a new effective node-ip after this patch. A mitigating factor would be that there's a strong likelihood it wasn't previously working, though.

from cloud-provider.

danwinship commented on September 1, 2024

(FWIW you don't have to actually remove any IPs. You just need to list the right one first; the other ones are mostly ignored anyway.)

bparees also suggested this, but I was wary that having 'ineligible' IPs listed as internal might end up biting us in edge cases.

In the past it would have (because of the randomly-reordering bug Andrew mentioned), but not these days. Eg, the AWS cloud provider returns all IPs on all AWS-created interfaces on the node, but it always returns the first IP on "eth0" first, and everything works.

CCM doesn't run on the host and its configuration is cluster-wide, so centralising and maintaining node-ips would be its own challenge.

Ah, right, forgot that part.

I completely agree that the cloud provider should know this. However, I don't think cloud-provider specifically can be canonical as it's not responsible for creating infrastructure: that's the domain of CAPI (or MAPI in OpenShift)

Yeah... I think in the case of the public clouds, the set of possible configurations is more constrained, so even though the cloud provider doesn't "own" that data, it can still figure out the right answers. (Eg, I don't think I've ever seen someone in a public cloud set up their cluster so that the primary node IPs are on an interface other than the one with the default route.)

With external cloud provider we can determine node-ip centrally because kubelet adds it as an annotation to the Node.

kubelet --node-ip is not supposed to be an input into the NodeAddresses-generating algorithm. It's an override of the NodeAddresses-generating algorithm. (In the internal CloudProvider case, the cloud provider has to generate the list of NodeAddresses without seeing the --node-ip value, and I think it's assumed that external providers would work the same way, even though it's technically possible for them not to.)

So, either the cloud provider knows everything and can generate a correctly-sorted list of NodeAddresses and doesn't need the admin to use kubelet --node-ip (eg AWS, GCP, Azure), or the cloud provider doesn't know everything, and can only generate an unsorted list of NodeAddresses for each node, and so the admin needs to use kubelet --node-ip to tell it which one it should be using (eg OpenStack, vSphere).

Move the validated node-ip to the head of the list, but do not filter any NodeAddresses

This makes sense to me.

from cloud-provider.

andrewsykim commented on September 1, 2024

/triage accpeted

from cloud-provider.

k8s-ci-robot commented on September 1, 2024

@andrewsykim: The label(s) triage/accpeted cannot be applied, because the repository doesn't have them.

In response to this:

/triage accpeted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from cloud-provider.

andrewsykim commented on September 1, 2024

/triage accepted

from cloud-provider.

Kubelet no longer restricts InternalIP to --node-ip after upgrade to CCM about cloud-provider HOT 16 CLOSED

Comments (16)

Step 1: Collate all node-ips

Step 2: Translate list of node-ips into a list of network names

Non-bootstrap worker node case

Bootstrap worker node case

Bootstrap master node case

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs