GithubHelp home page GithubHelp logo

Comments (20)

2000yeshu avatar 2000yeshu commented on August 11, 2024 2

from machine-controller.

2000yeshu avatar 2000yeshu commented on August 11, 2024 1

@realjenius Can you review #1706 ?

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024 1

@realjenius Can you review #1706 ?

First scan through looks really good!

I'd like to try to upgrade a canary environment in our system to this branch. I'm working on that now - we also have some general "upgrade machine-controller" stuff to work through as well so it's going to take me a little bit.

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024 1

Yup! I just looked and I was toying with a similar change. Will report back shortly.

from machine-controller.

embik avatar embik commented on August 11, 2024

Since Vultr is a community provider, we very much welcome contributions to the Vultr provider to make it more robust. The approach seems sensible if tagging is unreliable, but opens up the possibility that machine-controller fails to update the status resource and reconciles the Machine again. So I wonder if this trades one potential source of "double VM" with another.

I would also like to hear if @2000yeshu as the original author of the provider has any thoughts on this matter.

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024

Ahh, good point. Unfortunately we also don't even know for certain if our theory is correct. We have only detected the symptom, and by the time it happens, our network is being disrupted due to hot-looping our ip routing tables between the two machines, so we're in more of a "high-priority fix" issue, and haven't found any other breadcrumbs for the root cause.

So our focus so far has been finding ways to make the cloud provider less likely to get into this state, so definitely open to other suggestions!

from machine-controller.

2000yeshu avatar 2000yeshu commented on August 11, 2024

Issue is definitely regarding the eventual consistency of labels. But not sure if it is only on the instance GET by labels side. Since this is a recent issue, i have a feeling that they might have started maintaining a reverse map of labels -> instance.
So one thing we could do to verify is get all instances and filter by label to ensure instance.
Edit: I meant tags not labels

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024

I think we could work on a patch to try that.

I saw the droplet code in digital ocean has something similar to this concept where it polls for tag acknowledgement: https://github.com/kubermatic/machine-controller/blob/d69e4e9d19e34b41b22d9ac9ab25c12a0f8f786a/pkg/cloudprovider/provider/digitalocean/provider.go#L329C2-L346

What's the failure state in this scenario? If we timeout waiting for the tag to show up, is the only option in this case a terminal error?


The other complication is that my team also uses bare metal machines (our older fork of the Vultr provider has support for that), so separately we also want to submit upstream support for MachineType as an optional configuration parameter, allowing for picking bare metal over cloud instances. We think this bug exists for both, however; the API just varies slightly, so both flavors would need this logic I'd think.

from machine-controller.

2000yeshu avatar 2000yeshu commented on August 11, 2024

I just verified that polling for all machines and filtering on tags doesn't work as well.
I can try the digital ocean as well in a couple of hours. That looks like a good temporary solution.
The solution that you suggested feels a little anti-operator-pattern to me. The operator shouldn't use it's own controlled CRD's status to make global state consistent. Instead, it should use the Spec to make the global state consistent and Status subresource should be a sideeffect of the global state.
@embik described a precise reason of that antipatterness for that.

As for your baremetal question, my team is already using a solution for that. I can make PR for that by tomm.

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024

Excellent news, thank you @2000yeshu ! Please let us know if we can help in any way!

from machine-controller.

2000yeshu avatar 2000yeshu commented on August 11, 2024

For sure, as the bare metal sol'n is specific to our use case, I might need input from you to make it GA.

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024

For sure, as the bare metal sol'n is specific to our use case, I might need input from you to make it GA.

Definitely, happy to discuss on another issue or on the PR or wherever. In short, we use it just like the cloud instances - we just added a MachineType property to the provider RawConfig that can be cloud-instance (default if unset) or bare-metal, and the logic just switches between the various APIs accordingly.

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024

Hey @2000yeshu ! I just tried the latest as of this morning, and had some similar results, it seems. I've got two Vultr bare-metal servers with the same labels and tags (different IPs) and only one machine:

Screenshot 2023-09-19 at 16 15 32

kubectl get machines -n kube-system
NAME                                 PROVIDER   OS       NODE                                 KUBELET   ADDRESS        AGE
vultr-dfw-9b9d6589-hx2gh             vultr      ubuntu   vultr-dfw-9b9d6589-hx2gh             1.24.12   45.32.195.47   19m

The underlying provider spec for this MachineDeployment:

      providerSpec:
        value:
          sshPublicKeys: []
          cloudProvider: "vultr"
          cloudProviderSpec:
            apiKey:
              secretKeyRef:
                namespace: kube-system
                name: machine-controller-vultr
                key: token
            physicalMachine: true
            region: "{{ $location }}"
            plan: "{{ $plan }}"
            osId: 387
            tags:
              - [omitted]-node
          operatingSystem: "ubuntu"
          operatingSystemSpec:
            disableAutoUpdate: true

I'm still hunting down if it could be something on my side, but thoughts would be welcome!

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024

I added some logs, it looks like the Poll didn't retry for some reason (the Waiting for instance creation log is at the top of the poll function) - it failed after the first 10 second sleep + error in the wait for instance creation. I'll keep investigating.

23
2023-09-20T14:47:25.721Z	info	machine-controller	vultr/provider.go:429	Creating a machine: "15d07445-cb07-44f0-bcd0-95f5ad46cd84"	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
22
2023-09-20T14:47:25.721Z	info	machine-controller	vultr/provider.go:419	Creating a physical machine for machine "15d07445-cb07-44f0-bcd0-95f5ad46cd84"	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
21
2023-09-20T14:47:36.232Z	info	machine-controller	vultr/provider.go:356	Waiting for instance creation of: %q15d07445-cb07-44f0-bcd0-95f5ad46cd84	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
20
2023-09-20T14:47:46.430Z	error	machine-controller	machine/controller.go:407	Reconciling failed	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "error": "failed to create machine at cloudprovider, due to instance \"vultr-dfw-9b9d6589-h4t2t\" created but controller failed to fetch instance details"}
19
2023-09-20T14:47:46.431Z	debug	machine-controller	machine/controller.go:382	Reconciling	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t"}
18
2023-09-20T14:47:46.431Z	debug	events	record/event.go:327	failed to create machine at cloudprovider, due to instance "vultr-dfw-9b9d6589-h4t2t" created but controller failed to fetch instance details	{"type": "Warning", "object": {"kind":"Machine","namespace":"kube-system","name":"vultr-dfw-9b9d6589-h4t2t","uid":"15d07445-cb07-44f0-bcd0-95f5ad46cd84","apiVersion":"cluster.k8s.io/v1alpha1","resourceVersion":"354465"}, "reason": "ReconcilingError"}
17
2023-09-20T14:47:46.431Z	debug	machine-controller	machine/controller.go:800	Requesting instance for machine from cloudprovider because no associated node with status ready found...	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
16
2023-09-20T14:47:46.626Z	debug	machine-controller	machine/controller.go:808	Validated machine spec	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
15
2023-09-20T14:47:46.652Z	info	machine-controller	vultr/provider.go:429	Creating a machine: "15d07445-cb07-44f0-bcd0-95f5ad46cd84"	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
14
2023-09-20T14:47:46.652Z	info	machine-controller	vultr/provider.go:419	Creating a physical machine for machine "15d07445-cb07-44f0-bcd0-95f5ad46cd84"	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
13
2023-09-20T14:47:57.078Z	info	machine-controller	vultr/provider.go:356	Waiting for instance creation of: %q15d07445-cb07-44f0-bcd0-95f5ad46cd84	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
12
2023-09-20T14:48:07.422Z	error	machine-controller	machine/controller.go:407	Reconciling failed	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "error": "failed to create machine at cloudprovider, due to instance \"vultr-dfw-9b9d6589-h4t2t\" created but controller failed to fetch instance details"}
11
2023-09-20T14:48:07.422Z	debug	machine-controller	machine/controller.go:382	Reconciling	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t"}
10
2023-09-20T14:48:07.422Z	debug	machine-controller	machine/controller.go:800	Requesting instance for machine from cloudprovider because no associated node with status ready found...	{"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
9
2023-09-20T14:48:07.423Z	debug	events	record/event.go:327	failed to create machine at cloudprovider, due to instance "vultr-dfw-9b9d6589-h4t2t" created but controller failed to fetch instance details	{"type": "Warning", "object": {"kind":"Machine","namespace":"kube-system","name":"vultr-dfw-9b9d6589-h4t2t","uid":"15d07445-cb07-44f0-bcd0-95f5ad46cd84","apiVersion":"cluster.k8s.io/v1alpha1","resourceVersion":"354482"}, "reason": "ReconcilingError"}
8
2023-09-20T14:48:08.133Z	debug	events	record/event.go:327	Found instance at cloud provider, addresses: map[45.32.192.59:ExternalIP]	{"type": "Normal", "object": {"kind":"Machine","namespace":"kube-system","name":"vultr-dfw-9b9d6589-h4t2t","uid":"15d07445-cb07-44f0-bcd0-95f5ad46cd84","apiVersion":"cluster.k8s.io/v1alpha1","resourceVersion":"354482"}, "reason": "InstanceFound"}

from machine-controller.

2000yeshu avatar 2000yeshu commented on August 11, 2024

@realjenius Can you please pull the latest commit and try again? Made a big mistake in checking for instance creation.

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024

Took a while to hunt down what I was seeing @2000yeshu but I commented on the PR. I think there were still a couple edges I was hitting.

from machine-controller.

realjenius avatar realjenius commented on August 11, 2024

@2000yeshu I've taken the latest patch, and have re-enabled a subset of our Vultr clusters. I will be gradually scaling back up these environments over the next day, but so far so good!

from machine-controller.

2000yeshu avatar 2000yeshu commented on August 11, 2024

@2000yeshu I've taken the latest patch, and have re-enabled a subset of our Vultr clusters. I will be gradually scaling back up these environments over the next day, but so far so good!

Cool, I am also testing on my clusters. Will keep you updated.

from machine-controller.

embik avatar embik commented on August 11, 2024

#1706 looks good to me now. @realjenius @2000yeshu would you prefer to keep it open until you have validated for a bit or do you think the PR should be merged?

from machine-controller.

2000yeshu avatar 2000yeshu commented on August 11, 2024

Think this can be merged now.

from machine-controller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.