Comments (20)
from machine-controller.
@realjenius Can you review #1706 ?
from machine-controller.
@realjenius Can you review #1706 ?
First scan through looks really good!
I'd like to try to upgrade a canary environment in our system to this branch. I'm working on that now - we also have some general "upgrade machine-controller" stuff to work through as well so it's going to take me a little bit.
from machine-controller.
Yup! I just looked and I was toying with a similar change. Will report back shortly.
from machine-controller.
Since Vultr is a community provider, we very much welcome contributions to the Vultr provider to make it more robust. The approach seems sensible if tagging is unreliable, but opens up the possibility that machine-controller fails to update the status resource and reconciles the Machine
again. So I wonder if this trades one potential source of "double VM" with another.
I would also like to hear if @2000yeshu as the original author of the provider has any thoughts on this matter.
from machine-controller.
Ahh, good point. Unfortunately we also don't even know for certain if our theory is correct. We have only detected the symptom, and by the time it happens, our network is being disrupted due to hot-looping our ip routing tables between the two machines, so we're in more of a "high-priority fix" issue, and haven't found any other breadcrumbs for the root cause.
So our focus so far has been finding ways to make the cloud provider less likely to get into this state, so definitely open to other suggestions!
from machine-controller.
Issue is definitely regarding the eventual consistency of labels. But not sure if it is only on the instance GET by labels side. Since this is a recent issue, i have a feeling that they might have started maintaining a reverse map of labels -> instance.
So one thing we could do to verify is get all instances and filter by label to ensure instance.
Edit: I meant tags not labels
from machine-controller.
I think we could work on a patch to try that.
I saw the droplet code in digital ocean has something similar to this concept where it polls for tag acknowledgement: https://github.com/kubermatic/machine-controller/blob/d69e4e9d19e34b41b22d9ac9ab25c12a0f8f786a/pkg/cloudprovider/provider/digitalocean/provider.go#L329C2-L346
What's the failure state in this scenario? If we timeout waiting for the tag to show up, is the only option in this case a terminal error?
The other complication is that my team also uses bare metal machines (our older fork of the Vultr provider has support for that), so separately we also want to submit upstream support for MachineType
as an optional configuration parameter, allowing for picking bare metal over cloud instances. We think this bug exists for both, however; the API just varies slightly, so both flavors would need this logic I'd think.
from machine-controller.
I just verified that polling for all machines and filtering on tags doesn't work as well.
I can try the digital ocean as well in a couple of hours. That looks like a good temporary solution.
The solution that you suggested feels a little anti-operator-pattern to me. The operator shouldn't use it's own controlled CRD's status to make global state consistent. Instead, it should use the Spec to make the global state consistent and Status subresource should be a sideeffect of the global state.
@embik described a precise reason of that antipatterness for that.
As for your baremetal question, my team is already using a solution for that. I can make PR for that by tomm.
from machine-controller.
Excellent news, thank you @2000yeshu ! Please let us know if we can help in any way!
from machine-controller.
For sure, as the bare metal sol'n is specific to our use case, I might need input from you to make it GA.
from machine-controller.
For sure, as the bare metal sol'n is specific to our use case, I might need input from you to make it GA.
Definitely, happy to discuss on another issue or on the PR or wherever. In short, we use it just like the cloud instances - we just added a MachineType
property to the provider RawConfig
that can be cloud-instance
(default if unset) or bare-metal
, and the logic just switches between the various APIs accordingly.
from machine-controller.
Hey @2000yeshu ! I just tried the latest as of this morning, and had some similar results, it seems. I've got two Vultr bare-metal servers with the same labels and tags (different IPs) and only one machine:
kubectl get machines -n kube-system
NAME PROVIDER OS NODE KUBELET ADDRESS AGE
vultr-dfw-9b9d6589-hx2gh vultr ubuntu vultr-dfw-9b9d6589-hx2gh 1.24.12 45.32.195.47 19m
The underlying provider spec for this MachineDeployment
:
providerSpec:
value:
sshPublicKeys: []
cloudProvider: "vultr"
cloudProviderSpec:
apiKey:
secretKeyRef:
namespace: kube-system
name: machine-controller-vultr
key: token
physicalMachine: true
region: "{{ $location }}"
plan: "{{ $plan }}"
osId: 387
tags:
- [omitted]-node
operatingSystem: "ubuntu"
operatingSystemSpec:
disableAutoUpdate: true
I'm still hunting down if it could be something on my side, but thoughts would be welcome!
from machine-controller.
I added some logs, it looks like the Poll didn't retry for some reason (the Waiting for instance creation
log is at the top of the poll function) - it failed after the first 10 second sleep + error in the wait for instance creation. I'll keep investigating.
23
2023-09-20T14:47:25.721Z info machine-controller vultr/provider.go:429 Creating a machine: "15d07445-cb07-44f0-bcd0-95f5ad46cd84" {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
22
2023-09-20T14:47:25.721Z info machine-controller vultr/provider.go:419 Creating a physical machine for machine "15d07445-cb07-44f0-bcd0-95f5ad46cd84" {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
21
2023-09-20T14:47:36.232Z info machine-controller vultr/provider.go:356 Waiting for instance creation of: %q15d07445-cb07-44f0-bcd0-95f5ad46cd84 {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
20
2023-09-20T14:47:46.430Z error machine-controller machine/controller.go:407 Reconciling failed {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "error": "failed to create machine at cloudprovider, due to instance \"vultr-dfw-9b9d6589-h4t2t\" created but controller failed to fetch instance details"}
19
2023-09-20T14:47:46.431Z debug machine-controller machine/controller.go:382 Reconciling {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t"}
18
2023-09-20T14:47:46.431Z debug events record/event.go:327 failed to create machine at cloudprovider, due to instance "vultr-dfw-9b9d6589-h4t2t" created but controller failed to fetch instance details {"type": "Warning", "object": {"kind":"Machine","namespace":"kube-system","name":"vultr-dfw-9b9d6589-h4t2t","uid":"15d07445-cb07-44f0-bcd0-95f5ad46cd84","apiVersion":"cluster.k8s.io/v1alpha1","resourceVersion":"354465"}, "reason": "ReconcilingError"}
17
2023-09-20T14:47:46.431Z debug machine-controller machine/controller.go:800 Requesting instance for machine from cloudprovider because no associated node with status ready found... {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
16
2023-09-20T14:47:46.626Z debug machine-controller machine/controller.go:808 Validated machine spec {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
15
2023-09-20T14:47:46.652Z info machine-controller vultr/provider.go:429 Creating a machine: "15d07445-cb07-44f0-bcd0-95f5ad46cd84" {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
14
2023-09-20T14:47:46.652Z info machine-controller vultr/provider.go:419 Creating a physical machine for machine "15d07445-cb07-44f0-bcd0-95f5ad46cd84" {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
13
2023-09-20T14:47:57.078Z info machine-controller vultr/provider.go:356 Waiting for instance creation of: %q15d07445-cb07-44f0-bcd0-95f5ad46cd84 {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
12
2023-09-20T14:48:07.422Z error machine-controller machine/controller.go:407 Reconciling failed {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "error": "failed to create machine at cloudprovider, due to instance \"vultr-dfw-9b9d6589-h4t2t\" created but controller failed to fetch instance details"}
11
2023-09-20T14:48:07.422Z debug machine-controller machine/controller.go:382 Reconciling {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t"}
10
2023-09-20T14:48:07.422Z debug machine-controller machine/controller.go:800 Requesting instance for machine from cloudprovider because no associated node with status ready found... {"machine": "kube-system/vultr-dfw-9b9d6589-h4t2t", "provider": "vultr"}
9
2023-09-20T14:48:07.423Z debug events record/event.go:327 failed to create machine at cloudprovider, due to instance "vultr-dfw-9b9d6589-h4t2t" created but controller failed to fetch instance details {"type": "Warning", "object": {"kind":"Machine","namespace":"kube-system","name":"vultr-dfw-9b9d6589-h4t2t","uid":"15d07445-cb07-44f0-bcd0-95f5ad46cd84","apiVersion":"cluster.k8s.io/v1alpha1","resourceVersion":"354482"}, "reason": "ReconcilingError"}
8
2023-09-20T14:48:08.133Z debug events record/event.go:327 Found instance at cloud provider, addresses: map[45.32.192.59:ExternalIP] {"type": "Normal", "object": {"kind":"Machine","namespace":"kube-system","name":"vultr-dfw-9b9d6589-h4t2t","uid":"15d07445-cb07-44f0-bcd0-95f5ad46cd84","apiVersion":"cluster.k8s.io/v1alpha1","resourceVersion":"354482"}, "reason": "InstanceFound"}
from machine-controller.
@realjenius Can you please pull the latest commit and try again? Made a big mistake in checking for instance creation.
from machine-controller.
Took a while to hunt down what I was seeing @2000yeshu but I commented on the PR. I think there were still a couple edges I was hitting.
from machine-controller.
@2000yeshu I've taken the latest patch, and have re-enabled a subset of our Vultr clusters. I will be gradually scaling back up these environments over the next day, but so far so good!
from machine-controller.
@2000yeshu I've taken the latest patch, and have re-enabled a subset of our Vultr clusters. I will be gradually scaling back up these environments over the next day, but so far so good!
Cool, I am also testing on my clusters. Will keep you updated.
from machine-controller.
#1706 looks good to me now. @realjenius @2000yeshu would you prefer to keep it open until you have validated for a bit or do you think the PR should be merged?
from machine-controller.
Think this can be merged now.
from machine-controller.
Related Issues (20)
- Deterministic way to get private ip of machine HOT 2
- vSphere: Allow configuration of disk provisioning for VMs HOT 6
- Node not ready due to cloud provider instance network issues HOT 3
- Failing tests for pull-machine-controller-e2e-hetzner HOT 1
- AWS: Support for passing CpuOptions HOT 8
- Support running shell script on Node boot HOT 8
- Can't create MachineDeployment: Post "https://machine-controller-webhook.kube-system.svc:443/machinedeployments?timeout=10s": context deadline exceeded HOT 1
- Stale token in cloud-init-settings/kube-system-hetzner-kubelet-bootstrap-config HOT 5
- KubeVirt GenerateRandMAC HOT 4
- Tags does not appear to work for Equinix provider HOT 4
- Support for networks with disabled port security HOT 4
- Expose metrics for MachineDeployment status HOT 2
- Support enabling cloud drive on OpenStack VMs HOT 5
- vSphere machine deployment with anti-affinity keeps recreating nodes
- Too many reconciliation errors/warnings for machine deployment objects HOT 2
- Future API group conflict with upstream CAPI HOT 3
- Remove user-data plugins from machine-controller
- E2E tests for Azure are failing HOT 1
- Support Flatcar Linux on Hetzner Cloud HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from machine-controller.