Comments (30)
This issue looks related: openshift/cluster-machine-approver#15
from dev-scripts.
My machine-approver log has a bunch of:
I0321 18:09:39.113975 1 main.go:97] CSR csr-grgft added
I0321 18:09:39.133134 1 main.go:123] CSR csr-grgft not authorized: No target machine
I0321 18:14:32.453059 1 main.go:97] CSR csr-f66gv added
I0321 18:14:32.464451 1 main.go:123] CSR csr-f66gv not authorized: No target machine
I0321 18:18:11.088381 1 main.go:97] CSR csr-jmf46 added
I0321 18:18:11.102598 1 main.go:123] CSR csr-jmf46 not authorized: No target machine
I0321 18:22:27.112564 1 main.go:97] CSR csr-zxjkn added
I0321 18:22:27.127977 1 main.go:123] CSR csr-zxjkn not authorized: No target machine
I0321 18:27:20.454085 1 main.go:97] CSR csr-f85k9 added
The code of interest is authorizeCSR() in here:
https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L91
It's failing right now as it can't find a Machine
with a NodeRef
with a name that matches the requestor. The request includes something like machine-0
, and that does match the name in Nodes in our env.
The issue appears to be that NodeRef
isn't getting set on our Machine.
and that is because we don't have our actuator / machine controller running. That's what would be setting NodeRef
.
If I'm reading all this right, we need to complete integration of our cluster-api provider to get this to work properly.
openshift/machine-api-operator#235
from dev-scripts.
This is where NodeRef gets set:
https://github.com/metalkube/cluster-api-provider-baremetal/blob/master/vendor/sigs.k8s.io/cluster-api/pkg/controller/node/node.go#L76
from dev-scripts.
Right now NodeRef will never get set because our Nodes do not have the machine.openshift.io/machine
annotation set on the Node.
That is supposed to be set by the nodelink-controller
here, which is run by the machine-api-operator
.
https://github.com/openshift/machine-api-operator/blob/master/cmd/nodelink-controller/main.go#L330
In my current cluster, the machine-api-operator
is not running the clusterapi-manager-controllers
pod, which would include the nodelink controller. This is due to other errors during machine-api-operator
startup.
from dev-scripts.
machine-api-operator
has the following error:
E0401 19:16:52.677137 1 operator.go:177] Failed getting operator config: no platform provider found on install config
which implies that it's seeing the provider type as blank in the Infrastructure config object.
https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/config.go#L48
https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/operator.go#L192-L200
However, I see it set to BareMetal
when I look at it manually:
$ oc get infrastructures.config.openshift.io --all-namespaces -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
creationTimestamp: 2019-04-01T17:15:56Z
generation: 1
name: cluster
resourceVersion: "396"
selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
uid: cfc641b8-54a1-11e9-ba61-52fdfc072182
spec: {}
status:
apiServerURL: https://api.ostest.test.metalkube.org:6443
etcdDiscoveryDomain: ostest.test.metalkube.org
platform: BareMetal
kind: List
metadata:
resourceVersion: ""
selfLink: ""
now I'm trying to understand why the machine-api-operator
wouldn't be getting that successfully ...
from dev-scripts.
I started running the machine-api-operator
locally and got different behavior. It was able to get the provider type successfully from the Infrastructure config object, but was falling back to the no-op
provider due to the following problem:
openshift/machine-api-operator#268
from dev-scripts.
next problem: the clusterapi-manager-controllers
pod is running, but in CrashLoopBackoff
.
Here's the error:
$ oc logs -n openshift-machine-api pod/clusterapi-manager-controllers-58f6f9769f-xckj7 -c controller-manager
flag provided but not defined: -logtostderr
Usage of ./manager:
-kubeconfig string
Paths to a kubeconfig. Only required if out-of-cluster.
-master string
The address of the Kubernetes API server. Overrides any value in kubeconfig. Only required if out-of-cluster.
This is the same problem we fixed in metal3-io/cluster-api-provider-baremetal#39 except this time it's manager
that we build from ./vendor/github.com/openshift/cluster-api/cmd/manager
.
Indeed, if I build and run this locally from openshift/cluster-api-provider-baremetal, I get the same behavior.
$ make
go build -o bin/machine-controller-manager ./cmd/manager
go build -o bin/manager ./vendor/github.com/openshift/cluster-api/cmd/manager
$ bin/manager -logtostderr=true
flag provided but not defined: -logtostderr
Usage of bin/manager:
-kubeconfig string
Paths to a kubeconfig. Only required if out-of-cluster.
-master string
The address of the Kubernetes API server. Overrides any value in kubeconfig. Only required if out-of-cluster.
from dev-scripts.
It seems this flags issue was fixed in openshift/clsuter-api-provider-aws
by directly modifying the vendored copy of openshift/cluster-api
and not upstreaming that fix into openshift/cluster-api
.
openshift/cluster-api-provider-aws@50732ee#diff-6ae4328a95448a2a20bcb23ee01dca50
next we need this fix applied to openshift/cluster-api
and then to update our copy of it in openshift/cluster-api-provider-baremetal
.
from dev-scripts.
also for reference, openshift/cluster-api-provider-libvirt
got the fix in a similar way, by directly modifying the vendored openshift/cluster-api
in this commit:
openshift/cluster-api-provider-libvirt@07a9858#diff-6ae4328a95448a2a20bcb23ee01dca50
from dev-scripts.
Patch proposed to fix this directly in openshift/cluster-api-provider-baremetal
here: openshift/cluster-api-provider-baremetal#9
The cluster-api copies were not modified directly. They switched to a different branch, and we needed to switch our provider as well.
from dev-scripts.
The above patch to openshift/cluster-api-provider-baremetal merged, but I needed a way to run a custom build of the actuator. I documented how I'm doing that manually here: #271
The next error is in the container that's actually running our actuator.
$ oc logs -n openshift-machine-api clusterapi-manager-controllers-5cdf45d7f9-n9z4b -c machine-controller
...
E0402 20:22:42.603016 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
E0402 20:22:43.605226 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
E0402 20:22:44.606683 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
...
from dev-scripts.
The RBAC error above is fixed by the following PR: openshift/machine-api-operator#271
With that in place, the baremetal machine controller is running successfully. \o/
Two things:
-
There are enough fixes in place that we should be able to proceed with actuator driven worker node deployments.
-
The original cert issue is not yet resolved, but we've gotten enough fixed that we're back to working on the Machine and Node resource linkage necessary to resolve the cert issue. The lack of that linkage is causing failures in:
cluster-machine-approver
pod, plus thenodelink-controller
andmachine-healthcheck
containers of theclusterapi-machine-controllers
pod.
from dev-scripts.
We now have Ironic introspection data available on a BareMetalHost.
I put up a PR to make the nodelink-controller deal with the fact that we may have multiple internal IPs listed for a bare metal Machine object: openshift/machine-api-operator#314
from dev-scripts.
Actuator update to populate the IPs from the BMH: metal3-io/cluster-api-provider-baremetal#24
from dev-scripts.
We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.
from dev-scripts.
We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.
I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer.
from dev-scripts.
We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.
I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer.
A couple of complications:
- Manifests generated by the installer are generated prior to running terraform.
- Introspection data isn't available until after terraform is running and the cluster is coming up
- The hardware details are in the
status
section, which can't be set in a manifest anyway.
We first need a way to set this at all, and then we'll have to figure out how to use the installer and terraform to get the information into the cluster. I filed this to create a way to pass in the info: metal3-io/baremetal-operator#242
from dev-scripts.
I expect hardware data to be available when https://github.com/metal3-io/metal3-docs/blob/master/design/hardware-status.md is implemented.
from dev-scripts.
Here are some updates to the latest status of this issue.
I recently cleaned up dev-scripts a bit to remove some related hacks that are no longer required: #686
I reviewed the current code in OpenShift that automatically approves CSRs and documented my understanding of the process here: openshift/cluster-machine-approver#32
Now, status for the baremetal
paltform:
-
CSRs for masters get automatically approved successfully. This is done by a service that runs on the bootstrap VM during installation.
-
CSRs for workers do not get approved successfully yet, but we are very close:
Approval of the client CSR was blocked on cluster-api-provider-baremetal not populating the NodeInternalDNS
address field with the hostname. That was resolved here: metal3-io/cluster-api-provider-baremetal#100 and then added to OpenShift here: openshift/cluster-api-provider-baremetal#40
Once we are running a new enough release image to include the above change, automatic CSR approvals for workers should be working, but needs validation.
There are still things to consider for improvements:
Automatic CSR approval for workers relies on hardware introspection data on the BareMetalHost
object. If the IPs used by the host change for some reason from what the host had during introspection, this will break. "Continuous Introspection" discussed in previous comments would resolve this.
Note that once the first CSR is approved, future CSRs will get approved automatically. The addresses just need to line up for the first one which occurs immediately post-deployment, so the problem of mismatched addresses is pretty unlikely.
from dev-scripts.
Further clarification about masters. Indeed, both the client and server certs will be approved during bootstrapping. However, only the rotations of the client cert will get automatically approved from then on. Automation is still required for approval of the server certs.
I looked into it and now understand why this is required.
The docs were just updated last week to clarify that only the node client cert will be auto approved on an ongoing basis, and that some automation will always be required for the server certs.
https://bugzilla.redhat.com/show_bug.cgi?id=1720178
openshift/openshift-docs@6912507
openshift/openshift-docs#16060
This means that we are back to requiring the addresses on masters to enable the cluster-machine-approver to automate the server CSR approval for masters. Otherwise, we have to use a cron job of sorts to do it, either outside the cluster like dev-scripts has been doing it, or perhaps inside the cluster by doing something like https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml
from dev-scripts.
proposing #713, which is a cronjob running within the cluster.
from dev-scripts.
https://bugzilla.redhat.com/show_bug.cgi?id=1737611#c2
This proposes a change to the cluster machine approver to be able to automatically approve server cert refreshes in a way that would not require us to solve the IP addresses-on-master-Machines issue.
from dev-scripts.
#730 is a temporary work-around for this
from dev-scripts.
automatic CSR approvals doesn't work anymore after downgrading Ironic, more info here: #782
from dev-scripts.
as this bug got merged, can we consider this issue can now be closed?
from dev-scripts.
openshift/cluster-machine-approver#38 landed and @karmab reported an install-scripts deployment that stayed up for >24h without fix_certs, so this may now be resolved?
from dev-scripts.
The PR you refer to resolves our issues for masters, but not workers. You'd still have to manually approve the first CSRs for each worker without some hack in place. This is because we're lacking the hostname after downgrading Ironic, and that's one of the pieces of info needed in the automated CSR approval workflow.
from dev-scripts.
hi @russellb!
Do you mean that we're still hitting this bug? I can see how workers are not joining the cluster (at the node level) but they are provisioned and functional at machine/baremetalhost level. To make it work, I have to manually approve the pending CSRs. Thanks in advance!
Btw, I'm talking about real baremetal deployment, not dev-scripts.
from dev-scripts.
AFAIK, this issue should be resolved.
For workers, we have enough information on the BareMetalHost and Machine resources to support automatic CSR approval. If that's not working, we need to check the logs of the cluster-machine-approver
. That would be a regression.
Master CSRs are automatically approved during installation. Refreshing them should have been resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1737611
from dev-scripts.
we've removed all hacks from dev-scripts as of #915
Please open new bugs if anyone sees a similar issue at this point
from dev-scripts.
Related Issues (20)
- [Doc] CI builds does not work with pull secret from https://cloud.redhat.com/openshift/install/pull-secret HOT 1
- Setting UPSTREAM_IRONIC=true results in rpc auth failures HOT 2
- vBMC container missing mount after initial cluster deployment HOT 4
- Pull request #1071 added some strange buffering HOT 1
- A cluster name that exceeds 14 characters causes an error during bridge creation HOT 5
- error TASK [libvirt : Create libvirt networks] HOT 5
- Deployment throws error with config OPENSHIFT_VERSION=4.5.0
- FATAL failed to fetch Master Machines: failed to load asset "Install Config": invalid "install-config.yaml" HOT 3
- Error when chown runs and user name differs from group name
- Use openshift-install coreos print-stream-json HOT 3
- Allow creating a custom number of extra disks HOT 2
- Align Ansible version to metal3-dev-env one HOT 2
- libvirt net ostestbm won't start after reboot HOT 2
- make agent fails with error AGENT_E2E_TEST_SCENARIO is missing
- lxml install needs devel packages for arm64
- The Cluster creation fails with Error: could not contact Ironic API: context deadline exceeded HOT 6
- openshift-sdn support is about to be removed from the installer
- Wrong command in ./01_install_requirements.sh script HOT 6
- Make sure NetworkManager-initscripts-updown is installed on RHEL9 host.
- Failure to create dualstack v6 primary cluster
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dev-scripts.