All of our deployments have a problem where certs aren't being approved automatically,

This issue looks related: <a class="issue-link js-issue-link" data-error-text="Failed

My machine-approver log has a bunch of: <div class="snippet-clipboard-content notr

This is where NodeRef gets set: <a href="https://github.com/metalkube/cluster-api-

machine-api-operator has the following error: <di

Certs not getting auto approved, stuck in pending state,about openshift-metal3/dev-scripts

Comments (30)

russellb commented on July 30, 2024

This issue looks related: openshift/cluster-machine-approver#15

from dev-scripts.

russellb commented on July 30, 2024

My machine-approver log has a bunch of:

I0321 18:09:39.113975       1 main.go:97] CSR csr-grgft added
I0321 18:09:39.133134       1 main.go:123] CSR csr-grgft not authorized: No target machine
I0321 18:14:32.453059       1 main.go:97] CSR csr-f66gv added
I0321 18:14:32.464451       1 main.go:123] CSR csr-f66gv not authorized: No target machine
I0321 18:18:11.088381       1 main.go:97] CSR csr-jmf46 added
I0321 18:18:11.102598       1 main.go:123] CSR csr-jmf46 not authorized: No target machine
I0321 18:22:27.112564       1 main.go:97] CSR csr-zxjkn added
I0321 18:22:27.127977       1 main.go:123] CSR csr-zxjkn not authorized: No target machine
I0321 18:27:20.454085       1 main.go:97] CSR csr-f85k9 added

The code of interest is authorizeCSR() in here:

https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L91

It's failing right now as it can't find a Machine with a NodeRef with a name that matches the requestor. The request includes something like machine-0, and that does match the name in Nodes in our env.

The issue appears to be that NodeRef isn't getting set on our Machine.

and that is because we don't have our actuator / machine controller running. That's what would be setting NodeRef.

If I'm reading all this right, we need to complete integration of our cluster-api provider to get this to work properly.

openshift/machine-api-operator#235

from dev-scripts.

russellb commented on July 30, 2024

This is where NodeRef gets set:
https://github.com/metalkube/cluster-api-provider-baremetal/blob/master/vendor/sigs.k8s.io/cluster-api/pkg/controller/node/node.go#L76

from dev-scripts.

russellb commented on July 30, 2024

Right now NodeRef will never get set because our Nodes do not have the machine.openshift.io/machine annotation set on the Node.

That is supposed to be set by the nodelink-controller here, which is run by the machine-api-operator.

https://github.com/openshift/machine-api-operator/blob/master/cmd/nodelink-controller/main.go#L330

In my current cluster, the machine-api-operator is not running the clusterapi-manager-controllers pod, which would include the nodelink controller. This is due to other errors during machine-api-operator startup.

from dev-scripts.

russellb commented on July 30, 2024

machine-api-operator has the following error:

E0401 19:16:52.677137       1 operator.go:177] Failed getting operator config: no platform provider found on install config

which implies that it's seeing the provider type as blank in the Infrastructure config object.

https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/config.go#L48

https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/operator.go#L192-L200

However, I see it set to BareMetal when I look at it manually:

$ oc get infrastructures.config.openshift.io --all-namespaces -oyaml                                                                                                     
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Infrastructure
  metadata:
    creationTimestamp: 2019-04-01T17:15:56Z
    generation: 1
    name: cluster
    resourceVersion: "396"
    selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
    uid: cfc641b8-54a1-11e9-ba61-52fdfc072182
  spec: {}
  status:
    apiServerURL: https://api.ostest.test.metalkube.org:6443
    etcdDiscoveryDomain: ostest.test.metalkube.org
    platform: BareMetal
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

now I'm trying to understand why the machine-api-operator wouldn't be getting that successfully ...

from dev-scripts.

russellb commented on July 30, 2024

I started running the machine-api-operator locally and got different behavior. It was able to get the provider type successfully from the Infrastructure config object, but was falling back to the no-op provider due to the following problem:

openshift/machine-api-operator#268

from dev-scripts.

russellb commented on July 30, 2024

next problem: the clusterapi-manager-controllers pod is running, but in CrashLoopBackoff.

Here's the error:

$ oc logs -n openshift-machine-api pod/clusterapi-manager-controllers-58f6f9769f-xckj7 -c controller-manager
flag provided but not defined: -logtostderr
Usage of ./manager:
  -kubeconfig string
        Paths to a kubeconfig. Only required if out-of-cluster.
  -master string
        The address of the Kubernetes API server. Overrides any value in kubeconfig. Only required if out-of-cluster.

This is the same problem we fixed in metal3-io/cluster-api-provider-baremetal#39 except this time it's manager that we build from ./vendor/github.com/openshift/cluster-api/cmd/manager.

Indeed, if I build and run this locally from openshift/cluster-api-provider-baremetal, I get the same behavior.

$ make
go build -o bin/machine-controller-manager ./cmd/manager
go build -o bin/manager ./vendor/github.com/openshift/cluster-api/cmd/manager

$ bin/manager -logtostderr=true
flag provided but not defined: -logtostderr
Usage of bin/manager:
  -kubeconfig string
    	Paths to a kubeconfig. Only required if out-of-cluster.
  -master string
    	The address of the Kubernetes API server. Overrides any value in kubeconfig. Only required if out-of-cluster.

from dev-scripts.

russellb commented on July 30, 2024

It seems this flags issue was fixed in openshift/clsuter-api-provider-aws by directly modifying the vendored copy of openshift/cluster-api and not upstreaming that fix into openshift/cluster-api.

openshift/cluster-api-provider-aws@50732ee#diff-6ae4328a95448a2a20bcb23ee01dca50

next we need this fix applied to openshift/cluster-api and then to update our copy of it in openshift/cluster-api-provider-baremetal.

from dev-scripts.

russellb commented on July 30, 2024

also for reference, openshift/cluster-api-provider-libvirt got the fix in a similar way, by directly modifying the vendored openshift/cluster-api in this commit:

openshift/cluster-api-provider-libvirt@07a9858#diff-6ae4328a95448a2a20bcb23ee01dca50

from dev-scripts.

russellb commented on July 30, 2024

Patch proposed to fix this directly in openshift/cluster-api-provider-baremetal here: openshift/cluster-api-provider-baremetal#9

The cluster-api copies were not modified directly. They switched to a different branch, and we needed to switch our provider as well.

from dev-scripts.

russellb commented on July 30, 2024

The above patch to openshift/cluster-api-provider-baremetal merged, but I needed a way to run a custom build of the actuator. I documented how I'm doing that manually here: #271

The next error is in the container that's actually running our actuator.

$ oc logs -n openshift-machine-api clusterapi-manager-controllers-5cdf45d7f9-n9z4b -c machine-controller

...
E0402 20:22:42.603016       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
E0402 20:22:43.605226       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
E0402 20:22:44.606683       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metalkube.org is forbidden: User "system:serviceaccount:openshift-machine-api:default" cannot list resource "baremetalhosts" in API group "metalkube.org" at the cluster scope
...

from dev-scripts.

russellb commented on July 30, 2024

The RBAC error above is fixed by the following PR: openshift/machine-api-operator#271

With that in place, the baremetal machine controller is running successfully. \o/

Two things:

There are enough fixes in place that we should be able to proceed with actuator driven worker node deployments.
The original cert issue is not yet resolved, but we've gotten enough fixed that we're back to working on the Machine and Node resource linkage necessary to resolve the cert issue. The lack of that linkage is causing failures in: cluster-machine-approver pod, plus the nodelink-controller and machine-healthcheck containers of the clusterapi-machine-controllers pod.

from dev-scripts.

russellb commented on July 30, 2024

We now have Ironic introspection data available on a BareMetalHost.

I put up a PR to make the nodelink-controller deal with the fact that we may have multiple internal IPs listed for a bare metal Machine object: openshift/machine-api-operator#314

from dev-scripts.

bcrochet commented on July 30, 2024

Actuator update to populate the IPs from the BMH: metal3-io/cluster-api-provider-baremetal#24

from dev-scripts.

russellb commented on July 30, 2024

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

from dev-scripts.

hardys commented on July 30, 2024

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer.

from dev-scripts.

russellb commented on July 30, 2024

We are much closer here. Much of the plumbing to get addresses on Machines is done. We still lack addresses on BareMetalHosts that represent masters, as those don’t go through introspection driven by the baremetal operator. A bug needs to be opened for this.

I was expecting that to be solved via openshift-metal3/kni-installer#46 combined with openshift-metal3/terraform-provider-ironic#28 but perhaps @dhellmann and @stbenjam can confirm if we need an additional issue to track wiring the introspection data into the BMH resources registered via the installer.

A couple of complications:

Manifests generated by the installer are generated prior to running terraform.
Introspection data isn't available until after terraform is running and the cluster is coming up
The hardware details are in the status section, which can't be set in a manifest anyway.

We first need a way to set this at all, and then we'll have to figure out how to use the installer and terraform to get the information into the cluster. I filed this to create a way to pass in the info: metal3-io/baremetal-operator#242

from dev-scripts.

dhellmann commented on July 30, 2024

I expect hardware data to be available when https://github.com/metal3-io/metal3-docs/blob/master/design/hardware-status.md is implemented.

from dev-scripts.

russellb commented on July 30, 2024

Here are some updates to the latest status of this issue.

I recently cleaned up dev-scripts a bit to remove some related hacks that are no longer required: #686

I reviewed the current code in OpenShift that automatically approves CSRs and documented my understanding of the process here: openshift/cluster-machine-approver#32

Now, status for the baremetal paltform:

CSRs for masters get automatically approved successfully. This is done by a service that runs on the bootstrap VM during installation.
CSRs for workers do not get approved successfully yet, but we are very close:

Approval of the client CSR was blocked on cluster-api-provider-baremetal not populating the NodeInternalDNS address field with the hostname. That was resolved here: metal3-io/cluster-api-provider-baremetal#100 and then added to OpenShift here: openshift/cluster-api-provider-baremetal#40

Once we are running a new enough release image to include the above change, automatic CSR approvals for workers should be working, but needs validation.

There are still things to consider for improvements:

Automatic CSR approval for workers relies on hardware introspection data on the BareMetalHost object. If the IPs used by the host change for some reason from what the host had during introspection, this will break. "Continuous Introspection" discussed in previous comments would resolve this.

Note that once the first CSR is approved, future CSRs will get approved automatically. The addresses just need to line up for the first one which occurs immediately post-deployment, so the problem of mismatched addresses is pretty unlikely.

from dev-scripts.

russellb commented on July 30, 2024

Further clarification about masters. Indeed, both the client and server certs will be approved during bootstrapping. However, only the rotations of the client cert will get automatically approved from then on. Automation is still required for approval of the server certs.
I looked into it and now understand why this is required.

The docs were just updated last week to clarify that only the node client cert will be auto approved on an ongoing basis, and that some automation will always be required for the server certs.

https://bugzilla.redhat.com/show_bug.cgi?id=1720178

openshift/openshift-docs@6912507

openshift/openshift-docs#16060

This means that we are back to requiring the addresses on masters to enable the cluster-machine-approver to automate the server CSR approval for masters. Otherwise, we have to use a cron job of sorts to do it, either outside the cluster like dev-scripts has been doing it, or perhaps inside the cluster by doing something like https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml

from dev-scripts.

karmab commented on July 30, 2024

proposing #713, which is a cronjob running within the cluster.

from dev-scripts.

russellb commented on July 30, 2024

https://bugzilla.redhat.com/show_bug.cgi?id=1737611#c2

This proposes a change to the cluster machine approver to be able to automatically approve server cert refreshes in a way that would not require us to solve the IP addresses-on-master-Machines issue.

from dev-scripts.

dhellmann commented on July 30, 2024

#730 is a temporary work-around for this

from dev-scripts.

russellb commented on July 30, 2024

automatic CSR approvals doesn't work anymore after downgrading Ironic, more info here: #782

from dev-scripts.

karmab commented on July 30, 2024

as this bug got merged, can we consider this issue can now be closed?

from dev-scripts.

hardys commented on July 30, 2024

openshift/cluster-machine-approver#38 landed and @karmab reported an install-scripts deployment that stayed up for >24h without fix_certs, so this may now be resolved?

from dev-scripts.

russellb commented on July 30, 2024

The PR you refer to resolves our issues for masters, but not workers. You'd still have to manually approve the first CSRs for each worker without some hack in place. This is because we're lacking the hostname after downgrading Ironic, and that's one of the pieces of info needed in the automated CSR approval workflow.

from dev-scripts.

oglok commented on July 30, 2024

hi @russellb!

Do you mean that we're still hitting this bug? I can see how workers are not joining the cluster (at the node level) but they are provisioned and functional at machine/baremetalhost level. To make it work, I have to manually approve the pending CSRs. Thanks in advance!

Btw, I'm talking about real baremetal deployment, not dev-scripts.

from dev-scripts.

russellb commented on July 30, 2024

AFAIK, this issue should be resolved.

For workers, we have enough information on the BareMetalHost and Machine resources to support automatic CSR approval. If that's not working, we need to check the logs of the cluster-machine-approver. That would be a regression.

Master CSRs are automatically approved during installation. Refreshing them should have been resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1737611

from dev-scripts.

russellb commented on July 30, 2024

we've removed all hacks from dev-scripts as of #915

Please open new bugs if anyone sees a similar issue at this point

from dev-scripts.

Certs not getting auto approved, stuck in pending state about dev-scripts HOT 30 CLOSED

Comments (30)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs