Update 3/27/2024 The issue should no longer be reproducibl

QA validations were completed and <a class="issue-link js-issue-link" data-error-text=

QA validations were completed and <a href="https://github.com/rancher/cha

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

rancher,rancher

Comments (34)

snasovich commented on September 24, 2024 11

QA validations were completed and rancher/charts#3700 has just been merged meaning rancher-provisioning-capi version 103.2.0+up0.0.1 is now released and all default-configured non-airgap v2.8.0-v2.8.2 Rancher deployments will automatically update to this fixed version of chart and the issue should be fixed.

Any users that applied the workaround to downgrade the chart to 103.0.0+up0.0.1 (e.g. by following instructions in #44929 (comment)) should now be free to rollback the workaround.

Important Note: Rancher automatically refreshes chart data every 6 hours. To force immediate refresh, please follow these steps:

Select local cluster
Open "Apps" -> "Repositories"
Locate and check Rancher from the list of displayed repos
Select "Refresh" and wait for repo status to update to "Active"
rancher-provisioning-capi will upgrade to version 103.2.0+up0.0.1 shortly

Keeping the issue open for some time to ensure the fix works for all affected users.

from rancher.

m4rCsi commented on September 24, 2024 9

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:

Going into the Apps - Repositories Section
Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd
Going to Installed Apps
Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )

Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

from rancher.

hansbogert commented on September 24, 2024 8

Setting the capi-controller-manager to version 1.4.4 (was 1.5.5), lets me deploy new clusters with correct health status. Deploying new clusters was not possible anymore since roughly 10 hours.

from rancher.

nickvth commented on September 24, 2024 2

same here

from rancher.

josh383451 commented on September 24, 2024 2

QA validations were completed and rancher/charts#3700 has just been merged meaning rancher-provisioning-capi version 103.2.0+up0.0.1 is now released and all default-configured non-airgap v2.8.0-v2.8.2 Rancher deployments will automatically update to this fixed version of chart and the issue should be fixed.

Any users that applied the workaround to downgrade the chart to 103.0.0+up0.0.1 (e.g. by following instructions in #44929 (comment)) should now be free to rollback the workaround.

Keeping the issue open for some time to ensure the fix works for all affected users.

Can confirm this is working with provisioning AWS EC2 instances using 103.2.0+up0.0.1

from rancher.

daleckystepan commented on September 24, 2024 1

@richardcase no labels at all on secret app-kubeconfig

from rancher.

richardcase commented on September 24, 2024 1

Thanks @nickvth & @daleckystepan . The lack of labels appears to be the issue. Just to let you know we are looking at how to resolve this.

from rancher.

josh383451 commented on September 24, 2024 1

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:
* Going into the Apps - Repositories Section

* Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd

* Going to Installed Apps

* Downgrading `rancher-provisioning-capi`.  ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )
Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

Can confirm this is working for new cluster provisioning with AWS EC2

from rancher.

zackbradys commented on September 24, 2024 1

I can confirm that the above fix worked for existing clusters and provisioning new clusters.

Additionally, an alternative fix would be to redeploy rancher with useBundledSystemChart: true, which will redeploy the capi-controller-manager and any other related resources. I haven’t tried manually labeling the cluster, but others stated earlier that it worked as well.

from rancher.

nickvth commented on September 24, 2024 1

Workaround deploy rancher with useBundledSystemChart=true, maybe always the recommended way if you don't want that every merge/push to release-v2.8 git branch will update your cluster.

Configure Rancher server to use the packaged copy of Helm system charts. The system charts repository contains all the catalog items required for features such as monitoring, logging, alerting and global DNS. These Helm charts are located in GitHub, but since you are in an air gapped environment, using the charts that are bundled within Rancher is much easier than setting up a Git mirror.

After that:

Going to Installed Apps
Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )
No new version available

from rancher.

snasovich commented on September 24, 2024 1

@kingnarmer , please use the workaround from #44929 (comment) as it pins the charts repo to commit before this problematic updated chart was released.

As an overall update, we're working on releasing new version of chart that essentially rolls back CAPI version upgrade which should address the issue. This is currently undergoing QA process and rancher/charts#3700 is the PR to release this fixed chart.

from rancher.

Denys-Janrain-L commented on September 24, 2024 1

My list of installed apps is always empty, don't know why.
So only first two steps of workaround worked for me, after that I had to open rancher local console ( through browser ) and:
helm -n cattle-provisioning-capi-system rollback rancher-provisioning-capi 1
which rolled back it to 103.0.0+up0.0.1

from rancher.

Oats87 commented on September 24, 2024 1

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.

Your screenshot shows that your cluster is waiting for a worker node to be registered (and on top of that, your cluster does not have a worker listed in your machine list.

from rancher.

bagutzu commented on September 24, 2024

from rancher.

nickvth commented on September 24, 2024

Look like it's broken after upgrade to https://github.com/rancher/charts/tree/release-v2.8/charts/rancher-provisioning-capi/103.1.0%2Bup0.1.0

from rancher.

fplantinga-guida commented on September 24, 2024

Same here

from rancher.

daleckystepan commented on September 24, 2024

I also tried to downgrade but it is quickly updated again.

from rancher.

richardcase commented on September 24, 2024

@daleckystepan (or anyone else seeing this issue) - could you look at the secret that contains the kubeconfig for one of the clusters and see what labels it has? I'd be interested if there is one called cluster.x-k8s.io/cluster-name.

from rancher.

nickvth commented on September 24, 2024

No label cluster.x-k8s.io/cluster-name @richardcase

kg secret donald-prod-1-kubeconfig  -o yaml
apiVersion: v1
data:
  token: *****
  value: *****
kind: Secret
metadata:
  creationTimestamp: "2024-02-02T09:42:29Z"
  name: donald-prod-1-kubeconfig
  namespace: fleet-default
  ownerReferences:
  - apiVersion: provisioning.cattle.io/v1
    kind: Cluster
    name: donald-prod-1
    uid: a9e0e6f8-ba10-4a5d-9ed5-ded4077631dd
  resourceVersion: "142716571"
  uid: 5aacaca0-1ea7-4849-a055-cced99c75d4c
type: Opaque

from rancher.

daleckystepan commented on September 24, 2024

I added label manually and it seems to be working.

from rancher.

Oats87 commented on September 24, 2024

We had the same issue. Suddenly, out of nowhere, across all our rancher clusters and "downstream" clusters, we saw the same issue.

After a long investigation, we came to the same conclusion (i.e., the capi upgrade from 1.4.4 to 1.5.5 is the cause).

We weren't (and aren't) sure what is the easiest way to pin it to 1.4.4. Every time we downgraded, it auto-upgraded itself back to the broken 1.5.5 version. What we ended up doing in the interest of speed was:

Going into the Apps - Repositories Section

Changing The Repository Named "Rancher" from Branch "release-v2.8" to a318ef65fddf66b44c468d4a2636930ef39a88fd

Going to Installed Apps

Downgrading rancher-provisioning-capi. ( from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 )

Maybe that will help someone as well to workaround this until a proper fix has been found.

If someone knows how to pin it to 1.4.4 in a better way, please let us know :)

This is the most effective way as of now that I can think of to pin the version of chart so that it does not get inadvertently upgraded.

I added label manually and it seems to be working.

Adding the label manually to the kubeconfig secrets is also a solution in the short term, but if Rancher deems that the kubeconfig is invalid i.e. token is no longer valid or the server-url is changed etc, it will recreate that kubeconfig secret sans the label.

from rancher.

atsai1220 commented on September 24, 2024

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

from rancher.

nickvth commented on September 24, 2024

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

Thanks for sharing, but 2.8.3 is not released. So why propaged this change.

from rancher.

atsai1220 commented on September 24, 2024

This was described in the migration notes from 1.4 to 1.5: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/migrations/v1.4-to-v1.5.md#other

The generated kubeconfig by the Control Plane providers must be labelled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME}. This is required for the CAPI managers caches to store and retrieve them for the required operations.

This was the PR that propagated the change to Rancher 2.8.x environments. rancher/charts#3688

Thanks for sharing, but 2.8.3 is not released. So why propaged this change.

I have the same question. We will look to useBundledSystemChart=true in the future to prevent surprises.

from rancher.

pvlkov commented on September 24, 2024

Thank you for providing a quick fix. We will use the label workaround for now and upgrade to 2.8.3 as soon as it's out.

from rancher.

kingnarmer commented on September 24, 2024

Unfortunately both workarounds didn't work for me.

I downgraded rancher-provisioning-capi to from 103.1.0+up0.1.0 to 103.0.0+up0.0.1 from rancher gui--> apps --> installed apps . It was fine for few minutes then came back.
Updated useBundledSystemChart=true on existing rancher had no effect.

Appreciate help on how to mitigate .

from rancher.

sulaimantok commented on September 24, 2024

Same here, use this workaround also work for me workaround

from rancher.

daleckystepan commented on September 24, 2024

I tried to change CATTLE_SYSTEM_CATALOG to bundled but it has probably no effect if Rancher is already installed. Any other way to prevent those online updates and make it more transparent and managabale for us?

from rancher.

avthart commented on September 24, 2024

I tried to change CATTLE_SYSTEM_CATALOG to bundled but it has probably no effect if Rancher is already installed. Any other way to prevent those online updates and make it more transparent and managabale for us?

Either wait for the fix or you can manually rollback rancher-provisioning-capi using Rancher Apps.

from rancher.

qhris commented on September 24, 2024

The capi upgrade also makes it impossible to provision new clusters because of the same reason.

We verified that the workaround with setting the labels works on the kubeconfig secret.
Running rancher 2.8.3-rc6 was something we also tested that works.

from rancher.

romarioschneider commented on September 24, 2024

same issue

from rancher.

dylanthepodman commented on September 24, 2024

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.

from rancher.

snasovich commented on September 24, 2024

@dylanthepodman , thank you for reporting this. Most likely you're running into some different issue. Was provisioning working OK on the same setup earlier?

from rancher.

dylanthepodman commented on September 24, 2024

Unfortunately, this did not work. I also had added the cluster name label to the kubeconfig secret under fleet-default. At the moment, I still cannot provision a custom RKE2 cluster. Is it possible im doing something wrong here? the rancher-provisioning-capi was updated to the latest version as well.

Your screenshot shows that your cluster is waiting for a worker node to be registered (and on top of that, your cluster does not have a worker listed in your machine list.

I tried, and it is working now. Thank you for mentioning this to me.

Interestingly enough, I tried this same setup in Rancher v2.6.12 and it did not have this problem.

from rancher.

Resolved - [BUG] All pools unavailable on RKE2/K3s provisioned clusters on Rancher 2.8.0-2.8.2 about rancher HOT 34 CLOSED

Comments (34)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs