GithubHelp home page GithubHelp logo

Comments (21)

bpedersen2 avatar bpedersen2 commented on July 29, 2024 4

I could manually fix it:

  1. go to the harvester embedded rancher and get the kube config
  2. update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred

from rancher.

khushalchandak17 avatar khushalchandak17 commented on July 29, 2024 3

I was able to replicate the issue by setting the token expiry to 10 minutes as shared by @m-ildefons. The Rancher deployed is v2.8.3, and the Harvester version is v1.2.1. I noticed that when the token associated with the downstream cluster expires, the connection between Rancher and Harvester is disrupted. Well, this token expiry value is set to 30 days by default, as documented here. This change has been introduced in Rancher 2.8, as mentioned in this PR.

I guess this issue has not been observed in earlier versions of Rancher, as the token value was set to infinite, as documented here. It appears that it has been implemented for security reasons, performance enhancements, and to manage too many unexpired tokens.

from rancher.

m-ildefons avatar m-ildefons commented on July 29, 2024 2

I've managed to reproduce the issue in a test environment:

  1. Install Harvester (tested with v1.2.1, likely irrelevant)
  2. Install Rancher (tested with v2.8.2, testing with other versions TBD)
  3. Import Harvester cluster
  4. Upload a suitable cloud image and create a VM network, so VMs can be created
  5. To shorten the time to reproduce, set the default token TTL in Rancher to e.g. 10 minutes. This is a global config setting in Rancher.
  6. Create a Cloud Credential for the Harvester cluster
  7. Create a K8s cluster with the Harvester cluster as infrastructure provider, using the previously created Cloud Credential for authentication
  8. Wait until the default token TTL is over. The token associated with the Cloud Credentials will be expired and eventually removed, but the Cloud Credential will remain. This will not cause an error just yet though.
  9. Scale the K8s cluster from step 8 up or down. This operation will fail with behavior and errors similar to the reported problem.

I'm not sure if and how OIDC interacts here, since it wasn't in my test environment. Since the original bug report does not include any mention of an external identity provider and it makes my test environment simpler, I'll focus on locally provided users.

As a workaround, I suggest to create Cloud Credentials associated with a token without an expiration date.
To do that, set the maximum token TTL and the default token TTL settings (both global settings in Rancher) both to 0. Then create the Cloud Credentials to be used to create a K8s cluster on Harvester. Then create a K8s cluster using these Cloud Credentials.

To recover an existing cluster, adjust the maximum token TTL and default token TTL to 0, create a new Cloud Credential for the Harvester cluster and edit the Yaml for the cluster, such that .spec.cloudCredentialSecretName points to the new Cloud Credentials.
The K8s cluster will eventually recover and any outstanding scaling operation will be completed. The old Cloud Credentials can be disposed of afterwards.

from rancher.

bpedersen2 avatar bpedersen2 commented on July 29, 2024

maybe related to #44929 ?

from rancher.

bpedersen2 avatar bpedersen2 commented on July 29, 2024

Seems to occur even after the fix for #44929 , both on scaling and creating a new cluster

from rancher.

bpedersen2 avatar bpedersen2 commented on July 29, 2024

And I am on rancher v2.8.2

from rancher.

bpedersen2 avatar bpedersen2 commented on July 29, 2024

Looking at the created job (for a worker node scaleup):

"args": [ 8 items
"--driver-download-url=https://<host>/assets/docker-machine-driver-harvester",
"--driver-hash=a9c2847eff3234df6262973cf611a91c3926f3e558118fcd3f4197172eda3434",
"--secret-namespace=fleet-default",
"--secret-name=staging-pool-worker-bbfc2798-d5jsj-machine-state",
"rm",
"-y",
"--update-config",
"staging-pool-worker-bbfc2798-d5jsj"

the first thing the driver tries is to delete the non-exisiting pod and fails.... I would expect a create instead. I just don't know in where this command is generated

from rancher.

sarahhenkens avatar sarahhenkens commented on July 29, 2024

@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?

from rancher.

sarahhenkens avatar sarahhenkens commented on July 29, 2024

Following the manual fix steps by getting the kubeconfig and manually updating the secret in Rancher worked for me!

from rancher.

bpedersen2 avatar bpedersen2 commented on July 29, 2024

@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?

No, it is running standalone.

from rancher.

bpedersen2 avatar bpedersen2 commented on July 29, 2024

What I observe is that the token in harvester changes.

Rancher is configured to use OIDC, and in the rancher logs I get

Error refreshing token principals, skipping: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] [keycloak oidc] GetPrincipal: error creating new http client: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] error syncing 'user-XXX': handler mgmt-auth-userattributes-controller: oauth2: "invalid_grant" "Token is not active", requeuing

With a local user, it seems to work

from rancher.

bpedersen2 avatar bpedersen2 commented on July 29, 2024

I reregistred the harvester cluster using a non-oidc admin account, now the connections seems to be stable again. It looks like a problem with token expiration to me.

from rancher.

dawid10353 avatar dawid10353 commented on July 29, 2024

I have the same problem:

Failed creating server [fleet-default/rke2-rc-control-plane-2aae5bdf-2m48z] of kind (HarvesterMachine) for machine rke2-rc-control-plane-5b74797746x4dpcs-ncdxf in infrastructure provider: CreateError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)" The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.

Rancher v2.8.2
Dashboard v2.8.0
Helm v2.16.8-rancher2
Machine v0.15.0-rancher106
Harvester: v1.2.1

from rancher.

dawid10353 avatar dawid10353 commented on July 29, 2024

I have loop for many hours:
New VM is created, error, new VM is deleted, and new VM is created and again error and again new VM is deleted...

from rancher.

dawid10353 avatar dawid10353 commented on July 29, 2024

I could manually fix it:

  1. go to the harvester embedded rancher and get the kube config
  2. update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred

OK that's worked for me. I have Rancher with users provided by Active Directory.

from rancher.

dawid10353 avatar dawid10353 commented on July 29, 2024

Now i have this error:

	Failed deleting server [fleet-default/rke2-rc-control-plane-3fba9236-dxptf] of kind (HarvesterMachine) for machine rke2-rc-control-plane-77f9455c9dx9xgsk-4kcwf in infrastructure provider: DeleteError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped About to remove rke2-rc-control-plane-3fba9236-dxptf WARNING: This action will delete both local reference and remote instance. Error removing host "rke2-rc-control-plane-3fba9236-dxptf": the server has asked for the client to provide credentials (get virtualmachines.kubevirt.io rke2-rc-control-plane-3fba9236-dxptf)

from rancher.

m-ildefons avatar m-ildefons commented on July 29, 2024

Hi,
thanks for this bug report. May I ask which Harvester versions you were using, @bpedersen2, @sarahhenkens and when you last updated them?

from rancher.

bpedersen2 avatar bpedersen2 commented on July 29, 2024

I am on harvester 1.2.1 and rancher 2.8.3 ( and waiting for 1.2.2 to be able to upgrade to 1.3.x eventually)

from rancher.

sarahhenkens avatar sarahhenkens commented on July 29, 2024

Ran in the same issue today as @dawid10353. I'm running harvester 1.2.1 and rancher 2.8.2.

from rancher.

m-ildefons avatar m-ildefons commented on July 29, 2024

Could you please check the expiry of your API access tokens? There needs to be a kubeconfig token that isn't expired and is associated with the Harvester cluster.

Screenshot at 2024-05-13 09-06-44

Screenshot at 2024-05-13 09-07-18

It would also be helpful to know what you were trying to do when you observed the problems and at which step in the process the problems started to occur.

from rancher.

innobead avatar innobead commented on July 29, 2024

When reading the original issue #41919 which introduces the default TTL value (30 minutes for kubeconf) to securely manage tokens for users (or headless users for programming purposes).

However, Harvester cloud credentials, which are used for authenticating and authorizing the Rancher cluster to manage downstream Harvester clusters, should not use the unified TTL applied in this case, as the token is not for users but an internal mechanism instead.

cc @ibrokethecloud @bk201 @Vicente-Cheng @m-ildefons

from rancher.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.