thinkparq / beegfs-csi-driver Goto Github PK

The BeeGFS Container Storage Interface (CSI) driver provides high performing and scalable storage for workloads running in Kubernetes. 📦 🐝

License: Apache License 2.0

Dockerfile 1.00% Makefile 5.88% Go 63.31% Shell 21.22% HCL 6.00% Python 2.44% Smarty 0.16%

kubernetes storage beegfs csi-driver ai hpc k8s

beegfs-csi-driver's People

Contributors

Stargazers

Watchers

beegfs-csi-driver's Issues

Add ReadWriteOncePod?

I'm not sure if this would be useful to BeeGFS CSI users, but a new flavor of RWO called ReadWriteOncePod could be added if there's a use case for it.

It seems like an anti-feature for BeeGFS CSI... Maybe there's a use case for lowering risks of application-induced data corruption in RWO workloads?

Upgrade Operator SDK and Operator dependancies

Further updating various Operator dependancies breaks things in a number of places. While currently there aren't any pressing reasons to further update those dependancies, in the future we should first update the SDK version as it is possible that will help smooth over the rest of the dependency updates.

base64 for `ConnAuthConfig.ConnAuth`

Our conn_auth is random binary bytes, and I didn't find a way to put it in the csi-beegfs-connauth.yaml file. Is there any possible to support base64 encoding for this field?

eg.

// ConnAuthConfig associates a ConnAuth with a SysMgmtdHost.
type ConnAuthConfig struct {
        SysMgmtdHost string `json:"sysMgmtdHost"`
        ConnAuth     string `json:"connAuth"`
        
        // Add new field for configuration
        Encoding string `json:"encoding"`
}

The Encoding field could be either raw or base64

Recreate all end-to-end tests in GitHub Actions

See the "End-to-End Test" stage and runIntegrationSuite() from the original Jenkinsfile. More documentation is also in the test/e2e/README.md. This includes both the driver deployed with and without the operator as there doesn't seem to be a separate test suite/command.

Unless we plan to continue using statically deployed clusters for these tests, much of what was there needs to be reworked. This should actually simplify things greatly as all of the pieces to clean up old tests no longer apply. We can just run the Ginkgo e2e tests once the driver is deployed in the existing e2e job that is already setup to run against a matrix of BeeGFS and Kubernetes versions.

To use the current e2e test setup with Minikube with the Ginkgo e2e tests we will need to deploy a multi-node Minikube cluster. This shouldn't be to difficult since we no longer need to use the "none" driver to use BeeGFS with Minikube so long as we deploy BeeGFS into K8s. Alternatively we would need to deploy K8s clusters outside GitHub Actions (ideally managed through something like Rancher) as self-hosted runners.

Currently e2e tests are built on Ginkgo, and some tests are using deprecated Ginkgo features. Part of this issue is evaluating how we can update our current tests to use the latest version of Ginkgo. If it will take significant effort to migrate the current test suites that should be done as a separate issue where we also reevaluate what else is out there for K8s/CSI e2e tests and if it would be better to start from scratch in something else. Historically many K8s e2e tests used Ginkgo so this seems unlikely, but is worth exploring.

Feedback on Nomad CSI

It seems the paths in driver arguments no longer work with 1.3.0 Beta-1. I'm still investigating but basically monolith container fails to start. (In 1.3.0 Nomad will start supplying env var "--endpoint=${CSI_ENDPOINT}" so maybe that will simplify things somewhat). This is what I get no matter how I try to change the paths. I'll try to take another look at this another day.

Apr 20 14:57:10 b5 nomad[6938]:     2022-04-20T14:57:10.809Z [ERROR] client.alloc_runner.task_runner.task_hook: killing task because plugin failed: alloc_id=e2f36449-9542-e65d-0769-6f8e15aa32c3 task=plugin error="CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /opt/nomad/data/client/csi/plugins/e2f36449-9542-e65d-0769-6f8e15aa32c3/csi.sock: no such file or directory"

node-id (seen here) seems to be nodeid in more recent config examples by HashiCorp staff.
It's not entirely clear from the instructions if driver has to be deployed on a host (worker) with a functioning BeeGFS client or not. For example, connAuth: in plugin.nomad: if BeeGFS cluster isn't using connAuth, can that section be left out from the plugin job? Or will using the plugin break BeeGFS client if that's left out (or if it's not left out)? I know I'll find out after I get there, but it should be clear before because many may be concerned if plugin can interfere their BeeGFS client
beegfs-dkms-client uninstalls beegfs-client (v7.3.0) which is dangerous. If that package isn't necessary, let's not suggest to install it

failed to read client configuration template file: open /host/etc/beegfs/beegfs-client.conf: no such file or directory

I used to have the controller running on a master node via affinity rule, (version 1.2.0) but now that I have upgraded to v1.2.1 I had to remove the affinity rule otherwise the beegfs container would fail like this:

16:59 # kubectl logs -f -n kube-system csi-beegfs-controller-0 beegfs 
E0725 14:57:42.243667       1 main.go:70]  "msg"="Fatal: Failed to initialize driver" "error"="failed to read client configuration template file: open /host/etc/beegfs/beegfs-client.conf: no such file or directory" "fullError"="open /host/etc/beegfs/beegfs-client.conf: no such file or directory\nfailed to read client configuration template file" "goroutine"="main"

So my master nodes never had the beegfs requirements installed (everything that is required to be installed like client) and it always worked, so my question is if there has been any change introduced that we required the controller to be scheduled in a beegfs capable client node, and if not, what could be the issue.

Cheers.

Add support for expanding volumes

I tried to use beegfs-csi to create a PVC, and then I expanded the capacity on the basis of this PVC. I found that it was not successful. Is expansion of PVC supported now?

unable to find metadata node for path:/mnt/beegfs

Consider deploying to dedicated namespace?

I used the "one liner" to deploy, it deployed to kube-system.

Perhaps we should document that this approach doesn't deploy to a dedicated namespace and suggest to create one?

Consider updating the module path and driver name

Currently the module path as defined by go.mod is "github.com/netapp/beegfs-csi-driver" even though the actual path is now "github.com/thinkparq/beegfs-csi-driver". While migrated GitHub repositories setup a redirect (so either go get github.com/netapp/beegfs-csi-driver or go get github.com/thinkparq/beegfs-csi-driver will work), it would be good to eventually update the path to reflect the new repository.

Because this is a CSI driver, most users will be downloading/deploying it using a prebuilt container image and would not be affected by changing the module path. It is much less likely (though not impossible) there are other projects or libraries that are importing these packages and would be broken. Ideally any change to the module path would result in a major version bump.

A change that would be much more impactful to users is changing the driver name from "beegfs.csi.netapp.com" to "beegfs.csi.thinkparq.com". As this would almost certainly require bumping the major version it makes sense to plan for these two changes to happen together.

As there is no pressing reason to make either of these changes, we'll put them on the back burner until there is a more compelling reason to ship a 2.0 version.

Failed mount after kubernetes worker node upgrade from v1.23.15 to v1.24.9

Hi!

So after upgrading half of my worker nodes to new kubernetes (v1.24.9) I noticed that some of the pods got stuck in failed mount.

Warning  FailedMount  15s (x6 over 31s)  kubelet
MountVolume.MountDevice failed for volume "pvc-5bc91a74" : rpc error: code = 
Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/874cf8f302b0da66de76a4edb4ca3f7e0c5f7a6f25ad368e8ce8fda969225eb5/globalmount: no such file or directory

To get them up and running again I forced them to use nodes with the old kubernetes version (v1.23.15) and that works.

Versions:

BeeGFS: v7.3.2
CSI Driver: v1.3.0

Regarding the csi driver deployment I am using the k8s one from the repo.
Config:

config:
  beegfsClientConf:
    connClientPortUDP: "8028"
    connDisableAuthentication: "true"
    logType: "helperd"

And the only modification I had to make was in csi-beegfs-node.yaml where I set the plugins-mount-dir to /var/lib/kubelet/plugins/kubernetes.io/csi/pv instead of /var/lib/kubelet/plugins/kubernetes.io/csi

The kubernetes 1.23.15 worker node directory structure of /var/lib/kubelet/plugins/kubernetes.io/csi

tree -L 4
.
└── pv
    ├── pvc-01ba9661
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
    ├── pvc-03357f3e
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
...

The kubernetes 1.24.9 worker node directory structure of /var/lib/kubelet/plugins/kubernetes.io/csi

tree -L 4
.
├── beegfs.csi.netapp.com
└── pv
    ├── pvc-090f23e1
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
    ├── pvc-14ba4b44
    │   ├── globalmount
    │   │   ├── beegfs-client.conf
    │   │   └── mount
    │   └── vol_data.json
...

So for some reason the node with the newer kubernetes version has an empty beegfs.csi.netapp.com directory.
Why are the pods on the "new" nodes trying to mount this other location? Is the v1.3.0 version of the driver incompatible with kubernetes 1.24.9? Should I upgrade the driver to v1.4.0?

Please say if you need any more info.

Thanks in advance!

Plans for supporting K8s 1.23.x and 1.24.x

Hi,

I was wondering about plans to support newer versions of K8s. Are there plans to support beegfs 7.1.5 on the K8s version 1.23.x and other versions since 1.24 is planned for April?

Feature request: support VolumeSnapshot

👋 Hi, are there any plans to add support for snapshots?
Thanks!

Test Issue

This test confirms issues can be created by the public.

Request for feedback

A couple of months ago, ThinkParQ posted a survey in the BeeGFS newsletter and the BeeGFS user group asking for feedback on the current and future use of BeeGFS in containers and in the cloud (public, hybrid, private, or otherwise). This survey was created in collaboration with the BeeGFS CSI driver maintainers in the hopes of better understanding what features we should add to the 1.x version of the driver (and what a 2.x version of the driver might look like). Unfortunately, the survey didn't receive enough responses to be particularly useful. If you have not already taken the survey please consider doing so at the link below!

BeeGFS Cloud and Containerization Survey

Getting "failed to get valid default client" error

Hello, I tried configuring the driver but I am struggling with an issue with the client configuration. Can you please help me identify what is wrong in my installation. attached a screenshot of the error:

Failed to start controller : can t find valid default client configuration template file or file not json

Hi Team,

i deploy beegfs-csi-driver using kubectl apply -k deploy/k8s/overlays/default
then i update the configmap csi-beegfs-config to add :

...
data:
  csi-beegfs-config.yaml: |
    config:
      connInterfaces:
      - ib0
      connRDMAInterfaces:
      - ib0
      beegfsClientConf:
        sysMgmtdHost: 172.19.204.1
        connClientPortUDP: 8004
        connHelperdPortTCP: 8006
        connMgmtdPortTCP: 8008
        connMgmtdPortUDP: 8008
        connPortShift: 0
        connCommRetrySecs: 600
        connFallbackExpirationSecs: 900
        connMaxInternodeNum: 12
        connMaxConcurrentAttempts: 0
        connUseRDMA: true
        connRDMABufNum: 70
        connRDMABufSize: 8192
        connRDMATypeOfService: 0
        logClientID: false
        logLevel: 3
        logType: helperd
        quotaEnabled: false
        sysCreateHardlinksAsSymlinks: false
        sysMountSanityCheckMS: 11000
        sysSessionCheckOnClose: false
        sysSyncOnClose: false
        sysTargetOfflineTimeoutSecs: 900
        sysUpdateTargetStatesSecs: 30
        sysXAttrsEnabled: false
        tuneFileCacheType: buffered
        tuneRemoteFSync: true
        tuneUseGlobalAppendLocks: false
        tuneUseGlobalFileLocks: false
...

but pod csi-beegfs-controller-0 is still in error :

"msg"="Fatal: Failed to initialize driver" "error"="failed to get valid default client configuration template file" "fullError"="failed to get valid default client configuration template file\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.newBeegfsDriver\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/beegfs.go:225\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.NewBeegfsDriver\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/beegfs.go:160\nmain.handle\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/cmd/beegfs-csi-driver/main.go:67\nmain.main\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/cmd/beegfs-csi-driver/main.go:62\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581" "goroutine"="main"

What do i miss ?

metrics endpoint will not be started because `metrics-address` was not specified

Hi,

when csi-provisioner container inside csi-beegfs-controller-0 pod it logs a warning with the message:

W0617 12:47:35.586467       1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.

Full log:

I0617 12:47:31.644718       1 csi-provisioner.go:121] Version: v2.0.2
I0617 12:47:31.644795       1 csi-provisioner.go:135] Building kube configs for running in cluster...
I0617 12:47:31.651298       1 connection.go:153] Connecting to unix:///csi/csi.sock
I0617 12:47:35.580545       1 common.go:111] Probing CSI driver for readiness
I0617 12:47:35.580566       1 connection.go:182] GRPC call: /csi.v1.Identity/Probe
I0617 12:47:35.580570       1 connection.go:183] GRPC request: {}
I0617 12:47:35.585663       1 connection.go:185] GRPC response: {}
I0617 12:47:35.585730       1 connection.go:186] GRPC error: <nil>
I0617 12:47:35.585742       1 connection.go:182] GRPC call: /csi.v1.Identity/GetPluginInfo
I0617 12:47:35.585745       1 connection.go:183] GRPC request: {}
I0617 12:47:35.586397       1 connection.go:185] GRPC response: {"name":"beegfs.csi.netapp.com","vendor_version":"v1.1.0-0-gc65b537"}
I0617 12:47:35.586447       1 connection.go:186] GRPC error: <nil>
I0617 12:47:35.586458       1 csi-provisioner.go:182] Detected CSI driver beegfs.csi.netapp.com
W0617 12:47:35.586467       1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I0617 12:47:35.586477       1 connection.go:182] GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0617 12:47:35.586481       1 connection.go:183] GRPC request: {}
I0617 12:47:35.587155       1 connection.go:185] GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}}]}
I0617 12:47:35.587274       1 connection.go:186] GRPC error: <nil>
I0617 12:47:35.587284       1 connection.go:182] GRPC call: /csi.v1.Controller/ControllerGetCapabilities
I0617 12:47:35.587288       1 connection.go:183] GRPC request: {}
I0617 12:47:35.587831       1 connection.go:185] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
I0617 12:47:35.588083       1 connection.go:186] GRPC error: <nil>
I0617 12:47:35.588233       1 csi-provisioner.go:210] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments
I0617 12:47:35.588790       1 controller.go:735] Using saving PVs to API server in background
I0617 12:47:35.689094       1 volume_store.go:97] Starting save volume queue

Does this driver support Prometheus metrics? If does, how they can be enabled? And is there any way that the Kubernetes admin can track how much storage is in use?

cannot mount the same beegfs volume twice

I have a previous beegfs volume to which I manually create a PV and a PVC to make it available in a certain namespace. Now I would like to have it mounted on a different namespace, so I have created a PV and a new PVC but it throws:

  ----     ------         ----  ----                         -------
  Warning  ClaimMisbound  38s   persistentvolume-controller  Two claims are bound to the same volume, this one is bound incorrectly

Is it allowed ? We have also commercial support if needed.

Add support for ARM64

Just wanted to report that it builds okay:

BeeGFS CSI driver v1.2.1-0-g316c1cd
Ubuntu 20.04.4 LTS (4.9.277-83) with go v1.17

$ make
./release-tools/verify-go-version.sh "go"

======================================================
                  WARNING

  This projects is tested with Go v1.15.
  Your current Go version is v1.17.
  This may or may not be close enough.

  In particular test-gofmt and test-vendor
  are known to be sensitive to the version of
  Go.
======================================================

mkdir -p bin
echo '' | tr ';' '\n' | while read -r os arch suffix; do \
	if ! (set -x; CGO_ENABLED=0 GOOS="$os" GOARCH="$arch" go build  -a -ldflags ' -X main.version=v1.2.1-0-g316c1cd  -extldflags "-static"' -o "./bin/beegfs-csi-driver$suffix" ./cmd/beegfs-csi-driver); then \
		echo "Building beegfs-csi-driver for GOOS=$os GOARCH=$arch failed, see error(s) above."; \
		exit 1; \
	fi; \
done
+ CGO_ENABLED=0 GOOS= GOARCH= go build -a -ldflags  -X main.version=v1.2.1-0-g316c1cd  -extldflags "-static" -o ./bin/beegfs-csi-driver ./cmd/beegfs-csi-driver

$ ./bin/beegfs-csi-driver -version
beegfs-csi-driver v1.2.1-0-g316c1cd

thinkparq / beegfs-csi-driver Goto Github PK

beegfs-csi-driver's People

Contributors

Stargazers

Watchers

Forkers

beegfs-csi-driver's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs