thinkparq / beegfs-csi-driver Goto Github PK
View Code? Open in Web Editor NEWThe BeeGFS Container Storage Interface (CSI) driver provides high performing and scalable storage for workloads running in Kubernetes.
License: Apache License 2.0
The BeeGFS Container Storage Interface (CSI) driver provides high performing and scalable storage for workloads running in Kubernetes.
License: Apache License 2.0
Hi!
So after upgrading half of my worker nodes to new kubernetes (v1.24.9) I noticed that some of the pods got stuck in failed mount.
Warning FailedMount 15s (x6 over 31s) kubelet
MountVolume.MountDevice failed for volume "pvc-5bc91a74" : rpc error: code =
Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/beegfs.csi.netapp.com/874cf8f302b0da66de76a4edb4ca3f7e0c5f7a6f25ad368e8ce8fda969225eb5/globalmount: no such file or directory
To get them up and running again I forced them to use nodes with the old kubernetes version (v1.23.15) and that works.
Versions:
Regarding the csi driver deployment I am using the k8s one from the repo.
Config:
config:
beegfsClientConf:
connClientPortUDP: "8028"
connDisableAuthentication: "true"
logType: "helperd"
And the only modification I had to make was in csi-beegfs-node.yaml where I set the plugins-mount-dir to /var/lib/kubelet/plugins/kubernetes.io/csi/pv instead of /var/lib/kubelet/plugins/kubernetes.io/csi
The kubernetes 1.23.15 worker node directory structure of /var/lib/kubelet/plugins/kubernetes.io/csi
tree -L 4
.
└── pv
├── pvc-01ba9661
│ ├── globalmount
│ │ ├── beegfs-client.conf
│ │ └── mount
│ └── vol_data.json
├── pvc-03357f3e
│ ├── globalmount
│ │ ├── beegfs-client.conf
│ │ └── mount
│ └── vol_data.json
...
The kubernetes 1.24.9 worker node directory structure of /var/lib/kubelet/plugins/kubernetes.io/csi
tree -L 4
.
├── beegfs.csi.netapp.com
└── pv
├── pvc-090f23e1
│ ├── globalmount
│ │ ├── beegfs-client.conf
│ │ └── mount
│ └── vol_data.json
├── pvc-14ba4b44
│ ├── globalmount
│ │ ├── beegfs-client.conf
│ │ └── mount
│ └── vol_data.json
...
So for some reason the node with the newer kubernetes version has an empty beegfs.csi.netapp.com directory.
Why are the pods on the "new" nodes trying to mount this other location? Is the v1.3.0 version of the driver incompatible with kubernetes 1.24.9? Should I upgrade the driver to v1.4.0?
Please say if you need any more info.
Thanks in advance!
Just wanted to report that it builds okay:
$ make
./release-tools/verify-go-version.sh "go"
======================================================
WARNING
This projects is tested with Go v1.15.
Your current Go version is v1.17.
This may or may not be close enough.
In particular test-gofmt and test-vendor
are known to be sensitive to the version of
Go.
======================================================
mkdir -p bin
echo '' | tr ';' '\n' | while read -r os arch suffix; do \
if ! (set -x; CGO_ENABLED=0 GOOS="$os" GOARCH="$arch" go build -a -ldflags ' -X main.version=v1.2.1-0-g316c1cd -extldflags "-static"' -o "./bin/beegfs-csi-driver$suffix" ./cmd/beegfs-csi-driver); then \
echo "Building beegfs-csi-driver for GOOS=$os GOARCH=$arch failed, see error(s) above."; \
exit 1; \
fi; \
done
+ CGO_ENABLED=0 GOOS= GOARCH= go build -a -ldflags -X main.version=v1.2.1-0-g316c1cd -extldflags "-static" -o ./bin/beegfs-csi-driver ./cmd/beegfs-csi-driver
$ ./bin/beegfs-csi-driver -version
beegfs-csi-driver v1.2.1-0-g316c1cd
"--endpoint=${CSI_ENDPOINT}"
so maybe that will simplify things somewhat). This is what I get no matter how I try to change the paths. I'll try to take another look at this another day.Apr 20 14:57:10 b5 nomad[6938]: 2022-04-20T14:57:10.809Z [ERROR] client.alloc_runner.task_runner.task_hook: killing task because plugin failed: alloc_id=e2f36449-9542-e65d-0769-6f8e15aa32c3 task=plugin error="CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /opt/nomad/data/client/csi/plugins/e2f36449-9542-e65d-0769-6f8e15aa32c3/csi.sock: no such file or directory"
node-id
(seen here) seems to be nodeid
in more recent config examples by HashiCorp staff.
It's not entirely clear from the instructions if driver has to be deployed on a host (worker) with a functioning BeeGFS client or not. For example, connAuth:
in plugin.nomad: if BeeGFS cluster isn't using connAuth, can that section be left out from the plugin job? Or will using the plugin break BeeGFS client if that's left out (or if it's not left out)? I know I'll find out after I get there, but it should be clear before because many may be concerned if plugin can interfere their BeeGFS client
beegfs-dkms-client uninstalls beegfs-client (v7.3.0) which is dangerous. If that package isn't necessary, let's not suggest to install it
Further updating various Operator dependancies breaks things in a number of places. While currently there aren't any pressing reasons to further update those dependancies, in the future we should first update the SDK version as it is possible that will help smooth over the rest of the dependency updates.
See the "End-to-End Test" stage and runIntegrationSuite() from the original Jenkinsfile. More documentation is also in the test/e2e/README.md. This includes both the driver deployed with and without the operator as there doesn't seem to be a separate test suite/command.
Unless we plan to continue using statically deployed clusters for these tests, much of what was there needs to be reworked. This should actually simplify things greatly as all of the pieces to clean up old tests no longer apply. We can just run the Ginkgo e2e tests once the driver is deployed in the existing e2e job that is already setup to run against a matrix of BeeGFS and Kubernetes versions.
To use the current e2e test setup with Minikube with the Ginkgo e2e tests we will need to deploy a multi-node Minikube cluster. This shouldn't be to difficult since we no longer need to use the "none" driver to use BeeGFS with Minikube so long as we deploy BeeGFS into K8s. Alternatively we would need to deploy K8s clusters outside GitHub Actions (ideally managed through something like Rancher) as self-hosted runners.
Currently e2e tests are built on Ginkgo, and some tests are using deprecated Ginkgo features. Part of this issue is evaluating how we can update our current tests to use the latest version of Ginkgo. If it will take significant effort to migrate the current test suites that should be done as a separate issue where we also reevaluate what else is out there for K8s/CSI e2e tests and if it would be better to start from scratch in something else. Historically many K8s e2e tests used Ginkgo so this seems unlikely, but is worth exploring.
Currently the module path as defined by go.mod
is "github.com/netapp/beegfs-csi-driver" even though the actual path is now "github.com/thinkparq/beegfs-csi-driver". While migrated GitHub repositories setup a redirect (so either go get github.com/netapp/beegfs-csi-driver
or go get github.com/thinkparq/beegfs-csi-driver
will work), it would be good to eventually update the path to reflect the new repository.
Because this is a CSI driver, most users will be downloading/deploying it using a prebuilt container image and would not be affected by changing the module path. It is much less likely (though not impossible) there are other projects or libraries that are importing these packages and would be broken. Ideally any change to the module path would result in a major version bump.
A change that would be much more impactful to users is changing the driver name from "beegfs.csi.netapp.com" to "beegfs.csi.thinkparq.com". As this would almost certainly require bumping the major version it makes sense to plan for these two changes to happen together.
As there is no pressing reason to make either of these changes, we'll put them on the back burner until there is a more compelling reason to ship a 2.0 version.
Hi,
when csi-provisioner
container inside csi-beegfs-controller-0
pod it logs a warning with the message:
W0617 12:47:35.586467 1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
Full log:
I0617 12:47:31.644718 1 csi-provisioner.go:121] Version: v2.0.2
I0617 12:47:31.644795 1 csi-provisioner.go:135] Building kube configs for running in cluster...
I0617 12:47:31.651298 1 connection.go:153] Connecting to unix:///csi/csi.sock
I0617 12:47:35.580545 1 common.go:111] Probing CSI driver for readiness
I0617 12:47:35.580566 1 connection.go:182] GRPC call: /csi.v1.Identity/Probe
I0617 12:47:35.580570 1 connection.go:183] GRPC request: {}
I0617 12:47:35.585663 1 connection.go:185] GRPC response: {}
I0617 12:47:35.585730 1 connection.go:186] GRPC error: <nil>
I0617 12:47:35.585742 1 connection.go:182] GRPC call: /csi.v1.Identity/GetPluginInfo
I0617 12:47:35.585745 1 connection.go:183] GRPC request: {}
I0617 12:47:35.586397 1 connection.go:185] GRPC response: {"name":"beegfs.csi.netapp.com","vendor_version":"v1.1.0-0-gc65b537"}
I0617 12:47:35.586447 1 connection.go:186] GRPC error: <nil>
I0617 12:47:35.586458 1 csi-provisioner.go:182] Detected CSI driver beegfs.csi.netapp.com
W0617 12:47:35.586467 1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I0617 12:47:35.586477 1 connection.go:182] GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0617 12:47:35.586481 1 connection.go:183] GRPC request: {}
I0617 12:47:35.587155 1 connection.go:185] GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}}]}
I0617 12:47:35.587274 1 connection.go:186] GRPC error: <nil>
I0617 12:47:35.587284 1 connection.go:182] GRPC call: /csi.v1.Controller/ControllerGetCapabilities
I0617 12:47:35.587288 1 connection.go:183] GRPC request: {}
I0617 12:47:35.587831 1 connection.go:185] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}}]}
I0617 12:47:35.588083 1 connection.go:186] GRPC error: <nil>
I0617 12:47:35.588233 1 csi-provisioner.go:210] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments
I0617 12:47:35.588790 1 controller.go:735] Using saving PVs to API server in background
I0617 12:47:35.689094 1 volume_store.go:97] Starting save volume queue
Does this driver support Prometheus metrics? If does, how they can be enabled? And is there any way that the Kubernetes admin can track how much storage is in use?
I used to have the controller running on a master node via affinity rule, (version 1.2.0) but now that I have upgraded to v1.2.1 I had to remove the affinity rule otherwise the beegfs container would fail like this:
16:59 # kubectl logs -f -n kube-system csi-beegfs-controller-0 beegfs
E0725 14:57:42.243667 1 main.go:70] "msg"="Fatal: Failed to initialize driver" "error"="failed to read client configuration template file: open /host/etc/beegfs/beegfs-client.conf: no such file or directory" "fullError"="open /host/etc/beegfs/beegfs-client.conf: no such file or directory\nfailed to read client configuration template file" "goroutine"="main"
So my master nodes never had the beegfs requirements installed (everything that is required to be installed like client) and it always worked, so my question is if there has been any change introduced that we required the controller to be scheduled in a beegfs capable client node, and if not, what could be the issue.
Cheers.
This test confirms issues can be created by the public.
Hi,
I was wondering about plans to support newer versions of K8s. Are there plans to support beegfs 7.1.5 on the K8s version 1.23.x and other versions since 1.24 is planned for April?
Hi Team,
i deploy beegfs-csi-driver using kubectl apply -k deploy/k8s/overlays/default
then i update the configmap csi-beegfs-config to add :
...
data:
csi-beegfs-config.yaml: |
config:
connInterfaces:
- ib0
connRDMAInterfaces:
- ib0
beegfsClientConf:
sysMgmtdHost: 172.19.204.1
connClientPortUDP: 8004
connHelperdPortTCP: 8006
connMgmtdPortTCP: 8008
connMgmtdPortUDP: 8008
connPortShift: 0
connCommRetrySecs: 600
connFallbackExpirationSecs: 900
connMaxInternodeNum: 12
connMaxConcurrentAttempts: 0
connUseRDMA: true
connRDMABufNum: 70
connRDMABufSize: 8192
connRDMATypeOfService: 0
logClientID: false
logLevel: 3
logType: helperd
quotaEnabled: false
sysCreateHardlinksAsSymlinks: false
sysMountSanityCheckMS: 11000
sysSessionCheckOnClose: false
sysSyncOnClose: false
sysTargetOfflineTimeoutSecs: 900
sysUpdateTargetStatesSecs: 30
sysXAttrsEnabled: false
tuneFileCacheType: buffered
tuneRemoteFSync: true
tuneUseGlobalAppendLocks: false
tuneUseGlobalFileLocks: false
...
but pod csi-beegfs-controller-0 is still in error :
"msg"="Fatal: Failed to initialize driver" "error"="failed to get valid default client configuration template file" "fullError"="failed to get valid default client configuration template file\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.newBeegfsDriver\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/beegfs.go:225\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.NewBeegfsDriver\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/beegfs.go:160\nmain.handle\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/cmd/beegfs-csi-driver/main.go:67\nmain.main\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/cmd/beegfs-csi-driver/main.go:62\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581" "goroutine"="main"
What do i miss ?
👋 Hi, are there any plans to add support for snapshots?
Thanks!
I used the "one liner" to deploy, it deployed to kube-system
.
Perhaps we should document that this approach doesn't deploy to a dedicated namespace and suggest to create one?
A couple of months ago, ThinkParQ posted a survey in the BeeGFS newsletter and the BeeGFS user group asking for feedback on the current and future use of BeeGFS in containers and in the cloud (public, hybrid, private, or otherwise). This survey was created in collaboration with the BeeGFS CSI driver maintainers in the hopes of better understanding what features we should add to the 1.x version of the driver (and what a 2.x version of the driver might look like). Unfortunately, the survey didn't receive enough responses to be particularly useful. If you have not already taken the survey please consider doing so at the link below!
I have a previous beegfs volume to which I manually create a PV and a PVC to make it available in a certain namespace. Now I would like to have it mounted on a different namespace, so I have created a PV and a new PVC but it throws:
---- ------ ---- ---- -------
Warning ClaimMisbound 38s persistentvolume-controller Two claims are bound to the same volume, this one is bound incorrectly
Is it allowed ? We have also commercial support if needed.
I'm not sure if this would be useful to BeeGFS CSI users, but a new flavor of RWO called ReadWriteOncePod could be added if there's a use case for it.
It seems like an anti-feature for BeeGFS CSI... Maybe there's a use case for lowering risks of application-induced data corruption in RWO workloads?
Our conn_auth
is random binary bytes, and I didn't find a way to put it in the csi-beegfs-connauth.yaml
file. Is there any possible to support base64 encoding for this field?
eg.
// ConnAuthConfig associates a ConnAuth with a SysMgmtdHost.
type ConnAuthConfig struct {
SysMgmtdHost string `json:"sysMgmtdHost"`
ConnAuth string `json:"connAuth"`
// Add new field for configuration
Encoding string `json:"encoding"`
}
The Encoding
field could be either raw
or base64
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.