GithubHelp home page GithubHelp logo

Comments (16)

ejweber avatar ejweber commented on June 12, 2024 1

Thanks for taking a look @scaleoutsean!

  • Unfortunately, we're not currently running an up-to-date Nomad environment, so we will have to defer an investigation on the path issue. With CSI going GA in Nomad soon, we'll discuss plans to get this going again soon.
  • node-id is essentially a command line argument to the BeeGFS CSI driver. We have also seen that many/most CSI drivers use nodeid instead (e.g. https://github.com/kubernetes-csi/csi-driver-host-path/blob/f5fd42e78f3884ed6b780d23c1c43798a0d29d35/deploy/kubernetes-1.21/hostpath/csi-hostpath-plugin.yaml#L226). We cannot make a change here to align with the Nomad examples because doing so would be a breaking change regardless of container orchestrator (Nomad vs. Kubernetes).
  • We can definitely do a better job clarifying what is and isn't required in plugin.nomad. The General Configuration section of docs/deploy.md discusses the fact that all driver configuration is optional and is generally only required in "interesting" environments (e.g. when existing file systems are set up to use a connAuthInterfaces file), but there is nothing in plugin.nomad to reiterate that or to point to that documentation. More to your point, however, the way we have plugin.nomad set up, omitting either csi-beegfs-config.yaml or csi-beegfs-connauth.yaml will cause a driver failure. The driver will expect to find at least a blank file at both paths. One option is likely to remove from the data section of either (or both) template blocks. Another option is to remove the the mention of these files from the args list.
  • The driver can technically run with the beegfs-client package, but the beegfs-client-dkms package is specified as a prerequisite for all supported driver deployments (https://github.com/NetApp/beegfs-csi-driver#prerequisites). Both packages build and install the same BeeGFS client kernel, but the beegfs-client package is governed by a systemd unit and the beegfs-mounts.conf file, while the beegfs-client-dkms package uses the DKMS infrastructure to build the module and standard mount commands (e.g. mount -t beegfs -ocfgFile=... beegfs_nodev /mnt/beegfs) to handle mounts. The driver uses this command format to mount BeeGFS file systems to nodes on demand.

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024 1

The volume creation issues are difficult to troubleshoot without a working Nomad environment, but I can at least give you a bit of context.

The driver writes client configuration files to the configured directory directly (e.g. /opt/nomad/data/client/csi/monlith/beegfs-plugin0). This write happens in the driver's own mount namespace (within the driver container).

When the driver executes beegfs-ctl, it uses chroot to make this execution happen in the host namespace. This allows it to use the beegfs-ctl utility already installed on the host (instead of shipping one). Note that this is even more important for the node services NodeStageVolume command, as mount calls must reference a client configuration file in the host's namespace.

We take care in both Kubernetes and Nomad to ensure that the driver and the host both see the client configuration files as having the same path. That way, both the driver and the host can refer to them at that path. In Kubernetes, we accomplish this with a bind mount:

There is an implicit assumption in plugin.nomad that this /opt/nomad/data/client/csi/monolith/beegfs-plugin0 directory is similarly configured out of the box. And there is good evidence to suggest this is the case, as Nomad is writing to a socket in this directory and the driver is reading from the socket out of this directory.

That being said, the logs tell a different story. Since there is no error on directory creation, we can safely assume that the driver creates the configuration directory in its mount namespace. Since there is a failure on the part of beegfs-ctl to load the map file, I suspect that the host has a different view of what is or isn't contained in the same directory. This could be the result of some Nomad change since the version we tested with or an overlooked detail. (That "cleaning up path" message on log line 6 is an indicator that the client.conf file is blown away after the failure, so I wouldn't expect to be able to find it after the fact.)

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024 1

The changes we introduced in v1.3.0 addressed the known Nomad issues. Additional cleanup is being done in v1.4.0. I'm going to go ahead and close this one for now, but please feel free to open a followup issue with additional feedback if/when it makes sense.

from beegfs-csi-driver.

scaleoutsean avatar scaleoutsean commented on June 12, 2024

That's helpful, thank you.

More to your point, however, the way we have plugin.nomad set up, omitting either csi-beegfs-config.yaml or csi-beegfs-connauth.yaml will cause a driver failure. The driver will expect to find at least a blank file at both paths.

I went with that assumption (and a few others for other steps, which made troubleshooting harder due to several assumptions at once) and I tried to leave connAuth empty, but later I also configured connection authentication as I was trying to see if the lack of it was causing my problems (it seemed it wasn't). Which brings me to this part of the page you linked above:

NOTE: beegfs-client.conf values MUST be specified as strings, even if they appear to be integers or booleans (e.g. "8000", not 8000 and "true", not true).

That seems to refer only to values in YAML files and doesn't apply to the secret string in the connAuth file (sample here) because that's part of a template file used for configuration and not an argument passed to beegfs-client binary.

But as I just discovered, that's not the case: surrounding values in plugin.nomad's docker template (connAuth, connUseRDMA) with double quotes seems to have helped and now the plugin works. So that may be a bug in reference plugin.nomad file. And indeed, volume.hcl (from the example) indicates that quotes should be used.

I still can't create a volume (getting Failed to load map file: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.195_mnt_beegfs_nomad_VOLUME__NAME/beegfs-client.conf) but I just got to this step so I'll investigate this further.

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024

Interesting!

This commit added plugin.nomad on 08/05/21, which was tested in a Nomad environment we had set up at the time. plugin.nomad hasn't substantially change since then.

This commit changed the YAML parser the driver uses on 11/20/21. Both csi-beegfs-config.yaml and csi-beegfs-connauth.yaml are parsed in the same way, so the warning applies to both. When that commit went in, we updated all YAML files in the project, but it looks like we missed the Nomad templates (as you found).

As far as I understand (and remember) only values that might otherwise be interpreted as non-strings must be quoted. In the PR you submitted, it makes sense to me that "true" should be quoted. In my mind, this is what was causing the issue. I do not think "1.1.1.1" or "secret1" need to be quoted, and I would be curious whether the driver would run without.

from beegfs-csi-driver.

scaleoutsean avatar scaleoutsean commented on June 12, 2024

You're right, it works with only the boolean value surrounded by quotes. But:

  • What if there's a space or ' or " in the password (I didn't want to think about that or test)
  • There's that note that "beegfs-client.conf values MUST be specified as strings"
  • BeeGFS itself doesn't seem to need the quotes around string values, but I haven't looked if these quotes are "lost" by the time those parts of config files get saved on disk. It may also be that they don't support passwords and paths with spaces or quotes because of that, so personally I'd rather use double quotes consistently if they are known to work, than investigate those what if's

The details from bullet two are in another file not changed in my PR, so if you want to selectively surround boolean values with quotes that's fine, the PR allows "minor edits from maintainers" but if you make edits I would suggest to update the other example and that note (can be done in a separate PR by you) as well to minimize confusion for users.

Since I'm already editing this comment, I will add that yesterday I was thinking about making a suggestion to move /deploy and /examples into /docs and have everything under /docs publish-able to GH pages or elsewhere. I didn't mention that because it's not a big pain point now, but if additional changes are made to the docs, we could use the opportunity to do that.

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024

Any group of characters beginning with an alphabetic or numeric character in a YAML file is interpreted as a string unless it belongs to a special group (like integer, boolean, time, etc.). Additionally, any value can be forced to be interpreted as a string using double quotes. My intention with that beegfs-client.conf comment was to remind users to force special values in the config section to strings (because the Kubernetes YAML parser, for whatever reason, will error out instead of unmarshalling true to "true" in a map[string]string). I didn't intend it to mean, quote everything, but quotes around everything certainly doesn't hurt!

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024

We'd like to incorporate your PR into the upcoming 1.2.2 release. Our current process does not allow us to merge PRs directly on GitHub, but we can pull the commits in, test them in our infrastructure (which currently doesn't include a Nomad deployment, so it's just a formality), and include them when the release goes live. It'd be best if we did that on a fully working example though. Hopefully we can get to the bottom of the remaining issue. To that end, if there is additional commands or output you can share, I'd be happy to try and help troubleshoot.

from beegfs-csi-driver.

scaleoutsean avatar scaleoutsean commented on June 12, 2024

Do you mean for the next step (volume create)? Sure, I haven't been able to figure that one out.

  • BeeGFS Mgmt Host - b1 - 192.168.1.191
  • BeeGFS Client - b5 - 192.168.1.195
    • Also Nomad server and Nomad client
    • I'm not sure if BeeGFS client should or shouldn't be installed where BeeGFS CSI containers are running, but I've been using the same client with Host Volume (non-CSI) which saves me 1 VM worth of resources
$ sudo beegfs-ctl --listnodes --nodetype=mgmt
b1 [ID: 1]

$ sudo beegfs-ctl --listnodes --nodetype=client
9E22-6260ECCA-b5 [ID: 1]

$ nslookup b1
Non-authoritative answer:
Name:	b1
Address: 192.168.1.191

$ nslookup b5
Non-authoritative answer:
Name:	b5
Address: 127.0.2.1
Name:	b5
Address: 192.168.1.195
  • Nomad server/client on b5 (BeeGFS client), v1.3.0 Beta 1:
$ nomad node status
ID        DC   Name  Class   Drain  Eligibility  Status
a987e631  dc1  b5    <none>  false  eligible     ready
  • BeeGFS CSI Plugin is installed with only "false" surrounded by double quotes.
job "beegfs-csi-plugin" {
  type = "system"
  datacenters = ["dc1"]
  group "csi" {
    task "plugin" {
      driver = "docker"
      template {
        data        = <<EOH
config:
  beegfsClientConf:
    connUseRDMA: "false"
        EOH
        destination = "${NOMAD_TASK_DIR}/csi-beegfs-config.yaml"
      }
      template {
        data        = <<EOH
- connAuth: secret
  sysMgmtdHost: 192.168.1.191
        EOH
        destination = "${NOMAD_SECRETS_DIR}/csi-beegfs-connauth.yaml"
      }
      config {
        mount {
          type     = "bind"
          target   = "/host"
          source   = "/"
          readonly = true
        }
        image = "netapp/beegfs-csi-driver:v1.2.1"
        args = [
          "--driver-name=beegfs.csi.netapp.com",
          "--client-conf-template-path=/host/etc/beegfs/beegfs-client.conf",
          "--cs-data-dir=/opt/nomad/data/client/csi/monolith/beegfs-plugin0",
          "--config-path=${NOMAD_TASK_DIR}/csi-beegfs-config.yaml",
          "--connauth-path=${NOMAD_SECRETS_DIR}/csi-beegfs-connauth.yaml",
          "--v=5",
          "--endpoint=unix://opt/nomad/data/client/csi/monolith/beegfs-plugin0/csi.sock",
          "--node-id=node-${NOMAD_ALLOC_INDEX}",
        ]
        privileged = true
      }
      csi_plugin {
        id = "beegfs-plugin0"
        type = "monolith"
        mount_dir = "/opt/nomad/data/client/csi/monolith/beegfs-plugin0"
      }
      resources {
        cpu = 256
        memory = 128
      }
    }
  }
}
  • Plugin
$ nomad plugin status beegfs-plugin
ID                   = beegfs-plugin0
Provider             = beegfs.csi.netapp.com
Version              = v1.2.1-0-g316c1cd
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
ca0a455e  a987e631  csi         0        run      running  19m58s ago  19m52s ago

  • BeeGFS mounted on the usual path on the client, with dyn as the subdirectory where dynamic volumes would be created
$ dir -lat /mnt/beegfs/dyn/
total 1
drwxrwxr-x 2 vagrant vagrant 0 Apr 22 15:44 .
drwxrwxrwx 5 root    root    3 Apr 22 15:44 ..
  • Volume
id = "VOLUME"
name = "VOLUME"
type = "csi"
plugin_id = "beegfs-plugin0"
capacity_min = "1MB"
capacity_max = "1GB"
capability {
  access_mode = "single-node-reader-only"
  
  attachment_mode = "file-system"
}
capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}
parameters {
  sysMgmtdHost   = "192.168.1.191"
  volDirBasePath = "/mnt/beegfs/dyn"
}
  • Error
Error creating volume: Unexpected response code: 500 (1 error occurred:
	* controller create volume: CSI.ControllerCreateVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = beegfs-ctl failed with stdOut:  and stdErr: 
Error: Failed to load map file: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.195_mnt_beegfs_dyn_VOLUME/beegfs-client.conf

[BeeGFS Control Tool Version: 7.3.0
Refer to the default config file (/etc/beegfs/beegfs-client.conf)
or visit http://www.beegfs.com to find out about configuration options.]

: exit status 1

)
  • When I check that path, the file does not exist, but maybe it is created and removed too quickly.
$ dir -lat /mnt/beegfs/dyn/VOLUME
dir: cannot access '/mnt/beegfs/dyn/VOLUME': No such file or directory

$ sudo dir -lat /opt/nomad/data/client/csi/monolith/beegfs-plugin0/
total 8
drwx------ 5 root root 4096 Apr 22 01:57 ..
drwx------ 2 root root 4096 Apr 20 14:23 .

  • Here's what I see in controller logs - the first row claims the config files are being written.
I0422 16:10:53.072271       1 beegfs_ctl.go:34]  "msg"="Creating BeeGFS directory" "reqID"="009b" "path"="/mnt/beegfs/dyn/VOLUME" "volumeID"="beegfs://192.168.1.191/mnt/beegfs/dyn/VOLUME"
I0422 16:10:53.072320       1 beegfs_ctl.go:138]  "msg"="Executing command" "reqID"="009b" "command"=["beegfs-ctl","--cfgFile=/opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/beegfs-client.conf","--unmounted","--getentryinfo","/mnt/beegfs/dyn/VOLUME"]
I0422 16:10:53.076350       1 beegfs_ctl.go:161]  "msg"="stderr from command" "reqID"="009b" "command"=["beegfs-ctl","--cfgFile=/opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/beegfs-client.conf","--unmounted","--getentryinfo","/mnt/beegfs/dyn/VOLUME"] "stderr"="\nError: Failed to load map file: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/beegfs-client.conf\n\n[BeeGFS Control Tool Version: 7.3.0\nRefer to the default config file (/etc/beegfs/beegfs-client.conf)\nor visit http://www.beegfs.com to find out about configuration options.]\n\n"
I0422 16:10:53.077025       1 beegfs_util.go:270]  "msg"="Unmounting volume from path" "reqID"="009b" "path"="/opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/mount" "volumeID"="beegfs://192.168.1.191/mnt/beegfs/dyn/VOLUME"
W0422 16:10:53.077259       1 mount_helper_common.go:33] Warning: Unmount skipped because path does not exist: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/mount
I0422 16:10:53.077409       1 beegfs_util.go:283]  "msg"="Cleaning up path" "reqID"="009b" "path"="/opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME" "volumeID"="beegfs://192.168.1.191/mnt/beegfs/dyn/VOLUME"
E0422 16:10:53.077790       1 server.go:195]  "msg"="GRPC error" "error"="rpc error: code = Internal desc = beegfs-ctl failed with stdOut:  and stdErr: \nError: Failed to load map file: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/beegfs-client.conf\n\n[BeeGFS Control Tool Version: 7.3.0\nRefer to the default config file (/etc/beegfs/beegfs-client.conf)\nor visit http://www.beegfs.com to find out about configuration options.]\n\n: exit status 1: beegfs-ctl failed with stdOut:  and stdErr: \nError: Failed to load map file: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/beegfs-client.conf\n\n[BeeGFS Control Tool Version: 7.3.0\nRefer to the default config file (/etc/beegfs/beegfs-client.conf)\nor visit http://www.beegfs.com to find out about configuration options.]\n\n: exit status 1" "fullError"="exit status 1\nbeegfs-ctl failed with stdOut:  and stdErr: \nError: Failed to load map file: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/beegfs-client.conf\n\n[BeeGFS Control Tool Version: 7.3.0\nRefer to the default config file (/etc/beegfs/beegfs-client.conf)\nor visit http://www.beegfs.com to find out about configuration options.]\n\n\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.(*beegfsCtlExecutor).execute\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/beegfs_ctl.go:154\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.(*beegfsCtlExecutor).statDirectoryForVolume\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/beegfs_ctl.go:70\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.(*beegfsCtlExecutor).createDirectoryForVolume\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/beegfs_ctl.go:36\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.(*controllerServer).CreateVolume\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/controllerserver.go:139\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5676\ngithub.com/netapp/beegfs-csi-driver/pkg/beegfs.logGRPC\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/pkg/beegfs/server.go:193\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:5678\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/vendor/google.golang.org/grpc/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/vendor/google.golang.org/grpc/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/var/lib/jenkins/workspace/beegfs-csi-driver_master@2/vendor/google.golang.org/grpc/server.go:934\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nrpc error: code = Internal desc = beegfs-ctl failed with stdOut:  and stdErr: \nError: Failed to load map file: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME/beegfs-client.conf\n\n[BeeGFS Control Tool Version: 7.3.0\nRefer to the default config file (/etc/beegfs/beegfs-client.conf)\nor visit http://www.beegfs.com to find out about configuration options.]\n\n: exit status 1" "reqID"="009b" "method"="/csi.v1.Controller/CreateVolume" "request"="{\"accessibility_requirements\":{},\"capacity_range\":{\"limit_bytes\":1000000000,\"required_bytes\":1000000},\"name\":\"VOLUME\",\"parameters\":{\"sysMgmtdHost\":\"192.168.1.191\",\"volDirBasePath\":\"/mnt/beegfs/dyn\"},\"volume_capabilities\":[{\"AccessType\":{\"Mount\":{}},\"access_mode\":{\"mode\":2}},{\"AccessType\":{\"Mount\":{}},\"access_mode\":{\"mode\":1}}]}"

There's no log file about the config file being deleted, so I assume it should be there if it was created but that doesn't seem to be the case. I haven't looked at the source to see if deletions are logged as well - if they're not, maybe it'd be good to add that so that we know if the file was possibly created but quickly removed.

The volume workflow from this repo uses sed to search and replace volume name in the template file, but that doesn't work differently, it fails with the same error.

$ sed -e "s/VOLUME_NAME/sean[1]/" "volume.hcl" | nomad volume create -
Error creating volume: Unexpected response code: 500 (1 error occurred:
	* controller create volume: CSI.ControllerCreateVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = beegfs-ctl failed with stdOut:  and stdErr: 
Error: Failed to load map file: /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_sean%5B1%5D/beegfs-client.conf

[BeeGFS Control Tool Version: 7.3.0
Refer to the default config file (/etc/beegfs/beegfs-client.conf)
or visit http://www.beegfs.com to find out about configuration options.]

: exit status 1

)

I've tried several different things, no luck.

I can't see anything important in BeeGFS Mgmt server logs - I assume that's because that config map can't be loaded so nothing gets sent its way when nomad volume create is executed.

from beegfs-csi-driver.

scaleoutsean avatar scaleoutsean commented on June 12, 2024

Another feedback - it's a little hard to tell when plugin status is good by looking at the CLI output (plugin status). It seems Nomad shows the plugin as healthy as long as it's up and running.

In this particular case volume doesn't get created so we know there's something wrong, but until that point it seems hard to tell. For example, I can enter SysMgmtHost IP as 1.1.1.1 (edit: I mean in plugin.nomad) and the plugin will show as healthy in plugin status output. Thankfully it doesn't show in the Web UI under CSI > Plugins in that case so it seems there are some checks involved.

However, when I enter 192.168.1.195 (own IP of the Nomad Server / Nomad Client / BeeGFS Client), then the plugin does show in the Web UI. That may be due to its monolithic nature, but it also indicates that those health checks may not be reliable.

I wonder if a check command of some sort (maybe even just manually executed curl commands) could be used to test whether a temp volume can be created and deleted.

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024

For the "another feedback", it sounds like things are working as expected here. The configuration options provided to the driver in csi-beegfs-config.yaml csi-beegfs-connauth.yaml are used as needed. If a volume is created that references the file system with sysMgmtdHost 1.1.1.1, the driver will use the configuration associated with that file system. If no volume is ever created referencing the file system with sysMgmtdHost 1.1.1.1, the driver never uses that configuration, and simply having that configuration does not constitute an error. Of course, the reverse is also true (and this is one of the main reasons the driver was designed to work this way). You can specify absolutely no configuration and call out the sysMgmtdHos of no specific file systems and still create volumes referencing arbitrary file systems. As long as those file systems don't NEED special configuration, the driver can still mount them.

from beegfs-csi-driver.

scaleoutsean avatar scaleoutsean commented on June 12, 2024

If no volume is ever created referencing the file system with sysMgmtdHost 1.1.1.1, the driver never uses that configuration, and simply having that configuration does not constitute an error.

That's a good argument in favor of current approach. It also lets us configure CSI before storage is ready and leave the configuration in place during storage maintenance or downtime.

But I still wonder if at least some warning i(if not outright error status) should be emitted to alert the user can tell that plugin cannot access SysMgmt host. While in a large cluster there may always be some worker(s) that can't access SysMgmt IP, it would be useful to know that before the problem bubbles up to applications.

Since there is a failure on the part of beegfs-ctl to load the map file, I suspect that the host has a different view of what is or isn't contained in the same directory.

I'm also willing to consider I may have made some incorrect assumptions (as I mentioned, I'm not 100% sure I didn't, so I'm open to rechecking other details or offering additional information about my environment).

I looked the the bind thing and also this note which made me leave beegfs-client.conf in place, although my Ansible scripts deployed beegfs-client-b1.conf (b1 is hostname of SysMgmt worker). But I copied that file to beegfs-client.conf for BeeGFS CSI to load, so these have identical content.

Distroless-based containers don't have any shell, otherwise it'd be easy to get in and find out what's wrong. I had to implement a proven technique that I used in the 90's whenever Windows 3.1 had issues finding correct DLL files...

$ sudo mkdir -p /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME__NAME/
$ sudo cp /etc/beegfs/beegfs-client.conf /opt/nomad/data/client/csi/monolith/beegfs-plugin0/192.168.1.191_mnt_beegfs_dyn_VOLUME__NAME/

Then volume create worked.

$ nomad volume status VOLUME_NAME
ID                   = VOLUME_NAME
Name                 = VOLUME_NAME
External ID          = beegfs://192.168.1.191/mnt/beegfs/dyn/VOLUME_NAME
Plugin ID            = beegfs-plugin0
Provider             = beegfs.csi.netapp.com
Version              = v1.2.1-0-g316c1cd
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

This isn't how it's supposed to work, but I may be able to continue testing other things.

Note that Access Mode and other stuff in volume status output is missing, probably due to the hackish workaround. Also, volume remove doesn't work (I have to use volume unregister). I haven't tried to actually use it from a container yet, so who knows if that works or not. Edit: all right, it doesn't... But csi_hook sure looks for some funny path names.

failed to setup alloc: pre-run hook "csi_hook" failed: node plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = beegfs-ctl failed with stdOut: and stdErr: Error: Failed to load map file: /local/csi/staging/VOLUME_NAME/ro-file-system-single-node-reader-only/beegfs-client.conf [BeeGFS Control Tool Version: 7.3.0 Refer to the default config file (/etc/beegfs/beegfs-client.conf) or visit http://www.beegfs.com to find out about configuration options.] : exit status 1

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024

If you could run the plugin.nomad, then capture a docker inspect on the running plugin container, that'd give us information on exactly what directories are and aren't being shared between the host and container. It's looking like there may be a decent dev lift to rework the way our Nomad examples handle paths (either due to changes in Nomad, Nomad's CSI support, or the driver itself). The BeeGFS CSI driver is somewhat unique in the way it is picky about paths matching inside and outside it's container. I wouldn't expect most drivers to be that particular (because most drivers don't execute a mount command that requires host namespace access to a file they have written).

from beegfs-csi-driver.

scaleoutsean avatar scaleoutsean commented on June 12, 2024

Sure!

  • plugin.nomad (I've been working on some demos so now datacenter is dc1-f2, the rest is the same):
job "beegfs" {
  type = "system"
  datacenters = ["dc1-f2"]
  group "csi" {
    task "plugin" {
      driver = "docker"
      template {
        data        = <<EOH
config:
  beegfsClientConf:
    connUseRDMA: "false"
        EOH
        destination = "${NOMAD_TASK_DIR}/csi-beegfs-config.yaml"
      }
      template {
        data        = <<EOH
- connAuth: secret
  sysMgmtdHost: 192.168.1.191
        EOH
        destination = "${NOMAD_SECRETS_DIR}/csi-beegfs-connauth.yaml"
      }
      config {
        mount {
          type     = "bind"
          target   = "/host"
          source   = "/"
          readonly = true
        }
        image = "netapp/beegfs-csi-driver:v1.2.1"
        args = [
          "--driver-name=beegfs.csi.netapp.com",
          "--client-conf-template-path=/host/etc/beegfs/beegfs-client.conf",
          "--cs-data-dir=/opt/nomad/data/client/csi/monolith/beegfs-plugin0",
          "--config-path=${NOMAD_TASK_DIR}/csi-beegfs-config.yaml",
          "--connauth-path=${NOMAD_SECRETS_DIR}/csi-beegfs-connauth.yaml",
          "--v=5",
          "--endpoint=unix://opt/nomad/data/client/csi/monolith/beegfs-plugin0/csi.sock",
          "--node-id=node-${NOMAD_ALLOC_INDEX}",
        ]
        privileged = true
      }
      csi_plugin {
        id = "beegfs-plugin0"
        type = "monolith"
        mount_dir = "/opt/nomad/data/client/csi/monolith/beegfs-plugin0"
      }
      resources {
        cpu = 256
        memory = 128
      }
    }
  }
}

  • Allocation status:
ID                  = 1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e
Eval ID             = d40a547b
Name                = beegfs.csi[0]
Node ID             = 4f3d8916
Node Name           = b5
Job ID              = beegfs
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 10h31m ago
Modified            = 10h31m ago

Task "plugin" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/256 MHz  6.5 MiB/128 MiB  300 MiB  

Task Events:
Started At     = 2022-04-26T04:43:11Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type                   Description
2022-04-26T04:43:11Z  Plugin became healthy  plugin: beegfs-plugin0
2022-04-26T04:43:11Z  Started                Task started by client
2022-04-26T04:43:11Z  Task Setup             Building Task Directory
2022-04-26T04:43:11Z  Received               Task received by client

  • Container
$ docker ps -a
CONTAINER ID   IMAGE                             COMMAND                  CREATED        STATUS        PORTS     NAMES
7f85adbb6b0b   netapp/beegfs-csi-driver:v1.2.1   "beegfs-csi-driver -…"   11 hours ago   Up 11 hours             plugin-1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e
  • Details:
[
    {
        "Id": "7f85adbb6b0b453d5d3e888889991ec528baa120afdb4b0b3a60f02b59324a6c",
        "Created": "2022-04-26T04:43:11.528855254Z",
        "Path": "beegfs-csi-driver",
        "Args": [
            "--driver-name=beegfs.csi.netapp.com",
            "--client-conf-template-path=/host/etc/beegfs/beegfs-client.conf",
            "--cs-data-dir=/opt/nomad/data/client/csi/monolith/beegfs-plugin0",
            "--config-path=/local/csi-beegfs-config.yaml",
            "--connauth-path=/secrets/csi-beegfs-connauth.yaml",
            "--v=5",
            "--endpoint=unix://opt/nomad/data/client/csi/monolith/beegfs-plugin0/csi.sock",
            "--node-id=node-0"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 3743,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2022-04-26T04:43:11.742808522Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
        "Image": "sha256:a8414e83431d0ca80b8db3aae569bc1497b6b059b33577f2f83e4caecc076361",
        "ResolvConfPath": "/var/lib/docker/containers/7f85adbb6b0b453d5d3e888889991ec528baa120afdb4b0b3a60f02b59324a6c/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/7f85adbb6b0b453d5d3e888889991ec528baa120afdb4b0b3a60f02b59324a6c/hostname",
        "HostsPath": "/var/lib/docker/containers/7f85adbb6b0b453d5d3e888889991ec528baa120afdb4b0b3a60f02b59324a6c/hosts",
        "LogPath": "/var/lib/docker/containers/7f85adbb6b0b453d5d3e888889991ec528baa120afdb4b0b3a60f02b59324a6c/7f85adbb6b0b453d5d3e888889991ec528baa120afdb4b0b3a60f02b59324a6c-json.log",
        "Name": "/plugin-1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "unconfined",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/opt/nomad/data/alloc/1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e/alloc:/alloc",
                "/opt/nomad/data/alloc/1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e/plugin/local:/local",
                "/opt/nomad/data/alloc/1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e/plugin/secrets:/secrets"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {
                    "max-file": "2",
                    "max-size": "2m"
                }
            },
            "NetworkMode": "default",
            "PortBindings": null,
            "RestartPolicy": {
                "Name": "",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "CgroupnsMode": "host",
            "Dns": null,
            "DnsOptions": null,
            "DnsSearch": null,
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": true,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [
                "label=disable"
            ],
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 256,
            "Memory": 134217728,
            "NanoCpus": 0,
            "CgroupParent": "cpuset",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": null,
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": -1,
            "MemorySwappiness": 0,
            "OomKillDisable": false,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "Mounts": [
                {
                    "Type": "bind",
                    "Source": "/",
                    "Target": "/host",
                    "ReadOnly": true,
                    "BindOptions": {}
                },
                {
                    "Type": "bind",
                    "Source": "/opt/nomad/data/client/csi/plugins/1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e",
                    "Target": "/opt/nomad/data/client/csi/monolith/beegfs-plugin0",
                    "BindOptions": {
                        "Propagation": "rshared"
                    }
                },
                {
                    "Type": "bind",
                    "Source": "/opt/nomad/data/client/csi/monolith/beegfs-plugin0",
                    "Target": "/local/csi",
                    "BindOptions": {
                        "Propagation": "rshared"
                    }
                },
                {
                    "Type": "bind",
                    "Source": "/dev",
                    "Target": "/dev",
                    "BindOptions": {
                        "Propagation": "rprivate"
                    }
                }
            ],
            "MaskedPaths": null,
            "ReadonlyPaths": null
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/cba60653c4ce6743570250e571dce17dbe302dfe849a4de9101834dab4ff846e-init/diff:/var/lib/docker/overlay2/3414806ddbd29b6a4d3f8541a00deaff76bc8b66bf28f6cb92fecc65f216abad/diff:/var/lib/docker/overlay2/b276823b9b0577260de440c35fc6fbad7b060452ae5374804bc713794de3c10d/diff:/var/lib/docker/overlay2/4974a8f2a57e177f41c053ef1af77f17fa045d8535d8119ed252c17c3034145e/diff",
                "MergedDir": "/var/lib/docker/overlay2/cba60653c4ce6743570250e571dce17dbe302dfe849a4de9101834dab4ff846e/merged",
                "UpperDir": "/var/lib/docker/overlay2/cba60653c4ce6743570250e571dce17dbe302dfe849a4de9101834dab4ff846e/diff",
                "WorkDir": "/var/lib/docker/overlay2/cba60653c4ce6743570250e571dce17dbe302dfe849a4de9101834dab4ff846e/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/dev",
                "Destination": "/dev",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/opt/nomad/data/alloc/1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e/alloc",
                "Destination": "/alloc",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/opt/nomad/data/alloc/1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e/plugin/local",
                "Destination": "/local",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/opt/nomad/data/alloc/1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e/plugin/secrets",
                "Destination": "/secrets",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/",
                "Destination": "/host",
                "Mode": "",
                "RW": false,
                "Propagation": "rslave"
            },
            {
                "Type": "bind",
                "Source": "/opt/nomad/data/client/csi/plugins/1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e",
                "Destination": "/opt/nomad/data/client/csi/monolith/beegfs-plugin0",
                "Mode": "",
                "RW": true,
                "Propagation": "rshared"
            },
            {
                "Type": "bind",
                "Source": "/opt/nomad/data/client/csi/monolith/beegfs-plugin0",
                "Destination": "/local/csi",
                "Mode": "",
                "RW": true,
                "Propagation": "rshared"
            }
        ],
        "Config": {
            "Hostname": "7f85adbb6b0b",
            "Domainname": "",
            "User": "0",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "CSI_ENDPOINT=unix:///opt/nomad/data/client/csi/monolith/beegfs-plugin0/csi.sock",
                "NOMAD_ALLOC_DIR=/alloc",
                "NOMAD_ALLOC_ID=1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e",
                "NOMAD_ALLOC_INDEX=0",
                "NOMAD_ALLOC_NAME=beegfs.csi[0]",
                "NOMAD_CPU_LIMIT=256",
                "NOMAD_DC=dc1-f2",
                "NOMAD_GROUP_NAME=csi",
                "NOMAD_JOB_ID=beegfs",
                "NOMAD_JOB_NAME=beegfs",
                "NOMAD_MEMORY_LIMIT=128",
                "NOMAD_NAMESPACE=default",
                "NOMAD_PARENT_CGROUP=/nomad",
                "NOMAD_REGION=global",
                "NOMAD_SECRETS_DIR=/secrets",
                "NOMAD_TASK_DIR=/local",
                "NOMAD_TASK_NAME=plugin",
                "PATH=/netapp://usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt"
            ],
            "Cmd": [
                "--driver-name=beegfs.csi.netapp.com",
                "--client-conf-template-path=/host/etc/beegfs/beegfs-client.conf",
                "--cs-data-dir=/opt/nomad/data/client/csi/monolith/beegfs-plugin0",
                "--config-path=/local/csi-beegfs-config.yaml",
                "--connauth-path=/secrets/csi-beegfs-connauth.yaml",
                "--v=5",
                "--endpoint=unix://opt/nomad/data/client/csi/monolith/beegfs-plugin0/csi.sock",
                "--node-id=node-0"
            ],
            "Image": "netapp/beegfs-csi-driver:v1.2.1",
            "Volumes": null,
            "WorkingDir": "/",
            "Entrypoint": [
                "beegfs-csi-driver"
            ],
            "OnBuild": null,
            "Labels": {
                "com.hashicorp.nomad.alloc_id": "1aab2b04-1ecb-754f-3f7a-dcd7cfc3a85e",
                "description": "BeeGFS CSI Driver",
                "maintainers": "NetApp",
                "revision": "v1.2.1-0-g316c1cd"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "95252430e6b58cd1788ecf80a737109d95249674b0455d64637640bfea105259",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "/var/run/docker/netns/95252430e6b5",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "b088a0b0a4a011f0c9c2eb88b8ff37e288956b139e37f512bb45b1c6d8a19f26",
            "Gateway": "172.17.0.1",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "172.17.0.2",
            "IPPrefixLen": 16,
            "IPv6Gateway": "",
            "MacAddress": "02:42:ac:11:00:02",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "5be822a1bfacba34251a8d16e1c47102f0c9c9bfde84112b0652ae1e2bf5ba4a",
                    "EndpointID": "b088a0b0a4a011f0c9c2eb88b8ff37e288956b139e37f512bb45b1c6d8a19f26",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:ac:11:00:02",
                    "DriverOpts": null
                }
            }
        }
    }
]

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024

I am actively reworking our Nomad support now (though only in my spare time for the moment). I have a Nomad cluster up and have started to work through the issues you experienced. As best as I can tell, there HAVE been changes to the way Nomad handles CSI paths since our original implementation, and, as I guessed above, the unique need for our driver and the host to agree on the full path to configuration files is causing problems.

I fixed your initial CreateVolume issue with a new bind mount (/opt/nomad/client/csi/monolith/beegfs-plugin0:/opt/nomad/client/csi/monolith/beegfs-plugin0) that allows the controller service (running in the container) and beegfs-ctl/mount (outside the container) to agree on the location of beegfs-client.conf. There were some additional code changes required to make this workable, so I don't recommend trying it with a v1.2.2 container.

The node service issue you ran into is a bigger challenge. Nomad is now bind mounting /opt/nomad/client/csi/monolith/beegfs-plugin0:/local/csi automatically for CSI drivers and providing staging_target_paths in NodeStateVolume like /local/csi/.... (For what it's worth, Kubernetes provides absolute staging_target_paths like /var/lib/kubelet/plugins/beegfs.csi.netapp.com, which are much easier for us to deal with). Many drivers don't care, as all userspace utilities run inside the driver container and have a synchronized view of the file system. However, we choose not to package beegfs-ctl and the core utilities in our container.

We're talking through fixes, but we may need something like an additional command line argument that helps the driver modify the staging_target_paths it is provided before attempting to mount.

from beegfs-csi-driver.

ejweber avatar ejweber commented on June 12, 2024

We have identified a twofold path to get Nomad working again.

  1. We created hashicorp/nomad#13263 and hashicorp/nomad#13919 to make it possible for the driver (as it exists today) to deal with the staging_target_paths an target_paths Nomad provides. It's not clear if/when these changes will be incorporated into Nomad.
  2. We are completely reworking the Nomad manifests to make use of the proposed Nomad changes (and just work better in general). These changes will be in the next version (v1.3.0) of the driver, but they will only work with our internal builds of Nomad until our proposed changes to Nomad are incorporated.

from beegfs-csi-driver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.