det-lab / jupyterhub-deploy-kubernetes-jetstream Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zonca/jupyterhub-deploy-kubernetes-jetstream

0.0 0.0 1.0 148 KB

CDMS JupyterHub deployment on XSEDE Jetstream

Shell 49.57% Dockerfile 3.75% Python 8.69% Jupyter Notebook 31.91% Makefile 6.09%

jupyterhub-deploy-kubernetes-jetstream's People

Contributors

Watchers

Forkers

pandeylekhraj

jupyterhub-deploy-kubernetes-jetstream's Issues

Jupyterlab rebuild prompt due to jupyter-matplotlib

it seems not harmful, just click cancel

anyway I'll try to understand why this happens.

Test master-only configuration

Make sure that we can have only with the master node running and still have 1 single-user session working.

Images pulled to the nodes

Worker node

[fedora@k8s-4qtmvqk6gv47-minion-0 ~]$ sudo docker images
REPOSITORY                                            TAG                 IMAGE ID            CREATED             SIZE
docker.io/supercdms/cdms-jupyterlab                   1.8b                764e36e089da        5 weeks ago         20 GB
docker.io/jupyterhub/k8s-image-awaiter                0.8.2               938cb370f906        10 months ago       4.15 MB
docker.io/jupyterhub/k8s-network-tools                0.8.2               02576979bd59        13 months ago       5.62 MB
gcr.io/kubernetes-helm/tiller                         v2.11.0             ac5f7ee9ae7e        16 months ago       71.8 MB
gcr.io/google_containers/kubernetes-dashboard-amd64   v1.8.3              0c60bcf89900        23 months ago       102 MB
docker.io/coredns/coredns                             1.0.1               58d63427cdea        2 years ago         45.1 MB
gcr.io/google_containers/pause                        3.0                 99e59f495ffa        3 years ago         747 kB

Master node

REPOSITORY                                                       TAG                 IMAGE ID            CREATED             SIZE
quay.io/kubernetes-ingress-controller/nginx-ingress-controller   0.24.1              98675eb54d0e        9 months ago        631 MB
k8s.gcr.io/defaultbackend-amd64                                  1.5                 b5af743e5984        16 months ago       5.13 MB
docker.io/k8scloudprovider/openstack-cloud-controller-manager    v0.2.0              5b5ea0c144e8        18 months ago       39.4 MB
gcr.io/google_containers/heapster-amd64                          v1.4.2              d4e02f5922ca        2 years ago         73.4 MB
gcr.io/google_containers/pause                                   3.0                 99e59f495ffa        3 years ago         747 kB

we don't have the single-user image on the master node, but we have the node schedulable (just for 1 user), need to configure this.

Originally posted by @zonca in #3 (comment)

Upgrade JupyterHub to 1.3.0 and Kubernetes 1.19

Zero-to-jupyterhub released 0.10 with JupyterHub 1.2.1,
we currently have 0.9 with JupyterHub 1.1.0.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/CHANGELOG.md

In principle it should be an easy upgrade, I can do it directly in the production instance and if anything stops working, I should be able to revert to the old version.

I will do it when I see no-one is connected and I'll announce it here and tag some users.

Losing data in personal space

@thathayhaykid and I are running into an issue with lost data:

Files saved to /home/jovyan and /home/jovyan/work disappear when the server is restarted.

Distributed computing with dask

@pibion do you also want capability like Pangeo for a user to request a cluster of dask workers so that they can run in parallel?

In this case, do you have an example of such distributed computation?

Also, the data. How are Jupyter Notebook users going to access data? where do we store them?

uploading files to JupyterHub - increase file limit?

Several people have tried to upload data files (larger than 100 MB, less than 1 GB) but have run into "file too large to upload" errors.

Is it possible (and wise) to increase this limit?

Update Jupyterlab in user container

We have now Jupyterlab 2.x on the user images,
I am testing Jupyterlab 3, the nice thing is that now it is a lot easier to install extensions because they are simple python packages.

Cluster monitoring with Prometheus and Grafana

I have deployed Prometheus and Grafana, this is useful to check the health of the production deployment.

I followed and updated the tutorial at https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html

See the README in the secret repository for instructions on how to access: https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets/blob/master/README.md

Some example dashboards

the node dashboard continues below

I think we are at a good point in testing the deployment,
we can start to think about setting up a permanent URL.
Does CDMS or your institution have a procedure to set that up?
Or shoud I investigate if we can get a *.xsede.org domain?

allow users to spin up large or extra-large instances?

Hi @zonca , we have some users that are likely going to need a lot of RAM for some upcoming analysis (@ziqinghong). The goal is to eventually eliminate the need for > 10 GB but for now there are times when it would be helpful to have a whole lot of RAM available.

I wonder if it's possible to make it an option for users to request a large or extra-large instance. If this is difficult please don't worry about it. I thought I'd ask mainly to discuss the possibility.

Providing latex and fonts for matplotlib

Analyzers using matplotlib often want to format their axis titles with LaTeX and this requires dependencies that are not installed with the matplotlib package.

@bloer has considered putting these dependencies into CVMFS, but LaTeX isn't so easy to compile and we're considering instead installing the dependencies on some of the underlying systems, at least as a stopgap.

@zonca I just wanted to check in with this plan. I can edit the dockerfile at https://github.com/zonca/docker-jupyter-cdms-light and put in a pull request. I was wondering if we have a way of monitoring how long it takes instances to spin up? I'm worried the extra dependencies will make it take noticeably longer, and it's already fairly slow. It'd be nice to see histograms of what people encounter.

File Save Error for jupyter notebook

The error info is in the picture. It will show up every time the jupyter notebook wants to save it automatically or I click on save.

Build SSH public key into CVMFS container

zonca/docker-cvmfs-client@9e882e6

Testing the build, this simplifies deployment and redeployment because the key is built into the image.

Check the build logs, test at the next redeployment.

Extend and renew XSEDE project

@pibion the allocation ends on October 27th, I suggest you ask first for an extension of 6 months, see https://portal.xsede.org/allocations/policies#356 (in the text also mention you'd like to extend ECSS).
You can check how many hours are left on the allocation and decide how much supplement we might need, if any, we expect to keep using hours at the same rate of the last 3/4 months.

Then, in December, you could apply for a renewal (which starts in April): https://portal.xsede.org/allocations/research#xracquarterly

Useful plugins for JupyterHub / JupyterLab

This is in no way a priority, but I wanted to record the idea somewhere before I forget it:

It might be nice to have the Theia IDE available in the JupyterHub environment. It looks like there is some support for this: https://jupyter-server-proxy.readthedocs.io/en/latest/convenience/packages/theia.html.

Test autoscaling

Autoscaling support for our deployment from the official project is still very far away, see zonca#34

So I implemented a hacked-up script myself, see https://zonca.dev/2021/01/autoscaling_script_kubespray_jupyterhub.html, I tested on a sample deployment and worked just fine.

Now I would like to deploy on CDMS (after #48), the main issue is that scaling up is very slow, it takes about 20 min to get another node up.

So to maximize savings we could just keep the master running which supports 1 user (with 5 GB memory), a second concurrent user trying to connect will have to wait 20 min for their session to start
We keep master + 1 node always running, so only after those are full (or someone asks for a full node) we trigger autoscaling

@pibion what do you think? We could also try a couple of weeks with 1 and then decide.
I would like to set this up only after the redeploy (#48), so there is lots of time to discuss this.

Hub down - connection timeout

Got a notification at 6am PT that the hub is down, investigating now
@pibion

Update to zero-to-jupyterhub 0.9.0

Until now I have been using 0.8.2

0.9.0 was released 2 weeks ago, it upgrades to JupyterHub 1.1.0.

See changelog for 0.9.0: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/CHANGELOG.md#090---2020-04-15

The gitlab authentication #18 was not working with 0.8.2, it works fine with 0.9.0.
I authorized the whole SuperCDMS group to login, please let me know if there are issues.

the deployment is now at https://supercdms.jetstream-cloud.org/

@pibion

Persistent storage disabled

Warning, I currently had to disable persistent storage because there are some conflicts in the new release, I'll work to fix it in the next days and update this issue. This only affects the home of the Jupyter Notebook users, the data folder is not affected.
So you have some local storage, but if your session is killed, all your data is lost.

Unsure how to change machine allocation

I started working with a tiny server option (1 CPU), but then realized that I needed a default server option (4 CPUs). However, I am unsure how to change the machine allocation from tiny to default.

Setup authentication

After #17 is done, let's setup authentication.
@pibion, what are your plans about that?

Do we want to use Github accounts? or do you have a third party authenticator CDMS uses?

Add me to allocation

can you please add me to allocation if you haven't done it yet?
can you also write the allocation id here?

Automatic kernel selection

I'm wondering if it would be at all possible to configure the jupyter notebooks to open with the most recent cdms kernel by default?

This may possibly be related to an email I just sent regarding the possibility of running install_cdms_kernels during the image spawning process.

/cvmfs/data read-only when logged in through JupyterHub

We're trying to use DataCat (http://titus.stanford.edu:8080/git/summary/?r=DataHandling/DataCat.git) to grab data files and store them on /cvmfs/data as needed.

However, this mode of copying data requires users logged in to JupyterHub to have write privileges in /cvmfs/data, which they currently don't. @thathayhaykid will follow up with a way to reproduce the issue.

@bloer, @zonca, do you have any thoughts on ways to handle this? Maybe we could update DataCat to connect to SLAC and run the copy from there. People would have to make sure they've got an ssh key and config set up properly, but that's maybe reasonable.

Site Cannot Be Reached

When trying to access https://supercdms.jetstream-cloud.org/ I get a "Site cannot be reached" error. Perfectly fine internet connection.

Any help would be appreciated, thanks!

Test distributed computing with dask

After the discussion in #7, I have a prototype deployment of Dask Gateway accessible to the SuperCDMS users, see #43.

The documentation on how to use it is as usual at https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream.

When someone in the team is ready to test it, @pibion, please first reply here and let me test it again, it has many moving parts and I would like to make sure it is usable before you spend time on it.

Access CVMFS from pods?

@pibion I was thinking we could evaluate at some point how difficult it would be to setup CVMFS access for JupyterHub users:

https://cernvm.cern.ch/portal/filesystem

Let's first finalize the work on the JupyterHub deployment and then consider if we should add this to the workplan.

Documentation

https://twiki.cern.ch/twiki/bin/view/AtlasComputing/Cvmfs21#Hard_mount_cvmfs_and_nfs_export

install gfortran + documentation for adding to CDMS environment

We've had a user request gfortran. I think I can just add this to https://github.com/zonca/docker-jupyter-cdms-light/blob/master/Dockerfile, is that correct?

The images won't be rebuilt and deployed automatically after this change, will they?

add text editors to JupyterHub environment?

Right now we have only vi - would it be possible to add nano, vim, and emacs?

Redeployment on top of newer Kubernetes version

The Jetstream team is working on a newer Kubernetes environment,
within 1 or 2 weeks they will notify me about availability and I will tear down the deployment and rebuild it again on top of the new environment.

save the data volume so it can be reattached to the newer environment (I'll try my best, there is a small possibility I will loose data and they will need to be copied again)
tear down the old deployment
deploy kubernetes and check it is working correctly (especially logging which stopped working a month ago)
deploy the CVMFS / NFS service re-mounting the data volume
make sure the networking issue (#10) is solved
deploy JupyterHub
update the documentation if anything changed

Instance takes a long time to spin up (or does not spin up)

I'm trying to access a JupyterHub instance, but it seems to be taking a very long time to spin up.

Is anyone else having this issue? @zonca @ziqinghong

Rebuilding the user images

I started to update the user images, my work is based on the images released by the Jupyter team:

https://github.com/jupyter/docker-stacks

but modified to be based on Centos 7 instead of Ubuntu:

https://github.com/zonca/jupyter-docker-stacks-centos7

So we will get a newer version of JupyterLab, and I'll try to fix #25 and #27

I'll notify when it this is deployed.

install NFS-common on nodes

in order to mount the NFS share I have to manually install the nfs-common package on the nodes.

check if I can add this to kubespray or ask Jetstream if they can add it into the images

All but 2 nodes are unavailable

When trying to start with a default server configuration, I get this error, and the loading for the server stalls:
2021-03-18T00:02:34.528016Z [Warning] 0/2 nodes are available: 2 Insufficient cpu.

Amy checked the admin panel, and nobody else seems to be using any nodes.

NFS sharing issue

found issue with NFS sharing, it works fine if the Jupyter Notebook pod is in the same node, but it doesn't work across nodes, which is strange because Kubernetes networking should automatically handle that.

To debug I deployed 2 pods, one on master and one on minion,
then

telnet 10.254.77.77 111

it can connect on master (where the CVMFS/NFS pod is located), hangs on the other node.

`ls /cvmfs` takes a long time

I hadn't noticed this behavior before, so maybe the issue is that there's now more data in the directory? Here is an output showing the issue:

bash-4.2$ time ls
lost+found

real    0m0.005s
user    0m0.005s
sys     0m0.000s
bash-4.2$ time ls /cvmfs/data
CDMS  lost+found  not_writable  test_file  test_file_2  tf

real    0m40.055s
user    0m0.000s
sys     0m0.007s

SSH "Bad owner or permissions on /home/jovyan/.ssh/config" when starting up JupyterLab

Whenever I initially log in to JupyterLab, I am unable to use ssh, getting the error: "Bad owner or permissions on /home/jovyan/.ssh/config". I am able to fix this by changing the permissions on the config file (chmod 600 config).

JupyterHub boot up timeout on Tiny server configuration

I received a timeout error while booting up JupyterHub on the Tiny server configuration. I had previously received the same timeout error when booting up on the Default server configuration, and then decided to retry with a Tiny server configuration.

Here is the event log for the Tiny configuration:

Server requested
2021-02-25 00:15:24+00:00 [Normal] Successfully assigned jhub/jupyter-zkromer to zonca-k8s-master-1
2021-02-25 00:15:27+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r6a6a768faa154a8583abc7c98ca0be7b.scope mount.nfs: Operation not permitted
2021-02-25 00:15:30+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r724eb5e9a9794a3994f3564ef0e04832.scope mount.nfs: Operation not permitted
2021-02-25 00:15:33+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r7f54813ea3f249bd8e4d1e4d12eca05b.scope mount.nfs: Operation not permitted
2021-02-25 00:15:36+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r1b65550e5a7649f89ca73052fa5fd8a1.scope mount.nfs: Operation not permitted
2021-02-25 00:15:41+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-rfc770abe56ee49df86fb77fe7106cfb4.scope mount.nfs: Operation not permitted
2021-02-25 00:15:51+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r084be4dee5284884b974cb8a0de309db.scope mount.nfs: Operation not permitted
2021-02-25 00:16:01+00:00 [Normal] AttachVolume.Attach succeeded for volume "pvc-666b461d-e7d0-43f7-8747-44483d7b19a8"
2021-02-25 00:16:10+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r308a0c45b6df4c4fa77e087b71e1df91.scope mount.nfs: Operation not permitted
2021-02-25 00:16:43+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r0c099fa65d754c779938401dbc1fa3e2.scope mount.nfs: Operation not permitted
2021-02-25 00:17:27+00:00 [Warning] Unable to attach or mount volumes: unmounted volumes=[cvmfs-nfs-volume], unattached volumes=[volume-zkromer cvmfs-nfs-volume]: timed out waiting for the condition
2021-02-25 00:17:49+00:00 [Warning] (combined from similar events): MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r35c2a8d4c7bb40b1bdf8d111076efd07.scope mount.nfs: Operation not permitted
Spawn failed: pod/jupyter-zkromer did not start in 600 seconds!

Cannot start JupyterLab session

@jpanmany466 is having difficulty starting a JupyterLab session - the complaint it returns is that the spawner is timing out:

ip addresses for whitelisting?

In our collaboration we sometimes whitelist ranges of IP addresses for easier access to collaboration resources.

Is there a range of IP addresses for the JupyterHub instances people spin up that we could ask our colleague to whitelist?

Test number of users per Jetstream VM

@pibion I have created a cluster with 1 master and 2 worker nodes named k8s_cdms.
First I want to test how many users I can accommodate per node.
I am worried about this given reports at zonca#23.

Once this is solved I'll continue with the planned steps.

JupyterLab spawning failure

@jpanmany is running into spawning issues again:

This time it looks like the issue is that she doesn't have ssh keys, and the script is failing when it tries to set permissions.

Jupyter notebooks spawning time

Prompted by #46, let's debug what takes so long in starting up a session:

here is the kubernetes log

Events:
  Type    Reason                  Age    From                     Message
  ----    ------                  ----   ----                     -------
  Normal  Scheduled               2m41s  jhub-user-scheduler      Successfully assigned jhub/jupyter-zonca to zonca
-k8s-node-nf-1
  Normal  SuccessfulAttachVolume  2m8s   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-a4d
0d8fb-46b4-4dbc-b5cb-3bad7ced6280"
  Normal  Pulled                  79s    kubelet                  Container image "jupyterhub/k8s-network-tools:0.9
.0" already present on machine
  Normal  Created                 78s    kubelet                  Created container block-cloud-metadata
  Normal  Started                 78s    kubelet                  Started container block-cloud-metadata
  Normal  Pulled                  75s    kubelet                  Container image "zonca/docker-jupyter-cdms-light:
2020.07.07" already present on machine
  Normal  Created                 75s    kubelet                  Created container notebook
  Normal  Started                 74s    kubelet                  Started container notebook

startup took about 1 min and 30 sec @pibion do you often see startup times way longer?
it takes ~30 sec to mount the user volume
however there is another minute after that, but don't know what is driving that

only provide the most recent kernel?

I think it might be preferable to list only the most recent CDMS kernel - the list of available kernels is getting pretty long!

What do you think @bloer, @zonca?

Distributed data access - object store

Data is currently all stored at SLAC. There is a "data catalog" python library that allows users to query for data paths. If they don't exist locally they're downloaded to disk.

Is it possible to have a storage disk that is mounted to everyone's container? For initial testing 50 GB would be more than enough. If we want to try to support full CDMS analysis efforts that's more like 10 TB.

Originally posted by @pibion in #7 (comment)

Customize JupyterHub with the CDMS enviroment

Now that the issue with volumes is fixed (zonca#23), start working on this and log progress here.

GPU access through JetStream?

@zonca , @00KevinMac and @farnoush-bk are doing some machine learning work that requires GPU computing.

@00KevinMac has taken a look at XSEDE resources, and allocations on pretty much any of their GPU resources (e.g. Bridges or Comet) would serve well.

I'm wondering if it's possible to access these resources from the JetStream JupyterHub, or if JetStream has GPU VMs.

I'd like to stay away from installing the CDMS software on new systems - although maybe that's a fundamental requirement if we want to use other systems? Our experience within the collaboration of getting a system to work with CDMS software - even if it already supports CVMFS - is that it's a real pain.

So another option is: we use the XSEDE JupyterHub for CPU tasks and send people elsewhere for GPU tasks.

new spawn fails to start in 600 seconds

One of our users is getting a "failing to start in 600 seconds" error when trying to log in to JupyterHub. Picture below.

CDMS Python environment

A word of warning - it seems that the very first imports that pull in CDMS python packages are failing. This is probably because the CVMFS environment doesn't install those. @bloer is the authority on this, though.

Originally posted by @pibion in #8 (comment)

I would like some details about the Python packages for CDMS analysis.

Are they available through CVMFS?
If so, what are the requirements? is there docs somewhere?
If not, should we add them to the docker image we use on JupyterHub at: https://github.com/zonca/docker-jupyter-cdms-light

Offsite backup of user data

from @pibion:

I'm wondering if there's a way to access the existing storage volumes, and I'm also wondering if there's a way to set a permanent backup with e.g. the Open Storage Network (CDMS has an allocation there now). Adding @glass-ships and @thathayhaykid as they might be interested in thinking about this.

Distributed data access - block store

Conclusion of #8 is that object store is not suitable.

Other 2 options are:

One option could be to use Manila on Jetstream which provides a NFS service which is handled by Openstack so we don't have to manage it. This provides a standard read/write filesystem we can mount on all pods.
Or deploy our own NFS server, actually we can probably use the NFS server we use for CVMFS to also serve this 50GB volume read/write.

I have never used Manila before, so I would rather use our own NFS server, we can later do some benchmarks.

So the plan is to have 1 pod which mounts a large volume read-write, and expose the SSH port with certificate-only access to copy there the data with rsync. Down the road we could deploy a Globus endpoint. Then this pod has a NFS server which shares the data as read-only to the Jupyter Notebook pods.

I haven't decided yet if this should be a standalone pod or the same pod of CVMFS, I'll track progress in this issue.