det-lab / jupyterhub-deploy-kubernetes-jetstream Goto Github PK
View Code? Open in Web Editor NEWThis project forked from zonca/jupyterhub-deploy-kubernetes-jetstream
CDMS JupyterHub deployment on XSEDE Jetstream
This project forked from zonca/jupyterhub-deploy-kubernetes-jetstream
CDMS JupyterHub deployment on XSEDE Jetstream
Make sure that we can have only with the master node running and still have 1 single-user session working.
[fedora@k8s-4qtmvqk6gv47-minion-0 ~]$ sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/supercdms/cdms-jupyterlab 1.8b 764e36e089da 5 weeks ago 20 GB
docker.io/jupyterhub/k8s-image-awaiter 0.8.2 938cb370f906 10 months ago 4.15 MB
docker.io/jupyterhub/k8s-network-tools 0.8.2 02576979bd59 13 months ago 5.62 MB
gcr.io/kubernetes-helm/tiller v2.11.0 ac5f7ee9ae7e 16 months ago 71.8 MB
gcr.io/google_containers/kubernetes-dashboard-amd64 v1.8.3 0c60bcf89900 23 months ago 102 MB
docker.io/coredns/coredns 1.0.1 58d63427cdea 2 years ago 45.1 MB
gcr.io/google_containers/pause 3.0 99e59f495ffa 3 years ago 747 kB
REPOSITORY TAG IMAGE ID CREATED SIZE
quay.io/kubernetes-ingress-controller/nginx-ingress-controller 0.24.1 98675eb54d0e 9 months ago 631 MB
k8s.gcr.io/defaultbackend-amd64 1.5 b5af743e5984 16 months ago 5.13 MB
docker.io/k8scloudprovider/openstack-cloud-controller-manager v0.2.0 5b5ea0c144e8 18 months ago 39.4 MB
gcr.io/google_containers/heapster-amd64 v1.4.2 d4e02f5922ca 2 years ago 73.4 MB
gcr.io/google_containers/pause 3.0 99e59f495ffa 3 years ago 747 kB
we don't have the single-user image on the master node, but we have the node schedulable (just for 1 user), need to configure this.
Originally posted by @zonca in #3 (comment)
Zero-to-jupyterhub released 0.10 with JupyterHub 1.2.1,
we currently have 0.9 with JupyterHub 1.1.0.
https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/CHANGELOG.md
In principle it should be an easy upgrade, I can do it directly in the production instance and if anything stops working, I should be able to revert to the old version.
I will do it when I see no-one is connected and I'll announce it here and tag some users.
@thathayhaykid and I are running into an issue with lost data:
Files saved to /home/jovyan
and /home/jovyan/work
disappear when the server is restarted.
@pibion do you also want capability like Pangeo for a user to request a cluster of dask workers so that they can run in parallel?
In this case, do you have an example of such distributed computation?
Also, the data. How are Jupyter Notebook users going to access data? where do we store them?
Several people have tried to upload data files (larger than 100 MB, less than 1 GB) but have run into "file too large to upload" errors.
Is it possible (and wise) to increase this limit?
We have now Jupyterlab 2.x on the user images,
I am testing Jupyterlab 3, the nice thing is that now it is a lot easier to install extensions because they are simple python packages.
I have deployed Prometheus and Grafana, this is useful to check the health of the production deployment.
I followed and updated the tutorial at https://zonca.dev/2019/04/kubernetes-monitoring-prometheus-grafana.html
See the README in the secret repository for instructions on how to access: https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets/blob/master/README.md
Some example dashboards
I think we are at a good point in testing the deployment,
we can start to think about setting up a permanent URL.
Does CDMS or your institution have a procedure to set that up?
Or shoud I investigate if we can get a *.xsede.org domain?
Hi @zonca , we have some users that are likely going to need a lot of RAM for some upcoming analysis (@ziqinghong). The goal is to eventually eliminate the need for > 10 GB but for now there are times when it would be helpful to have a whole lot of RAM available.
I wonder if it's possible to make it an option for users to request a large or extra-large instance. If this is difficult please don't worry about it. I thought I'd ask mainly to discuss the possibility.
Analyzers using matplotlib often want to format their axis titles with LaTeX and this requires dependencies that are not installed with the matplotlib package.
@bloer has considered putting these dependencies into CVMFS, but LaTeX isn't so easy to compile and we're considering instead installing the dependencies on some of the underlying systems, at least as a stopgap.
@zonca I just wanted to check in with this plan. I can edit the dockerfile at https://github.com/zonca/docker-jupyter-cdms-light and put in a pull request. I was wondering if we have a way of monitoring how long it takes instances to spin up? I'm worried the extra dependencies will make it take noticeably longer, and it's already fairly slow. It'd be nice to see histograms of what people encounter.
zonca/docker-cvmfs-client@9e882e6
Testing the build, this simplifies deployment and redeployment because the key is built into the image.
Check the build logs, test at the next redeployment.
@pibion the allocation ends on October 27th, I suggest you ask first for an extension of 6 months, see https://portal.xsede.org/allocations/policies#356 (in the text also mention you'd like to extend ECSS).
You can check how many hours are left on the allocation and decide how much supplement we might need, if any, we expect to keep using hours at the same rate of the last 3/4 months.
Then, in December, you could apply for a renewal (which starts in April): https://portal.xsede.org/allocations/research#xracquarterly
This is in no way a priority, but I wanted to record the idea somewhere before I forget it:
It might be nice to have the Theia IDE available in the JupyterHub environment. It looks like there is some support for this: https://jupyter-server-proxy.readthedocs.io/en/latest/convenience/packages/theia.html.
Autoscaling support for our deployment from the official project is still very far away, see zonca#34
So I implemented a hacked-up script myself, see https://zonca.dev/2021/01/autoscaling_script_kubespray_jupyterhub.html, I tested on a sample deployment and worked just fine.
Now I would like to deploy on CDMS (after #48), the main issue is that scaling up is very slow, it takes about 20 min to get another node up.
master
running which supports 1 user (with 5 GB memory), a second concurrent user trying to connect will have to wait 20 min for their session to start@pibion what do you think? We could also try a couple of weeks with 1 and then decide.
I would like to set this up only after the redeploy (#48), so there is lots of time to discuss this.
Got a notification at 6am PT that the hub is down, investigating now
@pibion
Until now I have been using 0.8.2
0.9.0 was released 2 weeks ago, it upgrades to JupyterHub 1.1.0.
See changelog for 0.9.0: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/CHANGELOG.md#090---2020-04-15
The gitlab authentication #18 was not working with 0.8.2, it works fine with 0.9.0.
I authorized the whole SuperCDMS group to login, please let me know if there are issues.
the deployment is now at https://supercdms.jetstream-cloud.org/
Warning, I currently had to disable persistent storage because there are some conflicts in the new release, I'll work to fix it in the next days and update this issue. This only affects the home of the Jupyter Notebook users, the data folder is not affected.
So you have some local storage, but if your session is killed, all your data is lost.
I started working with a tiny server option (1 CPU), but then realized that I needed a default server option (4 CPUs). However, I am unsure how to change the machine allocation from tiny to default.
can you please add me to allocation if you haven't done it yet?
can you also write the allocation id here?
I'm wondering if it would be at all possible to configure the jupyter notebooks to open with the most recent cdms kernel by default?
This may possibly be related to an email I just sent regarding the possibility of running install_cdms_kernels
during the image spawning process.
We're trying to use DataCat (http://titus.stanford.edu:8080/git/summary/?r=DataHandling/DataCat.git) to grab data files and store them on /cvmfs/data as needed.
However, this mode of copying data requires users logged in to JupyterHub to have write privileges in /cvmfs/data, which they currently don't. @thathayhaykid will follow up with a way to reproduce the issue.
@bloer, @zonca, do you have any thoughts on ways to handle this? Maybe we could update DataCat to connect to SLAC and run the copy from there. People would have to make sure they've got an ssh key and config set up properly, but that's maybe reasonable.
When trying to access https://supercdms.jetstream-cloud.org/ I get a "Site cannot be reached" error. Perfectly fine internet connection.
Any help would be appreciated, thanks!
After the discussion in #7, I have a prototype deployment of Dask Gateway accessible to the SuperCDMS users, see #43.
The documentation on how to use it is as usual at https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream.
When someone in the team is ready to test it, @pibion, please first reply here and let me test it again, it has many moving parts and I would like to make sure it is usable before you spend time on it.
@pibion I was thinking we could evaluate at some point how difficult it would be to setup CVMFS access for JupyterHub users:
https://cernvm.cern.ch/portal/filesystem
Let's first finalize the work on the JupyterHub deployment and then consider if we should add this to the workplan.
We've had a user request gfortran. I think I can just add this to https://github.com/zonca/docker-jupyter-cdms-light/blob/master/Dockerfile, is that correct?
The images won't be rebuilt and deployed automatically after this change, will they?
Right now we have only vi - would it be possible to add nano, vim, and emacs?
The Jetstream team is working on a newer Kubernetes environment,
within 1 or 2 weeks they will notify me about availability and I will tear down the deployment and rebuild it again on top of the new environment.
I'm trying to access a JupyterHub instance, but it seems to be taking a very long time to spin up.
Is anyone else having this issue? @zonca @ziqinghong
I started to update the user images, my work is based on the images released by the Jupyter team:
https://github.com/jupyter/docker-stacks
but modified to be based on Centos 7 instead of Ubuntu:
https://github.com/zonca/jupyter-docker-stacks-centos7
So we will get a newer version of JupyterLab, and I'll try to fix #25 and #27
I'll notify when it this is deployed.
in order to mount the NFS share I have to manually install the nfs-common
package on the nodes.
check if I can add this to kubespray or ask Jetstream if they can add it into the images
When trying to start with a default server configuration, I get this error, and the loading for the server stalls:
2021-03-18T00:02:34.528016Z [Warning] 0/2 nodes are available: 2 Insufficient cpu.
Amy checked the admin panel, and nobody else seems to be using any nodes.
found issue with NFS sharing, it works fine if the Jupyter Notebook pod is in the same node, but it doesn't work across nodes, which is strange because Kubernetes networking should automatically handle that.
To debug I deployed 2 pods, one on master and one on minion,
then
telnet 10.254.77.77 111
it can connect on master (where the CVMFS/NFS pod is located), hangs on the other node.
I hadn't noticed this behavior before, so maybe the issue is that there's now more data in the directory? Here is an output showing the issue:
bash-4.2$ time ls
lost+found
real 0m0.005s
user 0m0.005s
sys 0m0.000s
bash-4.2$ time ls /cvmfs/data
CDMS lost+found not_writable test_file test_file_2 tf
real 0m40.055s
user 0m0.000s
sys 0m0.007s
Whenever I initially log in to JupyterLab, I am unable to use ssh, getting the error: "Bad owner or permissions on /home/jovyan/.ssh/config". I am able to fix this by changing the permissions on the config file (chmod 600 config).
I received a timeout error while booting up JupyterHub on the Tiny server configuration. I had previously received the same timeout error when booting up on the Default server configuration, and then decided to retry with a Tiny server configuration.
Here is the event log for the Tiny configuration:
Server requested
2021-02-25 00:15:24+00:00 [Normal] Successfully assigned jhub/jupyter-zkromer to zonca-k8s-master-1
2021-02-25 00:15:27+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r6a6a768faa154a8583abc7c98ca0be7b.scope mount.nfs: Operation not permitted
2021-02-25 00:15:30+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r724eb5e9a9794a3994f3564ef0e04832.scope mount.nfs: Operation not permitted
2021-02-25 00:15:33+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r7f54813ea3f249bd8e4d1e4d12eca05b.scope mount.nfs: Operation not permitted
2021-02-25 00:15:36+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r1b65550e5a7649f89ca73052fa5fd8a1.scope mount.nfs: Operation not permitted
2021-02-25 00:15:41+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-rfc770abe56ee49df86fb77fe7106cfb4.scope mount.nfs: Operation not permitted
2021-02-25 00:15:51+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r084be4dee5284884b974cb8a0de309db.scope mount.nfs: Operation not permitted
2021-02-25 00:16:01+00:00 [Normal] AttachVolume.Attach succeeded for volume "pvc-666b461d-e7d0-43f7-8747-44483d7b19a8"
2021-02-25 00:16:10+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r308a0c45b6df4c4fa77e087b71e1df91.scope mount.nfs: Operation not permitted
2021-02-25 00:16:43+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r0c099fa65d754c779938401dbc1fa3e2.scope mount.nfs: Operation not permitted
2021-02-25 00:17:27+00:00 [Warning] Unable to attach or mount volumes: unmounted volumes=[cvmfs-nfs-volume], unattached volumes=[volume-zkromer cvmfs-nfs-volume]: timed out waiting for the condition
2021-02-25 00:17:49+00:00 [Warning] (combined from similar events): MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r35c2a8d4c7bb40b1bdf8d111076efd07.scope mount.nfs: Operation not permitted
Spawn failed: pod/jupyter-zkromer did not start in 600 seconds!
In our collaboration we sometimes whitelist ranges of IP addresses for easier access to collaboration resources.
Is there a range of IP addresses for the JupyterHub instances people spin up that we could ask our colleague to whitelist?
Prompted by #46, let's debug what takes so long in starting up a session:
here is the kubernetes log
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m41s jhub-user-scheduler Successfully assigned jhub/jupyter-zonca to zonca
-k8s-node-nf-1
Normal SuccessfulAttachVolume 2m8s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-a4d
0d8fb-46b4-4dbc-b5cb-3bad7ced6280"
Normal Pulled 79s kubelet Container image "jupyterhub/k8s-network-tools:0.9
.0" already present on machine
Normal Created 78s kubelet Created container block-cloud-metadata
Normal Started 78s kubelet Started container block-cloud-metadata
Normal Pulled 75s kubelet Container image "zonca/docker-jupyter-cdms-light:
2020.07.07" already present on machine
Normal Created 75s kubelet Created container notebook
Normal Started 74s kubelet Started container notebook
Data is currently all stored at SLAC. There is a "data catalog" python library that allows users to query for data paths. If they don't exist locally they're downloaded to disk.
Is it possible to have a storage disk that is mounted to everyone's container? For initial testing 50 GB would be more than enough. If we want to try to support full CDMS analysis efforts that's more like 10 TB.
Originally posted by @pibion in #7 (comment)
Now that the issue with volumes is fixed (zonca#23), start working on this and log progress here.
@zonca , @00KevinMac and @farnoush-bk are doing some machine learning work that requires GPU computing.
@00KevinMac has taken a look at XSEDE resources, and allocations on pretty much any of their GPU resources (e.g. Bridges or Comet) would serve well.
I'm wondering if it's possible to access these resources from the JetStream JupyterHub, or if JetStream has GPU VMs.
I'd like to stay away from installing the CDMS software on new systems - although maybe that's a fundamental requirement if we want to use other systems? Our experience within the collaboration of getting a system to work with CDMS software - even if it already supports CVMFS - is that it's a real pain.
So another option is: we use the XSEDE JupyterHub for CPU tasks and send people elsewhere for GPU tasks.
A word of warning - it seems that the very first imports that pull in CDMS python packages are failing. This is probably because the CVMFS environment doesn't install those. @bloer is the authority on this, though.
Originally posted by @pibion in #8 (comment)
I would like some details about the Python packages for CDMS analysis.
from @pibion:
I'm wondering if there's a way to access the existing storage volumes, and I'm also wondering if there's a way to set a permanent backup with e.g. the Open Storage Network (CDMS has an allocation there now). Adding @glass-ships and @thathayhaykid as they might be interested in thinking about this.
Conclusion of #8 is that object store is not suitable.
Other 2 options are:
One option could be to use Manila on Jetstream which provides a NFS service which is handled by Openstack so we don't have to manage it. This provides a standard read/write filesystem we can mount on all pods.
Or deploy our own NFS server, actually we can probably use the NFS server we use for CVMFS to also serve this 50GB volume read/write.
I have never used Manila before, so I would rather use our own NFS server, we can later do some benchmarks.
So the plan is to have 1 pod which mounts a large volume read-write, and expose the SSH port with certificate-only access to copy there the data with rsync. Down the road we could deploy a Globus endpoint. Then this pod has a NFS server which shares the data as read-only to the Jupyter Notebook pods.
I haven't decided yet if this should be a standalone pod or the same pod of CVMFS, I'll track progress in this issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.