GithubHelp home page GithubHelp logo

det-lab / jupyterhub-deploy-kubernetes-jetstream Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zonca/jupyterhub-deploy-kubernetes-jetstream

0.0 0.0 1.0 148 KB

CDMS JupyterHub deployment on XSEDE Jetstream

Shell 49.57% Dockerfile 3.75% Python 8.69% Jupyter Notebook 31.91% Makefile 6.09%

jupyterhub-deploy-kubernetes-jetstream's People

Contributors

glass-ships avatar pibion avatar zonca avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

pandeylekhraj

jupyterhub-deploy-kubernetes-jetstream's Issues

Test master-only configuration

Make sure that we can have only with the master node running and still have 1 single-user session working.

Images pulled to the nodes

Worker node

[fedora@k8s-4qtmvqk6gv47-minion-0 ~]$ sudo docker images
REPOSITORY                                            TAG                 IMAGE ID            CREATED             SIZE
docker.io/supercdms/cdms-jupyterlab                   1.8b                764e36e089da        5 weeks ago         20 GB
docker.io/jupyterhub/k8s-image-awaiter                0.8.2               938cb370f906        10 months ago       4.15 MB
docker.io/jupyterhub/k8s-network-tools                0.8.2               02576979bd59        13 months ago       5.62 MB
gcr.io/kubernetes-helm/tiller                         v2.11.0             ac5f7ee9ae7e        16 months ago       71.8 MB
gcr.io/google_containers/kubernetes-dashboard-amd64   v1.8.3              0c60bcf89900        23 months ago       102 MB
docker.io/coredns/coredns                             1.0.1               58d63427cdea        2 years ago         45.1 MB
gcr.io/google_containers/pause                        3.0                 99e59f495ffa        3 years ago         747 kB

Master node

REPOSITORY                                                       TAG                 IMAGE ID            CREATED             SIZE
quay.io/kubernetes-ingress-controller/nginx-ingress-controller   0.24.1              98675eb54d0e        9 months ago        631 MB
k8s.gcr.io/defaultbackend-amd64                                  1.5                 b5af743e5984        16 months ago       5.13 MB
docker.io/k8scloudprovider/openstack-cloud-controller-manager    v0.2.0              5b5ea0c144e8        18 months ago       39.4 MB
gcr.io/google_containers/heapster-amd64                          v1.4.2              d4e02f5922ca        2 years ago         73.4 MB
gcr.io/google_containers/pause                                   3.0                 99e59f495ffa        3 years ago         747 kB

we don't have the single-user image on the master node, but we have the node schedulable (just for 1 user), need to configure this.

Originally posted by @zonca in #3 (comment)

Distributed computing with dask

@pibion do you also want capability like Pangeo for a user to request a cluster of dask workers so that they can run in parallel?

In this case, do you have an example of such distributed computation?

Also, the data. How are Jupyter Notebook users going to access data? where do we store them?

Update Jupyterlab in user container

We have now Jupyterlab 2.x on the user images,
I am testing Jupyterlab 3, the nice thing is that now it is a lot easier to install extensions because they are simple python packages.

Setup permanent domain name

I think we are at a good point in testing the deployment,
we can start to think about setting up a permanent URL.
Does CDMS or your institution have a procedure to set that up?
Or shoud I investigate if we can get a *.xsede.org domain?

allow users to spin up large or extra-large instances?

Hi @zonca , we have some users that are likely going to need a lot of RAM for some upcoming analysis (@ziqinghong). The goal is to eventually eliminate the need for > 10 GB but for now there are times when it would be helpful to have a whole lot of RAM available.

I wonder if it's possible to make it an option for users to request a large or extra-large instance. If this is difficult please don't worry about it. I thought I'd ask mainly to discuss the possibility.

Providing latex and fonts for matplotlib

Analyzers using matplotlib often want to format their axis titles with LaTeX and this requires dependencies that are not installed with the matplotlib package.

@bloer has considered putting these dependencies into CVMFS, but LaTeX isn't so easy to compile and we're considering instead installing the dependencies on some of the underlying systems, at least as a stopgap.

@zonca I just wanted to check in with this plan. I can edit the dockerfile at https://github.com/zonca/docker-jupyter-cdms-light and put in a pull request. I was wondering if we have a way of monitoring how long it takes instances to spin up? I'm worried the extra dependencies will make it take noticeably longer, and it's already fairly slow. It'd be nice to see histograms of what people encounter.

Extend and renew XSEDE project

@pibion the allocation ends on October 27th, I suggest you ask first for an extension of 6 months, see https://portal.xsede.org/allocations/policies#356 (in the text also mention you'd like to extend ECSS).
You can check how many hours are left on the allocation and decide how much supplement we might need, if any, we expect to keep using hours at the same rate of the last 3/4 months.

Then, in December, you could apply for a renewal (which starts in April): https://portal.xsede.org/allocations/research#xracquarterly

Test autoscaling

Autoscaling support for our deployment from the official project is still very far away, see zonca#34

So I implemented a hacked-up script myself, see https://zonca.dev/2021/01/autoscaling_script_kubespray_jupyterhub.html, I tested on a sample deployment and worked just fine.

Now I would like to deploy on CDMS (after #48), the main issue is that scaling up is very slow, it takes about 20 min to get another node up.

  1. So to maximize savings we could just keep the master running which supports 1 user (with 5 GB memory), a second concurrent user trying to connect will have to wait 20 min for their session to start
  2. We keep master + 1 node always running, so only after those are full (or someone asks for a full node) we trigger autoscaling

@pibion what do you think? We could also try a couple of weeks with 1 and then decide.
I would like to set this up only after the redeploy (#48), so there is lots of time to discuss this.

Update to zero-to-jupyterhub 0.9.0

Until now I have been using 0.8.2

0.9.0 was released 2 weeks ago, it upgrades to JupyterHub 1.1.0.

See changelog for 0.9.0: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/CHANGELOG.md#090---2020-04-15

The gitlab authentication #18 was not working with 0.8.2, it works fine with 0.9.0.
I authorized the whole SuperCDMS group to login, please let me know if there are issues.

the deployment is now at https://supercdms.jetstream-cloud.org/

@pibion

Persistent storage disabled

Warning, I currently had to disable persistent storage because there are some conflicts in the new release, I'll work to fix it in the next days and update this issue. This only affects the home of the Jupyter Notebook users, the data folder is not affected.
So you have some local storage, but if your session is killed, all your data is lost.

Unsure how to change machine allocation

I started working with a tiny server option (1 CPU), but then realized that I needed a default server option (4 CPUs). However, I am unsure how to change the machine allocation from tiny to default.

Setup authentication

After #17 is done, let's setup authentication.
@pibion, what are your plans about that?

Do we want to use Github accounts? or do you have a third party authenticator CDMS uses?

Add me to allocation

can you please add me to allocation if you haven't done it yet?
can you also write the allocation id here?

Automatic kernel selection

I'm wondering if it would be at all possible to configure the jupyter notebooks to open with the most recent cdms kernel by default?

This may possibly be related to an email I just sent regarding the possibility of running install_cdms_kernels during the image spawning process.

/cvmfs/data read-only when logged in through JupyterHub

We're trying to use DataCat (http://titus.stanford.edu:8080/git/summary/?r=DataHandling/DataCat.git) to grab data files and store them on /cvmfs/data as needed.

However, this mode of copying data requires users logged in to JupyterHub to have write privileges in /cvmfs/data, which they currently don't. @thathayhaykid will follow up with a way to reproduce the issue.

@bloer, @zonca, do you have any thoughts on ways to handle this? Maybe we could update DataCat to connect to SLAC and run the copy from there. People would have to make sure they've got an ssh key and config set up properly, but that's maybe reasonable.

Redeployment on top of newer Kubernetes version

The Jetstream team is working on a newer Kubernetes environment,
within 1 or 2 weeks they will notify me about availability and I will tear down the deployment and rebuild it again on top of the new environment.

  • save the data volume so it can be reattached to the newer environment (I'll try my best, there is a small possibility I will loose data and they will need to be copied again)
  • tear down the old deployment
  • deploy kubernetes and check it is working correctly (especially logging which stopped working a month ago)
  • deploy the CVMFS / NFS service re-mounting the data volume
  • make sure the networking issue (#10) is solved
  • deploy JupyterHub
  • update the documentation if anything changed

install NFS-common on nodes

in order to mount the NFS share I have to manually install the nfs-common package on the nodes.

check if I can add this to kubespray or ask Jetstream if they can add it into the images

All but 2 nodes are unavailable

When trying to start with a default server configuration, I get this error, and the loading for the server stalls:
2021-03-18T00:02:34.528016Z [Warning] 0/2 nodes are available: 2 Insufficient cpu.

Amy checked the admin panel, and nobody else seems to be using any nodes.

NFS sharing issue

found issue with NFS sharing, it works fine if the Jupyter Notebook pod is in the same node, but it doesn't work across nodes, which is strange because Kubernetes networking should automatically handle that.

To debug I deployed 2 pods, one on master and one on minion,
then

telnet 10.254.77.77 111

it can connect on master (where the CVMFS/NFS pod is located), hangs on the other node.

`ls /cvmfs` takes a long time

I hadn't noticed this behavior before, so maybe the issue is that there's now more data in the directory? Here is an output showing the issue:

bash-4.2$ time ls
lost+found

real    0m0.005s
user    0m0.005s
sys     0m0.000s
bash-4.2$ time ls /cvmfs/data
CDMS  lost+found  not_writable  test_file  test_file_2  tf

real    0m40.055s
user    0m0.000s
sys     0m0.007s

JupyterHub boot up timeout on Tiny server configuration

I received a timeout error while booting up JupyterHub on the Tiny server configuration. I had previously received the same timeout error when booting up on the Default server configuration, and then decided to retry with a Tiny server configuration.

Here is the event log for the Tiny configuration:

Server requested
2021-02-25 00:15:24+00:00 [Normal] Successfully assigned jhub/jupyter-zkromer to zonca-k8s-master-1
2021-02-25 00:15:27+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r6a6a768faa154a8583abc7c98ca0be7b.scope mount.nfs: Operation not permitted
2021-02-25 00:15:30+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r724eb5e9a9794a3994f3564ef0e04832.scope mount.nfs: Operation not permitted
2021-02-25 00:15:33+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r7f54813ea3f249bd8e4d1e4d12eca05b.scope mount.nfs: Operation not permitted
2021-02-25 00:15:36+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r1b65550e5a7649f89ca73052fa5fd8a1.scope mount.nfs: Operation not permitted
2021-02-25 00:15:41+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-rfc770abe56ee49df86fb77fe7106cfb4.scope mount.nfs: Operation not permitted
2021-02-25 00:15:51+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r084be4dee5284884b974cb8a0de309db.scope mount.nfs: Operation not permitted
2021-02-25 00:16:01+00:00 [Normal] AttachVolume.Attach succeeded for volume "pvc-666b461d-e7d0-43f7-8747-44483d7b19a8"
2021-02-25 00:16:10+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r308a0c45b6df4c4fa77e087b71e1df91.scope mount.nfs: Operation not permitted
2021-02-25 00:16:43+00:00 [Warning] MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r0c099fa65d754c779938401dbc1fa3e2.scope mount.nfs: Operation not permitted
2021-02-25 00:17:27+00:00 [Warning] Unable to attach or mount volumes: unmounted volumes=[cvmfs-nfs-volume], unattached volumes=[volume-zkromer cvmfs-nfs-volume]: timed out waiting for the condition
2021-02-25 00:17:49+00:00 [Warning] (combined from similar events): MountVolume.SetUp failed for volume "cvmfs-nfs-volume" : mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume --scope -- mount -t nfs 10.233.46.63:/ /var/lib/kubelet/pods/bb9c71a5-6a02-4f6f-b83f-7015be5549b4/volumes/kubernetes.io~nfs/cvmfs-nfs-volume Output: Running scope as unit: run-r35c2a8d4c7bb40b1bdf8d111076efd07.scope mount.nfs: Operation not permitted
Spawn failed: pod/jupyter-zkromer did not start in 600 seconds!

Cannot start JupyterLab session

@jpanmany466 is having difficulty starting a JupyterLab session - the complaint it returns is that the spawner is timing out:

spawner_timeout

ip addresses for whitelisting?

In our collaboration we sometimes whitelist ranges of IP addresses for easier access to collaboration resources.

Is there a range of IP addresses for the JupyterHub instances people spin up that we could ask our colleague to whitelist?

Test number of users per Jetstream VM

@pibion I have created a cluster with 1 master and 2 worker nodes named k8s_cdms.
First I want to test how many users I can accommodate per node.
I am worried about this given reports at zonca#23.

Once this is solved I'll continue with the planned steps.

JupyterLab spawning failure

@jpanmany is running into spawning issues again:

spawn_fail2

This time it looks like the issue is that she doesn't have ssh keys, and the script is failing when it tries to set permissions.

Jupyter notebooks spawning time

Prompted by #46, let's debug what takes so long in starting up a session:

here is the kubernetes log

Events:
  Type    Reason                  Age    From                     Message
  ----    ------                  ----   ----                     -------
  Normal  Scheduled               2m41s  jhub-user-scheduler      Successfully assigned jhub/jupyter-zonca to zonca
-k8s-node-nf-1
  Normal  SuccessfulAttachVolume  2m8s   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-a4d
0d8fb-46b4-4dbc-b5cb-3bad7ced6280"
  Normal  Pulled                  79s    kubelet                  Container image "jupyterhub/k8s-network-tools:0.9
.0" already present on machine
  Normal  Created                 78s    kubelet                  Created container block-cloud-metadata
  Normal  Started                 78s    kubelet                  Started container block-cloud-metadata
  Normal  Pulled                  75s    kubelet                  Container image "zonca/docker-jupyter-cdms-light:
2020.07.07" already present on machine
  Normal  Created                 75s    kubelet                  Created container notebook
  Normal  Started                 74s    kubelet                  Started container notebook
  • startup took about 1 min and 30 sec @pibion do you often see startup times way longer?
  • it takes ~30 sec to mount the user volume
  • however there is another minute after that, but don't know what is driving that

Distributed data access - object store

Data is currently all stored at SLAC. There is a "data catalog" python library that allows users to query for data paths. If they don't exist locally they're downloaded to disk.

Is it possible to have a storage disk that is mounted to everyone's container? For initial testing 50 GB would be more than enough. If we want to try to support full CDMS analysis efforts that's more like 10 TB.

Originally posted by @pibion in #7 (comment)

GPU access through JetStream?

@zonca , @00KevinMac and @farnoush-bk are doing some machine learning work that requires GPU computing.

@00KevinMac has taken a look at XSEDE resources, and allocations on pretty much any of their GPU resources (e.g. Bridges or Comet) would serve well.

I'm wondering if it's possible to access these resources from the JetStream JupyterHub, or if JetStream has GPU VMs.

I'd like to stay away from installing the CDMS software on new systems - although maybe that's a fundamental requirement if we want to use other systems? Our experience within the collaboration of getting a system to work with CDMS software - even if it already supports CVMFS - is that it's a real pain.

So another option is: we use the XSEDE JupyterHub for CPU tasks and send people elsewhere for GPU tasks.

CDMS Python environment

A word of warning - it seems that the very first imports that pull in CDMS python packages are failing. This is probably because the CVMFS environment doesn't install those. @bloer is the authority on this, though.

Originally posted by @pibion in #8 (comment)

I would like some details about the Python packages for CDMS analysis.

Offsite backup of user data

from @pibion:

I'm wondering if there's a way to access the existing storage volumes, and I'm also wondering if there's a way to set a permanent backup with e.g. the Open Storage Network (CDMS has an allocation there now). Adding @glass-ships and @thathayhaykid as they might be interested in thinking about this.

Distributed data access - block store

Conclusion of #8 is that object store is not suitable.

Other 2 options are:

  1. One option could be to use Manila on Jetstream which provides a NFS service which is handled by Openstack so we don't have to manage it. This provides a standard read/write filesystem we can mount on all pods.

  2. Or deploy our own NFS server, actually we can probably use the NFS server we use for CVMFS to also serve this 50GB volume read/write.

I have never used Manila before, so I would rather use our own NFS server, we can later do some benchmarks.

So the plan is to have 1 pod which mounts a large volume read-write, and expose the SSH port with certificate-only access to copy there the data with rsync. Down the road we could deploy a Globus endpoint. Then this pod has a NFS server which shares the data as read-only to the Jupyter Notebook pods.

I haven't decided yet if this should be a standalone pod or the same pod of CVMFS, I'll track progress in this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.