caicloud / ciao Goto Github PK
View Code? Open in Web Editor NEWKernel for Kubeflow in Jupyter Notebook
License: Apache License 2.0
Kernel for Kubeflow in Jupyter Notebook
License: Apache License 2.0
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind feature
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
We could discuss here how to support model serving in jupyter notebook.
/kind feature
/priority p3
/kind feature
/priority p2
[kubeflow] Building the Docker image...
[kubeflow] Failed to build the image: Building docker.io/wwbloveww/jupyter-kernel-ssweh:v1
Setting up the rootfs... this may take a bit.
time="2021-07-05T09:12:44Z" level=warning msg="Process sandbox is not available, consider unmasking procfs: "
#1 [internal] load .dockerignore
#1 digest: sha256:3f76dea899bf1b71ba075dbd1989f536cfdd9740f8e3e679285a28b32cd0b823
#1 name: "[internal] load .dockerignore"
#1 started: 2021-07-05 09:12:44.874143734 +0000 UTC m=+0.033279376
#1 completed: 2021-07-05 09:12:44.874206862 +0000 UTC m=+0.033342485
#1 duration: 63.109µs
#2 [internal] load build definition from Dockerfile
#2 digest: sha256:94cc84d8df80af2f135b1f4d5667ad9ed8f0645c2666121668d0a74df1e068c1
#2 name: "[internal] load build definition from Dockerfile"
#2 started: 2021-07-05 09:12:44.874298093 +0000 UTC m=+0.033433716
#2 completed: 2021-07-05 09:12:44.890747271 +0000 UTC m=+0.049882893
#2 duration: 16.449177ms
#2 transferring dockerfile: 153B done
#1 [internal] load .dockerignore
#1 started: 2021-07-05 09:12:44.875227943 +0000 UTC m=+0.034364046
#1 completed: 2021-07-05 09:12:44.898065215 +0000 UTC m=+0.057200827
#1 duration: 22.836781ms
#1 transferring context: 2B done
failed to solve: failed to read dockerfile: failed to mount /tmp/buildkit-mount172811560: [{Type:bind Source:/tmp/img/runc/native/snapshots/snapshots/1 Options:[rbind ro]}]: operation not permitted
Failed to create job: exit status 1
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
What happened:
gometalinter is unmaintained, thus we should use golangci-lint instead.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
/kind feature
/priority p2
Currently, we only support TensorFlow via TFJob, we need to support PyTorch when Kubeflow 0.3 released.
/kind feature
We need to support resource limits using magic commands or config files. I prefer to use config files first then support overwriting it using magic commands.
/kind feature
/kind feature
/priority p2
Now we rely on s2i to build the image, then need to push it to a registry. As @inc0 suggested, we should investigate if knative build is better.
I am wondering whether it is too heavy to implement a kernel to do tensorflow job? My understanding is that we can use a cell magic to achieve similar result: send the code to kubeflow and get back the logs.
Am I miss something? Thanks!
/kind feature
/priority p2
/kind discussion
I am not sure if it works. If we build and push a Docker image every time we run the code, overhead is definitely high.
rough approaches:
- analyze python source code and install imported packages, or use %package xxx syntax (slow)
- wrap add all commonly used packages in a base image (huge image)
- others..
Now we use env var to set the KUBECONFIG, and we need to discuss about it to find a better way.
/kind discussion
/priority p1
/kind enhancement
/priority p3
We have some variables should be configured before running. E.g. KUBECONFIG. Thus we need to have a mechanism to support configuration.
/priority p2
/kind feature
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind feature
Now we are required to install kubeflow to use ciao, to make it easier for people to get start, i think we can do:
wdyt? @gaocegege
/kind feature
The user experience here is that they essentially have to copy/paste their already well-written code and submit to kubeflow, i.e. mostly the training part. What I have in mind is that at last, we provide a more smoother experience such that it's possible to perform "develop-train-eval".
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
When the kernel in run in Docker, it will return an error:
zmq4 was installed with ZeroMQ version 4.3.1, but the application links with version 4.2.5
/assign @yeya24
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
the output message shows kubeflow cannot get the pods.
[kubeflow] Building the Docker image...
[kubeflow] Image built successfully
[kubeflow] Getting tensorflow Job jupyter-kernel-koxqy
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
Tried 10 times but cannot get the pods
Job jupyter-kernel-koxqy is created.
But the pods has been created and running:
vagrant@client-1:~$ kubectl get all -n ciao
NAME READY STATUS RESTARTS AGE
pod/ciao-56fcd4f588-zv722 1/1 Running 0 75s
pod/jupyter-kernel-koxqy-ps-0 1/1 Running 0 34s
pod/jupyter-kernel-koxqy-worker-0 1/1 Running 0 34s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ciao-service NodePort 10.233.16.108 <none> 8889:30885/TCP 3h22m
service/jupyter-kernel-koxqy-ps-0 ClusterIP None <none> 2222/TCP 35s
service/jupyter-kernel-koxqy-worker-0 ClusterIP None <none> 2222/TCP 35s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ciao 1/1 1 1 76s
What you expected to happen:
Looks like the label format of operator has been changed:
kubeflow/training-operator#951
How to reproduce it (as minimally and precisely as possible):
I use the following operator version:
vagrant@client-1:~$ kubectl describe deploy pytorch-operator -n kubeflow
Name: pytorch-operator
Namespace: kubeflow
CreationTimestamp: Fri, 26 Jul 2019 02:54:11 -0700
Labels: name=pytorch-operator
Annotations: deployment.kubernetes.io/revision: 1
Selector: name=pytorch-operator
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
Pod Template:
Labels: name=pytorch-operator
Service Account: pytorch-operator
Containers:
pytorch-operator:
Image: gcr.io/kubeflow-images-public/pytorch-operator:v1.0.0-rc.0
Port: <none>
Host Port: <none>
Command:
/pytorch-operator.v1beta2
--alsologtostderr
-v=1
Environment:
MY_POD_NAMESPACE: (v1:metadata.namespace)
MY_POD_NAME: (v1:metadata.name)
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: pytorch-operator-8b6d4ff5d (1/1 replicas created)
Events: <none>
vagrant@client-1:~$ kubectl describe deploy tf-job-operator -n kubeflow
Name: tf-job-operator
Namespace: kubeflow
CreationTimestamp: Fri, 26 Jul 2019 01:48:10 -0700
Labels: name=tf-job-operator
Annotations: deployment.kubernetes.io/revision: 1
Selector: name=tf-job-operator
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
Pod Template:
Labels: name=tf-job-operator
Service Account: tf-job-operator
Containers:
tf-job-operator:
Image: gcr.io/kubeflow-images-public/tf_operator:v0.5.3
Port: <none>
Host Port: <none>
Command:
/opt/kubeflow/tf-operator.v1
--alsologtostderr
-v=1
--monitoring-port=8443
Environment:
MY_POD_NAMESPACE: (v1:metadata.namespace)
MY_POD_NAME: (v1:metadata.name)
Mounts:
/etc/config from config-volume (rw)
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tf-job-operator-config
Optional: false
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: tf-job-operator-865c7ddb5f (1/1 replicas created)
Events: <none>
Anything else we need to know?:
/kind feature
/priority p1
/priority p1
/kind feature
We do not push the image to the docker registry. Thus we only support one-node local cluster now.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
The ciao log shows Kubeflow kernel starts failed:
[KernelGatewayApp] Kernel args: {'kernel_name': u'kubeflow', 'env': {'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', u'KERNEL_LAUNCH_TIMEOUT': u'40', u'KERNEL_WORKING_DIR': u'/home/jovyan'}}
[I 190731 06:30:15 web:2162] 201 POST /api/kernels (10.233.80.43) 23.44ms
[I 190731 06:30:15 web:2162] 200 GET /api/kernels/4711b4c3-0f7e-4ddb-9650-1b96f31477d5 (10.233.80.43) 1.05ms
2019/07/31 06:30:15 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:15.738126 9 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
ERROR: logging before flag.Parse: W0731 06:30:15.738144 9 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:15 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] Initializing websocket connection /api/kernels/4711b4c3-0f7e-4ddb-9650-1b96f31477d5/channels
[KernelGatewayApp] WARNING | No session ID specified
[KernelGatewayApp] Requesting kernel info from 4711b4c3-0f7e-4ddb-9650-1b96f31477d5
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:35719
[KernelGatewayApp] KernelRestarter: restarting kernel (1/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:46113
2019/07/31 06:30:18 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:18.709932 19 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
ERROR: logging before flag.Parse: W0731 06:30:18.709942 19 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:18 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] KernelRestarter: restarting kernel (2/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:34603
2019/07/31 06:30:21 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:21.722328 25 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
ERROR: logging before flag.Parse: W0731 06:30:21.722337 25 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:21 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] KernelRestarter: restarting kernel (3/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:38448
2019/07/31 06:30:24 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:24.730290 32 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
ERROR: logging before flag.Parse: W0731 06:30:24.730299 32 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:24 Error building kubeConfig: invalid configuration: no configuration has been provided
What you expected to happen:
The kernel gateway prevents the kernel to inherited environment variables when spawning, so KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT are empty.
See: jupyter-server/kernel_gateway#280
How to reproduce it (as minimally and precisely as possible):
Using the following ./hack/k8s.config.yaml
to build docker image:
namespace: ciao
s2i:
provider: configmap
Anything else we need to know?:
/kind feature
/priority p1
We should investigate if we could support remote kernels.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Maybe blocked by #15
/priority p3
/kind feature
We implement a simple interpreter based on the string comparison, which is not robust. We need to implement a real one.
/kind bug
/priority p0
If the code cell is finished successfully, we cannot re-run it. If it is failed, we can. It is definitely a bug
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind feature
/priority p3
I'm not sure if jupyterlab has the same protocol with notebook (i hope they do). Since the end goal of jupyterlab is to replace jupyter notebook, we'd better keep that in mind when designing the kernel.
Per my understanding, we can use a cell magic to do similar job: send the code to kubeflow and get back the log. What's the major reason to create a new kernel ?
This is the core API of the kernel but it's nested too deeply, I would imagine this to be easily discoverable. The specific commands should be defined somewhere else and preferably versioned, and interpreter just imports the commands? This is also easier for us to implement features like list all available commands, etc
FEATURE REQUEST:
Add support for ciao running in k8s cluster.
ciao/cmd/kubeflow-kernel/command/run.go
Line 54 in ebedd2a
Here we just get the config file path from yaml. Can we add support for ciao using Incluster config?
/kind feature
/kind bug
What happened:
when i exec
docker build -t caicloud/ciao .
What you expected to happen:
i hope success to build ciao image,but it failed.
Sending build context to Docker daemon 51.38MB
Step 1/25 : ARG RUNC_VERSION=9f9c96235cc97674e935002fc3d78361b696a69e
Step 2/25 : FROM golang:1.10-alpine AS build-env
---> 7b53e4a31d21
Step 3/25 : RUN apk add --no-cache zeromq-dev zeromq gcc musl-dev
---> Running in 31af2c464d73
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz: DNS lookup error
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz: DNS lookup error
ERROR: unsatisfiable constraints:
gcc (missing):
required by: world[gcc]
musl-dev (missing):
required by: world[musl-dev]
zeromq (missing):
required by: world[zeromq]
zeromq-dev (missing):
required by: world[zeromq-dev]
The command '/bin/sh -c apk add --no-cache zeromq-dev zeromq gcc musl-dev' returned a non-zero code: 4
and,你们或许应该提供官方镜像?
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Delete all resources when a job is finished.
Now it takes about 10 seconds to set up the containers and run the distributed code (Docker push is not included). We need to try to reduce the overhead of Kubeflow for better UX.
/priority p3
We should specify username/pwd/registry before using img to push the images.
Ref ##43 (comment)
Is it still correct to say s2i now that with configmap, we don't really build an image.
I can live with this if we think the image in s2i means the pre-defined image + code, but considering what the s2i project doest (actually building an image), this can be a little confusing.
We want to support kubeflow and python (and R and other languages) in one notebook, then we have to choices:
@gaocegege Hi, I want to know what path I should set for command "docker run -v {kubeconfig}:{kubeconfig} -p 8889:8889 caicloud/ciao". The image I have built, but when I run it, it shows following logs (in logs.txt file)
logs.txt
Can you help me see the error? Thank you.
/kind documentation
/priority p0
/kind feature
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.