caicloud / ciao Goto Github PK

View Code? Open in Web Editor NEW

68.0 12.0 18.0 13 MB

Kernel for Kubeflow in Jupyter Notebook

License: Apache License 2.0

Go 94.22% Shell 1.19% Dockerfile 2.99% Python 1.61%

clever kubeflow jupyter jupyter-kernel

ciao's People

Contributors

Stargazers

Watchers

Forkers

gaocegege xieydd e271828- ashahab kz33 solarisyan awesomemachinelearning yeya24 tilyp kphf1995cm xiaocaoxu minshenglin aland-zhang geekhuyang gogogwwb dengpanyin

ciao's Issues

[feasibility research] Support Customized UI in Jupyter

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:
/kind feature

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

[feature][discussion] Support Serving

We could discuss here how to support model serving in jupyter notebook.

/kind feature
/priority p3

[enhancement] Refactor the Logic about Logging

#1 (comment)

[kubeflow] Building the Docker image...
[kubeflow] Failed to build the image: Building docker.io/wwbloveww/jupyter-kernel-ssweh:v1
Setting up the rootfs... this may take a bit.
time="2021-07-05T09:12:44Z" level=warning msg="Process sandbox is not available, consider unmasking procfs: "

#1 [internal] load .dockerignore
#1 digest: sha256:3f76dea899bf1b71ba075dbd1989f536cfdd9740f8e3e679285a28b32cd0b823
#1 name: "[internal] load .dockerignore"
#1 started: 2021-07-05 09:12:44.874143734 +0000 UTC m=+0.033279376
#1 completed: 2021-07-05 09:12:44.874206862 +0000 UTC m=+0.033342485
#1 duration: 63.109µs

#2 [internal] load build definition from Dockerfile
#2 digest: sha256:94cc84d8df80af2f135b1f4d5667ad9ed8f0645c2666121668d0a74df1e068c1
#2 name: "[internal] load build definition from Dockerfile"
#2 started: 2021-07-05 09:12:44.874298093 +0000 UTC m=+0.033433716
#2 completed: 2021-07-05 09:12:44.890747271 +0000 UTC m=+0.049882893
#2 duration: 16.449177ms
#2 transferring dockerfile: 153B done

#1 [internal] load .dockerignore
#1 started: 2021-07-05 09:12:44.875227943 +0000 UTC m=+0.034364046
#1 completed: 2021-07-05 09:12:44.898065215 +0000 UTC m=+0.057200827
#1 duration: 22.836781ms
#1 transferring context: 2B done

failed to solve: failed to read dockerfile: failed to mount /tmp/buildkit-mount172811560: [{Type:bind Source:/tmp/img/runc/native/snapshots/snapshots/1 Options:[rbind ro]}]: operation not permitted
Failed to create job: exit status 1

[enhancement] Upgrade TFJob/PyTorchJob to v1

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

[chore] Replace gometalinter with golangci-lint

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

gometalinter is unmaintained, thus we should use golangci-lint instead.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

[feature] Support Different Frameworks

/kind feature
/priority p2

Currently, we only support TensorFlow via TFJob, we need to support PyTorch when Kubeflow 0.3 released.

[enhancement] Refactor Retry Logic

#1 (comment)

[feature] Support resource limits

/kind feature

We need to support resource limits using magic commands or config files. I prefer to use config files first then support overwriting it using magic commands.

[feature] Add Unit Test CI

/kind feature

[feature] Support Secret

/kind feature
/priority p2

[enhancement] Support Docker, BuildKit, img, Buildah, and Kaniko

Now we rely on s2i to build the image, then need to push it to a registry. As @inc0 suggested, we should investigate if knative build is better.

[discussion] Why do we need a kernel to do this instead of some cell magics?

I am wondering whether it is too heavy to implement a kernel to do tensorflow job? My understanding is that we can use a cell magic to achieve similar result: send the code to kubeflow and get back the logs.

Am I miss something? Thanks!

[feature] Support Volume

/kind feature
/priority p2

[feature] Docker Support

[discussion] Use PVC or Configmap to avoid Docker building/pushing

/kind discussion

I am not sure if it works. If we build and push a Docker image every time we run the code, overhead is definitely high.

[discussion] Install Dependencies for Running

#33 (comment)

rough approaches:

analyze python source code and install imported packages, or use %package xxx syntax (slow)

wrap add all commonly used packages in a base image (huge image)

others..

[discussion] Decide how to set KUBECONFIG

Now we use env var to set the KUBECONFIG, and we need to discuss about it to find a better way.

/kind discussion
/priority p1

[enhancement] Add copyright header to source code

/kind enhancement
/priority p3

[feature] Support Configuration

We have some variables should be configured before running. E.g. KUBECONFIG. Thus we need to have a mechanism to support configuration.

/priority p2
/kind feature

[UX] non-kubeflow backend

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind feature

Now we are required to install kubeflow to use ciao, to make it easier for people to get start, i think we can do:

use kubernetes job backend
only install required operators (can be installed via helm charts)

wdyt? @gaocegege

[configuration] Configure namespace in config file

/kind feature

[discussion] Improve User Experience

The user experience here is that they essentially have to copy/paste their already well-written code and submit to kubeflow, i.e. mostly the training part. What I have in mind is that at last, we provide a more smoother experience such that it's possible to perform "develop-train-eval".

#1 (comment)

[bug] zmq4 was installed with ZeroMQ version 4.3.1, but the application links with version 4.2.5

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

When the kernel in run in Docker, it will return an error:

zmq4 was installed with ZeroMQ version 4.3.1, but the application links with version 4.2.5

[enhancement] Upgrade TFJob/PyTorchJob to v1beta2

/assign @yeya24

Kubeflow kernel try to get the pods but failed

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

the output message shows kubeflow cannot get the pods.

[kubeflow] Building the Docker image...
[kubeflow] Image built successfully
[kubeflow] Getting tensorflow Job jupyter-kernel-koxqy
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
Tried 10 times but cannot get the pods
Job jupyter-kernel-koxqy is created.

But the pods has been created and running:

vagrant@client-1:~$ kubectl get all -n ciao
NAME                                READY   STATUS    RESTARTS   AGE
pod/ciao-56fcd4f588-zv722           1/1     Running   0          75s
pod/jupyter-kernel-koxqy-ps-0       1/1     Running   0          34s
pod/jupyter-kernel-koxqy-worker-0   1/1     Running   0          34s

NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/ciao-service                    NodePort    10.233.16.108   <none>        8889:30885/TCP   3h22m
service/jupyter-kernel-koxqy-ps-0       ClusterIP   None            <none>        2222/TCP         35s
service/jupyter-kernel-koxqy-worker-0   ClusterIP   None            <none>        2222/TCP         35s

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ciao   1/1     1            1           76s

What you expected to happen:

Looks like the label format of operator has been changed:

kubeflow/training-operator#951

How to reproduce it (as minimally and precisely as possible):

I use the following operator version:

vagrant@client-1:~$ kubectl describe deploy pytorch-operator -n kubeflow                                                                                                                                    
Name:                   pytorch-operator
Namespace:              kubeflow
CreationTimestamp:      Fri, 26 Jul 2019 02:54:11 -0700
Labels:                 name=pytorch-operator
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               name=pytorch-operator
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
Pod Template:
  Labels:           name=pytorch-operator
  Service Account:  pytorch-operator
  Containers:
   pytorch-operator:
    Image:      gcr.io/kubeflow-images-public/pytorch-operator:v1.0.0-rc.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /pytorch-operator.v1beta2
      --alsologtostderr
      -v=1
    Environment:
      MY_POD_NAMESPACE:   (v1:metadata.namespace)
      MY_POD_NAME:        (v1:metadata.name)
    Mounts:              <none>
  Volumes:               <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   pytorch-operator-8b6d4ff5d (1/1 replicas created)
Events:          <none>

vagrant@client-1:~$ kubectl describe deploy tf-job-operator -n kubeflow                                                                                                                                     
Name:                   tf-job-operator
Namespace:              kubeflow
CreationTimestamp:      Fri, 26 Jul 2019 01:48:10 -0700
Labels:                 name=tf-job-operator
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               name=tf-job-operator
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
Pod Template:
  Labels:           name=tf-job-operator
  Service Account:  tf-job-operator
  Containers:
   tf-job-operator:
    Image:      gcr.io/kubeflow-images-public/tf_operator:v0.5.3
    Port:       <none>
    Host Port:  <none>
    Command:
      /opt/kubeflow/tf-operator.v1
      --alsologtostderr
      -v=1
      --monitoring-port=8443
    Environment:
      MY_POD_NAMESPACE:   (v1:metadata.namespace)
      MY_POD_NAME:        (v1:metadata.name)
    Mounts:
      /etc/config from config-volume (rw)
  Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tf-job-operator-config
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   tf-job-operator-865c7ddb5f (1/1 replicas created)
Events:          <none>

Anything else we need to know?:

[discussion] Support NAS and HP Tuning

/kind feature
/priority p1

[feature] Push the Image to Docker Registry

/priority p1
/kind feature

We do not push the image to the docker registry. Thus we only support one-node local cluster now.

kernel can not started successly if kubeconfig is not set

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

The ciao log shows Kubeflow kernel starts failed:

[KernelGatewayApp] Kernel args: {'kernel_name': u'kubeflow', 'env': {'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', u'KERNEL_LAUNCH_TIMEOUT': u'40', u'KERNEL_WORKING_DIR': u'/home/jovyan'}}
[I 190731 06:30:15 web:2162] 201 POST /api/kernels (10.233.80.43) 23.44ms
[I 190731 06:30:15 web:2162] 200 GET /api/kernels/4711b4c3-0f7e-4ddb-9650-1b96f31477d5 (10.233.80.43) 1.05ms
2019/07/31 06:30:15 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:15.738126       9 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: W0731 06:30:15.738144       9 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:15 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] Initializing websocket connection /api/kernels/4711b4c3-0f7e-4ddb-9650-1b96f31477d5/channels
[KernelGatewayApp] WARNING | No session ID specified
[KernelGatewayApp] Requesting kernel info from 4711b4c3-0f7e-4ddb-9650-1b96f31477d5
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:35719
[KernelGatewayApp] KernelRestarter: restarting kernel (1/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:46113
2019/07/31 06:30:18 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:18.709932      19 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: W0731 06:30:18.709942      19 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:18 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] KernelRestarter: restarting kernel (2/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:34603
2019/07/31 06:30:21 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:21.722328      25 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: W0731 06:30:21.722337      25 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:21 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] KernelRestarter: restarting kernel (3/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:38448
2019/07/31 06:30:24 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:24.730290      32 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: W0731 06:30:24.730299      32 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:24 Error building kubeConfig: invalid configuration: no configuration has been provided

What you expected to happen:

The kernel gateway prevents the kernel to inherited environment variables when spawning, so KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT are empty.

See: jupyter-server/kernel_gateway#280

How to reproduce it (as minimally and precisely as possible):

Using the following ./hack/k8s.config.yaml to build docker image:

namespace: ciao
s2i:
  provider: configmap

Anything else we need to know?:

[feature] Cleanup the Docker Image and Others After Finished

/kind feature
/priority p1

[enhancement] Remote Kernel Support

We should investigate if we could support remote kernels.

[enhancement] Provide Error Information

#1 (comment)

[feature] Support %tensorboard magic command

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Maybe blocked by #15

[docs] Add Development Guide and Installation Guide

[feature] Implement a Real Interpreter

/priority p3
/kind feature

We implement a simple interpreter based on the string comparison, which is not robust. We need to implement a real one.

[bug] Fail to re-run the code cell

/kind bug
/priority p0

If the code cell is finished successfully, we cannot re-run it. If it is failed, we can. It is definitely a bug

[discussion] jupyter lab support?

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind feature
/priority p3

I'm not sure if jupyterlab has the same protocol with notebook (i hope they do). Since the end goal of jupyterlab is to replace jupyter notebook, we'd better keep that in mind when designing the kernel.

[discussion] Is it too heavy to create a kernel to achieve this?

Per my understanding, we can use a cell magic to do similar job: send the code to kubeflow and get back the log. What's the major reason to create a new kernel ?

[enhancement] Refactor Logic about Interpreter

This is the core API of the kernel but it's nested too deeply, I would imagine this to be easily discoverable. The specific commands should be defined somewhere else and preferably versioned, and interpreter just imports the commands? This is also easier for us to implement features like list all available commands, etc

Add support for ciao using Incluster k8s config

FEATURE REQUEST:
Add support for ciao running in k8s cluster.

ciao/cmd/kubeflow-kernel/command/run.go

Line 54 in ebedd2a

kubeConfig := viper.GetString(config.KubeConfig)

Here we just get the config file path from yaml. Can we add support for ciao using Incluster config?

/kind feature

Failed to build ciao image

/kind bug

What happened:

when i exec

docker build -t caicloud/ciao .

What you expected to happen:

i hope success to build ciao image,but it failed.

Sending build context to Docker daemon  51.38MB
Step 1/25 : ARG RUNC_VERSION=9f9c96235cc97674e935002fc3d78361b696a69e
Step 2/25 : FROM golang:1.10-alpine AS build-env
 ---> 7b53e4a31d21
Step 3/25 : RUN apk add --no-cache     zeromq-dev     zeromq     gcc     musl-dev
 ---> Running in 31af2c464d73
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz: DNS lookup error
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz: DNS lookup error
ERROR: unsatisfiable constraints:
  gcc (missing):
    required by: world[gcc]
  musl-dev (missing):
    required by: world[musl-dev]
  zeromq (missing):
    required by: world[zeromq]
  zeromq-dev (missing):
    required by: world[zeromq-dev]
The command '/bin/sh -c apk add --no-cache     zeromq-dev     zeromq     gcc     musl-dev' returned a non-zero code: 4

and,你们或许应该提供官方镜像？

[feature] Cleanup after job is finished

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Delete all resources when a job is finished.

[discussion] Reduce the Overhead of Kubeflow

Now it takes about 10 seconds to set up the containers and run the distributed code (Docker push is not included). We need to try to reduce the overhead of Kubeflow for better UX.

/priority p3

[s2i] Specify username and pwd for pusing images

We should specify username/pwd/registry before using img to push the images.

[refactor] Separate CM and image in s2i

Ref ##43 (comment)

Is it still correct to say s2i now that with configmap, we don't really build an image.

I can live with this if we think the image in s2i means the pre-defined image + code, but considering what the s2i project doest (actually building an image), this can be a little confusing.

[discussion] IPython Integration or SOS Kernel

We want to support kubeflow and python (and R and other languages) in one notebook, then we have to choices:

Add IPython into the kernel
Make this kernel a subkernel of sos. https://github.com/vatlab/sos-notebook

About path {kubeconfig} in docker run {kubeconfig} :{kubeconfig}

@gaocegege Hi, I want to know what path I should set for command "docker run -v {kubeconfig}:{kubeconfig} -p 8889:8889 caicloud/ciao". The image I have built, but when I run it, it shows following logs (in logs.txt file)
logs.txt
Can you help me see the error? Thank you.

[docs] Add docs about s2i

/kind documentation
/priority p0

[test] Add unit test cases and coveralls/travis support

/kind feature

caicloud / ciao Goto Github PK

ciao's People

Contributors

Stargazers

Watchers

Forkers

ciao's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs