GithubHelp home page GithubHelp logo

caicloud / ciao Goto Github PK

View Code? Open in Web Editor NEW
68.0 12.0 18.0 13 MB

Kernel for Kubeflow in Jupyter Notebook

License: Apache License 2.0

Go 94.22% Shell 1.19% Dockerfile 2.99% Python 1.61%
clever kubeflow jupyter jupyter-kernel

ciao's People

Contributors

bbbmj avatar caicloud-bot avatar e271828- avatar gaocegege avatar jiachengxu avatar minshenglin avatar yeya24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ciao's Issues

[feasibility research] Support Customized UI in Jupyter

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:
/kind feature

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Failed to build the image

[kubeflow] Building the Docker image...
[kubeflow] Failed to build the image: Building docker.io/wwbloveww/jupyter-kernel-ssweh:v1
Setting up the rootfs... this may take a bit.
time="2021-07-05T09:12:44Z" level=warning msg="Process sandbox is not available, consider unmasking procfs: "

#1 [internal] load .dockerignore
#1 digest: sha256:3f76dea899bf1b71ba075dbd1989f536cfdd9740f8e3e679285a28b32cd0b823
#1 name: "[internal] load .dockerignore"
#1 started: 2021-07-05 09:12:44.874143734 +0000 UTC m=+0.033279376
#1 completed: 2021-07-05 09:12:44.874206862 +0000 UTC m=+0.033342485
#1 duration: 63.109µs

#2 [internal] load build definition from Dockerfile
#2 digest: sha256:94cc84d8df80af2f135b1f4d5667ad9ed8f0645c2666121668d0a74df1e068c1
#2 name: "[internal] load build definition from Dockerfile"
#2 started: 2021-07-05 09:12:44.874298093 +0000 UTC m=+0.033433716
#2 completed: 2021-07-05 09:12:44.890747271 +0000 UTC m=+0.049882893
#2 duration: 16.449177ms
#2 transferring dockerfile: 153B done

#1 [internal] load .dockerignore
#1 started: 2021-07-05 09:12:44.875227943 +0000 UTC m=+0.034364046
#1 completed: 2021-07-05 09:12:44.898065215 +0000 UTC m=+0.057200827
#1 duration: 22.836781ms
#1 transferring context: 2B done

failed to solve: failed to read dockerfile: failed to mount /tmp/buildkit-mount172811560: [{Type:bind Source:/tmp/img/runc/native/snapshots/snapshots/1 Options:[rbind ro]}]: operation not permitted
Failed to create job: exit status 1

[enhancement] Upgrade TFJob/PyTorchJob to v1

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

[chore] Replace gometalinter with golangci-lint

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

gometalinter is unmaintained, thus we should use golangci-lint instead.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

[feature] Support resource limits

/kind feature

We need to support resource limits using magic commands or config files. I prefer to use config files first then support overwriting it using magic commands.

[feature] Support Configuration

We have some variables should be configured before running. E.g. KUBECONFIG. Thus we need to have a mechanism to support configuration.

/priority p2
/kind feature

[UX] non-kubeflow backend

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind feature

Now we are required to install kubeflow to use ciao, to make it easier for people to get start, i think we can do:

  • use kubernetes job backend
  • only install required operators (can be installed via helm charts)

wdyt? @gaocegege

[discussion] Improve User Experience

The user experience here is that they essentially have to copy/paste their already well-written code and submit to kubeflow, i.e. mostly the training part. What I have in mind is that at last, we provide a more smoother experience such that it's possible to perform "develop-train-eval".

#1 (comment)

Kubeflow kernel try to get the pods but failed

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

the output message shows kubeflow cannot get the pods.

[kubeflow] Building the Docker image...
[kubeflow] Image built successfully
[kubeflow] Getting tensorflow Job jupyter-kernel-koxqy
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
[kubeflow] Waiting for all replicas (0, 1, 1)
Tried 10 times but cannot get the pods
Job jupyter-kernel-koxqy is created.

But the pods has been created and running:

vagrant@client-1:~$ kubectl get all -n ciao
NAME                                READY   STATUS    RESTARTS   AGE
pod/ciao-56fcd4f588-zv722           1/1     Running   0          75s
pod/jupyter-kernel-koxqy-ps-0       1/1     Running   0          34s
pod/jupyter-kernel-koxqy-worker-0   1/1     Running   0          34s

NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/ciao-service                    NodePort    10.233.16.108   <none>        8889:30885/TCP   3h22m
service/jupyter-kernel-koxqy-ps-0       ClusterIP   None            <none>        2222/TCP         35s
service/jupyter-kernel-koxqy-worker-0   ClusterIP   None            <none>        2222/TCP         35s

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ciao   1/1     1            1           76s

What you expected to happen:

Looks like the label format of operator has been changed:

kubeflow/training-operator#951

How to reproduce it (as minimally and precisely as possible):

I use the following operator version:

vagrant@client-1:~$ kubectl describe deploy pytorch-operator -n kubeflow                                                                                                                                    
Name:                   pytorch-operator
Namespace:              kubeflow
CreationTimestamp:      Fri, 26 Jul 2019 02:54:11 -0700
Labels:                 name=pytorch-operator
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               name=pytorch-operator
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
Pod Template:
  Labels:           name=pytorch-operator
  Service Account:  pytorch-operator
  Containers:
   pytorch-operator:
    Image:      gcr.io/kubeflow-images-public/pytorch-operator:v1.0.0-rc.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /pytorch-operator.v1beta2
      --alsologtostderr
      -v=1
    Environment:
      MY_POD_NAMESPACE:   (v1:metadata.namespace)
      MY_POD_NAME:        (v1:metadata.name)
    Mounts:              <none>
  Volumes:               <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   pytorch-operator-8b6d4ff5d (1/1 replicas created)
Events:          <none>
vagrant@client-1:~$ kubectl describe deploy tf-job-operator -n kubeflow                                                                                                                                     
Name:                   tf-job-operator
Namespace:              kubeflow
CreationTimestamp:      Fri, 26 Jul 2019 01:48:10 -0700
Labels:                 name=tf-job-operator
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               name=tf-job-operator
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
Pod Template:
  Labels:           name=tf-job-operator
  Service Account:  tf-job-operator
  Containers:
   tf-job-operator:
    Image:      gcr.io/kubeflow-images-public/tf_operator:v0.5.3
    Port:       <none>
    Host Port:  <none>
    Command:
      /opt/kubeflow/tf-operator.v1
      --alsologtostderr
      -v=1
      --monitoring-port=8443
    Environment:
      MY_POD_NAMESPACE:   (v1:metadata.namespace)
      MY_POD_NAME:        (v1:metadata.name)
    Mounts:
      /etc/config from config-volume (rw)
  Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tf-job-operator-config
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   tf-job-operator-865c7ddb5f (1/1 replicas created)
Events:          <none>

Anything else we need to know?:

kernel can not started successly if kubeconfig is not set

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

The ciao log shows Kubeflow kernel starts failed:

[KernelGatewayApp] Kernel args: {'kernel_name': u'kubeflow', 'env': {'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', u'KERNEL_LAUNCH_TIMEOUT': u'40', u'KERNEL_WORKING_DIR': u'/home/jovyan'}}
[I 190731 06:30:15 web:2162] 201 POST /api/kernels (10.233.80.43) 23.44ms
[I 190731 06:30:15 web:2162] 200 GET /api/kernels/4711b4c3-0f7e-4ddb-9650-1b96f31477d5 (10.233.80.43) 1.05ms
2019/07/31 06:30:15 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:15.738126       9 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: W0731 06:30:15.738144       9 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:15 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] Initializing websocket connection /api/kernels/4711b4c3-0f7e-4ddb-9650-1b96f31477d5/channels
[KernelGatewayApp] WARNING | No session ID specified
[KernelGatewayApp] Requesting kernel info from 4711b4c3-0f7e-4ddb-9650-1b96f31477d5
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:35719
[KernelGatewayApp] KernelRestarter: restarting kernel (1/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:46113
2019/07/31 06:30:18 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:18.709932      19 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: W0731 06:30:18.709942      19 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:18 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] KernelRestarter: restarting kernel (2/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:34603
2019/07/31 06:30:21 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:21.722328      25 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: W0731 06:30:21.722337      25 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:21 Error building kubeConfig: invalid configuration: no configuration has been provided
[KernelGatewayApp] KernelRestarter: restarting kernel (3/5), new random ports
[KernelGatewayApp] Starting kernel: [u'kubeflow-kernel', u'run', u'--connection-file', u'/root/.local/share/jupyter/runtime/kernel-4711b4c3-0f7e-4ddb-9650-1b96f31477d5.json']
[KernelGatewayApp] Connecting to: tcp://127.0.0.1:38448
2019/07/31 06:30:24 Using config file: /etc/ciao/config.yaml
ERROR: logging before flag.Parse: W0731 06:30:24.730290      32 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: W0731 06:30:24.730299      32 client_config.go:557] error creating inClusterConfig, falling back to default config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
2019/07/31 06:30:24 Error building kubeConfig: invalid configuration: no configuration has been provided

What you expected to happen:

The kernel gateway prevents the kernel to inherited environment variables when spawning, so KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT are empty.

See: jupyter-server/kernel_gateway#280

How to reproduce it (as minimally and precisely as possible):

Using the following ./hack/k8s.config.yaml to build docker image:

namespace: ciao
s2i:
  provider: configmap

Anything else we need to know?:

[feature] Support %tensorboard magic command

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Maybe blocked by #15

[bug] Fail to re-run the code cell

/kind bug
/priority p0

If the code cell is finished successfully, we cannot re-run it. If it is failed, we can. It is definitely a bug

[discussion] jupyter lab support?

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind feature
/priority p3

I'm not sure if jupyterlab has the same protocol with notebook (i hope they do). Since the end goal of jupyterlab is to replace jupyter notebook, we'd better keep that in mind when designing the kernel.

[enhancement] Refactor Logic about Interpreter

This is the core API of the kernel but it's nested too deeply, I would imagine this to be easily discoverable. The specific commands should be defined somewhere else and preferably versioned, and interpreter just imports the commands? This is also easier for us to implement features like list all available commands, etc

Failed to build ciao image

/kind bug

What happened:

when i exec

docker build -t caicloud/ciao .

What you expected to happen:

i hope success to build ciao image,but it failed.

Sending build context to Docker daemon  51.38MB
Step 1/25 : ARG RUNC_VERSION=9f9c96235cc97674e935002fc3d78361b696a69e
Step 2/25 : FROM golang:1.10-alpine AS build-env
 ---> 7b53e4a31d21
Step 3/25 : RUN apk add --no-cache     zeromq-dev     zeromq     gcc     musl-dev
 ---> Running in 31af2c464d73
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz: DNS lookup error
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz: DNS lookup error
ERROR: unsatisfiable constraints:
  gcc (missing):
    required by: world[gcc]
  musl-dev (missing):
    required by: world[musl-dev]
  zeromq (missing):
    required by: world[zeromq]
  zeromq-dev (missing):
    required by: world[zeromq-dev]
The command '/bin/sh -c apk add --no-cache     zeromq-dev     zeromq     gcc     musl-dev' returned a non-zero code: 4

and,你们或许应该提供官方镜像?

[discussion] Reduce the Overhead of Kubeflow

Now it takes about 10 seconds to set up the containers and run the distributed code (Docker push is not included). We need to try to reduce the overhead of Kubeflow for better UX.

/priority p3

[refactor] Separate CM and image in s2i

Ref ##43 (comment)

Is it still correct to say s2i now that with configmap, we don't really build an image.

I can live with this if we think the image in s2i means the pre-defined image + code, but considering what the s2i project doest (actually building an image), this can be a little confusing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.