GithubHelp home page GithubHelp logo

chainer-operator's Introduction

OpenSSF Best Practices OpenSSF Scorecard CLOMonitor

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.


Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

Get Involved

Please refer to the Community page.

chainer-operator's People

Contributors

disktnk avatar everpeace avatar jlewi avatar terrytangyuan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chainer-operator's Issues

Remove gcp test infra support

Hi @kubeflow/wg-training-leads, I noticed there are many low-activity training repos.
Can I disable GCP test infra for them?
I believe for repos you keep maintaining, they have already moved to optional test infra.

https://github.com/kubeflow/community/blob/2336f10cc98b35b3812960feb8a3ab2a8df6cea0/wgs.yaml#L456-L483

subprojects:
  - name: caffe2-operator
  - name: chainer-operator
  - name: common
  - name: fate-operator
  - name: mpi-operator
  - name: mxnet-operator
  - name: pytorch-operator
  - name: tf-operator
  - name: xgboost-operator

Move to bare pod model like TFJob

Currently chainer-operator expands ChainerJob to

  • one master kind: Job
  • several worker set kind: StatefulSets

kind: Job will keep retrying even though a failure was caused by user code bug. activeDeadlineSeconds can mitigate this. however, this doesn't work in practice because actual jobs often run for a very long time.

So, my idea is to drop Job and StatefulSets and move to bare Pods models like TFJob. Then, users can use retryPolicy: ErrorCode to control retry behavior.

/area 0.4.0
/priority p1

Pave Initial Readme

Toc idea:


Overview

Deploy the Operator (admin task)

Creating a Job (user task)

explain

  • ChainerJob structure
  • backend

Using GPUs

Accessing logs


update Go

Hi @kubeflow/wg-training-leads, While adding the script to generate the dependabot configuration I noticed that the go dependency files are not compatible. I'm guessing Go needs to be updated so that go.mod and go.sum are created rather than Gopkg.lock and Gopkg.toml. Once that is done, running make build-dependabot from the root of the repo will update the config and allow it to scan for go dependencies.

#30

Support gang-scheduling

Multi-nodeChainerJob is expected to be scheduled as a group of pods, in other words all the pods should be scheduled at once. kube-arbitrator support such scheduling.

controller fails when mpiConfig.slots is null

defaulter set default slots when a container with name chainer exists.

https://github.com/kubeflow/chainer-operator/blob/master/pkg/apis/chainer/v1alpha1/defaults.go#L50

  • reproducer
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
  name: chainer-job
  namespace: default
spec:
  backend: mpi
  master:
    template:
      spec:
        containers:
        - args:
          - -n
          - "2"
          - -N
          - "-1"
          - --allow-run-as-root
          - python3
          - /train_mnist.py
          - -e
          - "2"
          - -b
          - "1000"
          - -u
          - "100"
          command:
          - mpiexec
          image: everpeace/chainermn:latest
          name: chainer-job
  workerSets:
    ws:
      replicas: 1
      template:
        spec:
          containers:
          - args:
            - -c
            - trap exit TERM; while true; do sleep 1 & wait; done
            command:
            - sh
            image: everpeace/chainermn:latest
            name: chainer-job
  • log
 E0628 12:31:07.098157       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/asm_amd64.s:573
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/panic.go:502
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/panic.go:63
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/signal_unix.go:388
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi/mpi_backend.go:282
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/util.go:199
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi/mpi_backend.go:234
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi/mpi_backend.go:136
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:448
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:340
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:348
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:301
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:287
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/asm_amd64.s:2361
chainer-operator-c9cb5f946-cfxpc chainer-operator panic: runtime error: invalid memory address or nil pointer dereference [recovered]
chainer-operator-c9cb5f946-cfxpc chainer-operator       panic: runtime error: invalid memory address or nil pointer dereference
chainer-operator-c9cb5f946-cfxpc chainer-operator [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe9e1c0]
chainer-operator-c9cb5f946-cfxpc chainer-operator
chainer-operator-c9cb5f946-cfxpc chainer-operator goroutine 127 [running]:
chainer-operator-c9cb5f946-cfxpc chainer-operator github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
chainer-operator-c9cb5f946-cfxpc chainer-operator       /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
chainer-operator-c9cb5f946-cfxpc chainer-operator panic(0xfaee60, 0x18c3d70)
chainer-operator-c9cb5f946-cfxpc chainer-operator       /usr/local/go/src/runtime/panic.go:502 +0x229
chainer-operator-c9cb5f946-cfxpc chainer-operator github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi.newConfigMap(0xc420288000, 0xc4202b6dc0)
chainer-operator-c9cb5f946-cfxpc chainer-operator       /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi/mpi_backend.go:282 +0x180

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.