kubeflow / chainer-operator Goto Github PK
View Code? Open in Web Editor NEWRepository for chainer operator
License: Apache License 2.0
Repository for chainer operator
License: Apache License 2.0
Multi-nodeChainerJob
is expected to be scheduled as a group of pods, in other words all the pods should be scheduled at once. kube-arbitrator support such scheduling.
I want to use a private Azure Container Registry with Kubernetes. Is there a way to specify imagePullSecrets as a parameter?
@jlewi Could you help me to do this?? Thanks in advance ๐
This repository has not had a commit since Nov 14, 2021
, therefore I propose we archive it to reflect that it is no longer in development.
Currently chainer-operator expands ChainerJob
to
kind: Job
kind: StatefulSets
kind: Job
will keep retrying even though a failure was caused by user code bug. activeDeadlineSeconds
can mitigate this. however, this doesn't work in practice because actual jobs often run for a very long time.
So, my idea is to drop Job
and StatefulSets
and move to bare Pod
s models like TFJob
. Then, users can use retryPolicy: ErrorCode
to control retry behavior.
/area 0.4.0
/priority p1
The chainer component should be using an image tagged 0.3 for the 0.3. release.
Currently its using the "latest" tag:
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/chainer-job/prototypes/chainer-operator.jsonnet#L6
Ideally the image would be published to gcr.io/kubeflow-images-public as well.
Is chainer ready for use?
When it is we should add docs to kubeflow.org?
we should create ksonnet package.
ref: #11 (comment)
/priority p1
Hi @kubeflow/wg-training-leads, I noticed there are many low-activity training repos.
Can I disable GCP test infra for them?
I believe for repos you keep maintaining, they have already moved to optional test infra.
subprojects:
- name: caffe2-operator
- name: chainer-operator
- name: common
- name: fate-operator
- name: mpi-operator
- name: mxnet-operator
- name: pytorch-operator
- name: tf-operator
- name: xgboost-operator
Hi @kubeflow/wg-training-leads, While adding the script to generate the dependabot configuration I noticed that the go dependency files are not compatible. I'm guessing Go needs to be updated so that go.mod
and go.sum
are created rather than Gopkg.lock
and Gopkg.toml
. Once that is done, running make build-dependabot
from the root of the repo will update the config and allow it to scan for go dependencies.
Toc idea:
explain
ChainerJob
structurecurrently I pushed it to my personal dockerhub everpeace/chainer-operator.
it needs to research how to deliver kubeflow related container images.
defaulter set default slots
when a container with name chainer
exists.
https://github.com/kubeflow/chainer-operator/blob/master/pkg/apis/chainer/v1alpha1/defaults.go#L50
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
name: chainer-job
namespace: default
spec:
backend: mpi
master:
template:
spec:
containers:
- args:
- -n
- "2"
- -N
- "-1"
- --allow-run-as-root
- python3
- /train_mnist.py
- -e
- "2"
- -b
- "1000"
- -u
- "100"
command:
- mpiexec
image: everpeace/chainermn:latest
name: chainer-job
workerSets:
ws:
replicas: 1
template:
spec:
containers:
- args:
- -c
- trap exit TERM; while true; do sleep 1 & wait; done
command:
- sh
image: everpeace/chainermn:latest
name: chainer-job
E0628 12:31:07.098157 1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/asm_amd64.s:573
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/panic.go:502
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/panic.go:63
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/signal_unix.go:388
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi/mpi_backend.go:282
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/util.go:199
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi/mpi_backend.go:234
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi/mpi_backend.go:136
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:448
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:340
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:348
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:301
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/chainer_controller.go:287
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/asm_amd64.s:2361
chainer-operator-c9cb5f946-cfxpc chainer-operator panic: runtime error: invalid memory address or nil pointer dereference [recovered]
chainer-operator-c9cb5f946-cfxpc chainer-operator panic: runtime error: invalid memory address or nil pointer dereference
chainer-operator-c9cb5f946-cfxpc chainer-operator [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe9e1c0]
chainer-operator-c9cb5f946-cfxpc chainer-operator
chainer-operator-c9cb5f946-cfxpc chainer-operator goroutine 127 [running]:
chainer-operator-c9cb5f946-cfxpc chainer-operator github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
chainer-operator-c9cb5f946-cfxpc chainer-operator panic(0xfaee60, 0x18c3d70)
chainer-operator-c9cb5f946-cfxpc chainer-operator /usr/local/go/src/runtime/panic.go:502 +0x229
chainer-operator-c9cb5f946-cfxpc chainer-operator github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi.newConfigMap(0xc420288000, 0xc4202b6dc0)
chainer-operator-c9cb5f946-cfxpc chainer-operator /go/src/github.com/kubeflow/chainer-operator/pkg/controllers/backends/mpi/mpi_backend.go:282 +0x180
ref: #11 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.