kubeflow / training-operator Goto Github PK

View Code? Open in Web Editor NEW

1.4K 85.0 615.0 93.95 MB

Distributed ML Training and Fine-Tuning on Kubernetes

Home Page: https://www.kubeflow.org/docs/components/training

License: Apache License 2.0

Go 62.26% Shell 1.42% Python 35.83% Makefile 0.34% Dockerfile 0.15%

ai distributed fine-tuning gpu huggingface jax kubeflow kubernetes llm machine-learning

training-operator's Introduction

Kubeflow Training Operator

Overview

Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others.

Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using Training Operator Python SDK.

Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes.

For a complete reference of the custom resource definitions, please refer to the API Definition.
For details of all-in-one operator design, please refer to the All-in-one Kubeflow Training Operator
For details on its observability, please refer to the monitoring design doc.

Prerequisites

Version >= 1.25 of Kubernetes cluster and kubectl

Installation

Master Branch

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

Stable Release

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

TensorFlow Release Only

For users who prefer to use original TensorFlow controllers, please checkout v1.2-branch, patches for bug fixes will still be accepted to this branch.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0"

Python SDK for Kubeflow Training Operator

Training Operator provides Python SDK for the custom resources. To learn more about available SDK APIs check the TrainingClient.

Use pip install command to install the latest release of the SDK:

pip install kubeflow-training

Training Operator controller and Python SDK have the same release versions.

Quickstart

Please refer to the getting started guide to quickly create your first Training Operator Job using Python SDK.

If you want to work directly with Kubernetes Custom Resources provided by Training Operator, follow the PyTorchJob MNIST guide.

API Documentation

Please refer to following API Documentation:

Kubeflow.org v1 API Documentation

Community

The following links provide information about getting involved in the community:

Attend the AutoML and Training Working Group community meeting.
Join our Slack channel.
Check out who is using the Training Operator.

This is a part of Kubeflow, so please see readme in kubeflow/kubeflow to get in touch with the community.

Contributing

Please refer to the DEVELOPMENT

Change Log

Please refer to CHANGELOG

Version Matrix

The following table lists the most recent few versions of the operator.

Operator Version	API Version	Kubernetes Version
`v1.0.x`	`v1`	1.16+
`v1.1.x`	`v1`	1.16+
`v1.2.x`	`v1`	1.16+
`v1.3.x`	`v1`	1.18+
`v1.4.x`	`v1`	1.23+
`v1.5.x`	`v1`	1.23+
`v1.6.x`	`v1`	1.23+
`v1.7.x`	`v1`	1.25+
`latest` (master HEAD)	`v1`	1.25+

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

PyTorch Operator: list of contributors and maintainers.
MPI Operator: list of contributors and maintainers.
XGBoost Operator: list of contributors and maintainers.
MXNet Operator: list of contributors and maintainers.
Common library: list of contributors and maintainers.

training-operator's People

Contributors

Stargazers

Watchers

Forkers

foxish aland-zhang imraghava ritazh young8 amitkumarj441 workhardddup bamaao jlewi jimexist ml-lab liangxugang jjmengze sozercan zjj2wry sohailkhanmarwat scriptllh scorpiocph djangopeng huozhu0088 zjx-caicloud nuaays hyperbolic2346 freeshow rushins zmoon111 cimomo cbockman limx59 vicaire ddysher connordoyle fanfenghua0902 pupillo mitake qizheng09 boragocode tyzhong fjibj radanalyticsio lujunruhit akinswin lijiansong cwbeitel stephenlau007 mqliang eversmily jianzi123 xychu nqn 0xgj prateekpandey14 markjacksonfishing jose5918 joserfjuniorllms superryanguo valmach ourobouros nanliu nanpian chenqiangzhishen sdf611097 elsonrodriguez kkasravi joseph-chan ogre0403 tensornetes carmark etsangsplk charlesa101 lluunn andriipetruk stefanofioravanzo tmckayus rc-zhang kayush2o6 spandankumarsahu shobhit-agarwal rohitagarwal003 wenzhel101 willb kunmingg nickchase xyhuang datalayer-externals yph152 wafaat augustoyao u2takey majidaldo mashroomxl zachzhaozy chesterli29 zhaoyingjun zxq-bit wgliang richardle798 vvvictorlee wbuchwalter philophilo

training-operator's Issues

Clean up examples; don't require cloning the repo

The examples are a bit of mess.

Do we need to use helm for the examples?

It would be nice if folks didn't need to clone the repo just to run the examples. If we use helm we should deploy the package to GCS as part of our releases so users can run them just by installing the archive using a public link. If we have raw, untemplated, YAML then we can just use a github https link with kubectl.

Simplify/Clarify Accelerators config

Accelerators configuration should be surfaced in chart's values.yaml, and examples should be provided for the different providers.

API Review

Opening an issue to track API issues that should be addressed as part of moving from Alpha to Beta.

TfReplicaType
- Should this be an enum?
- Instead of specifying replica type should we specify features e.g. run forever, restart always, etc...
master vs. chief nomenclature

@mrry from tensorflow should be included on the eventual api review.

E2E test that verifies invalid jobs are failed

We should add E2E tests that verify if a TfJob spec is invalid, then the job is marked as failed with an appropriate reason set.

See #26

Should this be converted to a Custom Resource Definition (CRD) in anticipation of 1.7

Just found out from a k8s expert that Third Party Resources are being removed in favor of Custom Resource Definitions.

https://coreos.com/blog/custom-resource-kubernetes-v17
https://kubernetes.io/docs/tasks/access-kubernetes-api/migrate-third-party-resource/

It looks like it's in beta in 1.7 and TPR will be completely removed in 1.8. So the time to move over is probably now...

Rename project

If I understand correctly mlkube was the internal name of this project at Google?
What do you think of renaming this repo to TFJob, or Tensorflow-operator?
Not a big deal, but it would be good for discoverability/clarity IMO.

E2E test should actually run TensorFlow

The E2E helm test doesn't actually run TensorFlow. We should update the test so that it will actually run a simple distributed TensorFlow program which will fail if all the workers and parameter servers can't communicate.

mlkube.io -> tensorflow/k8s

Now that we've moved the repo to tensorflow/k8s we need to update all the go imports and other references to mlkube.io

Update repo to use its new location tensorflow/k8s

This is the uber tracking bug.

Items

~~#84 Update prow to the new repository location~~
~~#85 Update mlkube.io -> tensorflow/k8s in the code~~
~~#88 Setup travis for tensorflow/k8s~~

periodic test is failing

Periodic test is failing

exit status 1 Command output: RUNNING: rzp-tfjob-test-gzbl8t
FAILED: rzp-tfjob-test-gzbl8t, run `kubectl logs rzp-tfjob-test-gzbl8t --namespace default` for more info

I think we need to fix #82 so that we have the stdout/stderr so we can debug it.

Better GPU support

We should make it easier to use GPUs.

Right now to use GPUs the user would have to add appropriate volume mounts to the PodSpec in the TfJob to mount the GPU devices from the host and set other specs like environment variables if needed.

I think we should have a higher level API. For example

type TFReplicaSpec struct {
  ...
  Gpus []GpuSpec
}

type GpuSpec struct {
   Type string
   Count int32
}

The TfJob controller could then be instantiated with the necessary information to add the appropriate volume mounts and scheduling information to the pods.

Run lint (Python, Go) as a presubmit test

We should run lint checks as a presubmit test.

Would be nice to do markdown as well but I'm not sure if there's a linter for markdown.

Dependency management

Hi,
Currently the project is not buildable because there has been breaking changes in k8s.io/client-go/pkg/api/v1, so go install complains that it cannot find buildable Go source files.

Are you open to add something like dep or Glide to lock-in specific versions of the project's dependencies?

Run TensorFlow server for parameter servers by default

For the parameter servers we could just run the TensorFlow server by default and not make the user specify any code to run.

Optimize scheduling of TF Processes

Opening this issue to track ideas for how to improve scheduling of TF processes.

This topic first came up in
https://github.com/jlewi/mlkube.io/issues/16#issuecomment-328678529

As an example of issues to consider

How can we use pod Affinity and Anti Affinity to minimize network contention between parameter servers?
When using multiple GPUs how can we take advantage of intra GPU communication features like NVIDIA Peer to Peer.

Consider how we manage replicas (stateful sets, managing pods directly)

In the current implementation if a TfProcess (e.g. PS, MASTER, WORKER) has N replicas. We end up creating N job controllers. This is largely a hold over from the initial implementation which predated StatefulSets. Now that StatefulSets are more mature we should consider switching to StatefulSets. This should simplify the logic in the CRD.

The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.

Here's a snippet showing the struct that's stored in TF_CONFIG.

TfConfig{
	Cluster: s.Job.ClusterSpec(),
	Task: map[string]interface{}{
		"type":  strings.ToLower(string(s.Spec.TfReplicaType)),
		"index": index,
	},
}

Currently we construct a unique value of the environment variable TF_CONFIG for each job controller. For stateful sets we'd need a new mechanism to configure this for each replica.

It doesn't look like we can use a PreStart hook since there's no guarantee it runs before the ENTRYPOINT.

Provide a default value for TfPort, replicas, and tfReplicaType

The operator should provide a default port to use for the gRPC TensorFlow server so that users don't have to specify it.

E2E test(s) to verify that permanent and retryable errors are handled correctly.

We should add an E2E test that ensures that we handle different exit codes of the master correctly.

This would prevent issues like #28.

Create a web page to list releases

It would be nice to be able to list releases with out using gsutil.

It would be nice to create a simple web page that automatically lists and shows all releases.

Alternatively when we push a release we could create an index.html file and serve that from GCS.

No results show up in prow test grid for presubmit jobs

Results are showing up in testgrid for postsubmit and periodic jobs but not presubmit jobs.

TfJobRestClient.Create doesn't set kind appropriately

If you create a job using and then use kubectl we get an error

kubectl get tfjobs -o yaml gpu-load-test-job-q5tf
error: Object 'Kind' is missing in '{"metadata":{"name":"gpu-load-test-job-q5tf","namespace":"default","selfLink":"/apis/mlkube.io/v1beta1/namespaces/default/tfjobs/gpu-load-test-job-q5tf","uid":"edb05532-71bf-11e7-a0f8-42010a8e0021","resourceVersion":"1222","creationTimestamp":"2017-07-26T05:04:39Z","labels":{"test.mlkube.io":""}},"spec":{"RuntimeId":"u3nm","replica_specs":[{"replicas":1,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":[{"args":["--gpu"],"env":[{"name":"LD_LIBRARY_PATH","value":"/usr/local/cuda/lib64"}],"image":"gcr.io/gcb/tf_smoke_cmle-375-20:latest","name":"tensorflow","resources":{},"securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/dev/nvidia0","name":"dev-nvidia"},{"mountPath":"/dev/nvidiactl","name":"dev-nvidiactl"},{"mountPath":"/dev/nvidia-uvm","name":"dev-nvidia-uvm"}]}],"restartPolicy":"OnFailure","volumes":[{"hostPath":{"path":"/dev/nvidia0"},"name":"dev-nvidia"},{"hostPath":{"path":"/dev/nvidiactl"},"name":"dev-nvidiactl"},{"hostPath":{"path":"/dev/nvidia-uvm"},"name":"dev-nvidia-uvm"}]}},"tf_port":2222,"tf_replica_type":"MASTER"}]},"status":{"conditions":null,"controlPaused":false,"phase":"Done","reason":"","replicaStatuses":[{"ReplicasStates":{"Succeeded":1},"state":"Succeeded","tf_replica_type":"MASTER"}],"state":"Succeeded"}}

Change version from beta -> alpha

We should try to follow k8s guidelines for versioning.

So the version should be v1alpha1 not (v1beta1) until we can meet the beta criterion.

Operator Log Spam; replicas.go:287] No container named: tensorflow found for pod; assuming POD is running

Seeing the following in the operator logs

W0821 13:03:36.791622       1 replicas.go:287] No container named: tensorflow found for pod; assuming POD is running

This is not a helpful error message since its unclear which TfJob it pertains to.

No results show up if you click on mlkube-build-periodic

If you click on periodic jobs in prow, we get the following internal service error

Traceback (most recent call last):
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
    rv = self.handle_exception(request, response, e)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
    rv = self.router.dispatch(request, response)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
    return handler.dispatch()
  File "/base/data/home/apps/s~k8s-gubernator/v20171019-c2170473.404922314496660020/view_base.py", line 56, in dispatch
    webapp2.RequestHandler.dispatch(self)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
    return method(*args, **kwargs)
  File "/base/data/home/apps/s~k8s-gubernator/v20171019-c2170473.404922314496660020/view_build.py", line 240, in get
    refs.append((x[1], ''))
IndexError: list index out of range

Use headless services for Training jobs

We should use headless services. We don't need load balancing since there is a single pod that is the backend for each service. I think this should provide some performance benefits but I don't know how much.

Setup release process for CRD

We should setup a release process for the CRD.

The artifact should be a helm package published in GCS so that customers can just do

helm install http://storage.googleapis.com/mlkube-io-charts/tfjob/latest.tgz

or for users wanting a specific version

helm install http://storage.googleapis.com/mlkube-io-charts/tfjob/chart-v20171019-383eafd.tgz

The steps involved are

Get the latest green postsubmit
Check out the code
Build and push the docker image
Update default values for the chart to point to the new image.
Push a .tar to GCS

We can have our Postsubmit ProwJobs write a GCS file containing the latest green summit.

I think its better to rebuild the docker image rather than promote images in gcr.io/mlkube-testing because permissions on the images in gcr.io/mlkube-testing are rather broad (they could be modified by any code running under K8s prow presubmits).

We could eventually setup GCB or some other mechanism to trigger pushes automatically but that's not a priority.

If handling Add event fails, TfJob should be marked as failed with appropriate error

Currently if there is a problem handling the Add event for a TfJob, the job just gets stuck as opposed to being reported as failed with an appropriate error message.

Here's an exmaple

apiVersion: mlkube.io/v1beta1
kind: TfJob
metadata:
  clusterName: ""
  creationTimestamp: 2017-08-21T13:41:49Z
  deletionGracePeriodSeconds: null
  deletionTimestamp: null
  name: dv2-eval-0641
  namespace: default
  resourceVersion: "877473"
  selfLink: /apis/mlkube.io/v1beta1/namespaces/default/tfjobs/eval-0641
  uid: 7b3eb3cd-8676-11e7-a025-42010a8e0097
spec:
  replicaSpecs:
  - replicas: 1
    template:
      spec:
        containers:
        - command:
          - python
          - -m
          - my_code.train
          - --master=
          - --checkpoint_dir=gs:/some/path
          - --eval_dir=gs:/some/path
          - --alsologtostderr
          image: gcr.io/cloud-ml-dev/image:latest
          name: tensorflow
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: MASTER
  tensorBoard:
    logDir: null

NewTBReplicaSet in this case is failing because logDir isn't specified.

The TfJob just remains in the state as indicated by the YAML above. That's not very helpful. The TfJob should be failed with a helpful error message.

Build sample container as part of release process

We should rebuild the samples Docker image as part of the release.

Since there will be breaking API changes in the future it would be good to version the examples in tandem with the CRD.

We will most likely want to build 2 images; one based on TF-CPU and one on TF-GPU.

runner.py needs to create build-log.txt with stdout/stderr of test

See GCS Layout

We should be creating build-log.txt with stdout/stderr from the test.

Ideally we should capture output of bootstrap.py as well. We can probably just add a logging handler
that appends logs of bootstrap.py and runner.py to a file.

Use K8s Garbage Collection

The code in gc.go collect pod、svc、deploy with special label app=tensorflow-job.
But the resource in a tfjob such as job svc doesn't have that label.
So, will you add the label when create job in the future.
In addition, tfjob doesn't have a k8s deploy,and delete pod of a job doesn't work because the job object haven't been deleted.
At the end, where does garbage come from? When a user deletes a tfjob, tf-operator crashes or the tf-operator receives delete event but restart before it deletes the tfjob, There will be some garbages.Such circumstance rare occur. Any other cases?

How to create TF Jobs from the user side?

I am wondering how the actual procedure would be to go from a simple hello world example to actually deploying it on a mlcube-enabled k8s cluster.

Your example seems to include a Docker container and setting the script (which also has to contain some rather specific scaffolding, like reading config from the environment) as ENTRYPOINT. Is that the recommended way? Or even the only way?

Ideally, I'd like to just map in a file containing the run() function via a volume and avoid forcing everyone to include the scaffolding or something like that. Maybe I'm missing something though.

Structured (Json) logging for Tf Processes

We'd like to support outputting structured (json) log entries from TensorFlow processes. These json log entries should contain metadata in the form of key-value pairs which provide information such as

Job name
Replica type and index that generated the log message

The goal is to allow K8s clusters to be configured to use a backend for logging (e.g. StackDriver logging on GKE) that can index these labels to facilitate searching for logs from a particular, job and replica.

Option 1: Python Logger

We us a Python logger that can emit json formatted logs and attach appropriate labels.
The TfJob operator passes the necessary metadata to the Python logger (e.g. via environment variables)
We make it easy for users to automatically use this custom Json logger
- We use a config map to add the custom logger and a sitecustomize.py as a volume to all Tf processes
- The sitecustomize could configure Python to use the json logger by default
- TfJob operator could set the Python path so that Tf processes end up using the supplied sitecustomize.py

Option 2: Sidecar

TfJob operator could redirect StdError/StdOut to a Pod volume
A sidecar could rewrite the logs to create json logs and attach appropriate metadata.

Plan

I think option 1 is better. Using a sidecar, we'd have to parse the log entries which requires knowing how they are formatted and this leads to various edge cases when dealing with things like multi-line log entries. I think using Python's logging infrastructure to format log entries as needed is cleaner.

option 2 would also change how a user uses kubectl logs to get the logs since they would now have to get the logs of the sidecar and not the tensorflow container.

Include TfJob name in labels

We should include the TfJob name in the labels of all resources. This would make it easy to identify all resources associated with a particular job.

Here is the current set of labels attached to a pod.

labels:
    controller-uid: e5e93fc9-b5e5-11e7-8522-42010a8e01a4
    job-name: master-76no-0
    job_type: MASTER
    mlkube.io: ""
    runtime_id: 76no
    task_index: "0"

E2E tests leaking GKE clusters

E2E tests seem to be leaking GKE clusters

gcloud --project=mlkube-testing container clusters list
NAME            ZONE           MASTER_VERSION  MASTER_IP        MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
prow            us-central1-f  1.7.6-gke.1     35.202.163.166   n1-standard-4  1.7.6 *       1          RUNNING
v20171017-153b  us-central1-f  1.7.6-gke.1     35.202.214.30    n1-standard-8  1.7.6 *       1          STOPPING
v20171017-4e44  us-central1-f  1.7.6-gke.1     35.188.116.185   n1-standard-8  1.7.6 *       1          STOPPING
v20171017-bbc7  us-central1-f  1.7.6-gke.1     35.202.143.139   n1-standard-8  1.7.6 *       1          STOPPING
v20171017-efab  us-central1-f  1.7.6-gke.1     104.198.197.46   n1-standard-8  1.7.6 *       1          STOPPING
v20171024-083b  us-central1-f  1.7.6-gke.1     130.211.234.145  n1-standard-8  1.7.6 *       1          STOPPING
v20171024-1cfc  us-central1-f  1.7.6-gke.1     35.184.45.20     n1-standard-8  1.7.6 *       1          STOPPING

(Clusters are listed as stopping because I manually deleted them).

E2E Test for default PS server

We should create an E2E test to make sure that the default PS server behaves correctly.

Update prow to use repo tensorflow/k8s

The repository has moved to tensorflow/k8s so we need to update prow.

Post submit jobs don't correctly upload artifacts to GCS

If you click on a post submit job to get the results we see an error like:

Unable to load build details from gs://kubernetes-jenkins/logs/jlewi_mlkube.io/mlkube-build-postsubmit/6

More validation of TfJob

We should add more validation to TfJobSpec.Validate so that we reject invalid jobs with clear error messages.

Things to check

At least 1 ReplicaSet is specified
There is a MASTER
There aren't multiple ReplicaSet's of the same type

UI / Kubernetes Dashboard Integration

It would be nice to have a UI that provides features like the following

List of jobs with relevant links (e.g. to TensorBoard)
Wizard to help create new jobs
- Something similar to the "deploy" wizard in the kubernetes dashboard.

It looks like K8s is thinking about how to support CRDs in the K8s dashboard issue/1559. So that might give us a lot of the functionality we want.
cc @wbuchwalter

presubmit test(bootstrap.py) doesn't properly check out PRs

Presubmit test submitted on #36 failed with the following error.

INFO:root:Image info:
{
  "image": "gcr.io/mlkube-testing/builder:v20171017-470b0d6"
}
INFO:root:repo https://github.com/jlewi/mlkube.io.git
INFO:root:Running: git clone https://github.com/jlewi/mlkube.io.git /go/src/github.com/jlewi/mlkube.io
Cloning into '/go/src/github.com/jlewi/mlkube.io'...
INFO:root:Running: git checkout 9a62ee96cabfeb0a10fba304a634771325501e82
fatal: reference is not a tree: 9a62ee96cabfeb0a10fba304a634771325501e82
Traceback (most recent call last):
  File "/workspace/bootstrap.py", line 110, in <module>
    main()
  File "/workspace/bootstrap.py", line 102, in main
    src_dir, sha = clone_repo()
  File "/workspace/bootstrap.py", line 78, in clone_repo
    run(["git", "checkout", sha], cwd=dest)
  File "/workspace/bootstrap.py", line 38, in run
    subprocess.check_call(command, cwd=cwd)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['git', 'checkout', '9a62ee96cabfeb0a10fba304a634771325501e82']' returned non-zero exit status 128

Looks like the issue is that bootstrap.py is trying to checkout the code from github.com/jlewi/mlkube.io. But since the PR is from a different user the code is in a different repo.

So to fix it I need to figure out which environment variables to use. We can modify boostrap.py to log the environment variables.

Structured Logging For the operator

I think it would be useful if the operator used structured logging.

For example, it would be nice if the operator outputted json formatted records with various metadata tags. One tag could be the name of the job a log message pertains to. This would make it easy to filter the log messages by job.

https://github.com/sirupsen/logrus is a Go package for structured logging. The main reason I initially didn't use that and went with https://github.com/golang/glog was because logrus doesn't support outputting the file and line number of an error.

Ideally we'd like the best of both packages; i.e. structured logs with file and line number.

tensorflow 1.4 and estimator support

In TensorFlow 1.4 TF_CONFIG uses "chief" and not "master" see here.

We should figure out what changes we should make to support this. We should also figure out how to continue supporting older versions of TF.

Estimator also added evaluation replicas which might not finish until after the master/chief finishes. So we will need to take evaluation replicas into account when determining job status.

Setup continuous build of containers

We should continuously build the Docker container for the TfJob operator.

Integrate with Prow for Continuous Testing

We would like to integrate with Prow, K8s, test infra to provide continuous testing.

After some initial discussion with @foxish and @krzyzacy the current plan is to have 2 prow jobs;

First job will checkout an build a container containing the CRD and E2E tests.
Second job will install the CRD and run the E2E tests.

The current plan is to not rely on helm test but just to write go tests as necessary.

First step is probably to create a prow job to build and push the Docker image.

Remaining Issues Blocking This Issue

~~#62 Post submit jobs not uploaded correctly~~
~~#75 Presubmit results not showing up in test grid~~
~~#76 No results show up for periodic jobs~~
~~#82 Need to copy stdout/stderr of tests to GCS.~~
~~#84 Update prow to use tensorflow/k8s~~
~~#113 postsubmit results show up in test grid~~

Set a default value for restartPolicy

Users shouldn't need to explicitly set restart policy. We should be able to pick sensible values.

Permanent errors don't cause job failure

If a container crashes with an exit code of 1 this should be considered a permanent error and cause the job to fail.

This doesn't happen because
isRetryableTerminationState
https://github.com/jlewi/mlkube.io/blob/master/pkg/trainer/training.go

Requires that a termination message be set in order for the exit code to be trusted.
This is legacy code that is no longer applicable. It assumes we were using a launcher.sh script which users aren't.

We should get rid of that check.

E2E test for GPUs

We should add an E2E test to verify that GPUs work.

Add a creationTimestamp

We should add a creationTimeStamp to the metadata of TfJobs.

func c.findAllTfJobs() in controller.go will never reach

   err := c.createCRD()
if err != nil {
	if k8sutil.IsKubernetesResourceAlreadyExistError(err) {
		// CRD has been initialized before. We need to recover existing cluster.
		watchVersion, err = c.findAllTfJobs()

If only if c.createCRD() return an error that matches status "AlreadyExist", func c.findAllTfJobs() will be called. However , see code in func createCRD() below:

_, err := c.ApiCli.ApiextensionsV1beta1().CustomResourceDefinitions().Create(crd)
if err != nil && !apierrors.IsAlreadyExists(err) {
	return err
}

If it receive the "AlreadyExist" response after create an crd, it will not return in subsequence code. Instead, it will create a polling get call for the crd.At the end, there will never an err that matches status "AlreadyExists"
See "!apierrors.IsAlreadyExists(err)"，maybe the "!" should be removed?

TensorBoard Integration

How do you see TensorBoard integrating with this solution?
It would be really cool if I could create a template, ask for TensorBoard to be deployed as well and receive either a ClusterIP or Public IP.

For example this could look like:

apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  addons:
    - tensorboard:
         ip-type: LoadBalancer
  replica_specs:
    - replicas: 1
      tf_port: 2222
      tf_replica_type: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
              name: tensorflow
              resources:
                limits:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure

TensorBoard would then run as a sidecar in the master's pod.
Now the main issue here is accessing the log files.
An easy way would be to document a certain convention. For example, we assume that the log files are saved under /var/tensorflow/logs and then mount this directory into the TensorBoard container through the node.

This also begs the question of data persistence: In this state, once the job shutdowns, all data is lost. Do you think we need to address this question right away, or could this be discussed later on?

Happy to work on this if you approve.