GithubHelp home page GithubHelp logo

kubeflow / training-operator Goto Github PK

View Code? Open in Web Editor NEW
1.4K 85.0 615.0 93.95 MB

Distributed ML Training and Fine-Tuning on Kubernetes

Home Page: https://www.kubeflow.org/docs/components/training

License: Apache License 2.0

Go 62.26% Shell 1.42% Python 35.83% Makefile 0.34% Dockerfile 0.15%
ai distributed fine-tuning gpu huggingface jax kubeflow kubernetes llm machine-learning

training-operator's Introduction

Kubeflow Training Operator

Build Status Coverage Status Go Report Card

Overview

Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others.

Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using Training Operator Python SDK.

Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes.

Prerequisites

  • Version >= 1.25 of Kubernetes cluster and kubectl

Installation

Master Branch

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

Stable Release

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

TensorFlow Release Only

For users who prefer to use original TensorFlow controllers, please checkout v1.2-branch, patches for bug fixes will still be accepted to this branch.

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0"

Python SDK for Kubeflow Training Operator

Training Operator provides Python SDK for the custom resources. To learn more about available SDK APIs check the TrainingClient.

Use pip install command to install the latest release of the SDK:

pip install kubeflow-training

Training Operator controller and Python SDK have the same release versions.

Quickstart

Please refer to the getting started guide to quickly create your first Training Operator Job using Python SDK.

If you want to work directly with Kubernetes Custom Resources provided by Training Operator, follow the PyTorchJob MNIST guide.

API Documentation

Please refer to following API Documentation:

Community

The following links provide information about getting involved in the community:

This is a part of Kubeflow, so please see readme in kubeflow/kubeflow to get in touch with the community.

Contributing

Please refer to the DEVELOPMENT

Change Log

Please refer to CHANGELOG

Version Matrix

The following table lists the most recent few versions of the operator.

Operator Version API Version Kubernetes Version
v1.0.x v1 1.16+
v1.1.x v1 1.16+
v1.2.x v1 1.16+
v1.3.x v1 1.18+
v1.4.x v1 1.23+
v1.5.x v1 1.23+
v1.6.x v1 1.23+
v1.7.x v1 1.25+
latest (master HEAD) v1 1.25+

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

training-operator's People

Contributors

andreyvelich avatar chanyilin avatar cheimu avatar codeflitting avatar deepanker13 avatar dependabot[bot] avatar gaocegege avatar hackerboy01 avatar jeffwan avatar jian-he avatar jimexist avatar jinchihe avatar jlewi avatar johnugeorge avatar jose5918 avatar kuizhiqing avatar lluunn avatar lowang-bh avatar moon03432 avatar nagar-ajay avatar oikomi avatar richardsliu avatar scorpiocph avatar syulin7 avatar tenzen-y avatar terrytangyuan avatar u2takey avatar wackxu avatar wbuchwalter avatar zw0610 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

training-operator's Issues

Clean up examples; don't require cloning the repo

The examples are a bit of mess.

Do we need to use helm for the examples?

It would be nice if folks didn't need to clone the repo just to run the examples. If we use helm we should deploy the package to GCS as part of our releases so users can run them just by installing the archive using a public link. If we have raw, untemplated, YAML then we can just use a github https link with kubectl.

API Review

Opening an issue to track API issues that should be addressed as part of moving from Alpha to Beta.

  • TfReplicaType
    - Should this be an enum?
    - Instead of specifying replica type should we specify features e.g. run forever, restart always, etc...

  • master vs. chief nomenclature

@mrry from tensorflow should be included on the eventual api review.

Rename project

If I understand correctly mlkube was the internal name of this project at Google?
What do you think of renaming this repo to TFJob, or Tensorflow-operator?
Not a big deal, but it would be good for discoverability/clarity IMO.

E2E test should actually run TensorFlow

The E2E helm test doesn't actually run TensorFlow. We should update the test so that it will actually run a simple distributed TensorFlow program which will fail if all the workers and parameter servers can't communicate.

mlkube.io -> tensorflow/k8s

Now that we've moved the repo to tensorflow/k8s we need to update all the go imports and other references to mlkube.io

periodic test is failing

Periodic test is failing

exit status 1 Command output: RUNNING: rzp-tfjob-test-gzbl8t
FAILED: rzp-tfjob-test-gzbl8t, run `kubectl logs rzp-tfjob-test-gzbl8t --namespace default` for more info

I think we need to fix #82 so that we have the stdout/stderr so we can debug it.

Better GPU support

We should make it easier to use GPUs.

Right now to use GPUs the user would have to add appropriate volume mounts to the PodSpec in the TfJob to mount the GPU devices from the host and set other specs like environment variables if needed.

I think we should have a higher level API. For example

type TFReplicaSpec struct {
  ...
  Gpus []GpuSpec
}

type GpuSpec struct {
   Type string
   Count int32
}

The TfJob controller could then be instantiated with the necessary information to add the appropriate volume mounts and scheduling information to the pods.

Dependency management

Hi,
Currently the project is not buildable because there has been breaking changes in k8s.io/client-go/pkg/api/v1, so go install complains that it cannot find buildable Go source files.

Are you open to add something like dep or Glide to lock-in specific versions of the project's dependencies?

Consider how we manage replicas (stateful sets, managing pods directly)

In the current implementation if a TfProcess (e.g. PS, MASTER, WORKER) has N replicas. We end up creating N job controllers. This is largely a hold over from the initial implementation which predated StatefulSets. Now that StatefulSets are more mature we should consider switching to StatefulSets. This should simplify the logic in the CRD.

The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.

Here's a snippet showing the struct that's stored in TF_CONFIG.

TfConfig{
	Cluster: s.Job.ClusterSpec(),
	Task: map[string]interface{}{
		"type":  strings.ToLower(string(s.Spec.TfReplicaType)),
		"index": index,
	},
}

Currently we construct a unique value of the environment variable TF_CONFIG for each job controller. For stateful sets we'd need a new mechanism to configure this for each replica.

It doesn't look like we can use a PreStart hook since there's no guarantee it runs before the ENTRYPOINT.

Create a web page to list releases

It would be nice to be able to list releases with out using gsutil.

It would be nice to create a simple web page that automatically lists and shows all releases.

Alternatively when we push a release we could create an index.html file and serve that from GCS.

TfJobRestClient.Create doesn't set kind appropriately

If you create a job using and then use kubectl we get an error

kubectl get tfjobs -o yaml gpu-load-test-job-q5tf
error: Object 'Kind' is missing in '{"metadata":{"name":"gpu-load-test-job-q5tf","namespace":"default","selfLink":"/apis/mlkube.io/v1beta1/namespaces/default/tfjobs/gpu-load-test-job-q5tf","uid":"edb05532-71bf-11e7-a0f8-42010a8e0021","resourceVersion":"1222","creationTimestamp":"2017-07-26T05:04:39Z","labels":{"test.mlkube.io":""}},"spec":{"RuntimeId":"u3nm","replica_specs":[{"replicas":1,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":[{"args":["--gpu"],"env":[{"name":"LD_LIBRARY_PATH","value":"/usr/local/cuda/lib64"}],"image":"gcr.io/gcb/tf_smoke_cmle-375-20:latest","name":"tensorflow","resources":{},"securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/dev/nvidia0","name":"dev-nvidia"},{"mountPath":"/dev/nvidiactl","name":"dev-nvidiactl"},{"mountPath":"/dev/nvidia-uvm","name":"dev-nvidia-uvm"}]}],"restartPolicy":"OnFailure","volumes":[{"hostPath":{"path":"/dev/nvidia0"},"name":"dev-nvidia"},{"hostPath":{"path":"/dev/nvidiactl"},"name":"dev-nvidiactl"},{"hostPath":{"path":"/dev/nvidia-uvm"},"name":"dev-nvidia-uvm"}]}},"tf_port":2222,"tf_replica_type":"MASTER"}]},"status":{"conditions":null,"controlPaused":false,"phase":"Done","reason":"","replicaStatuses":[{"ReplicasStates":{"Succeeded":1},"state":"Succeeded","tf_replica_type":"MASTER"}],"state":"Succeeded"}}

No results show up if you click on mlkube-build-periodic

If you click on periodic jobs in prow, we get the following internal service error

Traceback (most recent call last):
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1535, in __call__
    rv = self.handle_exception(request, response, e)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
    rv = self.router.dispatch(request, response)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
    return handler.dispatch()
  File "/base/data/home/apps/s~k8s-gubernator/v20171019-c2170473.404922314496660020/view_base.py", line 56, in dispatch
    webapp2.RequestHandler.dispatch(self)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File "/base/data/home/runtimes/python27_experiment/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
    return method(*args, **kwargs)
  File "/base/data/home/apps/s~k8s-gubernator/v20171019-c2170473.404922314496660020/view_build.py", line 240, in get
    refs.append((x[1], ''))
IndexError: list index out of range

Setup release process for CRD

We should setup a release process for the CRD.

The artifact should be a helm package published in GCS so that customers can just do

helm install http://storage.googleapis.com/mlkube-io-charts/tfjob/latest.tgz

or for users wanting a specific version

helm install http://storage.googleapis.com/mlkube-io-charts/tfjob/chart-v20171019-383eafd.tgz

The steps involved are

  1. Get the latest green postsubmit
  2. Check out the code
  3. Build and push the docker image
  4. Update default values for the chart to point to the new image.
  5. Push a .tar to GCS

We can have our Postsubmit ProwJobs write a GCS file containing the latest green summit.

I think its better to rebuild the docker image rather than promote images in gcr.io/mlkube-testing because permissions on the images in gcr.io/mlkube-testing are rather broad (they could be modified by any code running under K8s prow presubmits).

We could eventually setup GCB or some other mechanism to trigger pushes automatically but that's not a priority.

If handling Add event fails, TfJob should be marked as failed with appropriate error

Currently if there is a problem handling the Add event for a TfJob, the job just gets stuck as opposed to being reported as failed with an appropriate error message.

Here's an exmaple

apiVersion: mlkube.io/v1beta1
kind: TfJob
metadata:
  clusterName: ""
  creationTimestamp: 2017-08-21T13:41:49Z
  deletionGracePeriodSeconds: null
  deletionTimestamp: null
  name: dv2-eval-0641
  namespace: default
  resourceVersion: "877473"
  selfLink: /apis/mlkube.io/v1beta1/namespaces/default/tfjobs/eval-0641
  uid: 7b3eb3cd-8676-11e7-a025-42010a8e0097
spec:
  replicaSpecs:
  - replicas: 1
    template:
      spec:
        containers:
        - command:
          - python
          - -m
          - my_code.train
          - --master=
          - --checkpoint_dir=gs:/some/path
          - --eval_dir=gs:/some/path
          - --alsologtostderr
          image: gcr.io/cloud-ml-dev/image:latest
          name: tensorflow
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: MASTER
  tensorBoard:
    logDir: null

NewTBReplicaSet in this case is failing because logDir isn't specified.

The TfJob just remains in the state as indicated by the YAML above. That's not very helpful. The TfJob should be failed with a helpful error message.

Use K8s Garbage Collection

The code in gc.go collect pod、svc、deploy with special label app=tensorflow-job.
But the resource in a tfjob such as job svc doesn't have that label.
So, will you add the label when create job in the future.
In addition, tfjob doesn't have a k8s deploy,and delete pod of a job doesn't work because the job object haven't been deleted.
At the end, where does garbage come from? When a user deletes a tfjob, tf-operator crashes or the tf-operator receives delete event but restart before it deletes the tfjob, There will be some garbages.Such circumstance rare occur. Any other cases?

How to create TF Jobs from the user side?

I am wondering how the actual procedure would be to go from a simple hello world example to actually deploying it on a mlcube-enabled k8s cluster.

Your example seems to include a Docker container and setting the script (which also has to contain some rather specific scaffolding, like reading config from the environment) as ENTRYPOINT. Is that the recommended way? Or even the only way?

Ideally, I'd like to just map in a file containing the run() function via a volume and avoid forcing everyone to include the scaffolding or something like that. Maybe I'm missing something though.

Structured (Json) logging for Tf Processes

We'd like to support outputting structured (json) log entries from TensorFlow processes. These json log entries should contain metadata in the form of key-value pairs which provide information such as

  • Job name
  • Replica type and index that generated the log message

The goal is to allow K8s clusters to be configured to use a backend for logging (e.g. StackDriver logging on GKE) that can index these labels to facilitate searching for logs from a particular, job and replica.

Option 1: Python Logger

  1. We us a Python logger that can emit json formatted logs and attach appropriate labels.

  2. The TfJob operator passes the necessary metadata to the Python logger (e.g. via environment variables)

  3. We make it easy for users to automatically use this custom Json logger

    • We use a config map to add the custom logger and a sitecustomize.py as a volume to all Tf processes
    • The sitecustomize could configure Python to use the json logger by default
    • TfJob operator could set the Python path so that Tf processes end up using the supplied sitecustomize.py

Option 2: Sidecar

  1. TfJob operator could redirect StdError/StdOut to a Pod volume
  2. A sidecar could rewrite the logs to create json logs and attach appropriate metadata.

Plan

I think option 1 is better. Using a sidecar, we'd have to parse the log entries which requires knowing how they are formatted and this leads to various edge cases when dealing with things like multi-line log entries. I think using Python's logging infrastructure to format log entries as needed is cleaner.

option 2 would also change how a user uses kubectl logs to get the logs since they would now have to get the logs of the sidecar and not the tensorflow container.

Include TfJob name in labels

We should include the TfJob name in the labels of all resources. This would make it easy to identify all resources associated with a particular job.

Here is the current set of labels attached to a pod.

labels:
    controller-uid: e5e93fc9-b5e5-11e7-8522-42010a8e01a4
    job-name: master-76no-0
    job_type: MASTER
    mlkube.io: ""
    runtime_id: 76no
    task_index: "0"

E2E tests leaking GKE clusters

E2E tests seem to be leaking GKE clusters

gcloud --project=mlkube-testing container clusters list
NAME            ZONE           MASTER_VERSION  MASTER_IP        MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
prow            us-central1-f  1.7.6-gke.1     35.202.163.166   n1-standard-4  1.7.6 *       1          RUNNING
v20171017-153b  us-central1-f  1.7.6-gke.1     35.202.214.30    n1-standard-8  1.7.6 *       1          STOPPING
v20171017-4e44  us-central1-f  1.7.6-gke.1     35.188.116.185   n1-standard-8  1.7.6 *       1          STOPPING
v20171017-bbc7  us-central1-f  1.7.6-gke.1     35.202.143.139   n1-standard-8  1.7.6 *       1          STOPPING
v20171017-efab  us-central1-f  1.7.6-gke.1     104.198.197.46   n1-standard-8  1.7.6 *       1          STOPPING
v20171024-083b  us-central1-f  1.7.6-gke.1     130.211.234.145  n1-standard-8  1.7.6 *       1          STOPPING
v20171024-1cfc  us-central1-f  1.7.6-gke.1     35.184.45.20     n1-standard-8  1.7.6 *       1          STOPPING

(Clusters are listed as stopping because I manually deleted them).

More validation of TfJob

We should add more validation to TfJobSpec.Validate so that we reject invalid jobs with clear error messages.

Things to check

  1. At least 1 ReplicaSet is specified
  2. There is a MASTER
  3. There aren't multiple ReplicaSet's of the same type

UI / Kubernetes Dashboard Integration

It would be nice to have a UI that provides features like the following

  • List of jobs with relevant links (e.g. to TensorBoard)
  • Wizard to help create new jobs
    • Something similar to the "deploy" wizard in the kubernetes dashboard.

It looks like K8s is thinking about how to support CRDs in the K8s dashboard issue/1559. So that might give us a lot of the functionality we want.
cc @wbuchwalter

presubmit test(bootstrap.py) doesn't properly check out PRs

Presubmit test submitted on #36 failed with the following error.

INFO:root:Image info:
{
  "image": "gcr.io/mlkube-testing/builder:v20171017-470b0d6"
}
INFO:root:repo https://github.com/jlewi/mlkube.io.git
INFO:root:Running: git clone https://github.com/jlewi/mlkube.io.git /go/src/github.com/jlewi/mlkube.io
Cloning into '/go/src/github.com/jlewi/mlkube.io'...
INFO:root:Running: git checkout 9a62ee96cabfeb0a10fba304a634771325501e82
fatal: reference is not a tree: 9a62ee96cabfeb0a10fba304a634771325501e82
Traceback (most recent call last):
  File "/workspace/bootstrap.py", line 110, in <module>
    main()
  File "/workspace/bootstrap.py", line 102, in main
    src_dir, sha = clone_repo()
  File "/workspace/bootstrap.py", line 78, in clone_repo
    run(["git", "checkout", sha], cwd=dest)
  File "/workspace/bootstrap.py", line 38, in run
    subprocess.check_call(command, cwd=cwd)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['git', 'checkout', '9a62ee96cabfeb0a10fba304a634771325501e82']' returned non-zero exit status 128

Looks like the issue is that bootstrap.py is trying to checkout the code from github.com/jlewi/mlkube.io. But since the PR is from a different user the code is in a different repo.

So to fix it I need to figure out which environment variables to use. We can modify boostrap.py to log the environment variables.

Structured Logging For the operator

I think it would be useful if the operator used structured logging.

For example, it would be nice if the operator outputted json formatted records with various metadata tags. One tag could be the name of the job a log message pertains to. This would make it easy to filter the log messages by job.

https://github.com/sirupsen/logrus is a Go package for structured logging. The main reason I initially didn't use that and went with https://github.com/golang/glog was because logrus doesn't support outputting the file and line number of an error.

Ideally we'd like the best of both packages; i.e. structured logs with file and line number.

tensorflow 1.4 and estimator support

In TensorFlow 1.4 TF_CONFIG uses "chief" and not "master" see here.

We should figure out what changes we should make to support this. We should also figure out how to continue supporting older versions of TF.

Estimator also added evaluation replicas which might not finish until after the master/chief finishes. So we will need to take evaluation replicas into account when determining job status.

Integrate with Prow for Continuous Testing

We would like to integrate with Prow, K8s, test infra to provide continuous testing.

After some initial discussion with @foxish and @krzyzacy the current plan is to have 2 prow jobs;

  • First job will checkout an build a container containing the CRD and E2E tests.
  • Second job will install the CRD and run the E2E tests.

The current plan is to not rely on helm test but just to write go tests as necessary.

First step is probably to create a prow job to build and push the Docker image.

Remaining Issues Blocking This Issue

  • #62 Post submit jobs not uploaded correctly
  • #75 Presubmit results not showing up in test grid
  • #76 No results show up for periodic jobs
  • #82 Need to copy stdout/stderr of tests to GCS.
  • #84 Update prow to use tensorflow/k8s
  • #113 postsubmit results show up in test grid

Permanent errors don't cause job failure

If a container crashes with an exit code of 1 this should be considered a permanent error and cause the job to fail.

This doesn't happen because
isRetryableTerminationState
https://github.com/jlewi/mlkube.io/blob/master/pkg/trainer/training.go

Requires that a termination message be set in order for the exit code to be trusted.
This is legacy code that is no longer applicable. It assumes we were using a launcher.sh script which users aren't.

We should get rid of that check.

func c.findAllTfJobs() in controller.go will never reach

   err := c.createCRD()
if err != nil {
	if k8sutil.IsKubernetesResourceAlreadyExistError(err) {
		// CRD has been initialized before. We need to recover existing cluster.
		watchVersion, err = c.findAllTfJobs()

If only if c.createCRD() return an error that matches status "AlreadyExist", func c.findAllTfJobs() will be called. However , see code in func createCRD() below:

_, err := c.ApiCli.ApiextensionsV1beta1().CustomResourceDefinitions().Create(crd)
if err != nil && !apierrors.IsAlreadyExists(err) {
	return err
}

If it receive the "AlreadyExist" response after create an crd, it will not return in subsequence code. Instead, it will create a polling get call for the crd.At the end, there will never an err that matches status "AlreadyExists"
See "!apierrors.IsAlreadyExists(err)",maybe the "!" should be removed?

TensorBoard Integration

How do you see TensorBoard integrating with this solution?
It would be really cool if I could create a template, ask for TensorBoard to be deployed as well and receive either a ClusterIP or Public IP.

For example this could look like:

apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  addons:
    - tensorboard:
         ip-type: LoadBalancer
  replica_specs:
    - replicas: 1
      tf_port: 2222
      tf_replica_type: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
              name: tensorflow
              resources:
                limits:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure

TensorBoard would then run as a sidecar in the master's pod.
Now the main issue here is accessing the log files.
An easy way would be to document a certain convention. For example, we assume that the log files are saved under /var/tensorflow/logs and then mount this directory into the TensorBoard container through the node.

This also begs the question of data persistence: In this state, once the job shutdowns, all data is lost. Do you think we need to address this question right away, or could this be discussed later on?

Happy to work on this if you approve.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.