GithubHelp home page GithubHelp logo

dsdinter / pytorch-operator Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kubeflow/pytorch-operator

0.0 2.0 0.0 43.17 MB

PyTorch on Kubernetes

License: Apache License 2.0

Dockerfile 0.04% Shell 4.86% Go 95.09%

pytorch-operator's Introduction

Kubernetes Custom Resource and Operator for PyTorch jobs

Build Status Go Report Card

Overview

This repository contains the specification and implementation of PyTorchJob custom resource definition. Using this custom resource, users can create and manage PyTorch jobs like other built-in resources in Kubernetes. See CRD definition

Prerequisites

Installing PyTorch Operator

Please refer to the installation instructions in the Kubeflow user guide. This installs pytorchjob CRD and pytorch-operator controller to manage the lifecycle of PyTorch jobs.

Creating a PyTorch Job

You can create PyTorch Job by defining a PyTorchJob config file. See distributed MNIST example config file. You may change the config file based on your requirements.

cat examples/dist-mnist/pytorch_job_mnist.yaml

Deploy the PyTorchJob resource to start training:

kubectl create -f examples/dist-mnist/pytorch_job_mnist.yaml

You should now be able to see the created pods matching the specified number of replicas.

kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test

Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

PODNAME=$(kubectl get pods -l pytorch_job_name=dist-mnist-for-e2e-test,task_index=0 -o name)
kubectl logs -f ${PODNAME}

Monitoring a PyTorch Job

kubectl get -o yaml pytorchjobs dist-mnist-for-e2e-test

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha1
  kind: PyTorchJob
  metadata:
    clusterName: ""
    creationTimestamp: 2018-06-22T08:16:14Z
    generation: 1
    name: dist-mnist-for-e2e-test
    namespace: default
    resourceVersion: "3276193"
    selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/pytorchjobs/dist-mnist-for-e2e-test
    uid: 87772d3b-75f4-11e8-bdd9-42010aa00072
  spec:
    RuntimeId: kmma
    pytorchImage: pytorch/pytorch:v0.2
    replicaSpecs:
    - masterPort: 23456
      replicaType: MASTER
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
            imagePullPolicy: IfNotPresent
            name: pytorch
            resources: {}
          restartPolicy: OnFailure
    - masterPort: 23456
      replicaType: WORKER
      replicas: 3
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
            imagePullPolicy: IfNotPresent
            name: pytorch
            resources: {}
          restartPolicy: OnFailure
    terminationPolicy:
      master:
        replicaName: MASTER
        replicaRank: 0
  status:
    phase: Done
    reason: ""
    replicaStatuses:
    - ReplicasStates:
        Succeeded: 1
      replica_type: MASTER
      state: Succeeded
    - ReplicasStates:
        Running: 1
        Succeeded: 2
      replica_type: WORKER
      state: Running
    state: Succeeded
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

pytorch-operator's People

Contributors

jose5918 avatar johnugeorge avatar elsonrodriguez avatar jlewi avatar mhbuehler avatar akado2009 avatar raddaoui avatar andreyvelich avatar garganubhav avatar benhall avatar gaocegege avatar kunmingg avatar

Watchers

James Cloos avatar David Sabater Dinter avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.