GithubHelp home page GithubHelp logo

isabella232 / tensorflowonyarn Goto Github PK

View Code? Open in Web Editor NEW

This project forked from manuzhang/tensorflowonyarn

0.0 0.0 0.0 141 KB

Support TensorFlow on YARN

License: Apache License 2.0

Shell 1.81% Java 88.81% Protocol Buffer 1.61% C++ 6.00% C 1.77%

tensorflowonyarn's Introduction

TensorFlowOnYARN Build Status

TensorFlow on YARN (TOY) is a toolkit to enable Hadoop users an easy way to run TensorFlow applications in distributed pattern and accomplish tasks including model management and serving inference.

  • This project focuses on support of running Tensorflow on YARN, as part of Deep Learning on Hadoop (HDL) effort.
  • YARN-6043

Goals

  • Support all TensorFlow components on YARN, TensorFlow distributed cluster, TensorFlow serving, TensorBoard, etc.
  • Support multi-tenants with consideration of different types of users, such as devOp, data scientist and data engineer
  • Support running TensorFlow application in a short-time/long-running job manner of both between-graph mode and in-graph mode
  • Support model management to deploy and also support a service layer to handle upper layer's like Spark or web backend inference request easily
  • Minor or no changes required to run user’s existing TensorFlow application(can be written in all officially supported languages including Python, C++, Java and Go)

Note that current project is a prototype with limitation and is still under development

Architecture

Figure1. TOY Architecture

Features

  • Launch a TensorFlow cluster with specified number of worker and PS server
  • Replace python layer with java bridge layer to start server
  • Generate ClusterSpec dynamically
  • RPC support for client to get ClusterSpec from AM
  • Signal handling for graceful shutdown
  • Package TensorFlow runtime as a resource that can be distributed easily
  • Run in-graph TensorFlow application in client mode
  • TensorBoard support
  • Better handling of network port conflicts
  • Fault tolerance
  • Cluster mode based on Docker
  • Real-time logging support
  • Code refine and more tests

Quick Start

  1. Prepare the build environment following the instructions from https://www.tensorflow.org/install/install_sources

  2. Clone the TensorFlowOnYARN repository.

    git clone --recursive https://github.com/Intel-bigdata/TensorFlowOnYARN
  3. Build the assembly.

    cd TensorFlowOnYARN/tensorflow-parent
    mvn package -Pnative -Pdist

    tensorflow-yarn-${VERSION}.tar.gz and tensorflow-yarn-${VERSION}.zip are built out in the tensorflow-parent/tensorflow-yarn-dist/target directory. Distribute the assembly to the client node of a YARN cluster and extract.

  4. Run the between-graph mnist example.

    cd tensorflow-yarn-${VERSION}
    bin/ydl-tf launch --num_worker 2 --num_ps 2

    This will launch a YARN application, which creates a tf.train.Server instance for each task. A ClusterSpec is printed on the console such that you can submit the training script to. e.g.

    ClusterSpec: {"ps":["node1:22257","node2:22222"],"worker":["node3:22253","node2:22255"]}
    python examples/between-graph/mnist_feed.py \
      --ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
      --worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
      --task_index=0
    
    python examples/between-graph/mnist_feed.py \
      --ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
      --worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
      --task_index=1
  5. To get ClusterSpec of an existing TensorFlow cluster launched by a previous YARN application.

    bin/ydl-tf cluster --app_id <Application ID>
  6. You may also use YARN commands through ydl-tf.

    For example, to get running application list,

    bin/ydl-tf application --list

    or to kill an existing YARN application(TensorFlow cluster),

    bin/ydl-tf kill --application <Application ID>

tensorflowonyarn's People

Contributors

manuzhang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.