GithubHelp home page GithubHelp logo

hhy5277 / tensorflowonspark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yahoo/tensorflowonspark

0.0 2.0 0.0 710 KB

TensorFlowOnSpark brings TensorFlow programs onto Apache Spark clusters

License: Apache License 2.0

Shell 3.00% Python 97.00%

tensorflowonspark's Introduction

TensorFlowOnSpark

What's TensorFlowOnSpark?

TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

TensorFlowOnSpark enables distributed TensorFlow training and inference on Apache Spark clusters. It seeks to minimize the amount of code changes required to run existing TensorFlow programs on a shared grid. Its Spark-compatible API helps manage the TensorFlow cluster with the following steps:

  1. Reservation - reserves a port for the TensorFlow process on each executor and also starts a listener for data/control messages.
  2. Startup - launches the Tensorflow main function on the executors.
  3. Data ingestion
  4. Readers & QueueRunners - leverages TensorFlow's Reader mechanism to read data files directly from HDFS.
  5. Feeding - sends Spark RDD data into the TensorFlow nodes using the feed_dict mechanism. Note that we leverage the Hadoop Input/Output Format for access to TFRecords on HDFS.
  6. Shutdown - shuts down the Tensorflow workers and PS nodes on the executors.

We have also enhanced TensorFlow to support direct access to remote memory (RDMA) on Infiniband networks.

TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

Why TensorFlowOnSpark?

TensorFlowOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.

  • Easily migrate all existing TensorFlow programs with <10 lines of code change;
  • Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard;
  • Server-to-server direct communication achieves faster learning when available;
  • Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow;
  • Easily integrate with your existing data processing pipelines and machine learning algorithms (ex. MLlib, CaffeOnSpark);
  • Easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband.

Using TensorFlowOnSpark

Please check TensorFlowOnSpark wiki site for detailed documentations such as getting started guides for YARN cluster and AWS EC2 cluster. A Conversion Guide has been provided to help you convert your TensorFlow programs.

Mailing List

Please join TensorFlowOnSpark user group for discussions and questions.

License

The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.

tensorflowonspark's People

Contributors

anfeng avatar junshi15 avatar leewyang avatar viirya avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.