GithubHelp home page GithubHelp logo

icloud-ecnu / prophet Goto Github PK

View Code? Open in Web Editor NEW
16.0 1.0 1.0 1.82 MB

Prophet is a predictable communication scheduling strategy to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization.

Python 62.02% C++ 36.19% C 0.21% Dockerfile 0.79% Shell 0.79%

prophet's Introduction

Prophet

This repository is forked from https://github.com/bytedance/byteps with a modification on the commununication scheduling mechanism.

Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on overlapping the computation and communication to improve DDNN training performance, the GPU and network resources are still under-utilized in DDNN training clusters.

To tackle this issue, we design and implement a predictable communication scheduling strategy named Prophet to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization. Leveraging our observed stepwise pattern of gradient transfer start time, Prophet first uses the monitored network bandwidth and the profiled time interval among gradients to predict the appropriate number of gradients that can be grouped into blocks. Then, these gradient blocks can be transferred one by one to guarantee high utilization of GPU and network resources while ensuring the priority of gradient transfer (i.e., low-priority gradients cannot preempt high-priority gradients in the network transfer). Prophet can make the forward propagation start as early as possible so as to greedily reduce the waiting (idle) time of GPU resources during the DDNN training process.

Publication

Zhenwei Zhang, Qiang Qi, Ruitao Shang, Li Chen, Fei Xu*, โ€œProphet: Speeding up Distributed DNN Training with Predictable Communication Scheduling,โ€ in: Proc. of ICPP 2021, August 9-12, 2021. Article No. 69.

@inproceedings{zhang2021prophet,
  title={Prophet: Speeding up distributed dnn training with predictable communication scheduling},
  author={Zhang, Zhenwei and Qi, Qiang and Shang, Ruitao and Chen, Li and Xu, Fei},
  booktitle={Proceedings of the 50th International Conference on Parallel Processing},
  pages={1--11},
  year={2021}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.