GithubHelp home page GithubHelp logo

parcae's Introduction

Parcae

This is the artifact repository for our NSDI '24 paper "Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances". [NSDI '24], [arXiv]

Parcae is a system that enables cheap, fast, and scalable DNN training on preemptible instances by proac- tively adjusting the parallelization strategy of a DNN training job to adapt to predicted resource changes before instance pre- emptions and allocations really happen, which significantly reduces the cost of handling these events. Parcae optimizes liveput, a novel metric that measures the expected training throughput of a DNN job under various possible preemp- tion scenarios. Compared to existing reactive, throughput- optimized systems, Parcae’s proactive, live-optimized solution considers both the throughput of a job and its robustness under preemptions. To optimize liveput, Parcae supports lightweight instance migration and uses an availability predictor to fore- cast future preemptions. It then uses a liveput optimizer to discover an optimal strategy to parallelize DNN training un- der predicted preemptions. We evaluate Parcae on a variety of DNNs and preemption traces and show that Parcae outper- forms existing spot-instance DNN training systems by up to 10×. More importantly, Parcae achieves near-optimal perfor- mance for training large DNNs under frequent preemptions, in which case existing approaches cannot make any progress.

Requirements

Our tested version is PyTorch 1.11 and CUDA 11.3.

Installation

The installation of Parcae is the same as installing DeepSpeed. You can also refer to the DeepSpeed documentation for detailed instructions.

pip install -r requirements.txt
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 pip install .

Getting Started

Parcae is evaluated by replaying the trace on on-demand instacnes. Try Parcae with the following steps:

Acknowledgement

Parcae is built based on DeepSpeed. We also learned a lot from Bamboo (thanks John and Pengzhan) and TorchElastic.

Citation

@inproceedings{nsdi24parcae,
  author = {Jiangfei Duan and Ziang Song and Xupeng Miao and Xiaoli Xi and Dahua Lin and Harry Xu and Minjia Zhang and Zhihao Jia},
  title = {Parcae: Proactive, {Liveput-Optimized} {DNN} Training on Preemptible Instances},
  booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)},
  year = {2024},
  address = {Santa Clara, CA},
  url = {https://www.usenix.org/conference/nsdi24/presentation/duan},
  publisher = {USENIX Association},
  month = apr
}

parcae's People

Contributors

jf-d avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.