GithubHelp home page GithubHelp logo

youngshuohao / awesome-system-for-machine-learning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from huaizhengzhang/awesome-system-for-machine-learning

0.0 1.0 0.0 163 KB

A curated list of research in machine learning system. I also summarize some papers if I think they are really interesting.

License: MIT License

awesome-system-for-machine-learning's Introduction

Maintenance Awesome GitHub

Awesome System for Machine Learning

A curated list of research in machine learning system. Link to the code if available is also present. I also summarize some papers if I think they are really interesting.

I categorize them by myself. You are kindly invited to pull requests!

Table of Contents

Resources

Papers

Book

  • Computer Architecture: A Quantitative Approach [Must read]
  • Streaming Systems [Book]
  • Kubernetes in Action (start to read) [Book]

Video

  • A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
  • How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
  • SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
  • SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]

Course

Survey

  • Hidden technical debt in machine learning systems [Paper]
    • Sculley, David, et al. (NIPS 2015)
    • Summary:
  • End-to-end arguments in system design [Paper]
    • Saltzer, Jerome H., David P. Reed, and David D. Clark.
  • System Design for Large Scale Machine Learning [Thesis]
  • Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
    • Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
    • Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.

Userful Tools

  • Intel® VTune™ Amplifier [Website]
    • Stop guessing why software is slow. Advanced sampling and profiling techniques quickly analyze your code, isolate issues, and deliver insights for optimizing performance on modern processors
  • NVIDIA DALI [GitHub]
    • A library containing both highly optimized building blocks and an execution engine for data pre-processing in deep learning applications
  • gpushare-scheduler-extender [GitHub]
    • Some of these tasks can be run on the same Nvidia GPU device to increase GPU utilization
  • TensorRT [NVIDIA]
    • It is designed to work in a complementary fashion with training frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. It focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result

Project

  • Weld: Weld is a runtime for improving the performance of data-intensive applications. [Project Website]
  • MindsDB: MindsDB's goal is to make it very simple for developers to use the power of artificial neural networks in their projects [GitHub]
  • PAI: OpenPAI is an open source platform that provides complete AI model training and resource management capabilities. [Microsoft Project]
  • Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems [Facebook Project]
  • Osquery is a SQL powered operating system instrumentation, monitoring, and analytics framework. [Facebook Project]
  • TVM: An Automated End-to-End Optimizing Compiler for Deep Learning [Project Website]
  • Horovod: Distributed training framework for TensorFlow, Keras, and PyTorch. [GitHub]
  • Seldon: Sheldon Core is an open source platform for deploying machine learning models on a Kubernetes cluster.[GitHub]
  • Kubeflow: Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. [GitHub]

Model Deployment Papers

  • Clipper: A Low-Latency Online Prediction Serving System [Paper] [GitHub]
    • Crankshaw, Daniel, et al. (NSDI 2017)
    • Summary: Adaptive batch
  • InferLine: ML Inference Pipeline Composition Framework [Paper]
    • Crankshaw, Daniel, et al. (Preprint)
    • Summary: update version of Clipper
  • TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
    • Dakkak, Abdul, et al (Preprint)
    • Summary: model cold start problem

Machine Learning System Papers (Inference)

  • Dynamic Space-Time Scheduling for GPU Inference [Paper]
    • Jain, Paras, et al. (NIPS 18, System for ML)
    • Summary:
  • Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
    • Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
  • Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
    • D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
    • Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.

Machine Learning System Papers (Training)

  • Beyond data and model parallelism for deep neural networks [Paper]
    • Jia, Zhihao, Matei Zaharia, and Alex Aiken. (SysML 2019)
    • Summary: SOAP (sample, operation, attribution and parameter) parallelism. Operator graph, device topology and extution optimizer. MCMC search algorithm and excution simulator.
  • Device placement optimization with reinforcement learning [Paper]
    • Mirhoseini, Azalia, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. (ICML 17)
    • Summary: Using REINFORCE learn a device placement policy. Group operations to excute. Need a lot of GPUs.
  • Spotlight: Optimizing device placement for training deep neural networks [Paper]
    • Gao, Yuanxiang, Li Chen, and Baochun Li (ICML 18)
  • GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [Paper][GitHub] [News]
    • Huang, Yanping, et al. (arXiv preprint arXiv:1811.06965 (2018))
    • Summary:

Resource Management

  • Resource management with deep reinforcement learning [Paper] [GitHub]
    • Mao, Hongzi, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula (ACM HotNets 2016)
    • Summary: Highly cited paper. Nice definaton. An example solution that translates the problem of packing tasks with multiple resource demands into a learning problem and then used DRL to solve it.

Deep Reinforcement Learning System

  • Ray: A Distributed Framework for Emerging {AI} Applications [GitHub]
    • Moritz, Philipp, et al. (OSDI 2018)
    • Summary: Distributed DRL training, simulation and inference system. Can be used as a high-performance python framework.

Video System papers

  • Live Video Analytics at Scale with Approximation and Delay-Tolerance [Paper]
    • Zhang, Haoyu, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. (NSDI 2017)
  • Chameleon: scalable adaptation of video analytics [Paper]
    • Jiang, Junchen, et al. (SIGCOMM 2018)
    • Summary: Configuration controller for balancing accuracy and resource. Golden configuration is a good design. Periodic profiling often exceeded any resource savings gained by adapting the configurations.
  • Noscope: optimizing neural network queries over video at scale [Paper] [GitHub]
    • Kang, Daniel, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. (VLDB2017)
    • Summary:
  • SVE: Distributed video processing at Facebook scale [Paper]
    • Huang, Qi, et al. (SOSP2017)
    • Summary:
  • Scanner: Efficient Video Analysis at Scale [Paper][GitHub]
    • Poms, Alex, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian (SIGGRAPH 2018)
    • Summary:

Edge or Mobile Papers

  • NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision [Paper]
    • Fang, Biyi, Xiao Zeng, and Mi Zhang. (MobiCom 2018)
    • Summary: Borrow some ideas from network prune. The pruned model then recovers to trade-off computation resource and accuracy at runtime
  • Lavea: Latency-aware video analytics on edge computing platform [Paper]
    • Yi, Shanhe, et al. (Second ACM/IEEE Symposium on Edge Computing. ACM, 2017.)

Advanced Theory

  • Differentiable MPC for End-to-end Planning and Control [Paper] [GitHub]
    • Amos, Brandon, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter (NIPS 2018)

Traditional System Optimization Papers

  • AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers [Paper]
    • Gandhi, Anshul, et al. (TOCS 2012)

awesome-system-for-machine-learning's People

Contributors

huaizhengzhang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.