Awesome System for Machine Learning

A curated list of research in machine learning system. Link to the code if available is also present. I also summarize some papers if I think they are really interesting.

I categorize them by myself. You are kindly invited to pull requests!

Resources

Book
Video
Course
Survey
Tool
Project with code

Book

Computer Architecture: A Quantitative Approach [Must read]
Streaming Systems [Book]
Kubernetes in Action (start to read) [Book]

Video

A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]

Course

CS294: AI For Systems and Systems For AI. [UC Berkeley] (Strong Recommendation)
CSE 599W: System for ML. [Chen Tianqi] [University of Washington]
CSE 291F: Advanced Data Analytics and ML Systems. [UCSD]
CSci 8980: Machine Learning in Computer Systems [University of Minnesota, Twin Cities]

Survey

Hidden technical debt in machine learning systems [Paper]
- Sculley, David, et al. (NIPS 2015)
- Summary:
End-to-end arguments in system design [Paper]
- Saltzer, Jerome H., David P. Reed, and David D. Clark.
System Design for Large Scale Machine Learning [Thesis]
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
- Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
- Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.

Userful Tools

Intel® VTune™ Amplifier [Website]
- Stop guessing why software is slow. Advanced sampling and profiling techniques quickly analyze your code, isolate issues, and deliver insights for optimizing performance on modern processors
NVIDIA DALI [GitHub]
- A library containing both highly optimized building blocks and an execution engine for data pre-processing in deep learning applications
gpushare-scheduler-extender [GitHub]
- Some of these tasks can be run on the same Nvidia GPU device to increase GPU utilization
TensorRT [NVIDIA]
- It is designed to work in a complementary fashion with training frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. It focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result

Project

Weld: Weld is a runtime for improving the performance of data-intensive applications. [Project Website]
MindsDB: MindsDB's goal is to make it very simple for developers to use the power of artificial neural networks in their projects [GitHub]
PAI: OpenPAI is an open source platform that provides complete AI model training and resource management capabilities. [Microsoft Project]
Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems [Facebook Project]
Osquery is a SQL powered operating system instrumentation, monitoring, and analytics framework. [Facebook Project]
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning [Project Website]
Horovod: Distributed training framework for TensorFlow, Keras, and PyTorch. [GitHub]
Seldon: Sheldon Core is an open source platform for deploying machine learning models on a Kubernetes cluster.[GitHub]
Kubeflow: Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. [GitHub]

Model Deployment Papers

Clipper: A Low-Latency Online Prediction Serving System [Paper] [GitHub]
- Crankshaw, Daniel, et al. (NSDI 2017)
- Summary: Adaptive batch
InferLine: ML Inference Pipeline Composition Framework [Paper]
- Crankshaw, Daniel, et al. (Preprint)
- Summary: update version of Clipper
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
- Dakkak, Abdul, et al (Preprint)
- Summary: model cold start problem

Machine Learning System Papers (Inference)

Dynamic Space-Time Scheduling for GPU Inference [Paper]
- Jain, Paras, et al. (NIPS 18, System for ML)
- Summary:
Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
- Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
- D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
- Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.

Machine Learning System Papers (Training)

Beyond data and model parallelism for deep neural networks [Paper]
- Jia, Zhihao, Matei Zaharia, and Alex Aiken. (SysML 2019)
- Summary: SOAP (sample, operation, attribution and parameter) parallelism. Operator graph, device topology and extution optimizer. MCMC search algorithm and excution simulator.
Device placement optimization with reinforcement learning [Paper]
- Mirhoseini, Azalia, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. (ICML 17)
- Summary: Using REINFORCE learn a device placement policy. Group operations to excute. Need a lot of GPUs.
Spotlight: Optimizing device placement for training deep neural networks [Paper]
- Gao, Yuanxiang, Li Chen, and Baochun Li (ICML 18)
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [Paper][GitHub] [News]
- Huang, Yanping, et al. (arXiv preprint arXiv:1811.06965 (2018))
- Summary:

Resource Management

Resource management with deep reinforcement learning [Paper] [GitHub]
- Mao, Hongzi, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula (ACM HotNets 2016)
- Summary: Highly cited paper. Nice definaton. An example solution that translates the problem of packing tasks with multiple resource demands into a learning problem and then used DRL to solve it.

Deep Reinforcement Learning System

Ray: A Distributed Framework for Emerging {AI} Applications [GitHub]
- Moritz, Philipp, et al. (OSDI 2018)
- Summary: Distributed DRL training, simulation and inference system. Can be used as a high-performance python framework.

Video System papers

Live Video Analytics at Scale with Approximation and Delay-Tolerance [Paper]
- Zhang, Haoyu, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. (NSDI 2017)
Chameleon: scalable adaptation of video analytics [Paper]
- Jiang, Junchen, et al. (SIGCOMM 2018)
- Summary: Configuration controller for balancing accuracy and resource. Golden configuration is a good design. Periodic profiling often exceeded any resource savings gained by adapting the configurations.
Noscope: optimizing neural network queries over video at scale [Paper] [GitHub]
- Kang, Daniel, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. (VLDB2017)
- Summary:
SVE: Distributed video processing at Facebook scale [Paper]
- Huang, Qi, et al. (SOSP2017)
- Summary:
Scanner: Efficient Video Analysis at Scale [Paper][GitHub]
- Poms, Alex, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian (SIGGRAPH 2018)
- Summary:

Edge or Mobile Papers

NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision [Paper]
- Fang, Biyi, Xiao Zeng, and Mi Zhang. (MobiCom 2018)
- Summary: Borrow some ideas from network prune. The pruned model then recovers to trade-off computation resource and accuracy at runtime
Lavea: Latency-aware video analytics on edge computing platform [Paper]
- Yi, Shanhe, et al. (Second ACM/IEEE Symposium on Edge Computing. ACM, 2017.)

Advanced Theory

Differentiable MPC for End-to-end Planning and Control [Paper] [GitHub]
- Amos, Brandon, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter (NIPS 2018)

Traditional System Optimization Papers

AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers [Paper]
- Gandhi, Anshul, et al. (TOCS 2012)

youngshuohao / awesome-system-for-machine-learning Goto Github PK

awesome-system-for-machine-learning's Introduction

Awesome System for Machine Learning

Table of Contents