GithubHelp home page GithubHelp logo

batermj / awesome-system-for-machine-learning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from huaizhengzhang/awesome-system-for-machine-learning

0.0 3.0 0.0 727 KB

A curated list of research in machine learning system. I also summarize some papers if I think they are really interesting.

License: MIT License

awesome-system-for-machine-learning's Introduction

Maintenance Awesome GitHub

Awesome System for Machine Learning

Path to system for AI [Whitepaper You Must Read]

A curated list of research in machine learning system. Link to the code if available is also present. I also summarize some papers if I think they are really interesting.

AI system

Table of Contents

Resources

Papers

System for AI

AI for System

PR template

- Title [[Paper]](link) [[GitHub]](link)
  - Author (*conference(journal) year*)
  - Summary: 

Book

  • Computer Architecture: A Quantitative Approach [Must read]
  • Streaming Systems [Book]
  • Kubernetes in Action (start to read) [Book]

Video

  • SysML 2019: [YouTube]
  • ScaledML 2019: David Patterson, Ion Stoica, Dawn Song and so on [YouTube]
  • ScaledML 2018: Jeff Dean, Ion Stoica, Yangqing Jia and so on [YouTube] [Slides]
  • A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
  • How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
  • SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
  • SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]

Course

Survey

  • Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools [Paper]
    • RUBEN MAYER, HANS-ARNO JACOBSEN
    • Summary:
  • How (and How Not) to Write a Good Systems Paper [Advice]
  • Applied machine learning at Facebook: a datacenter infrastructure perspective [Paper]
    • Hazelwood, Kim, et al. (HPCA 2018)
  • Infrastructure for Usable Machine Learning: The Stanford DAWN Project
    • Bailis, Peter, Kunle Olukotun, Christopher Ré, and Matei Zaharia. (preprint 2017)
  • Hidden technical debt in machine learning systems [Paper]
    • Sculley, David, et al. (NIPS 2015)
    • Summary:
  • End-to-end arguments in system design [Paper]
    • Saltzer, Jerome H., David P. Reed, and David D. Clark.
  • System Design for Large Scale Machine Learning [Thesis]
  • Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
    • Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
    • Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.

Userful Tools

  • Netron: Visualizer for deep learning and machine learning models [GitHub]
  • Facebook/FBGEMM: FBGEMM (Facebook GEneral Matrix Multiplication) is a low-precision, high-performance matrix-matrix multiplications and convolution library for server-side inference. [GitHub]
  • XiaoMi/mobile-ai-bench: Benchmarking Neural Network Inference on Mobile Devices [GitHub]
  • Dslabs: Distributed Systems Labs and Framework for UW system course [GitHub]
  • Machine Learning Model Zoo [Website]
  • MLPerf Benchmark Suite/Inference: Reference implementations of inference benchmarks [GitHub]
  • Pytorch-Memory-Utils: detect your GPU memory during training with Pytorch. [GitHub]
  • Faiss: A library for efficient similarity search and clustering of dense vectors [GitHub]
  • torchstat: a lightweight neural network analyzer based on PyTorch. [GitHub]
  • Microsoft/MMdnn: A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models.[GitHub]
  • Popular Network memory consumption and FLOP counts [GitHub]
  • Intel® VTune™ Amplifier [Website]
    • Stop guessing why software is slow. Advanced sampling and profiling techniques quickly analyze your code, isolate issues, and deliver insights for optimizing performance on modern processors
  • NVIDIA DALI [GitHub]
    • A library containing both highly optimized building blocks and an execution engine for data pre-processing in deep learning applications
  • gpushare-scheduler-extender [GitHub]
    • Some of these tasks can be run on the same Nvidia GPU device to increase GPU utilization
  • TensorRT [NVIDIA]
    • It is designed to work in a complementary fashion with training frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. It focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result
  • TensorStream: A library for real-time video stream decoding to CUDA memory [GitHub]

Project

  • Machine Learning for .NET [GitHub]
    • ML.NET is a cross-platform open-source machine learning framework which makes machine learning accessible to .NET developers.
    • ML.NET allows .NET developers to develop their own models and infuse custom machine learning into their applications, using .NET, even without prior expertise in developing or tuning machine learning models.
  • ONNX: Open Neural Network Exchange [GitHub]
  • BentoML: Machine Learning Toolkit for packaging and deploying models [GitHub]
  • ModelDB: A system to manage ML models [GitHub] [MIT short paper]
  • EuclidesDB: A multi-model machine learning feature embedding database [GitHub]
  • Prefect: Perfect is a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. [GitHub]
  • MindsDB: MindsDB's goal is to make it very simple for developers to use the power of artificial neural networks in their projects [GitHub]
  • PAI: OpenPAI is an open source platform that provides complete AI model training and resource management capabilities. [Microsoft Project]
  • Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems [Facebook Project]
  • Osquery is a SQL powered operating system instrumentation, monitoring, and analytics framework. [Facebook Project]
  • Horovod: Distributed training framework for TensorFlow, Keras, and PyTorch. [GitHub]
  • Seldon: Sheldon Core is an open source platform for deploying machine learning models on a Kubernetes cluster.[GitHub]
  • Kubeflow: Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. [GitHub]

Data Prcocessing

  • Google/jax: Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more [GitHub]
  • CuPy: NumPy-like API accelerated with CUDA [GitHub]
  • Modin: Speed up your Pandas workflows by changing a single line of code [GitHub]
  • Weld: Weld is a runtime for improving the performance of data-intensive applications. [Project Website]
  • Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines [Project Website]
    • Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe. (PLDI 2013)
    • Summary: Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines.

Model Serving

  • {PRETZEL}: Opening the Black Box of Machine Learning Prediction Serving Systems. [Paper]
    • Lee, Y., Scolari, A., Chun, B.G., Santambrogio, M.D., Weimer, M. and Interlandi, M., 2018. (OSDI 2018)
    • Summary:
  • Brusta: PyTorch model serving project [GitHub]
  • Model Server for Apache MXNet: Model Server for Apache MXNet is a tool for serving neural net models for inference [GitHub]
  • TFX: A TensorFlow-Based Production-Scale Machine Learning Platform [Paper] [Website]
    • Baylor, Denis, et al. (KDD 2017)
    • Summary:
  • Tensorflow-serving: Flexible, high-performance ml serving [Paper] [GitHub]
    • Olston, Christopher, et al.
  • IntelAI/OpenVINO-model-server: Inference model server implementation with gRPC interface, compatible with TensorFlow serving API and OpenVINO™ as the execution backend. [GitHub]
  • Clipper: A Low-Latency Online Prediction Serving System [Paper] [GitHub]
    • Crankshaw, Daniel, et al. (NSDI 2017)
    • Summary: Adaptive batch
  • InferLine: ML Inference Pipeline Composition Framework [Paper]
    • Crankshaw, Daniel, et al. (Preprint)
    • Summary: update version of Clipper
  • TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
    • Dakkak, Abdul, et al (Preprint)
    • Summary: model cold start problem
  • Rafiki: machine learning as an analytics service system [Paper] [GitHub]
    • Wang, Wei, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad.
    • Summary: Contain both training and inference. Auto-Hype-Parameter search for training. Ensemble models for inference. Using DRL to balance trade-off between accuracy and latency.

Machine Learning System Papers (Inference)

  • Dynamic Space-Time Scheduling for GPU Inference [Paper]
    • Jain, Paras, et al. (NIPS 18, System for ML)
    • Summary:
  • Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
    • Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
  • Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
    • D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
    • Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.

Machine Learning System Papers (Training)

  • Mesh-TensorFlow: Deep Learning for Supercomputers [Paper] [GitHub]
    • Shazeer, Noam, Youlong Cheng, Niki Parmar, Dustin Tran, et al. (NIPS 2018)
    • Summary: Data parallelism for language model
  • PyTorch-BigGraph: A Large-scale Graph Embedding System [Paper] [GitHub]
    • Lerer, Adam and Wu, Ledell and Shen, Jiajun and Lacroix, Timothee and Wehrstedt, Luca and Bose, Abhijit and Peysakhovich, Alex (SysML 2019)
  • Beyond data and model parallelism for deep neural networks [Paper] [GitHub]
    • Jia, Zhihao, Matei Zaharia, and Alex Aiken. (SysML 2019)
    • Summary: SOAP (sample, operation, attribution and parameter) parallelism. Operator graph, device topology and extution optimizer. MCMC search algorithm and excution simulator.
  • Device placement optimization with reinforcement learning [Paper]
    • Mirhoseini, Azalia, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. (ICML 17)
    • Summary: Using REINFORCE learn a device placement policy. Group operations to excute. Need a lot of GPUs.
  • Spotlight: Optimizing device placement for training deep neural networks [Paper]
    • Gao, Yuanxiang, Li Chen, and Baochun Li (ICML 18)
  • GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [Paper][GitHub] [News]
    • Huang, Yanping, et al. (arXiv preprint arXiv:1811.06965 (2018))
    • Summary:
  • Gandiva: Introspective cluster scheduling for deep learning. [Paper]
    • Xiao, Wencong, et al. (OSDI 2018)
    • Summary: Improvet the efficency of hyper-parameter in cluster. Aware of hardware utilization.
  • Optimus: an efficient dynamic resource scheduler for deep learning clusters [Paper]
    • Peng, Yanghua, et al. (EuroSys 2018)
    • Summary: Job scheduling on clusters. Total complete time as the metric.

Machine Learning Compiler

  • TVM: An Automated End-to-End Optimizing Compiler for Deep Learning [Project Website]
    • {TVM}: An Automated End-to-End Optimizing Compiler for Deep Learning [Paper]
      • Chen, Tianqi, et al. (OSDI 2018)
  • Facebook TC: Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. [GitHub]
  • Tensorflow/mlir: "Multi-Level Intermediate Representation" Compiler Infrastructure [GitHub]
  • PyTorch/glow: Compiler for Neural Network hardware accelerators [GitHub]

Deep Reinforcement Learning System

  • Ray: A Distributed Framework for Emerging {AI} Applications [GitHub]
    • Moritz, Philipp, et al. (OSDI 2018)
    • Summary: Distributed DRL training, simulation and inference system. Can be used as a high-performance python framework.
  • Elf: An extensive, lightweight and flexible research platform for real-time strategy games [Paper] [GitHub]
    • Tian, Yuandong, Qucheng Gong, Wenling Shang, Yuxin Wu, and C. Lawrence Zitnick. (NIPS 2017)
    • Summary:
  • Horizon: Facebook's Open Source Applied Reinforcement Learning Platform [Paper] [GitHub]
    • Gauci, Jason, et al. (preprint 2019)
  • RLgraph: Modular Computation Graphs for Deep Reinforcement Learning [Paper][GitHub]
    • Schaarschmidt, Michael, Sven Mika, Kai Fricke, and Eiko Yoneki. (SysML 2019)
    • Summary:

Video System papers

  • CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video [Paper]
    • Mao, Huizi, Taeyoung Kong, and William J. Dally. (SysML2019)
  • Live Video Analytics at Scale with Approximation and Delay-Tolerance [Paper]
    • Zhang, Haoyu, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. (NSDI 2017)
  • Chameleon: scalable adaptation of video analytics [Paper]
    • Jiang, Junchen, et al. (SIGCOMM 2018)
    • Summary: Configuration controller for balancing accuracy and resource. Golden configuration is a good design. Periodic profiling often exceeded any resource savings gained by adapting the configurations.
  • Noscope: optimizing neural network queries over video at scale [Paper] [GitHub]
    • Kang, Daniel, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. (VLDB2017)
    • Summary:
  • SVE: Distributed video processing at Facebook scale [Paper]
    • Huang, Qi, et al. (SOSP2017)
    • Summary:
  • Scanner: Efficient Video Analysis at Scale [Paper][GitHub]
    • Poms, Alex, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian (SIGGRAPH 2018)
    • Summary:
  • A cloud-based large-scale distributed video analysis system [Paper]
    • Wang, Yongzhe, et al. (ICIP 2016)
  • Rosetta: Large scale system for text detection and recognition in images [Paper]
    • Borisyuk, Fedor, Albert Gordo, and Viswanath Sivakumar. (KDD 2018)
    • Summary:
  • Neural adaptive content-aware internet video delivery. [Paper] [GitHub]
    • Yeo, H., Jung, Y., Kim, J., Shin, J. and Han, D., 2018. (OSDI 2018)
    • Summary: Combine video super-resolution and ABR

Edge or Mobile Papers

  • NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision [Paper]
    • Fang, Biyi, Xiao Zeng, and Mi Zhang. (MobiCom 2018)
    • Summary: Borrow some ideas from network prune. The pruned model then recovers to trade-off computation resource and accuracy at runtime
  • Lavea: Latency-aware video analytics on edge computing platform [Paper]
    • Yi, Shanhe, et al. (Second ACM/IEEE Symposium on Edge Computing. ACM, 2017.)
  • Scaling Video Analytics on Constrained Edge Nodes [Paper] [GitHub]
    • Canel, C., Kim, T., Zhou, G., Li, C., Lim, H., Andersen, D. G., Kaminsky, M., and Dulloo (SysML 2019)

Resource Management

  • Resource management with deep reinforcement learning [Paper] [GitHub]
    • Mao, Hongzi, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula (ACM HotNets 2016)
    • Summary: Highly cited paper. Nice definaton. An example solution that translates the problem of packing tasks with multiple resource demands into a learning problem and then used DRL to solve it.

Advanced Theory

  • Differentiable MPC for End-to-end Planning and Control [Paper] [GitHub]
    • Amos, Brandon, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter (NIPS 2018)

Traditional System Optimization Papers

  • AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers [Paper]
    • Gandhi, Anshul, et al. (TOCS 2012)

awesome-system-for-machine-learning's People

Contributors

huaizhengzhang avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.