GithubHelp home page GithubHelp logo

awesome-mixture-of-experts's Introduction

awesome-mixture-of-experts Awesome

MIT License

A collection of AWESOME things about mixture-of-experts

This repo is a collection of AWESOME things about mixture-of-experts, including papers, code, etc. Feel free to star and fork.

Contents

Open Models

  • DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models [Jan 2024] Repo Paper
  • LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training [Dec 2023] Repo
  • Mixtral of Experts [Dec 2023] Repo Paper
  • OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models [Aug 2023] Repo Paper
  • Efficient Large Scale Language Modeling with Mixtures of Experts [Dec 2021] Repo Paper
  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [Feb 2021] Repo Paper

Papers

Must Read

I list my favorite MoE papers here. I think these papers can greatly help new MoErs to know about this topic.

  • A Review of Sparse Expert Models in Deep Learning [4 Sep 2022]
  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [11 Jan 2021]
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [13 Dec 2021]
  • Scaling Vision with Sparse Mixture of Experts [NeurIPS2021]
  • ST-MoE: Designing Stable and Transferable Sparse Expert Models [17 Feb 2022]
  • Mixture-of-Experts with Expert Choice Routing [NeurIPS 2022]
  • Brainformers: Trading Simplicity for Efficiency [ICML 2023]
  • From Sparse to Soft Mixtures of Experts [2 Aug 2023]
  • OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models Aug 2023

MoE Model

Publication

  • Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks [ICML 2023]
  • Robust Mixture-of-Expert Training for Convolutional Neural Networks [ICCV 2023]
  • Merging Experts into One: Improving Computational Efficiency of Mixture of Experts [EMNLP 2023]
  • PAD-Net: An Efficient Framework for Dynamic Networks [ACL 2023]
  • Brainformers: Trading Simplicity for Efficiency [ICML 2023]
  • On the Representation Collapse of Sparse Mixture of Experts [NeurIPS 2022]
  • StableMoE: Stable Routing Strategy for Mixture of Experts [ACL 2022]
  • Taming Sparsely Activated Transformer with Stochastic Experts [ICLR 2022]
  • Go Wider Instead of Deeper [AAAI2022]
  • Hash layers for large sparse models [NeurIPS2021]
  • DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning [NeurIPS2021]
  • Scaling Vision with Sparse Mixture of Experts [NeurIPS2021]
  • BASE Layers: Simplifying Training of Large, Sparse Models [ICML2021]
  • Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [ICLR2017]
  • CPM-2: Large-scale cost-effective pre-trained language models [AI Open]
  • Mixture of experts: a literature survey [Artificial Intelligence Review]

arXiv

  • MoEC: Mixture of Expert Clusters [19 Jul 2022]
  • No Language Left Behind: Scaling Human-Centered Machine Translation [6 Jul 2022]
  • Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners [8 Jun 2022]
  • Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts [6 Jun 2022]
  • Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation [5 Jun 2022]
  • Interpretable Mixture of Experts for Structured Data [5 Jun 2022]
  • Task-Specific Expert Pruning for Sparse Mixture-of-Experts [1 Jun 2022]
  • Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers [28 May 2022]
  • AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [24 May 2022]
  • Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT [24 May 2022]
  • One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code [12 May 2022]
  • SkillNet-NLG: General-Purpose Natural Language Generation with a Sparsely Activated Approach [26 Apr 2022]
  • Residual Mixture of Experts [20 Apr 2022]
  • Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners [16 Apr 2022]
  • MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [15 Apr 2022]
  • Mixture-of-experts VAEs can disregard variation in surjective multimodal data [11 Apr 2022]
  • Efficient Language Modeling with Sparse all-MLP [14 Mar 2022]
  • Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [2 Mar 2022]
  • Mixture-of-Experts with Expert Choice Routing [18 Feb 2022]
  • ST-MoE: Designing Stable and Transferable Sparse Expert Models [17 Feb 2022]
  • Designing Effective Sparse Expert Models [17 Feb 2022]
  • Unified Scaling Laws for Routed Language Models [2 Feb 2022]
  • Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [28 Jan 2022]
  • One Student Knows All Experts Know: From Sparse to Dense [26 Jan 2022]
  • Dense-to-Sparse Gate for Mixture-of-Experts [29 Dec 2021]
  • Efficient Large Scale Language Modeling with Mixtures of Experts [20 Dec 2021]
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [13 Dec 2021]
  • Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition [10 Dec 2021]
  • SpeechMoE2: Mixture-of-Experts Model with Improved Routing [23 Nov 2021]
  • VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts [23 Nov 2021]
  • Towards More Effective and Economic Sparsely-Activated Model [14 Oct 2021]
  • M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [8 Oct 2021]
  • Sparse MoEs meet Efficient Ensembles [7 Oct 2021]
  • MoEfication: Conditional Computation of Transformer Models for Efficient Inference [5 Oct 2021]
  • Cross-token Modeling with Conditional Computation [5 Sep 2021]
  • M6-T: Exploring Sparse Expert Models and Beyond [31 May 2021]
  • SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts [7 May 2021]
  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [11 Jan 2021]
  • Exploring Routing Strategies for Multilingual Mixture-of-Experts Models [28 Sept 2020]

MoE System

Publication

  • Pathways: Asynchronous Distributed Dataflow for ML [MLSys2022]
  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [OSDI2022]
  • FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models[PPoPP2022]
  • BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores [PPoPP2022]
  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [ICLR2021]

arXiv

  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts [29 Nov 2022]
  • HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System [28 Mar 2022]
  • SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System [20 Mar 2022]
  • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [14 Jan 2022]
  • SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [29 Sep 2021]
  • FastMoE: A Fast Mixture-of-Expert Training System [24 Mar 2021]

MoE Application

Publication

  • Switch-NeRF: Learning Scene Decomposition with Mixture of Experts for Large-scale Neural Radiance Fields [02 Feb 2023]

arXiv

  • Spatial Mixture-of-Experts [24 Nov 2022]
  • A Mixture-of-Expert Approach to RL-based Dialogue Management [31 May 2022]
  • Pluralistic Image Completion with Probabilistic Mixture-of-Experts [18 May 2022]
  • ST-ExpertNet: A Deep Expert Framework for Traffic Prediction [5 May 2022]
  • Build a Robust QA System with Transformer-based Mixture of Experts [20 Mar 2022]
  • Mixture of Experts for Biomedical Question Answering [15 Apr 2022]

Library

awesome-mixture-of-experts's People

Contributors

ayaka14732 avatar donglixp avatar luodian avatar mizhenxing avatar mouxiaohuang avatar shwai-he avatar xuefuzhao avatar yuxuan-lou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-mixture-of-experts's Issues

FasterMoE paper on PPoPP'22

There is a new distributed system FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models published on PPoPP'22. As long as you have already put BaGuaLu in your list, please also kindly consider including this paper.

FYI, we have also indicated this paper collections at FastMoE's homepage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.