GithubHelp home page GithubHelp logo

faruba / flexgen Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fminference/flexgen

0.0 0.0 0.0 36.75 MB

Running large language models on a single GPU for throughput-oriented scenarios.

License: Apache License 2.0

Shell 3.55% Python 96.45%

flexgen's Introduction

FlexGen

FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.

Throughput-Oriented Inference for Large Language Models

In recent years, large language models (LLMs) have shown great performance across a wide range of tasks. Increasingly, LLMs have been applied not only to interactive applications (such as chat), but also to many "back-of-house" tasks. These tasks include benchmarking, information extraction, data wrangling, and form processing.

One key characteristic of these applications is that they are throughput-oriented: they require running LLM inferences over millions of tokens in batches, e.g., all the private documents in a company's corpus, or all the tasks in the HELM benchmark. These workloads are less sensitive to latency - the user starts up a job and lets it run overnight - but increasing throughput is critical for reducing costs. Throughput is a measure of tokens processed per second over the job's entire runtime (which can be hours). Throughput-oriented workloads provide opportunities to trade off latency for higher throughput, which makes it easier to take advantage of low-cost commodity GPUs.

The goal of FlexGen is to create a high-throughput system to enable new and exciting applications of foundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU instead of expensive systems.

Check out the examples of what you can run on a single commodity GPU with FlexGen, including benchmarking and data wrangling.

Limitation. As an offloading-based system running on weak GPUs, FlexGen also has its limitations. FlexGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases. FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.


This project was made possible thanks to a collaboration with

                   


Install

Requirements:

Method 1: With pip

pip install flexgen

Method 2: From source

git clone https://github.com/FMInference/FlexGen.git
cd FlexGen
pip install -e .

Examples

HELM Benchmark

FlexGen can be integrated into HELM, a language model benchmark framework, as its execution backend. You can use the commands below to run a Massive Multitask Language Understanding (MMLU) scenario with a single T4 (16GB) GPU and 200GB of DRAM.

python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100

Note that only a subset of HELM scenarios is tested. See more tested scenarios here.

Data Wrangling

You can run the examples in this paper, 'Can Foundation Models Wrangle Your Data?', by following the instructions here.

Performance Benchmark

Generation Throughput (token/s)

The corresponding effective batch sizes are in parentheses. Please see here for more details.

System OPT-6.7B OPT-30B OPT-175B
Hugging Face Accelerate 25.12 (2 on GPU) 0.62 (8 on CPU) 0.01 (2 on disk)
DeepSpeed ZeRO-Inference 9.28 (16 on CPU) 0.60 (4 on CPU) 0.01 (1 on disk)
Petals 8.25 (2 on GPU) 2.84 (2 on GPU) 0.08 (2 on GPU)
FlexGen 25.26 (2 on GPU) 7.32 (144 on CPU) 0.69 (256 on disk)
FlexGen with Compression 29.12 (72 on GPU) 8.38 (512 on CPU) 1.12 (144 on CPU)
  • Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
  • Workload: input sequence length = 512, output sequence length = 32. The batch size is tuned to a large value that maximizes the generation throughput for each system.
  • Metric: generation throughput (token/s) = number of the generated tokens / (time for processing prompts + time for generation).

How to reproduce.

Roadmap

We plan to work on the following features.

  • Optimize the performance for multiple GPUs on the same machine
  • Support more models (BLOOM, CodeGen, GLM)
  • Release the cost model and policy optimizer
  • Macbook Support (M1 and M2)
  • AMD Support

flexgen's People

Contributors

ying1123 avatar merrymercy avatar zhangce avatar keroro824 avatar mryab avatar binhangyuan avatar danfu09 avatar eltociear avatar borda avatar kemingy avatar lukelin-web avatar meatfucker avatar shughes-uk avatar takanotaiga avatar tomaarsen avatar zhuohan123 avatar nicholasachow avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.