GithubHelp home page GithubHelp logo

premai-io / benchmarks Goto Github PK

View Code? Open in Web Editor NEW
70.0 11.0 3.0 2.28 MB

🕹ī¸ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.

License: MIT License

Shell 54.87% Rust 4.11% Python 41.02%
ai inference-engines llmops mlops benchmarks latency performances

benchmarks's Introduction

🕹ī¸ Benchmarks

A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models

GitHub contributors GitHub commit activity GitHub last commit GitHub top language GitHub issues License


alt text Check out our release blog to know more.

Table of Contents
  1. Quick glance towards performance metrics
  2. ML Engines
  3. Why Benchmarks
  4. Usage and workflow
  5. Contribute

đŸĨŊ Quick glance towards performance benchmarks

Take a first glance at Mistral 7B v0.1 Instruct and Llama 2 7B Chat Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports.

Environment:

  • Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat
  • CUDA Version: 12.1
  • Batch size: 1

Command:

./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture'

Mistral 7B v0.1 Instruct

Performance Metrics: (unit: Tokens/second)

Engine float32 float16 int8 int4
transformers (pytorch) 39.61 Âą 0.65 37.05 Âą 0.49 5.08 Âą 0.01 19.58 Âą 0.38
AutoAWQ - - - 63.12 Âą 2.19
AutoGPTQ 39.11 Âą 0.42 42.94 Âą 0.80
DeepSpeed 79.88 Âą 0.32
ctransformers - - 86.14 Âą 1.40 87.22 Âą 1.54
llama.cpp - - 88.27 Âą 0.72 95.33 Âą 5.54
ctranslate 43.17 Âą 2.97 68.03 Âą 0.27 45.14 Âą 0.24 -
PyTorch Lightning 32.79 Âą 2.74 43.01 Âą 2.90 7.75 Âą 0.12 -
Nvidia TensorRT-LLM 117.04 Âą 2.16 206.59 Âą 6.93 390.49 Âą 4.86 427.40 Âą 4.84
vllm 84.91 Âą 0.27 84.89 Âą 0.28 - 106.03 Âą 0.53
exllamav2 - - 114.81 Âą 1.47 126.29 Âą 3.05
onnx 15.75 Âą 0.15 22.39 Âą 0.14 - -
Optimum Nvidia 50.77 Âą 0.85 50.91 Âą 0.19 - -

Performance Metrics: GPU Memory Consumption (unit: MB)

Engine float32 float16 int8 int4
transformers (pytorch) 31071.4 15976.1 10963.91 5681.18
AutoGPTQ 13400.80 6633.29
AutoAWQ - - - 6572.47
DeepSpeed 80097.34
ctransformers - - 10255.07 6966.74
llama.cpp - - 9141.49 5880.41
ctranslate 32602.32 17523.8 10074.72 -
PyTorch Lightning 48783.95 18738.05 10680.32 -
Nvidia TensorRT-LLM 79536.59 78341.21 77689.0 77311.51
vllm 73568.09 73790.39 - 74016.88
exllamav2 - - 21483.23 9460.25
onnx 33629.93 19537.07 - -
Optimum Nvidia 79563.85 79496.74 - -

*(Data updated: 30th April 2024)

Llama 2 7B Chat

Performance Metrics: (unit: Tokens / second)

Engine float32 float16 int8 int4
transformers (pytorch) 36.65 Âą 0.61 34.20 Âą 0.51 6.91 Âą 0.14 17.83 Âą 0.40
AutoAWQ - - - 63.59 Âą 1.86
AutoGPTQ 34.36 Âą 0.51 36.63 Âą 0.61
DeepSpeed 84.60 Âą 0.25
ctransformers - - 85.50 Âą 1.00 86.66 Âą 1.06
llama.cpp - - 89.90 Âą 2.26 97.35 Âą 4.71
ctranslate 46.26 Âą 1.59 79.41 Âą 0.37 48.20 Âą 0.14 -
PyTorch Lightning 38.01 Âą 0.09 48.09 Âą 1.12 10.68 Âą 0.43 -
Nvidia TensorRT-LLM 104.07 Âą 1.61 191.00 Âą 4.60 316.77 Âą 2.14 358.49 Âą 2.38
vllm 89.40 Âą 0.22 89.43 Âą 0.19 - 115.52 Âą 0.49
exllamav2 - - 125.58 Âą 1.23 159.68 Âą 1.85
onnx 14.28 Âą 0.12 19.42 Âą 0.08 - -
Optimum Nvidia 53.64 Âą 0.78 53.82 Âą 0.11 - -

Performance Metrics: GPU Memory Consumption (unit: MB)

Engine float32 float16 int8 int4
transformers (pytorch) 29114.76 14931.72 8596.23 5643.44
AutoAWQ - - - 7149.19
AutoGPTQ 10718.54 5706.35
DeepSpeed 80105.13
ctransformers - - 9774.83 6889.14
llama.cpp - - 8797.55 5783.95
ctranslate 29951.52 16282.29 9470.74 -
PyTorch Lightning 42748.35 14736.69 8028.16 -
Nvidia TensorRT-LLM 79421.24 78295.07 77642.86 77256.98
vllm 77928.07 77928.07 - 77768.69
exllamav2 - - 16582.18 7201.62
onnx 33072.09 19180.55 - -
Optimum Nvidia 79429.63 79295.41 - -

*(Data updated: 30th April 2024)

Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the archive.md file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated.

đŸ›ŗ ML Engines

In the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances here.

Engine Float32 Float16 Int8 Int4 CUDA ROCM Mac M1/M2 Training
candle ⚠ī¸ ✅ ⚠ī¸ ⚠ī¸ ✅ ❌ 🚧 ❌
llama.cpp ❌ ❌ ✅ ✅ ✅ 🚧 🚧 ❌
ctranslate ✅ ✅ ✅ ❌ ✅ ❌ 🚧 ❌
onnx ✅ ✅ ❌ ❌ ✅ ⚠ī¸ ❌ ❌
transformers (pytorch) ✅ ✅ ✅ ✅ ✅ 🚧 ✅ ✅
vllm ✅ ✅ ❌ ✅ ✅ 🚧 ❌ ❌
exllamav2 ❌ ❌ ✅ ✅ ✅ 🚧 ❌ ❌
ctransformers ❌ ❌ ✅ ✅ ✅ 🚧 🚧 ❌
AutoGPTQ ✅ ✅ ⚠ī¸ ⚠ī¸ ✅ ❌ ❌ ❌
AutoAWQ ❌ ❌ ❌ ✅ ✅ ❌ ❌ ❌
DeepSpeed-MII ❌ ✅ ❌ ❌ ✅ ❌ ❌ ⚠ī¸
PyTorch Lightning ✅ ✅ ✅ ✅ ✅ ⚠ī¸ ⚠ī¸ ✅
Optimum Nvidia ✅ ✅ ❌ ❌ ✅ ❌ ❌ ❌
Nvidia TensorRT-LLM ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌

Legend:

  • ✅ Supported
  • ❌ Not Supported
  • ⚠ī¸ There is a catch related to this
  • 🚧 It is supported but not implemented in this current version

You can check out the nuances related to ⚠ī¸ and 🚧 in details here

🤔 Why Benchmarks

This can be a common question. What are the benefits you can expect from this repository? So here are some quick pointers to answer those.

  1. Oftentimes, we are confused when given several choices on which engines or precision to use for our LLM inference workflow. Because sometimes we have constraints on computing and sometimes we have other requirements. So this repository helps you to get a quick idea of what to use based on your requirements.

  2. Sometimes there comes a quality vs speed tradeoff between engines and precisions. So this repository keeps track of those and gives you an idea to understand the tradeoffs so that you can give more importance to your priorities.

  3. A fully reproducible and hackable script. The latest benchmarks come with a lot of best practices so that they can be robust enough to run on GPU devices. Also, you can reference and extend the implementations to build your own workflows out of it.

🚀 Usage and workflow

Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Each benchmark runs an inference engine that provides some sort of optimizations either through just quantization or device-specific optimizations like custom cuda kernels.

To get started you need to download the models first. This will download the following models: Llama2 7B Chat and Mistral-7B v0.1 Instruct. You can start download by typing this command:

./download.sh

Please make sure that when you are running Llama2-7B Chat weights, we would assume that you already agreed to the required terms and conditions and got verified to download the weights.

A Benchmark workflow

When you run a benchmark, the following set of events occurs:

  • Automatically setting up the environments and installing the required dependencies.

  • Converting the models to some specific format (if required) and saving them.

  • Running the benchmarks and storing them inside the logs folder. Each log folder has the following structure:

  • performance.log: This will track the model run performances. You can see the token/sec and memory consumption (MB) here.

  • quality.md: This file is an automatically generated readme file, which contains qualitative comparisons of different precisions of some engines. We take 5 prompts and run them for the set of supported precisions of that engine. We then put those results side by side. Our ground truth is the output from huggingface PyTorch model with raw float32 weights.

  • quality.json Same as the readme file but more in raw format.

Inside each benchmark folder, you will also see a readme.md file which contains all the information and the qualitative comparison of the engine. For example: bench_tensorrtllm.

Running a Benchmark

Here is how we run benchmarks for an inference engine.

./bench_<engine-name>/bench.sh \
 --prompt <value> \ # Enter a prompt string
 --max_tokens <value> \  # Maximum number of tokens to output
 --repetitions <value> \  # Number of repetitions to be made for the prompt.
 --device <cpu/cuda/metal> \  # The device in which we want to benchmark.
 --model_name <name-of-the-model> # The name of the model. (options: 'llama' for Llama2 and 'mistral' for Mistral-7B-v0.1)

Here is an example. Let's say we want to benchmark Nvidia TensorRT LLM. So here is how the command would look like:

./bench_tensorrtllm/bench.sh -d cuda -n llama -r 10

To know more, here is more detailed info on each command line argument.

 -p, --prompt Prompt for benchmarks (default: 'Write an essay about the transformer model architecture')
 -r, --repetitions Number of repetitions for benchmarks (default: 10)
 -m, --max_tokens Maximum number of tokens for benchmarks (default: 512)
 -d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'CPU', default: 'cuda')
 -n, --model_name The name of the model to benchmark (possible values: 'llama' for using Llama2, 'mistral' for using Mistral 7B v0.1)
 -lf, --log_file Logging file name.
 -h, --help Show this help message

🤝 Contribute

We welcome contributions to enhance and expand our benchmarking repository. If you'd like to contribute a new benchmark, follow these steps:

Creating a New Benchmark

1. Create a New Folder

Start by creating a new folder for your benchmark. Name it bench_{new_bench_name} for consistency.

mkdir bench_{new_bench_name}

2. Folder Structure

Inside the new benchmark folder, include the following structure

bench_{new_bench_name}
├── bench.sh # Benchmark script for setup and execution
├── requirements.txt # Dependencies required for the benchmark
└── ... # Any additional files needed for the benchmark

3. Benchmark Script (bench.sh):

The bench.sh script should handle setup, environment configuration, and the actual execution of the benchmark. Ensure it supports the parameters mentioned in the Benchmark Script Parameters section.

Pre-commit Hooks

We use pre-commit hooks to maintain code quality and consistency.

1. Install Pre-commit: Ensure you have pre-commit installed

pip install pre-commit

2. Install Hooks: Run the following command to install the pre-commit hooks

pre-commit install

The existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards.

benchmarks's People

Contributors

actions-user avatar anindyadeep avatar biswaroop1547 avatar filopedraz avatar nsosio avatar swarnimarun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

benchmarks's Issues

Roadmap - Task List

Minor fixes:

  • Fix the supported features table, to better reflect reality.
  • Cleanup the code and ensure names for user exposed ENV variables are consistent.
  • Handle quantization information, and provide info regarding quantization method used across models and scripts in the README. (llama.cpp quantization vs tinygrad vs CTranslate2 vs GPTQ)
  • Provide more CLI options for running individual scripts with other models as well. For testing frameworks on systems with less memory.

Feature(in order of priority):

  • Add custom model runner code, for running benchmarks and provide hooks for directly reporting performance metrics into as an output.
  • Setup scripts for running benchmarks with a single command and getting proper performance reports.
  • Improve caching for models, currently some scripts will end up redownloading the models, which has already been fixed for some.
  • Simplify running for any specific platform(nvidia/mac), with any supported model.
  • Auto-setup rust, python, git and etc, for the user before running the benchmark(low priority).

Linter

Description

Most of the repo is in Python. Set up a linter accordingly. Check prem-daemon accordingly.

AWQ Quantization

Description

Benchmarks for inference engines supporting this quantization method.

PyTorch

PyTorch (Transformers) (test multiple versions eg 1.2.1 vs 2.1.0)

Petals

Test with a private swarm (refer to premAI-io/dev-portal#69)

Questions

  • Where are the bottlenecks?
  • There are no advantages to build_gpu? there's no way to force using both the local GPU and the swarm (without connecting the local GPU to the swarm)?
  • Compare Petals with Deepspeed in a centralized scenario.

Burn Upgrade

Upgrade burn version from 0.9.0 to 0.10.0 in llama burn

Two commands to run all the benchmarks

Description

The repo should expose two commands:

  • Command to run benchmarks on Mac (CPU/GPU when available)
  • Command to run benchmarks on NVIDIA GPUs

I should be able to clone the repo git clone and run bash ./mac.sh or bash ./nvidia.sh. If you want you can have multiple abstractions and CLI exposed, but this is the final objective.

Standard output should print the results in a consistent manner in order to be able to check them easily.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.