🕹️ Benchmarks

A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.

Table of Contents

Quick glance towards performance metrics for Llama-2-7B
Getting started
Usage
Contribute
Roadmap
Introducing Prem Grant Program

📊 Quick glance towards performance metrics for Llama-2-7B

Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines. Metric used: tokens/sec

Engine	float32	float16	int8	int4
burn	10.04 ± 0.64	-	-	-
candle	-	36.78 ± 2.17	-	-
llama.cpp	-	-	79.15 ± 1.20	100.90 ± 1.46
ctranslate	35.23 ± 4.01	55.72 ± 16.66	35.73 ± 10.87	-
tinygrad	-	20.32 ± 0.06	-	-
onnx	-	54.16 ± 3.15	-	-
transformers (pytorch)	43.79 ± 0.61	46.39 ± 0.28	6.98 ± 0.05	21.72 ± 0.11
vllm	90.78 ± 1.60	90.54 ± 2.22	-	114.69 ± 11.20
exllamav2	-	-	121.63 ± 0.74	130.16 ± 0.35
ctransformers	-	-	76.75 ± 10.36	84.26 ± 5.79
AutoGPTQ	42.01 ± 1.03	30.24 ± 0.41	-	-
AutoAWQ	-	-	-	109.20 ± 3.28
DeepSpeed	-	81.44 ± 8.13	-
PyTorch Lightning	24.85 ± 0.07	44.56 ± 2.89	10.50 ± 0.12	24.83 ± 0.05
Optimum Nvidia	110.36 ± 0.52	109.09 ± 4.26	-	-
Nvidia TensorRT-LLM	55.19 ± 1.03	85.03 ± 0.62	167.66 ± 2.05	235.18 ± 3.20

*(Data updated: 04th February 2024)

-- The above benchmarking is done on A100-80GB GPU. You can find more details for other devices like CPU/Metal under docs folder.

Also if you want to see more detailed information about each of the benchmark, you can find those details the respective benchmark folders.
If you want to compare side by side which inference engines supports which precision and device, you can check out the ml_engines.md file. Please note that this file is incomplete and a better comparision of engines will be added in the later versions.

Benchmarks can also be considered as a repository of hackable scripts, that contains the code and all the knowledge base to run the popular inference engines.

🚀 Getting Started

Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Here's a quick guide to get you started:

Benchmark Organization: Each benchmark is uniquely identified as bench_name and resides in its dedicated folder, named bench_{bench_name}.
Benchmark Script (bench.sh): Within these benchmark folders, you'll find a common script named bench.sh. This script takes care of everything from setup and environment configuration to actual execution.

Benchmark Script Parameters

The bench.sh script supports the following key parameters, allowing for customization and flexibility:

prompt: Benchmark-specific prompt.
max_tokens: Maximum tokens for the benchmark.
repetitions: Number of benchmark repetitions.
log_file: File for storing benchmark logs.
device: Specify the device for benchmark execution (CPU, CUDA, Metal).
models_dir: Directory containing necessary model files.

Streamlined Execution

The overarching benchmark.sh script further simplifies the benchmark execution process:

File Download: It automatically downloads essential files required for benchmarking.
Folder Iteration: The script iterates through all benchmark folders in the repository, streamlining the process for multiple benchmarks.

This approach empowers users to effortlessly execute benchmarks based on their preferences. To run a specific benchmark, navigate to the corresponding benchmark folder (e.g., bench_{bench_name}) and execute the bench.sh script with the required parameters.

📄 Usage

To utilize the benchmarking capabilities of this repository, follow these usage examples:

Run a Specific Benchmark

Navigate to the benchmark folder and execute the bench.sh script with the desired parameters:

./bench_{bench_name}/bench.sh --prompt <value> --max_tokens <value> --repetitions <value> --log_file <file_path> --device <cpu/cuda/metal> --models_dir <path_to_models>

Replace <value> with the specific values for your benchmark, and <file_path> and <path_to_models> with the appropriate file and directory paths.

Run All Benchmarks Collectively

For a comprehensive execution of all benchmarks, use the overarching benchmark.sh script:

./bench.sh --prompt <value> --max_tokens <value> --repetitions <value> --log_file <file_path> --device <cpu/cuda/metal> --models_dir <path_to_models>

Again, customize the parameters according to your preferences, ensuring that <file_path> and <path_to_models> point to the correct locations.

Feel free to adjust the parameters as needed for your specific benchmarking requirements. Please note that, running all the benchmarks collectively can requires lot of storage (around 500 GB). Please make sure that you have enough storage to run all of them at once.

🤝 Contribute

We welcome contributions to enhance and expand our benchmarking repository. If you'd like to contribute a new benchmark, follow these steps:

Creating a New Benchmark

1. Create a New Folder

Start by creating a new folder for your benchmark. Name it bench_{new_bench_name} for consistency.

mkdir bench_{new_bench_name}

2. Folder Structure

Inside the new benchmark folder, include the following structure

bench_{new_bench_name}
├── bench.sh           # Benchmark script for setup and execution
├── requirements.txt   # Dependencies required for the benchmark
└── ...                # Any additional files needed for the benchmark

3. Benchmark Script (bench.sh):

The bench.sh script should handle setup, environment configuration, and the actual execution of the benchmark. Ensure it supports the parameters mentioned in the Benchmark Script Parameters section.

Pre-commit Hooks

We use pre-commit hooks to maintain code quality and consistency.

1. Install Pre-commit: Ensure you have pre-commit installed

pip install pre-commit

2. Install Hooks: Run the following command to install the pre-commit hooks

pre-commit install

The existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards.

🗾 Roadmap

In our upcoming versions, we will be adding support for the following:

Add more metrics on memory consumption. This includes how much RAM/GPU memory is consumed when we run the benchmarks.
Add support for more models. Upcoming versions will support popular LLMs like Mamba, Mistral, Mixtral, Phi2 etc.
Add ways to understand and articulate on change of generation quality with the change of frameworks and precision. We will try to add ways to understand how the generation quality of an LLM changes when we change the precision of the models or use a different inference engine framework.
Add support for batching. Since batching is very important while deploying LLMs. So coming versions will benchmark LLMs on batched inputs.

If you feel like there is something more to add, feel free to open an issue or a PR. We would be super happy to take contributions from the community.

🏆 Introducing Prem Grant Program

🌟 Exciting news, AI enthusiasts! Prem is thrilled to launch the Prem Grant Program, exclusively designed for forward-thinking AI startups ready to reshape the future. With this program, you get six months of free access to OpenAI, Anthropic, Cohere, Llama2, Mistral (or any other open-source model) APIs, opening doors to endless AI possibilities at zero cost. Enjoy free fine-tuning, seamless model deployment, and expert ML support. This is more than a grant; it's an invite to lead the AI revolution. Don't miss out – apply now and let's build the future of AI together with Prem! 🌟

Read more about the Prem Startup grant program here. You can directly apply to the program from here.

satyavadapalli / benchmarks Goto Github PK

benchmarks's Introduction

🕹️ Benchmarks

📊 Quick glance towards performance metrics for Llama-2-7B

🚀 Getting Started

Benchmark Script Parameters

Streamlined Execution

📄 Usage

Run a Specific Benchmark

Run All Benchmarks Collectively

🤝 Contribute

Creating a New Benchmark

Pre-commit Hooks

🗾 Roadmap

🏆 Introducing Prem Grant Program

benchmarks's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs