At the moment, this repository is no longer under maintenance. I strongly suggest https://github.com/huggingface/optimum-benchmark for a well-done multi-backend and multi-model benchmarking library.
Quantization is the current way-to-go to run local models. However, performance of a model can remarkably vary depending many factors: quantization method, but also hardware, model architecture, kernel, etc. The lighthouse intention is sheding light on LLMs inference performance, providing out-of-the-box funcitonalities to benchmark quantized models and quickly understand their potention across different configurations.
To use the fuctionalities of the lighthouse on your machine just clone the repo with:
git clone https://github.com/LorenzoPozzi97/lighthouse.git
Call the script setup.sh that will create a virtual environemnt with all the necessary packages and activate the venv:
bash setup.sh
source .lighthouse/bin/activate
- Linux
- Windows
- Mac OS
The lighthouse works in three steps:
- Decide the parameters of your experiments
- Store the results in your personal database, namely the bulb ๐ก
- Interrogate the bulb ๐ก to create interactive graphs
Here is a typical interaction with the library to test a modle quantized in GGUF format:
python ./lighthouse/benchmark_gguf.py --model-path solar-10.7b-instruct-v1.0.Q4_K_M.gguf
Each configuration in the experiment will be appended to you ๐ก. To create an interactive parallel coordinated graphs use:
python ./lighthouse/parallel_coordinates.py
or for a 2D graph (with the name autogenerated for the test):
python ./lighthouse/bidimensional_graphs.py --run-anchor straightforward_turkey_trot
Experments can track a number of different metrics.
memo
: A brief comment on the run experiment.Run Name
: A unique autogenerated name given to the experiment.
- auto_gptq_v
- llama_cpp_v
- kernel --> GPTQ
- Quant. Method
- Model
- Model Size
- Batch
- Threads
- Batch Threads
- Context Window
- Prompt Length
- New Tokens
- GPU Layers --> GGUF
- Device
- VRAM
- RAM
- CPU Count
- Mem. Usage (TODO) [GB]
- Time To First Token (TTFT) [s] (= Prompt Eval Time)
- TTFT [tk/s] (=TTFT/prompt tokens = prompt Eval Time [tk/s])
- Time Per Output Token [s/tk] (TPOT)
- Eval Time [s] (=TPOT^-1)
- Latency [s] (= TTFT + TPOT * new tokens)
- Latency [tk/s] = L / (new tokens + prompt tokems)
- Load time [s]
- GGUF
- GPTQ
- EETQ
- bitsandbytes
- AWQ
Any contribution is very well appreciated! This project is still in its embryonal stage but it could save a lot of time to many people. Here's how you can contribute:
- Fork the Repository: Start by forking the repository to your GitHub account. This creates your own copy of the project where you can make changes.
- Clone Your Fork: Clone your fork to your local machine using Git. Replace YOUR-USERNAME with your GitHub username.
git clone https://github.com/YOUR-USERNAME/project-name.git
- Create a Branch: Navigate into the cloned repository and create a branch for your contribution.
git checkout -b feature/your-feature-name
- Make Your Changes Locally: Implement your changes or fixes in your branch.
- Commit Your Changes: Once you're happy with your changes, commit them to your branch. Make sure your commit messages are clear and descriptive.
git commit -am "Add a concise commit message describing your change"
- Push to Your Fork: Push your changes to your GitHub fork.
git push origin feature/your-feature-name
- Open a Pull Request (PR): Go to the original repository on GitHub, and you'll see a prompt to open a pull request from your fork. Fill in the PR template with details about your changes.
- Review: Once your PR is submitted, the project maintainers will review your changes.
- Merge: If the review is passed, the maintainers will merge your PR. Thank you for your contribution!