Comments (12)
Thanks for the request. We can definitely integrate FP6 quantization to vLLM as another supported quantization method to run FP6 models. It shouldn't be too hard given that it only quantized linear layers, and the FP6 linear kernels are open source.
Meanwhile, I'd still keep FP8 as the standard (actually the term "standard" is not really important because FP8 is also implemented as a quantization method and it's up to users). The reason is that FP8 is officially supported by GPU vendors (both NVIDIA and AMD) at instruction level, meaning that 1) the vendors will maintain compatibility and performance in future GPU releases, and 2) more workloads (e.g., FP flash attention and kv-cache) can be covered.
from vllm.
Is this feature now usable with non-arctic model?
Installed from source and it works as expected, amazing!
- Download model https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct for example
- Put https://huggingface.co/Snowflake/snowflake-arctic-instruct/blob/main/quant_config.json into model dir
python -m vllm.entrypoints.openai.api_server --model ./NousResearch/Meta-Llama-3-8B-Instruct --quantization deepspeedfp
Loaded model weights are reduced as well.
no quantization: model_runner.py:167] Loading model weights took 14.9595 GB
8bits: model_runner.py:167] Loading model weights took 8.6860 GB
6bits: model_runner.py:167] Loading model weights took 7.0610 GB
from vllm.
Thanks for the clarification. Then we can close this issue I suppose?
from vllm.
@nivibilla thanks for sharing the separated kernel implementation - this makes it a lot more straightforward to understand and I would be interested in implementing this within vLLM.
We will face the same problem that we have with dynamic FP8 quantization, where we have to fully load in the model weights in original precision i.e. FP16 and then can quantize the weights down. This means the peak memory consumption will still be equivalent to the original model weights. We will address this soon with a weight loader refactor.
I just want to share that they did compare against fine-grained and coarse-grained W4A16 INT4 kernels, which we already have very good ones implemented, and saw performance slightly below them as expected. So while I don't think this will offer particularly new capability in vLLM, it would be very nice to get a relatively accurate 6-bit model compression with just a runtime flag.
from vllm.
Thanks @mgoin , yes the performance isn't as good as INT4. However the model performance is nearly indistinguishable from fp16 which is really nice. I hope that fp6 becomes the new fp8 standard. There's no need to run the weights in any higher precision. I think it's a nice tradeoff from int4, better performance and slightly slower.
And yes the weight loading may be an issue. I hope to have two replicas of a model load inside the same GPU so if one of them takes up the entire GPU memory when loading then it will fail. Hope this can be fixed too.
from vllm.
cc @comaniac
from vllm.
It seems support for this will land in #4652 as quantization="deepspeedfp"
from vllm.
Not really. The PR you pointed out only uses FP6/8 checkpoints. The compute is still in FP16.
from vllm.
@comaniac FP6_LLM is weight-only quantization i.e. W6A16, you can see this in the graph I shared in my comment above. There is no compute savings with this method compared to FP16. Also the PR I pointed to allows quantizing at runtime, like our FP8 quantization, not just loading pre-quantized checkpoints.
from vllm.
@mgoin I'm a bit confused, why does fp6 not save vram? Even if the activations are in fp16. Surely the weights being in fp6 save memory right?
from vllm.
Hi @nivibilla you just misunderstood what I said. I said there is no compute savings, meaning the computation still all happens at fp16 precision. This does not imply there is no memory savings, which is very much happening.
from vllm.
@mgoin ohhh I see. Lol mb
from vllm.
Related Issues (20)
- [Bug]: Gemma model fails with GPTQ marlin HOT 6
- [Bug]: The implementation of DynamicNTKScalingRotaryEmbedding may have errors. HOT 1
- [Bug]: Can't run vllm distributed inference with vLLM + Ray
- [New Model]: IBM Granite Code Models HOT 1
- [Bug]: [WSL] no response when vllm.entrypoints.openai.api_server run HOT 8
- [Bug]: can not clean up the memory usage after instantiating the LLM class. HOT 1
- [Bug]: async engine failure when placing multi lora adapter under load HOT 2
- [Misc]: Loading microsoft/Phi-3-medium-128k-instruct with vLLM HOT 1
- [ibm-granite/granite-8b-code-instruct]: Empty reponses on ibm-granite HOT 3
- [Bug]: vLLM embeddings example code doesn't work HOT 2
- [Bug]: Crash sometimes using LLM entrypoint and LoRA adapters HOT 1
- [Usage]: Multiple samplig params with OpenAI library HOT 5
- [Bug]: The tail problem HOT 1
- [New Model]: LLaVA-NeXT-Video support
- [Usage]: extractive question answering using VLLM
- [Feature]: Triton GPTQ
- [Feature]: How to Enable VLLM to Work with PreTrainedModel Objects in my MOE-LoRA? THX
- [Bug]: nsys cannot track the cuda kernel called by the process except rank 0 HOT 2
- [Usage]: Do we have any tutorials for using vllm with tensorrt-LLM? HOT 2
- [Usage]: how should I do data parallelism using vLLM?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.