Comments (11)
Thanks for reporting! I can reproduce the issue and I believe I have a fix. It'll take a bit to finish up the other stuff I'm working on, get this fix in, and make a release. But if you want to use it now, you can build aphrodite from source (clone the repo and run pip install -e .
) then modify vocab_parallel_embedding.py
at line 92:
index becd6f9..20db81e 100644
--- a/aphrodite/modeling/layers/vocab_parallel_embedding.py
+++ b/aphrodite/modeling/layers/vocab_parallel_embedding.py
@@ -91,16 +91,24 @@ class VocabParallelEmbedding(torch.nn.Module):
def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
output_dim = getattr(param, "output_dim", None)
+ packed_dim = getattr(param, "packed_dim", None)
if output_dim is not None:
- assert loaded_weight.shape[output_dim] == self.org_vocab_size
- loaded_weight = loaded_weight.narrow(
- output_dim, self.vocab_start_index,
- min(self.vocab_end_index - self.vocab_start_index,
- self.org_vocab_size - self.vocab_start_index))
+ shard_offset = self.vocab_start_index
+ shard_size = min(self.vocab_end_index,
+ self.org_vocab_size) - shard_offset
+ if packed_dim == output_dim:
+ shard_size = shard_size // param.pack_factor
+ shard_offset = shard_offset // param.pack_factor
+ loaded_weight = loaded_weight.narrow(output_dim, shard_offset,
+ shard_size)
if isinstance(param, torch.nn.parameter.UninitializedParameter):
vocab_shape = list(loaded_weight.shape)
if output_dim is not None:
- vocab_shape[output_dim] = self.num_embeddings_per_partition
+ if packed_dim == output_dim:
+ vocab_shape[
+ output_dim] = self.num_embeddings_per_partition // param.pack_factor
+ else:
+ vocab_shape[output_dim] = self.num_embeddings_per_partition
param.materialize(vocab_shape, dtype=loaded_weight.dtype)
if output_dim is not None:
param.data.narrow(
from aphrodite-engine.
Great, happy to help! I suppose we will be seeing some improvements there soon then? :)
Your fix worked, btw - it is running now. However, I believe I will keep using AWQ for now due to the higher tokens/s. These would be the benchmarks on it btw (relatively informal):
Aphrodite Bench 4bpw Openhermes-2.5 on RTX 3090, trx 3960x, 64gb ddr4, ubuntu:
48 parallel requests:
- GPTQ: 850-900 tok/s,
- AWQ: 1250-1350 tok/s,
- exl2: 550-700 tok/s
single request:
- AWQ: 100-115 tok/s,
- single request GPTQ: 135-147 tok/s
- single request exl2: 80-115 tok/s
from aphrodite-engine.
Thanks for your quick response! I will try it out ASAP and get back to you :)
from aphrodite-engine.
No problem. Seems to be a problem with this quant specifically, or rather this type. Works with tinyllama exl2 for example.
Thanks to this issue, I may have found a solution to the exllamav2 tensor parallel roadblock I hit in #375
from aphrodite-engine.
GPTQ is generally faster than exl2 because it's a simpler quant format. You're also using a 5bit quant for exl2, while the GPTQ/AWQ ones are 4bit.
EDIT: ah wait didn't notice you said 4bpw
from aphrodite-engine.
Yeah at first I had 5bpw, but I then changed it, wouldnt really be fair otherwise ;)
Are you sure exl2 should be slower? In my experience its about the same, there are also some benchmarks like this one: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
Another question: Do you have an idea why GPTQ seems to be faster for single but slower for multiple requests? Seems pretty unintuitive
from aphrodite-engine.
You may read the "ExLlama v1 vs ExLlama v2 GPTQ speed (update)" section of the ooba's blog
So GPTQ through ExLlamav2 is actually the model with the fastest evaluation speed of all
from aphrodite-engine.
Getting the same error with exl2 and command-r model (turboderp/command-r-v01-35B-exl2):
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/user/vllm/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 621, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in __init__
self.model_executor = executor_class(model_config, cache_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 51, in __init__
self._init_worker()
File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 86, in _init_worker
self.driver_worker.load_model()
File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/worker.py", line 108, in load_model
self.model_runner.load_model()
File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 134, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/loader.py", line 98, in get_model
model.load_weights(
File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 340, in load_weights
param = params_dict[name]
~~~~~~~~~~~^^^^^^
KeyError: 'lm_head.q_groups'
from aphrodite-engine.
Have you tried the solution above? That worked for me.
from aphrodite-engine.
Have you tried the solution above? That worked for me.
Didn't work as I'm not using Llama.
I'm getting the same error with and without --quantization exl2
from aphrodite-engine.
This should be fixed since 638547e
from aphrodite-engine.
Related Issues (20)
- [Bug]: Does --trust-remote-code work? HOT 1
- [Bug]: multi GPU crashes backend HOT 6
- [Bug]: WSL Cuda out of Memory when Trying to Load GGUF Model HOT 8
- [Usage]: load-in-4bit not load after converted, and it seem not use swap well
- [Bug]: KV Cache and Max Tokens - Lack of Consistency
- [Feature]: Add support for DBRX model HOT 2
- [Feature]: Add support for Qwen2MoE HOT 1
- [Feature]: Add support for Command-r HOT 2
- [Feature]: actual working health endpoint HOT 2
- [Feature]: any workarounds for cc 6.0? HOT 2
- [Bug]: served-model-name is unused HOT 1
- [Installation]: No module named 'aphrodite._C' HOT 2
- [Crash]: Program gets terminated HOT 1
- [Bug]: Converting gguf to state_dict HOT 3
- [Feature]: Is there a reason CUDA 6.1 is the minimum? Would CUDA 6.0 on the P100 not work? HOT 5
- [Bug]: manually setting --max-model-len flag always leads to OOM, even if it is set very low HOT 2
- [Bug]: gguf loading failed. config.json? HOT 4
- [Feature]: Support hqq quantize method.
- [Bug]: Mixtral-8x22b-instruct not running with AWQ HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aphrodite-engine.