Your current environment <div class="snippet-clipboard-content notranslate posit

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug]: Exllama v2 not working about aphrodite-engine HOT 11 OPEN

SalomonKisters commented on May 28, 2024

[Bug]: Exllama v2 not working

from aphrodite-engine.

Comments (11)

AlpinDale commented on May 28, 2024 2

Thanks for reporting! I can reproduce the issue and I believe I have a fix. It'll take a bit to finish up the other stuff I'm working on, get this fix in, and make a release. But if you want to use it now, you can build aphrodite from source (clone the repo and run pip install -e .) then modify vocab_parallel_embedding.py at line 92:

index becd6f9..20db81e 100644
--- a/aphrodite/modeling/layers/vocab_parallel_embedding.py
+++ b/aphrodite/modeling/layers/vocab_parallel_embedding.py
@@ -91,16 +91,24 @@ class VocabParallelEmbedding(torch.nn.Module):
 
     def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
         output_dim = getattr(param, "output_dim", None)
+        packed_dim = getattr(param, "packed_dim", None)
         if output_dim is not None:
-            assert loaded_weight.shape[output_dim] == self.org_vocab_size
-            loaded_weight = loaded_weight.narrow(
-                output_dim, self.vocab_start_index,
-                min(self.vocab_end_index - self.vocab_start_index,
-                    self.org_vocab_size - self.vocab_start_index))
+            shard_offset = self.vocab_start_index
+            shard_size = min(self.vocab_end_index,
+                             self.org_vocab_size) - shard_offset
+            if packed_dim == output_dim:
+                shard_size = shard_size // param.pack_factor
+                shard_offset = shard_offset // param.pack_factor
+            loaded_weight = loaded_weight.narrow(output_dim, shard_offset,
+                                                 shard_size)
         if isinstance(param, torch.nn.parameter.UninitializedParameter):
             vocab_shape = list(loaded_weight.shape)
             if output_dim is not None:
-                vocab_shape[output_dim] = self.num_embeddings_per_partition
+                if packed_dim == output_dim:
+                    vocab_shape[
+                        output_dim] = self.num_embeddings_per_partition // param.pack_factor
+                else:
+                    vocab_shape[output_dim] = self.num_embeddings_per_partition
             param.materialize(vocab_shape, dtype=loaded_weight.dtype)
         if output_dim is not None:
             param.data.narrow(

from aphrodite-engine.

SalomonKisters commented on May 28, 2024 1

Great, happy to help! I suppose we will be seeing some improvements there soon then? :)
Your fix worked, btw - it is running now. However, I believe I will keep using AWQ for now due to the higher tokens/s. These would be the benchmarks on it btw (relatively informal):

Aphrodite Bench 4bpw Openhermes-2.5 on RTX 3090, trx 3960x, 64gb ddr4, ubuntu:

48 parallel requests:

GPTQ: 850-900 tok/s,
AWQ: 1250-1350 tok/s,
exl2: 550-700 tok/s

single request:

AWQ: 100-115 tok/s,
single request GPTQ: 135-147 tok/s
single request exl2: 80-115 tok/s

from aphrodite-engine.

SalomonKisters commented on May 28, 2024

Thanks for your quick response! I will try it out ASAP and get back to you :)

from aphrodite-engine.

AlpinDale commented on May 28, 2024

No problem. Seems to be a problem with this quant specifically, or rather this type. Works with tinyllama exl2 for example.

Thanks to this issue, I may have found a solution to the exllamav2 tensor parallel roadblock I hit in #375

from aphrodite-engine.

AlpinDale commented on May 28, 2024

GPTQ is generally faster than exl2 because it's a simpler quant format. You're also using a 5bit quant for exl2, while the GPTQ/AWQ ones are 4bit.

EDIT: ah wait didn't notice you said 4bpw

from aphrodite-engine.

SalomonKisters commented on May 28, 2024

Yeah at first I had 5bpw, but I then changed it, wouldnt really be fair otherwise ;)
Are you sure exl2 should be slower? In my experience its about the same, there are also some benchmarks like this one: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/

Another question: Do you have an idea why GPTQ seems to be faster for single but slower for multiple requests? Seems pretty unintuitive

from aphrodite-engine.

sgsdxzy commented on May 28, 2024

You may read the "ExLlama v1 vs ExLlama v2 GPTQ speed (update)" section of the ooba's blog

So GPTQ through ExLlamav2 is actually the model with the fastest evaluation speed of all

from aphrodite-engine.

ccdv-ai commented on May 28, 2024

@SalomonKisters @AlpinDale

Getting the same error with exl2 and command-r model (turboderp/command-r-v01-35B-exl2):

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/vllm/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 621, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in __init__
    self.model_executor = executor_class(model_config, cache_config,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 51, in __init__
    self._init_worker()
  File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 86, in _init_worker
    self.driver_worker.load_model()
  File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/worker.py", line 108, in load_model
    self.model_runner.load_model()
  File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 134, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/loader.py", line 98, in get_model
    model.load_weights(
  File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 340, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'lm_head.q_groups'

from aphrodite-engine.

SalomonKisters commented on May 28, 2024

Have you tried the solution above? That worked for me.

from aphrodite-engine.

ccdv-ai commented on May 28, 2024

Have you tried the solution above? That worked for me.

Didn't work as I'm not using Llama.
I'm getting the same error with and without --quantization exl2

from aphrodite-engine.

sgsdxzy commented on May 28, 2024

This should be fixed since 638547e

from aphrodite-engine.

[Bug]: Exllama v2 not working about aphrodite-engine HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs