GithubHelp home page GithubHelp logo

Comments (11)

AlpinDale avatar AlpinDale commented on May 28, 2024 2

Thanks for reporting! I can reproduce the issue and I believe I have a fix. It'll take a bit to finish up the other stuff I'm working on, get this fix in, and make a release. But if you want to use it now, you can build aphrodite from source (clone the repo and run pip install -e .) then modify vocab_parallel_embedding.py at line 92:

index becd6f9..20db81e 100644
--- a/aphrodite/modeling/layers/vocab_parallel_embedding.py
+++ b/aphrodite/modeling/layers/vocab_parallel_embedding.py
@@ -91,16 +91,24 @@ class VocabParallelEmbedding(torch.nn.Module):
 
     def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
         output_dim = getattr(param, "output_dim", None)
+        packed_dim = getattr(param, "packed_dim", None)
         if output_dim is not None:
-            assert loaded_weight.shape[output_dim] == self.org_vocab_size
-            loaded_weight = loaded_weight.narrow(
-                output_dim, self.vocab_start_index,
-                min(self.vocab_end_index - self.vocab_start_index,
-                    self.org_vocab_size - self.vocab_start_index))
+            shard_offset = self.vocab_start_index
+            shard_size = min(self.vocab_end_index,
+                             self.org_vocab_size) - shard_offset
+            if packed_dim == output_dim:
+                shard_size = shard_size // param.pack_factor
+                shard_offset = shard_offset // param.pack_factor
+            loaded_weight = loaded_weight.narrow(output_dim, shard_offset,
+                                                 shard_size)
         if isinstance(param, torch.nn.parameter.UninitializedParameter):
             vocab_shape = list(loaded_weight.shape)
             if output_dim is not None:
-                vocab_shape[output_dim] = self.num_embeddings_per_partition
+                if packed_dim == output_dim:
+                    vocab_shape[
+                        output_dim] = self.num_embeddings_per_partition // param.pack_factor
+                else:
+                    vocab_shape[output_dim] = self.num_embeddings_per_partition
             param.materialize(vocab_shape, dtype=loaded_weight.dtype)
         if output_dim is not None:
             param.data.narrow(

from aphrodite-engine.

SalomonKisters avatar SalomonKisters commented on May 28, 2024 1

Great, happy to help! I suppose we will be seeing some improvements there soon then? :)
Your fix worked, btw - it is running now. However, I believe I will keep using AWQ for now due to the higher tokens/s. These would be the benchmarks on it btw (relatively informal):

Aphrodite Bench 4bpw Openhermes-2.5 on RTX 3090, trx 3960x, 64gb ddr4, ubuntu:

48 parallel requests:

  • GPTQ: 850-900 tok/s,
  • AWQ: 1250-1350 tok/s,
  • exl2: 550-700 tok/s

single request:

  • AWQ: 100-115 tok/s,
  • single request GPTQ: 135-147 tok/s
  • single request exl2: 80-115 tok/s

from aphrodite-engine.

SalomonKisters avatar SalomonKisters commented on May 28, 2024

Thanks for your quick response! I will try it out ASAP and get back to you :)

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on May 28, 2024

No problem. Seems to be a problem with this quant specifically, or rather this type. Works with tinyllama exl2 for example.

Thanks to this issue, I may have found a solution to the exllamav2 tensor parallel roadblock I hit in #375

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on May 28, 2024

GPTQ is generally faster than exl2 because it's a simpler quant format. You're also using a 5bit quant for exl2, while the GPTQ/AWQ ones are 4bit.

EDIT: ah wait didn't notice you said 4bpw

from aphrodite-engine.

SalomonKisters avatar SalomonKisters commented on May 28, 2024

Yeah at first I had 5bpw, but I then changed it, wouldnt really be fair otherwise ;)
Are you sure exl2 should be slower? In my experience its about the same, there are also some benchmarks like this one: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/

Another question: Do you have an idea why GPTQ seems to be faster for single but slower for multiple requests? Seems pretty unintuitive

from aphrodite-engine.

sgsdxzy avatar sgsdxzy commented on May 28, 2024

You may read the "ExLlama v1 vs ExLlama v2 GPTQ speed (update)" section of the ooba's blog

So GPTQ through ExLlamav2 is actually the model with the fastest evaluation speed of all

from aphrodite-engine.

ccdv-ai avatar ccdv-ai commented on May 28, 2024

@SalomonKisters @AlpinDale

Getting the same error with exl2 and command-r model (turboderp/command-r-v01-35B-exl2):

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/vllm/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 621, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in __init__
    self.model_executor = executor_class(model_config, cache_config,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 51, in __init__
    self._init_worker()
  File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 86, in _init_worker
    self.driver_worker.load_model()
  File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/worker.py", line 108, in load_model
    self.model_runner.load_model()
  File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 134, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/loader.py", line 98, in get_model
    model.load_weights(
  File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 340, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'lm_head.q_groups'

from aphrodite-engine.

SalomonKisters avatar SalomonKisters commented on May 28, 2024

Have you tried the solution above? That worked for me.

from aphrodite-engine.

ccdv-ai avatar ccdv-ai commented on May 28, 2024

Have you tried the solution above? That worked for me.

Didn't work as I'm not using Llama.
I'm getting the same error with and without --quantization exl2

from aphrodite-engine.

sgsdxzy avatar sgsdxzy commented on May 28, 2024

This should be fixed since 638547e

from aphrodite-engine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.