GithubHelp home page GithubHelp logo

Comments (7)

philipturner avatar philipturner commented on August 15, 2024 2

That scheme requires detecting which layers to train during the training process. We could try to fine-tune and test certain layer combinations for LLaMa-65B. I have a 32 GB Apple GPU and a good algorithm for evaluating the model without CUDA. My fast SSD could also pad the slight >0.5 GB memory overrun. Or I start with 3 bits (which provides acceptable performance at 65B params) and consume 24.375 GB of memory.

We could also vary the quantization level within each layer. That may require a prefix sum within the shader, but it's theoretically possible to cache intermediate sums while achieving a net positive compression ratio. Finally, we could try lossless compression, decoding with a complex algorithm in the GPU shader.

from gptq-for-llama.

turboderp avatar turboderp commented on August 15, 2024

Wouldn't it be hard to do anything useful with the remaining VRAM, though? Fitting the model weights on the GPU is one thing, but to run inference you need quite a bit of VRAM on top for the key/value cache. You could theoretically dump the cache after each layer is processed, but then you're redoing the whole computation from scratch with every new token, and at that point you might as well run the whole inference pass on the CPU.

E.g. the 30B 4-bit model does fit on a 24GB GPU, using some 16 GB or so of VRAM. With that you have room for a sequence length of maybe 600 tokens before memory starts running out. Quantizing half the parameters of the 60B model further down to two bits would leave you with basically no room for a prompt, let alone a response.

from gptq-for-llama.

philipturner avatar philipturner commented on August 15, 2024

If your GPU has a direct path to the disk, and a fast SSD, the SSD bandwidth could pad the memory overflow. I calculated that on my M1 Max MBP, it could maintain the theoretical minimum latency with a few hundred MB of overfill.

If you're training, you can afford slightly more overfill by using larger batch sizes. That approach would get more information learned in the same time, given fixed memory pressure.

but to run inference you need quite a bit of VRAM on top for the key/value cache

Can you quantify how large the cache is (in bytes)? I need exact numbers, accurate within a factor of ~0.5-2.0.

from gptq-for-llama.

turboderp avatar turboderp commented on August 15, 2024

In elements:

2 * num_layers * batch_size * num_attn_heads * key_value_dim * seq_len
= 2 * num_layers * batch_size * hidden_dim * seq_len

So for half precision, multiply the whole thing by two again. The 60B model has 80 layers and a hidden dimension of 8192, for reference, so it should work out to 2560 kB per token if my math is right. (EDIT: Looked at the 30B numbers before. This should be correct now.)

That's the theoretical minimum amount of data (not counting any overhead from stride etc.) that you would have to pass between iterations if you want anything resembling speed.

As for the computation itself, I guess it depends. I'm looking into it in depth right now because I want to try to achieve a higher sequence length with the 30B model on a 24 GB GPU, and I'm not convinced the Transformers implementation is all that efficient with VRAM usage.

Regardless, there's for sure some intermediate processing you also need to take into account where the query and key matrices are multiplied to produce the attention score matrix, which scales quadratically with the sequence length. It's quite a few big matrices that have to exist in VRAM at the same time.

In any case streaming to an SSD sounds both much too slow and like a good way to burn out an SSD. (?)

from gptq-for-llama.

philipturner avatar philipturner commented on August 15, 2024

In any case streaming to an SSD sounds both much too slow and like a good way to burn out an SSD. (?)

The SSD burns out if you perform enough write operations - ~1000-3000 per bit over its lifetime. That says nothing about durability against read operations.

The workflow would use something like Metal fast resource loading, DirectStorage, or GPUDirect. You carefully set up a certain streaming workflow, where two of 60 layers are not held in memory. For each token in sequence, you load the 29th layer (L29) while L0-L28 are executing. Then execute L30, discard its contents, and start loading L59. It should also arrive just in time, after L30-L58 finish executing.

This is an incremental gain, but it could be the difference between "fits" and "not fits" for a particular model. You can also page (L28, L29) <-> (L58, L59) or (L27, L28, L29) <-> (L57, L58, L59) to save more memory. Divide SSD bandwidth / RAM bandwidth, you get the ideal proportion of paged layers. The examples in this paragraph are (2 + 2) / 60 and (3 + 3) / 60 respectively.

from gptq-for-llama.

turboderp avatar turboderp commented on August 15, 2024

Silly me, I was thinking about swapping state in and out of VRAM. Of course you meant streaming just weights, which would be read only. I can't see why that wouldn't work.

As for the practical memory requirements, I did a few tests on 30B, measuring max memory allocation for a single inference step on different context lengths:

(before inf.):  17562.44 MB
256 tokens:     17677.23 MB
512 tokens:     18584.57 MB
768 tokens:     18698.00 MB
1024 tokens:    18810.15 MB
1280 tokens:    18922.09 MB
1536 tokens:    19034.63 MB
1792 tokens:    19149.66 MB
2048 tokens:    19260.77 MB

There's an odd bump at around 500 tokens, which I can only think has to do with PyTorch switching to a different memory allocation strategy for tensors over a certain size. I need to investigate that. In the meantime, I did it again in finer steps just to confirm, and it came out looking like this:

https://i.imgur.com/6nWWIYA.png

It's very odd. But in any case, it seems that that with the model as is and with a simple, cached forward pass you would need an extra 1.7 GB of available VRAM on top of the weights to make use of the max sequence length of LLaMA-30B. I can't run the 60B model, but I would expect it to take up 64% more space (going by the layer count and hidden dim).

Beam search would of course increase it a lot.

from gptq-for-llama.

qwopqwop200 avatar qwopqwop200 commented on August 15, 2024

There are currently no plans to support any quantization other than GPTQ. Also, according to my experience so far, 4-bit quantization was the most efficient.

from gptq-for-llama.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.