I have tried quantizing galactica-30b with this command: <div class="snippet-clipb

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Quantizing GALACTICA? about gptq-for-llama HOT 13 CLOSED

oobabooga commented on August 15, 2024

Quantizing GALACTICA?

from gptq-for-llama.

Comments (13)

qwopqwop200 commented on August 15, 2024 1

Currently, this problem seems to be caused by layernorm, but it seems very difficult to solve the problem.

from gptq-for-llama.

fgdfgfthgr-fox commented on August 15, 2024

How did you even quantized galactica? When I tried your command on the 6.7b model, it tells me

SError: Token is required (`token=True`), but no token found.

from gptq-for-llama.

IamDavyG commented on August 15, 2024

@oobabooga I also get gibberish with my 4 bit quantized Galactica 30B model with wikitext2:

The top 10 equations of all time are:

$-$0.0122 $\times$ number of authors + 0.0000 $\times$ number of disambiguated names + 0.0000 \⃝raisebox{0.25pt}{ \textcircled{\raisebox{-0.9pt} {1}} }\

I think it's to do with the markdown because text works ok.

I also get worse results when I quantize it with groupsize=128 where it just outputs "Sy" for every prompt:

The top 10 equations of all time are:

Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy Sy

My GPTQ version is the one that was rolled back if it matters.

@fgdfgfthgr-fox Have you signed into your Huggingface account and issued a token to your client?

from gptq-for-llama.

oobabooga commented on August 15, 2024

@IamDavyG did you see the wikitext2 score after your quantization was finished? I forgot to do that when I first tried. We should be aiming for something between 5 and 10. If the score is above that, then the quantization failed completely.

from gptq-for-llama.

IamDavyG commented on August 15, 2024

5.55 for no groupsize setting and 5.44 for groupsize 128 so it looks ok

from gptq-for-llama.

oobabooga commented on August 15, 2024

That's interesting. If you feel like it, uploading your int4 file to Hugging Face would be very useful.

from gptq-for-llama.

qwopqwop200 commented on August 15, 2024

Currently OPT seems to work but galactica doesn't. Are OPT and galactica different architectures?

from gptq-for-llama.

oobabooga commented on August 15, 2024

Both are OPTForCausalLM models, but GALACTICA is not a fine tune of OPT. I am not sure if they are exactly the same.

from gptq-for-llama.

qwopqwop200 commented on August 15, 2024

I tried many things to solve this problem but only galactica doesn't work.

from gptq-for-llama.

oobabooga commented on August 15, 2024

Thank you for looking into this.

from gptq-for-llama.

qwopqwop200 commented on August 15, 2024

@oobabooga
It's an old issue, but I've confirmed that galactica works using AutoGPTQ.
galactica 125m get wikitext2 17.16 ppl

from gptq-for-llama.

oobabooga commented on August 15, 2024

Thanks a lot for the heads up @qwopqwop200. Could you share the script that you used? I keep getting

TypeError: OPTForCausalLM.forward() got an unexpected keyword argument 'token_type_ids'

both with the OPTGPTQForCausalLM.from_pretrained in the README and with the first example with AutoGPTQForCausalLM.from_pretrained.

from gptq-for-llama.

qwopqwop200 commented on August 15, 2024

Thanks a lot for the heads up @qwopqwop200. Could you share the script that you used? I keep getting

TypeError: OPTForCausalLM.forward() got an unexpected keyword argument 'token_type_ids'

both with the OPTGPTQForCausalLM.from_pretrained in the README and with the first example with AutoGPTQForCausalLM.from_pretrained.

import os

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import numpy as np
import torch
import torch.nn as nn

pretrained_model_dir = "facebook/galactica-125m"
quantized_model_dir = "galactica-125m-4bit-128g"

# os.makedirs(quantized_model_dir, exist_ok=True)
def get_wikitext2(nsamples, seed, seqlen, model):
    from datasets import load_dataset
    traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
    testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')

    from transformers import AutoTokenizer
    try:
        tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
    except:
        tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
    trainenc = tokenizer("\n\n".join(traindata['text']), return_tensors='pt')
    testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')

    import random
    random.seed(seed)
    np.random.seed(0)
    torch.random.manual_seed(0)
    
    trainloader = []
    for _ in range(nsamples):
        i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = trainenc.input_ids[:, i:j]
        trainloader.append({'input_ids':inp})
    return trainloader, testenc

@torch.no_grad()
def opt_eval(model, testenc, dev, seqlen = 2048):
    print('Evaluating ...')

    testenc = testenc.input_ids
    nsamples = testenc.numel() // seqlen

    use_cache = model.config.use_cache
    model.config.use_cache = False
    layers = model.model.decoder.layers

    model.model.decoder.embed_tokens = model.model.decoder.embed_tokens.to(dev)
    model.model.decoder.embed_positions = model.model.decoder.embed_positions.to(dev)
    if hasattr(model.model.decoder, 'project_out') and model.model.decoder.project_out:
        model.model.decoder.project_out = model.model.decoder.project_out.to(dev)
    if hasattr(model.model.decoder, 'project_in') and model.model.decoder.project_in:
        model.model.decoder.project_in = model.model.decoder.project_in.to(dev)
    layers[0] = layers[0].to(dev)

    dtype = next(iter(model.parameters())).dtype
    inps = torch.zeros((nsamples, seqlen, model.config.hidden_size), dtype=dtype, device=dev)
    cache = {'i': 0, 'attention_mask': None}

    class Catcher(nn.Module):

        def __init__(self, module):
            super().__init__()
            self.module = module

        def forward(self, inp, **kwargs):
            inps[cache['i']] = inp
            cache['i'] += 1
            cache['attention_mask'] = kwargs['attention_mask']
            raise ValueError

    layers[0] = Catcher(layers[0])
    for i in range(nsamples):
        batch = testenc[:, (i * seqlen):((i + 1) * seqlen)].to(dev)
        try:
            model(batch)
        except ValueError:
            pass
    layers[0] = layers[0].module

    layers[0] = layers[0].cpu()
    model.model.decoder.embed_tokens = model.model.decoder.embed_tokens.cpu()
    model.model.decoder.embed_positions = model.model.decoder.embed_positions.cpu()
    if hasattr(model.model.decoder, 'project_out') and model.model.decoder.project_out:
        model.model.decoder.project_out = model.model.decoder.project_out.cpu()
    if hasattr(model.model.decoder, 'project_in') and model.model.decoder.project_in:
        model.model.decoder.project_in = model.model.decoder.project_in.cpu()
    torch.cuda.empty_cache()

    outs = torch.zeros_like(inps)
    attention_mask = cache['attention_mask']

    for i in range(len(layers)):
        print(i)
        layer = layers[i].to(dev)

        for j in range(nsamples):
            outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask)[0]
        layers[i] = layer.cpu()
        del layer
        torch.cuda.empty_cache()
        inps, outs = outs, inps

    if model.model.decoder.final_layer_norm is not None:
        model.model.decoder.final_layer_norm = model.model.decoder.final_layer_norm.to(dev)
    if model.model.decoder.project_out is not None:
        model.model.decoder.project_out = model.model.decoder.project_out.to(dev)
    model.lm_head = model.lm_head.to(dev)

    testenc = testenc.to(dev)
    nlls = []
    for i in range(nsamples):
        hidden_states = inps[i].unsqueeze(0)
        if model.model.decoder.final_layer_norm is not None:
            hidden_states = model.model.decoder.final_layer_norm(hidden_states)
        if model.model.decoder.project_out is not None:
            hidden_states = model.model.decoder.project_out(hidden_states)
        lm_logits = model.lm_head(hidden_states)
        shift_logits = lm_logits[:, :-1, :].contiguous()
        shift_labels = testenc[:, (i * seqlen):((i + 1) * seqlen)][:, 1:]
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
        neg_log_likelihood = loss.float() * seqlen
        nlls.append(neg_log_likelihood)
    ppl = torch.exp(torch.stack(nlls).sum() / (nsamples * seqlen))
    print(ppl.item())

    model.config.use_cache = use_cache

def main():
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
    trainloader,testenc = get_wikitext2(128, 0, 2048, pretrained_model_dir)

    quantize_config = BaseQuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
    )

    # load un-quantized model, the model will always be force loaded into cpu
    model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

    # quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask" 
    # with value under torch.LongTensor type.
    model.quantize(trainloader, use_triton=False)

    # save quantized model
    model.save_quantized(quantized_model_dir)

    # save quantized model using safetensors
    model.save_quantized(quantized_model_dir, use_safetensors=True)

    # load quantized model, currently only support cpu or single gpu
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)

    opt_eval(model.model, testenc, "cuda:0")

if __name__ == "__main__":
    import logging

    logging.basicConfig(
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
    )

    main()

from gptq-for-llama.

Quantizing GALACTICA? about gptq-for-llama HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs