GithubHelp home page GithubHelp logo

asahi417 / lmppl Goto Github PK

View Code? Open in Web Editor NEW
90.0 90.0 5.0 91 KB

Calculate perplexity on a text with pre-trained language models. Support MLM (eg. DeBERTa), recurrent LM (eg. GPT3), and encoder-decoder LM (eg. Flan-T5).

License: MIT License

Python 100.00%
bart gpt languagemodel nlp

lmppl's People

Contributors

asahi417 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lmppl's Issues

Dataset too large

I am using the run_mlm.py file but I have my own copy because I changed where the tokenizer is going to since it is a different path from the model which is local.

While intially working with this method, I used the first two lines of my dataset and it was working just fine, but now that I have expanded the input, I am getting this error:

IndexError                                Traceback (most recent call last)
Cell In[58], line 5
      3 scorer = MaskedLM('/data/user/home/nchendri/LongRun/')
      4 text =  dsMap['test']['text']
----> 5 ppl = scorer.get_perplexity(text, batch=32)
      6 print(ppl)
      7 print(list(zip(text, ppl)))

Cell In[57], line 162, in MaskedLM.get_perplexity(self, input_texts, batch)
    159     return _e
    161 if self.max_length is not None:
--> 162     data.append([encode_mask(i) for i in range(min(self.max_length - len(self.sp_token_prefix), len(x)))])
    163 else:
    164     data.append([encode_mask(i) for i in range(len(x))])

Cell In[57], line 162, in <listcomp>(.0)
    159     return _e
    161 if self.max_length is not None:
--> 162     data.append([encode_mask(i) for i in range(min(self.max_length - len(self.sp_token_prefix), len(x)))])
    163 else:
    164     data.append([encode_mask(i) for i in range(len(x))])

Cell In[57], line 157, in MaskedLM.get_perplexity.<locals>.encode_mask(mask_position)
    155 # add the correct token id as the label
    156 label = [PAD_TOKEN_LABEL_ID] * _e['input_ids'].shape[1]
--> 157 label[mask_position + len(self.sp_token_prefix)] = masked_token_id
    158 _e['labels'] = torch.tensor([label], dtype=torch.long)
    159 return _e

IndexError: list assignment index out of range

ppl in openai model

nll.append(sum([i for i in completion['choices'][0]['logprobs']['token_logprobs'] if i is not None]))
I think this calculation may be wrong. May need to change to

nll.append(sum([i for i in completion['choices'][0]['logprobs']['token_logprobs'] if i is not None]) / len([i for i in completion['choices'][0]['logprobs']['token_logprobs'] if i is not None]))

ImportError while trying to use device_map = "auto" parameter.

Hi. I encountered an ImportError while trying to use device_map = "auto" parameter. The builder module from the google.protobuf.internal returns with an ImportEror, blocks the usage of device_map parameter. Tested with CPU, 1xT4, 1xA100.

Python Version:

pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Python 3.10.12

Versions of protobuf dependencies:

protobuf in /usr/local/lib/python3.10/dist-packages (3.19.6)
grpcio in /usr/local/lib/python3.10/dist-packages (1.56.0)

Please Update Readme - Available Models

Hello! This is a great package, but it would help to know all the models its set up for. I see in the code
references to GPT-2 XL and also GPT-4. Are the examples in the readme the only ones that should be used at this time?

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0!

Hi,

Thanks for this great resource.

Trying to run this snippet of code

import lmppl

scorer = lmppl.EncoderDecoderLM("/home/racball/models--flan-t5-xxl",device_map='auto',low_cpu_mem_usage=True)

inputs = [
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee.',
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee.'
]
outputs = [
    'I am happy.',
    'I am sad.'
]
ppl = scorer.get_perplexity(input_texts=inputs, output_texts=outputs)
print(list(zip(outputs, ppl)))

runs into this stack of errors

RuntimeError                              Traceback (most recent call last)
Cell In[6], line 14
      6 inputs = [
      7     'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee.',
      8     'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee.'
      9 ]
     10 outputs = [
     11     'I am happy.',
     12     'I am sad.'
     13 ]
---> 14 ppl = scorer.get_perplexity(input_texts=inputs, output_texts=outputs)
     15 print(list(zip(outputs, ppl)))
     16 # >>> [
     17 #   ('I am happy.', 4138.748977714201),
     18 #   ('I am sad.', 2991.629250051472)
     19 # ]
     20 # print(f"prediction: {outputs[ppl.index(min(ppl))]}")
     21 # >>> "prediction: I am sad."

File [/nobackup/racball/miniconda3/envs/bertviz/lib/python3.10/site-packages/lmppl/ppl_encoder_decoder_lm.py:157](https://vscode-remote+ssh-002dremote-002bash.vscode-resource.vscode-cdn.net/nobackup/racball/miniconda3/envs/bertviz/lib/python3.10/site-packages/lmppl/ppl_encoder_decoder_lm.py:157), in EncoderDecoderLM.get_perplexity(self, input_texts, output_texts, batch)
    155 # model run & loss conversion into likelihood
    156 valid_length = (model_inputs["labels"] != PAD_TOKEN_LABEL_ID).sum(dim=-1)
--> 157 output = self.model(**{k: v.to(self.device) for k, v in model_inputs.items()})
    158 loss = self.loss_fct(output['logits'].view(-1, self.config.vocab_size), model_inputs["labels"].view(-1))
    159 loss = loss.view(len(output['logits']), -1)

File [/nobackup/racball/miniconda3/envs/bertviz/lib/python3.10/site-packages/torch/nn/modules/module.py:1194](https://vscode-remote+ssh-002dremote-002bash.vscode-resource.vscode-cdn.net/nobackup/racball/miniconda3/envs/bertviz/lib/python3.10/site-packages/torch/nn/modules/module.py:1194), in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File [/nobackup/racball/miniconda3/envs/bertviz/lib/python3.10/site-packages/accelerate/hooks.py:158](https://vscode-remote+ssh-002dremote-002bash.vscode-resource.vscode-cdn.net/nobackup/racball/miniconda3/envs/bertviz/lib/python3.10/site-packages/accelerate/hooks.py:158), in add_hook_to_module..new_forward(*args, **kwargs)
    156         output = old_forward(*args, **kwargs)
    157 else:
--> 158     output = old_forward(*args, **kwargs)
    159 return module._hf_hook.post_forward(module, output)

File [/nobackup/racball/miniconda3/envs/bertviz/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1696](https://vscode-remote+ssh-002dremote-002bash.vscode-resource.vscode-cdn.net/nobackup/racball/miniconda3/envs/bertviz/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1696), in T5ForConditionalGeneration.forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1694 if labels is not None:
   1695     loss_fct = CrossEntropyLoss(ignore_index=-100)
...
   3024 if size_average is not None or reduce is not None:
   3025     reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3026 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument target in method wrapper_nll_loss_forward)

Tried forcing .to('cuda:0') in multiple parts of the source code to no avail. Any thoughts?

A quite large perplexity issue

Hi, thank you for your developing lmppl.

I have a question about too large perplexity.

I installed lmppl and execute the commands described in the README as follows, but get_perplexity() returns quite large value.
Is there something wrong with the procedure?

>>> import lmppl
>>> scorer = lmppl.LM('gpt2')
Using pad_token, but it is not set yet.
>>> text = [
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am happy.',
    'sentiment classification: I dropped my laptop on my knee, and someone stole my coffee. I am sad.'
]
>>> ppl = scorer.get_perplexity(text)
100%|██████████| 1/1 [00:00<00:00,  3.03it/s]
>>> ppl
[4.2328431180493815e+43, 4.732356477497072e+43] # <-- They are quite large, there seems to be something wrong.

Version of some modules in my environment:

  • python 3.7.10
  • lmppl==0.2.9
  • transformers==4.11.3
  • torch==1.12.1+cu116

Thank you.

RuntimeError: Sizes of tensors must match

I always get this error when I apply the get_perplexity() method on a list of text or a data frame column.

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 194 but got size 193 for tensor number 125 in the list.

Any idea what is causing this error?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.