kanishkamisra / minicons Goto Github PK

View Code? Open in Web Editor NEW

101.0 6.0 25.0 6.38 MB

Utility for behavioral and representational analyses of Language Models

Home Page: https://minicons.kanishka.website

License: MIT License

Python 100.00%

nlp natural-language-processing transformers language-model

minicons's People

Contributors

Stargazers

Watchers

Forkers

forrestdavis aaronmueller agoel00 carina-kauf thomashikaru lisalevinson ruanchaves techthiyanes wwt17 jessetnroberts plonerma stw2 gorkaurbizu subbareddy248 ryskina grvkamath shiupadhye yancong222

minicons's Issues

Discrepancies in output between simonepri/lm-scorer and minicons libraries

Issue

I am the author of hashformers, a state-of-the-art library for hashtag segmentation.

I am currently transitioning from using simonepri/lm-scorer to using minicons as the backbone for my library. So my goal right now is to replicate the exact scores produced by lm-scorer using minicons.

Here's the original code snippet using lm-scorer:

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained("gpt2", device=device, batch_size=batch_size)
logprobs = scorer.tokens_score("I like this package.")
print(logprobs)

The corresponding scorer in the lm-scorer library can be found here.

In my attempts to duplicate this functionality with minicons, I came up with the following code:

import torch
from minicons import scorer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = scorer.IncrementalLMScorer('gpt2', device)
logprobs = model.compute_stats(model.prepare_text("I like this package."))
print(logprobs)

However, this code doesn't provide the expected output. The differences include:

The number of tokens returned by the two libraries doesn't match.
The absolute value of each token score differs between the two outputs.
The relative positions of the token scores also differ. np.argsort returns different results for each library's output.

This is the output produced by lm-scorer ( the relevant scores are the first list in the tuple ):

(
  [0.018321018666028976, 0.0066428035497665405, 0.08063317090272903, 0.000607448979280889, 0.277709037065506, 0.0036384568084031343], 
  [40, 588, 428, 5301, 13, 50256], 
  ['I', 'Ġlike', 'Ġthis', 'Ġpackage', '.', '<|endoftext|>']
)

This is the output produced by minicons:

Using pad_token, but it is not set yet.

[[-6.164241790771484, -3.1028060913085938, -7.756439208984375, -1.4581527709960938]]

Code comparison

It's hard to spot where is the difference because the code in both libraries is rather similar:

minicons

minicons/scorer.py:

        ids = [
            [i for i in instance if i != self.tokenizer.pad_token_id]
            for instance in encoded["input_ids"].tolist()
        ]

        ## Ignore the probabilities of the first token.
        effective_ids = [id[1:] for id in ids]

        with torch.no_grad():
            logits = self.model(**encoded).logits.detach()

        logits[:, :, self.tokenizer.pad_token_id] = float("-inf")

        logits = logits.split([1] * len(offsets))

        ## Set up storage variables
        scores = []
        if rank:
            ranks = []

        for logit, idx, offset in zip(logits, effective_ids, offsets):
            length = len(idx)
            logit = logit.squeeze(0)[:, :-1][
                torch.arange(offset, length),
            ]

            logprob_distribution = logit - logit.logsumexp(1).unsqueeze(1)
            query_ids = idx[offset:]

...


                    score = logprob_distribution[
                        torch.arange(length - offset), query_ids
                    ].tolist()

...

            scores.append(score)

...

            return scores

lm-scorer

lm-scorer/gpt2.py:

with torch.no_grad():
            ids = encoding["input_ids"].to(self.model.device)
            attention_mask = encoding["attention_mask"].to(self.model.device)
            nopad_mask = ids != self.tokenizer.pad_token_id
            logits: torch.Tensor = self.model(ids, attention_mask=attention_mask)[0]

        for sent_index in range(len(text)):
            sent_nopad_mask = nopad_mask[sent_index]
            # len(tokens) = len(text[sent_index]) + 1
            sent_tokens = [
                tok
                for i, tok in enumerate(encoding.tokens(sent_index))
                if sent_nopad_mask[i] and i != 0
            ]

            # sent_ids.shape = [len(text[sent_index]) + 1]
            sent_ids = ids[sent_index, sent_nopad_mask][1:]
            # logits.shape = [len(text[sent_index]) + 1, vocab_size]
            sent_logits = logits[sent_index, sent_nopad_mask][:-1, :]
            sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
            # ids_scores.shape = [seq_len + 1]
            sent_ids_scores = sent_logits.gather(1, sent_ids.unsqueeze(1)).squeeze(1)
            # log_prob.shape = [seq_len + 1]
            sent_log_probs = sent_ids_scores - sent_logits.logsumexp(1)

            sent_log_probs = cast(torch.DoubleTensor, sent_log_probs)
            sent_ids = cast(torch.LongTensor, sent_ids)

            output = (sent_log_probs, sent_ids, sent_tokens)
            outputs.append(output)

        return outputs

I'd appreciate any insights or suggestions on how to make the minicons output match the lm-scorer output.

Thank you for your assistance!

incorrect tokenizers

Hi,

Minicons seems not tokenize alphabet-based texts given "bert" pre-trained models are introduced. It is desirable to generate the surprisal values for word forms in real life rather than the split forms. For example, "symbolised" is split into ('symbol', 9.485310554504395), ('##ised', 6.920506000518799),. However, I want to get the surprisal value for the real word ("symbolised"). I am not sure how to solve this problem. The package also seems to incorrectly generate surprisal values for some real words, particularly those long words with suffix or prefix, because a long word with prefix or suffix will be split into several units.

Many thanks!

In [13]: model = scorer.MaskedLMScorer('bert-base-multilingual-cased', 'cpu')

In [14]: ge_sen=["Janus symbolisierte häufig Veränderungen und Übergänge, wie de
    ...: n Wechsel von einer Bedingung zur anderen, von einer Perspektive zur an
    ...: deren und das Heranwachsen junger Menschen zum Erwachsenenalter."]

In [15]: model.token_score(ge_sen, surprisal = True, base_two = True)
Out[15]: 
[[('Jan', 7.411351680755615),
  ('##us', 6.953413963317871),
  ('symbol', 8.663262367248535),
  ('##isierte', 8.227853775024414),
  ('häufig', 9.369148254394531),
  ('Veränderungen', 4.863248348236084),
  ('und', 3.3478829860687256),
  ('Über', 3.3023200035095215),
  ('##gänge', 0.40428581833839417),
  (',', 0.048578906804323196),
  ('wie', 1.878091812133789),
  ('den', 5.769808769226074),
  ('Wechsel', 3.2879366874694824),
  ('von', 0.016336975619196892),
  ('einer', 0.016496576368808746),
  ('Bed', 0.0244187843054533),
  ('##ingu', 0.09460146725177765),
  ('##ng', 0.018612651154398918),
  ('zur', 0.9586092829704285),
  ('anderen', 1.2600054740905762),
  (',', 0.3100062906742096),
  ('von', 0.013392632827162743),
  ('einer', 0.025651555508375168),
  ('Pers', 0.007922208867967129),
  ('##pektive', 0.03971010446548462),
  ('zur', 0.8729674220085144),
  ('anderen', 1.6451447010040283),
  ('und', 2.9337639808654785),
  ('das', 0.1244136244058609),
  ('Hera', 1.1853374242782593),
  ('##n', 1.9540393352508545),
  ('##wachsen', 0.006810512859374285),
  ('junge', 2.0289151668548584),
  ('##r', 0.007776367478072643),
  ('Menschen', 3.1449434757232666),
  ('zum', 5.088050365447998),
  ('Er', 0.001235523377545178),
  ('##wachsenen', 0.01289732288569212),
  ('##alter', 0.12524327635765076),
  ('.', 0.02648257650434971)]]

In [16]: en_sen=["Janus often symbolised changes and transitions, such as moving
    ...:  from one condition to another, from one perspective to another, and yo
    ...: ung people growing into adulthood."]

In [17]: en_sen
Out[17]: ['Janus often symbolised changes and transitions, such as moving from one condition to another, from one perspective to another, and young people growing into adulthood.']

In [18]: model.token_score(en_sen, surprisal = True, base_two = True)
Out[18]: 
[[('Jan', 7.161930084228516),
  ('##us', 4.905619144439697),
  ('often', 5.8594160079956055),
  ('symbol', 9.485310554504395),
  ('##ised', 6.920506000518799),
  ('changes', 4.574926853179932),
  ('and', 3.2199747562408447),
  ('transition', 5.44439697265625),
  ('##s', 0.018392512574791908),
  (',', 0.02080027014017105),
  ('such', 0.04780016839504242),
  ('as', 0.013945729471743107),
  ('moving', 7.4285569190979),
  ('from', 0.008073553442955017),
  ('one', 0.2561193108558655),
  ('condition', 18.707305908203125),
  ('to', 0.014606142416596413),
  ('another', 0.8214359283447266),
  (',', 0.7367089986801147),
  ('from', 0.06036728620529175),
  ('one', 0.4734668731689453),
  ('perspective', 13.356915473937988),
  ('to', 0.06987723708152771),
  ('another', 0.7075008749961853),
  (',', 0.08287912607192993),
  ('and', 2.1124203205108643),
  ('young', 6.065392017364502),
  ('people', 3.042752742767334),
  ('growing', 4.334306716918945),
  ('into', 4.379203796386719),
  ('adult', 1.3680847883224487),
  ('##hood', 0.2171218991279602),
  ('.', 0.06372988969087601)]]

In [19]: sp_sen=["Jano suele simbolizar los cambios y las transiciones, como el 
    ...: paso de una condición a otra, de una perspectiva a otra, y el crecimien
    ...: to de los jóvenes hacia la edad adulta."]

In [20]: model.token_score(sp_sen, surprisal = True, base_two = True)
Out[20]: 
[[('Jan', 11.449429512023926),
  ('##o', 7.180861949920654),
  ('suele', 7.2584357261657715),
  ('simbol', 4.928884983062744),
  ('##izar', 0.018150361254811287),
  ('los', 0.03109721466898918),
  ('cambios', 3.5657286643981934),
  ('y', 6.550257682800293),
  ('las', 0.04733512923121452),
  ('trans', 3.946718454360962),
  ('##iciones', 0.35458695888519287),
  (',', 0.0718887448310852),
  ('como', 0.6874077916145325),
  ('el', 0.009603511542081833),
  ('paso', 5.542746067047119),
  ('de', 0.015120714902877808),
  ('una', 0.010806013830006123),
  ('condición', 15.124244689941406),
  ('a', 0.03922305256128311),
  ('otra', 0.5000980496406555),
  (',', 0.5917675495147705),
  ('de', 0.05246984213590622),
  ('una', 0.015467431396245956),
  ('perspectiva', 11.579629898071289),
  ('a', 0.05401906371116638),
  ('otra', 0.39069506525993347),
  (',', 0.024377508088946342),
  ('y', 1.9166929721832275),
  ('el', 0.006273927167057991),
  ('crecimiento', 6.725331783294678),
  ('de', 0.011221524327993393),
  ('los', 0.7993561029434204),
  ('jóvenes', 4.965604305267334),
  ('hacia', 3.6372487545013428),
  ('la', 0.27643802762031555),
  ('edad', 0.262629896402359),
  ('adulta', 0.3033374845981598),
  ('.', 0.03442404791712761)]]

In [21]: ru_sen=["Янус часто символизировал изменения и переходы, такие как пере
    ...: ход от одного состояния к другому, от одной перспективы к другой, а так
    ...: же молодых людей, вступающих во взрослую жизнь."]

In [22]: model.token_score(ru_sen, surprisal = True, base_two = True)
Out[22]: 
[[('Ян', 7.062388896942139),
  ('##ус', 7.699002742767334),
  ('часто', 10.491772651672363),
  ('символ', 1.846983551979065),
  ('##из', 0.5921100974082947),
  ('##ировал', 7.98089599609375),
  ('изменения', 9.341201782226562),
  ('и', 1.0752657651901245),
  ('пер', 0.0009851165814325213),
  ('##еход', 0.024955371394753456),
  ('##ы', 0.4438115358352661),
  (',', 0.036848314106464386),
  ('такие', 1.1838680505752563),
  ('как', 0.00436423160135746),
  ('пер', 0.006749975029379129),
  ('##еход', 0.0007649788167327642),
  ('от', 0.038956135511398315),
  ('одного', 0.11314807087182999),
  ('состояния', 9.765267372131348),
  ('к', 0.005316327791661024),
  ('другому', 1.0975052118301392),
  (',', 0.7671073079109192),
  ('от', 0.021667061373591423),
  ('одной', 1.209750771522522),
  ('пер', 0.016496576368808746),
  ('##спект', 0.001849157502874732),
  ('##ивы', 0.4393042325973511),
  ('к', 0.001108944183215499),
  ('другой', 1.3383814096450806),
  (',', 0.024982888251543045),
  ('а', 0.026218410581350327),
  ('также', 1.0678788423538208),
  ('молодых', 8.310323715209961),
  ('людей', 0.6578267812728882),
  (',', 0.008406511507928371),
  ('в', 0.719817578792572),
  ('##ступ', 0.6718330383300781),
  ('##ающих', 0.12413019686937332),
  ('во', 0.23966126143932343),
  ('в', 0.04291310906410217),
  ('##з', 0.00735535379499197),
  ('##рос', 0.2500462532043457),
  ('##лу', 0.015621528029441833),
  ('##ю', 0.00034396530827507377),
  ('жизнь', 3.4308295249938965),
  ('.', 0.15315811336040497)]]



```python

Testing word-by-word surprisal in languages other than English

Hello,

Thank you for making this amazing work available! I am wondering if the code works for languages other than English. I was following your example and was able to load English and Japanese models without an error.

model_jp = scorer.IncrementalLMScorer("colorfulscoop/gpt2-small-ja", 'cpu')
model_en = scorer.IncrementalLMScorer("gpt2", 'cpu')

But when I ran the line model_jp.logprobs(model_jp.prepare_text([text])) with the Japanese model, it threw the following error:

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in tokens(self, batch_index)
    293         """
    294         if not self._encodings:
--> 295             raise ValueError("tokens() is not available when using Python-based tokenizers")
    296         return self._encodings[batch_index].tokens
    297 

ValueError: tokens() is not available when using Python-based tokenizers

I would appreciate it if you could point me to any possible solutions, and I apologize if it this is a very basic question. Thank you!

A question about extracting word representations

Hi! I have a question about extracting word representations.
If the sentence has two target words, for example, "There are two books. One is mine, the other one is yours.", when using model.extract_representation(['There are two books. The red one is mine, and the other one is yours.', 'one'], layer = 12), which representation in extracted? Is the representation of the first "one"?
Many thanks!

Linking to docs

Hi Kanishka, this is a great tool! I was wondering if you could link to the documentation (https://minicons.kanishka.website/) in the README? I actually found it through a Google search on computing word-level surprisals out of the box, and couldn't find it linked on the README when I was trying to re-reference it. The tutorials are nice, but it's also helpful to have the docs on hand here as a quick reference.

Thanks for your consideration!

IncrementalLMScorer discards probability of first token

Consider this section of code from IncrementalLMScorer:

## Ignore the probabilities of the first token.
effective_ids = [id[1:] for id in ids]

If I'm understanding this correctly, the class is discarding the probability the model assigns to the first token in every element of a batch. I understand why such logic would make sense in the context of a model that uses a BOS token; but does this mean that this class is unusable for models that do not use a BOS token? It is not at all clear from the docs that this class is only meant to be used with BOS token models.

A BOS token is mentioned in other places in the code- for example, as a Boolean argument to prepare_text- but in such places it's clearly marked as optional, with the default being False. So I'm a little confused by the lack of such optionality in the function above.

Am I understanding this correctly? If so, is there a workaround (besides reduplication of the code)?

Can't make the surprisal example to work

Hi I couldn't make the surprisals example to work in https://github.com/kanishkamisra/minicons/blob/master/examples/surprisals.md

>>> import minicons
>>> import torch
>>> from torch.utils.data import DataLoader
>>> 
>>> import json
>>> model = scorer.IncrementalLMScorer('gpt2', 'cpu')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'scorer' is not defined
>>>

Also not with minicons.scorer

Not sure from where scorer is supposed to come...

Sorry, if it's just me doing something stupid. My python abilities are mostly from copy&paste (I'm an R person)

GPT2 minicons surprisal: IndexError: index out of range in self

I am trying to calculate the surprisal value by feeding in a txt file with about 5000 sentences. But there is an error message I encounter: IndexError: index out of range in self Can anyone help with this issue?

Here is the code:

Here is the error message:

Expected behavior:
I would like to have the surprisal value for each word for the whole text file.

Thank you!

wrong tokenization for the first word

When I try to get probabilities for "Banana is a word." the tokenization is messed up:
['Ban', 'ana', 'Ġis', 'Ġa', 'Ġword', '.']

(Do you know why there is no G for "Ban" , "ana"?)

If I include an initial space, it works fine: ['ĠBanana', 'Ġis', 'Ġa', 'Ġword', '.']

Using minicons wih XLM-R-Base and got GPU out of memory (the GPU size is 80GB)

Hi, I tried to compute this with a sentence of 350 words and got GPU OOM.

mlm_model = scorer.MaskedLMScorer('xlm-roberta-base', 'cuda')
mlm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item())

Is this case normal?

Chinese word surprsial

Hi,

The transformer models, like "bert-base-multilingual-uncased" could be introduced in "minicons" help compute taken surprisal or probabilities for different languages if we have a text
of this language as an input. This can be applied in English, German, Spanish and some alphabet-based languages.

However, it doesn't seem to work in Chinese. As you know, Chinese will be pre-processed with word segmentation. Despite this, if the input Chinese text is the one with word segments (two-character, or three and more character combination)), the output is still on the surprisal with each Chinese character in the text rather than segmented Chinese words. I am not sure how to solve this problem on computing surprisal values for Chinese words.

Many thanks!

The following is the example:

from minicons import scorer
import torch
from torch.utils.data import DataLoader

import numpy as np

import json

model = scorer.IncrementalLMScorer('bert-base-multilingual-uncased', 'cpu')

Using bos_token, but it is not set yet.

sentences = ["我 昨天 下午 我 就 是 直接 买 了 一份 那个 凉菜", "他们 那边 都 是 小 小葱 包括 重庆 那边"]

model.token_score(sentences, surprisal = True, base_two = True)

[[('[CLS]', 0.0),
  ('我', 16.780792236328125),
  ('昨', 18.67901039123535),
  ('天', 29.759370803833008),
  ('下', 39.109107971191406),
  ('午', 33.43532943725586),
  ('我', 34.247886657714844),
  ('就', 23.704923629760742),
  ('是', 25.778093338012695),
  ('直', 31.338485717773438),
  ('接', 28.79427146911621),
  ('买', 44.60960388183594),
  ('了', 30.6632022857666),
  ('一', 25.942493438720703),
  ('份', 44.91115188598633),
  ('那', 35.40247344970703),
  ('个', 37.76634979248047),
  ('凉', 35.126708984375),
  ('菜', 11.82837963104248),
  ('[SEP]', 32.64777755737305)],
 [('[CLS]', 0.0),
  ('他', 15.437037467956543),
  ('们', 10.030117988586426),
  ('那', 30.752634048461914),
  ('边', 45.248435974121094),
  ('都', 20.54657745361328),
  ('是', 27.90602684020996),
  ('小', 31.462167739868164),
  ('小', 2.2013779016560875e-05),
  ('葱', 17.992713928222656),
  ('包', 13.990900039672852),
  ('括', 34.425636291503906),
  ('重', 31.417207717895508),
  ('庆', 23.46117401123047),
  ('那', 34.11079788208008),
  ('边', 42.18030548095703),
  ('[SEP]', 36.32227325439453)]]

Installing with conda

Thank you for providing this package which works well for me. I have been using it in a Jupyter notebook but I encountered some problems when trying to use it in an anaconda virtual environment.

I tried to pip install minicons in the virtual environment, but it results in the kernel crashing. Is there a way to conda install minicons?

Very low log-probabilities (and therefore high surprisals) for a grammatical sentence

Hi,

This issue is a question. I'm using the German GPT-2 model dbmdz/german-gpt2 to get log-probability and surprisal scores for each token.

The log-probabilities are quite low given that that the sentence is a grammatical German sentence. Below my code and a comparison between a German and an English sentence with the same model.

from minicons import scorer

gpt_model_scorer = scorer.IncrementalLMScorer("dbmdz/german-gpt2", "cpu")

log_probs = gpt_model_scorer.token_score(["Der Mensch sammelt die unterschiedlichsten Gegenstände."])
# results for German
# ('Der', 0.0)
# ('Mensch', -102.05498504638672)
# ('sammelt', -101.456787109375)
# ('die', -95.31419372558594)
# ('unterschiedlichsten', -98.86357116699219)
# ('Gegenstände', -89.14930725097656)
# ('.', -88.57086181640625)

log_probs_english = gpt_model_scorer.token_score(["The man collects various items."])
# results for English
# ('The', 0.0)
# ('man', -95.54633331298828)
# ('colle', -68.77188110351562)
# ('cts', -36.030174255371094)
# ('v', -87.14112854003906)
#('ario', -44.695987701416016)
# ('us', -45.50498962402344)
#('items', -79.1251449584961)
# ('.', -73.40552520751953)

Am I using the your code in the intended way? Is the issue the GPT-2 model?

I Must Know

Is it called 'minicons' because minicons are a type of small helpful mech in the fictional Transformers universe?

seq_score vs sequence_score vs compute_stats

Hi,
I was giving a try to minicons again. I just installed it using pip, but I'm not sure if that's the latest version.

In the readme file scorer shows that it allows for sequence_score, in the website minicons.kanishka.website there is compute_stats, but in my version I only have seq_score. Have I installed the latest version?