kanishkamisra / minicons Goto Github PK
View Code? Open in Web Editor NEWUtility for behavioral and representational analyses of Language Models
Home Page: https://minicons.kanishka.website
License: MIT License
Utility for behavioral and representational analyses of Language Models
Home Page: https://minicons.kanishka.website
License: MIT License
I am the author of hashformers, a state-of-the-art library for hashtag segmentation.
I am currently transitioning from using simonepri/lm-scorer to using minicons
as the backbone for my library. So my goal right now is to replicate the exact scores produced by lm-scorer
using minicons
.
Here's the original code snippet using lm-scorer
:
import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained("gpt2", device=device, batch_size=batch_size)
logprobs = scorer.tokens_score("I like this package.")
print(logprobs)
The corresponding scorer in the lm-scorer
library can be found here.
In my attempts to duplicate this functionality with minicons
, I came up with the following code:
import torch
from minicons import scorer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = scorer.IncrementalLMScorer('gpt2', device)
logprobs = model.compute_stats(model.prepare_text("I like this package."))
print(logprobs)
However, this code doesn't provide the expected output. The differences include:
np.argsort
returns different results for each library's output.This is the output produced by lm-scorer
( the relevant scores are the first list in the tuple ):
(
[0.018321018666028976, 0.0066428035497665405, 0.08063317090272903, 0.000607448979280889, 0.277709037065506, 0.0036384568084031343],
[40, 588, 428, 5301, 13, 50256],
['I', 'Ġlike', 'Ġthis', 'Ġpackage', '.', '<|endoftext|>']
)
This is the output produced by minicons
:
Using pad_token, but it is not set yet.
[[-6.164241790771484, -3.1028060913085938, -7.756439208984375, -1.4581527709960938]]
It's hard to spot where is the difference because the code in both libraries is rather similar:
ids = [
[i for i in instance if i != self.tokenizer.pad_token_id]
for instance in encoded["input_ids"].tolist()
]
## Ignore the probabilities of the first token.
effective_ids = [id[1:] for id in ids]
with torch.no_grad():
logits = self.model(**encoded).logits.detach()
logits[:, :, self.tokenizer.pad_token_id] = float("-inf")
logits = logits.split([1] * len(offsets))
## Set up storage variables
scores = []
if rank:
ranks = []
for logit, idx, offset in zip(logits, effective_ids, offsets):
length = len(idx)
logit = logit.squeeze(0)[:, :-1][
torch.arange(offset, length),
]
logprob_distribution = logit - logit.logsumexp(1).unsqueeze(1)
query_ids = idx[offset:]
...
score = logprob_distribution[
torch.arange(length - offset), query_ids
].tolist()
...
scores.append(score)
...
return scores
with torch.no_grad():
ids = encoding["input_ids"].to(self.model.device)
attention_mask = encoding["attention_mask"].to(self.model.device)
nopad_mask = ids != self.tokenizer.pad_token_id
logits: torch.Tensor = self.model(ids, attention_mask=attention_mask)[0]
for sent_index in range(len(text)):
sent_nopad_mask = nopad_mask[sent_index]
# len(tokens) = len(text[sent_index]) + 1
sent_tokens = [
tok
for i, tok in enumerate(encoding.tokens(sent_index))
if sent_nopad_mask[i] and i != 0
]
# sent_ids.shape = [len(text[sent_index]) + 1]
sent_ids = ids[sent_index, sent_nopad_mask][1:]
# logits.shape = [len(text[sent_index]) + 1, vocab_size]
sent_logits = logits[sent_index, sent_nopad_mask][:-1, :]
sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
# ids_scores.shape = [seq_len + 1]
sent_ids_scores = sent_logits.gather(1, sent_ids.unsqueeze(1)).squeeze(1)
# log_prob.shape = [seq_len + 1]
sent_log_probs = sent_ids_scores - sent_logits.logsumexp(1)
sent_log_probs = cast(torch.DoubleTensor, sent_log_probs)
sent_ids = cast(torch.LongTensor, sent_ids)
output = (sent_log_probs, sent_ids, sent_tokens)
outputs.append(output)
return outputs
I'd appreciate any insights or suggestions on how to make the minicons
output match the lm-scorer
output.
Thank you for your assistance!
Hi,
Minicons seems not tokenize alphabet-based texts given "bert" pre-trained models are introduced. It is desirable to generate the surprisal values for word forms in real life rather than the split forms. For example, "symbolised" is split into ('symbol', 9.485310554504395), ('##ised', 6.920506000518799),. However, I want to get the surprisal value for the real word ("symbolised"). I am not sure how to solve this problem. The package also seems to incorrectly generate surprisal values for some real words, particularly those long words with suffix or prefix, because a long word with prefix or suffix will be split into several units.
Many thanks!
In [13]: model = scorer.MaskedLMScorer('bert-base-multilingual-cased', 'cpu')
In [14]: ge_sen=["Janus symbolisierte häufig Veränderungen und Übergänge, wie de
...: n Wechsel von einer Bedingung zur anderen, von einer Perspektive zur an
...: deren und das Heranwachsen junger Menschen zum Erwachsenenalter."]
In [15]: model.token_score(ge_sen, surprisal = True, base_two = True)
Out[15]:
[[('Jan', 7.411351680755615),
('##us', 6.953413963317871),
('symbol', 8.663262367248535),
('##isierte', 8.227853775024414),
('häufig', 9.369148254394531),
('Veränderungen', 4.863248348236084),
('und', 3.3478829860687256),
('Über', 3.3023200035095215),
('##gänge', 0.40428581833839417),
(',', 0.048578906804323196),
('wie', 1.878091812133789),
('den', 5.769808769226074),
('Wechsel', 3.2879366874694824),
('von', 0.016336975619196892),
('einer', 0.016496576368808746),
('Bed', 0.0244187843054533),
('##ingu', 0.09460146725177765),
('##ng', 0.018612651154398918),
('zur', 0.9586092829704285),
('anderen', 1.2600054740905762),
(',', 0.3100062906742096),
('von', 0.013392632827162743),
('einer', 0.025651555508375168),
('Pers', 0.007922208867967129),
('##pektive', 0.03971010446548462),
('zur', 0.8729674220085144),
('anderen', 1.6451447010040283),
('und', 2.9337639808654785),
('das', 0.1244136244058609),
('Hera', 1.1853374242782593),
('##n', 1.9540393352508545),
('##wachsen', 0.006810512859374285),
('junge', 2.0289151668548584),
('##r', 0.007776367478072643),
('Menschen', 3.1449434757232666),
('zum', 5.088050365447998),
('Er', 0.001235523377545178),
('##wachsenen', 0.01289732288569212),
('##alter', 0.12524327635765076),
('.', 0.02648257650434971)]]
In [16]: en_sen=["Janus often symbolised changes and transitions, such as moving
...: from one condition to another, from one perspective to another, and yo
...: ung people growing into adulthood."]
In [17]: en_sen
Out[17]: ['Janus often symbolised changes and transitions, such as moving from one condition to another, from one perspective to another, and young people growing into adulthood.']
In [18]: model.token_score(en_sen, surprisal = True, base_two = True)
Out[18]:
[[('Jan', 7.161930084228516),
('##us', 4.905619144439697),
('often', 5.8594160079956055),
('symbol', 9.485310554504395),
('##ised', 6.920506000518799),
('changes', 4.574926853179932),
('and', 3.2199747562408447),
('transition', 5.44439697265625),
('##s', 0.018392512574791908),
(',', 0.02080027014017105),
('such', 0.04780016839504242),
('as', 0.013945729471743107),
('moving', 7.4285569190979),
('from', 0.008073553442955017),
('one', 0.2561193108558655),
('condition', 18.707305908203125),
('to', 0.014606142416596413),
('another', 0.8214359283447266),
(',', 0.7367089986801147),
('from', 0.06036728620529175),
('one', 0.4734668731689453),
('perspective', 13.356915473937988),
('to', 0.06987723708152771),
('another', 0.7075008749961853),
(',', 0.08287912607192993),
('and', 2.1124203205108643),
('young', 6.065392017364502),
('people', 3.042752742767334),
('growing', 4.334306716918945),
('into', 4.379203796386719),
('adult', 1.3680847883224487),
('##hood', 0.2171218991279602),
('.', 0.06372988969087601)]]
In [19]: sp_sen=["Jano suele simbolizar los cambios y las transiciones, como el
...: paso de una condición a otra, de una perspectiva a otra, y el crecimien
...: to de los jóvenes hacia la edad adulta."]
In [20]: model.token_score(sp_sen, surprisal = True, base_two = True)
Out[20]:
[[('Jan', 11.449429512023926),
('##o', 7.180861949920654),
('suele', 7.2584357261657715),
('simbol', 4.928884983062744),
('##izar', 0.018150361254811287),
('los', 0.03109721466898918),
('cambios', 3.5657286643981934),
('y', 6.550257682800293),
('las', 0.04733512923121452),
('trans', 3.946718454360962),
('##iciones', 0.35458695888519287),
(',', 0.0718887448310852),
('como', 0.6874077916145325),
('el', 0.009603511542081833),
('paso', 5.542746067047119),
('de', 0.015120714902877808),
('una', 0.010806013830006123),
('condición', 15.124244689941406),
('a', 0.03922305256128311),
('otra', 0.5000980496406555),
(',', 0.5917675495147705),
('de', 0.05246984213590622),
('una', 0.015467431396245956),
('perspectiva', 11.579629898071289),
('a', 0.05401906371116638),
('otra', 0.39069506525993347),
(',', 0.024377508088946342),
('y', 1.9166929721832275),
('el', 0.006273927167057991),
('crecimiento', 6.725331783294678),
('de', 0.011221524327993393),
('los', 0.7993561029434204),
('jóvenes', 4.965604305267334),
('hacia', 3.6372487545013428),
('la', 0.27643802762031555),
('edad', 0.262629896402359),
('adulta', 0.3033374845981598),
('.', 0.03442404791712761)]]
In [21]: ru_sen=["Янус часто символизировал изменения и переходы, такие как пере
...: ход от одного состояния к другому, от одной перспективы к другой, а так
...: же молодых людей, вступающих во взрослую жизнь."]
In [22]: model.token_score(ru_sen, surprisal = True, base_two = True)
Out[22]:
[[('Ян', 7.062388896942139),
('##ус', 7.699002742767334),
('часто', 10.491772651672363),
('символ', 1.846983551979065),
('##из', 0.5921100974082947),
('##ировал', 7.98089599609375),
('изменения', 9.341201782226562),
('и', 1.0752657651901245),
('пер', 0.0009851165814325213),
('##еход', 0.024955371394753456),
('##ы', 0.4438115358352661),
(',', 0.036848314106464386),
('такие', 1.1838680505752563),
('как', 0.00436423160135746),
('пер', 0.006749975029379129),
('##еход', 0.0007649788167327642),
('от', 0.038956135511398315),
('одного', 0.11314807087182999),
('состояния', 9.765267372131348),
('к', 0.005316327791661024),
('другому', 1.0975052118301392),
(',', 0.7671073079109192),
('от', 0.021667061373591423),
('одной', 1.209750771522522),
('пер', 0.016496576368808746),
('##спект', 0.001849157502874732),
('##ивы', 0.4393042325973511),
('к', 0.001108944183215499),
('другой', 1.3383814096450806),
(',', 0.024982888251543045),
('а', 0.026218410581350327),
('также', 1.0678788423538208),
('молодых', 8.310323715209961),
('людей', 0.6578267812728882),
(',', 0.008406511507928371),
('в', 0.719817578792572),
('##ступ', 0.6718330383300781),
('##ающих', 0.12413019686937332),
('во', 0.23966126143932343),
('в', 0.04291310906410217),
('##з', 0.00735535379499197),
('##рос', 0.2500462532043457),
('##лу', 0.015621528029441833),
('##ю', 0.00034396530827507377),
('жизнь', 3.4308295249938965),
('.', 0.15315811336040497)]]
```python
Hello,
Thank you for making this amazing work available! I am wondering if the code works for languages other than English. I was following your example and was able to load English and Japanese models without an error.
model_jp = scorer.IncrementalLMScorer("colorfulscoop/gpt2-small-ja", 'cpu')
model_en = scorer.IncrementalLMScorer("gpt2", 'cpu')
But when I ran the line model_jp.logprobs(model_jp.prepare_text([text]))
with the Japanese model, it threw the following error:
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in tokens(self, batch_index)
293 """
294 if not self._encodings:
--> 295 raise ValueError("tokens() is not available when using Python-based tokenizers")
296 return self._encodings[batch_index].tokens
297
ValueError: tokens() is not available when using Python-based tokenizers
I would appreciate it if you could point me to any possible solutions, and I apologize if it this is a very basic question. Thank you!
Hi! I have a question about extracting word representations.
If the sentence has two target words, for example, "There are two books. One is mine, the other one is yours.", when using model.extract_representation(['There are two books. The red one is mine, and the other one is yours.', 'one'], layer = 12), which representation in extracted? Is the representation of the first "one"?
Many thanks!
Hi Kanishka, this is a great tool! I was wondering if you could link to the documentation (https://minicons.kanishka.website/) in the README? I actually found it through a Google search on computing word-level surprisals out of the box, and couldn't find it linked on the README when I was trying to re-reference it. The tutorials are nice, but it's also helpful to have the docs on hand here as a quick reference.
Thanks for your consideration!
Consider this section of code from IncrementalLMScorer
:
## Ignore the probabilities of the first token.
effective_ids = [id[1:] for id in ids]
If I'm understanding this correctly, the class is discarding the probability the model assigns to the first token in every element of a batch. I understand why such logic would make sense in the context of a model that uses a BOS token; but does this mean that this class is unusable for models that do not use a BOS token? It is not at all clear from the docs that this class is only meant to be used with BOS token models.
A BOS token is mentioned in other places in the code- for example, as a Boolean argument to prepare_text
- but in such places it's clearly marked as optional, with the default being False
. So I'm a little confused by the lack of such optionality in the function above.
Am I understanding this correctly? If so, is there a workaround (besides reduplication of the code)?
Hi I couldn't make the surprisals example to work in https://github.com/kanishkamisra/minicons/blob/master/examples/surprisals.md
>>> import minicons
>>> import torch
>>> from torch.utils.data import DataLoader
>>>
>>> import json
>>> model = scorer.IncrementalLMScorer('gpt2', 'cpu')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'scorer' is not defined
>>>
Also not with minicons.scorer
Not sure from where scorer is supposed to come...
Sorry, if it's just me doing something stupid. My python abilities are mostly from copy&paste (I'm an R person)
I am trying to calculate the surprisal value by feeding in a txt file with about 5000 sentences. But there is an error message I encounter: IndexError: index out of range in self Can anyone help with this issue?
Expected behavior:
I would like to have the surprisal value for each word for the whole text file.
Thank you!
When I try to get probabilities for "Banana is a word." the tokenization is messed up:
['Ban', 'ana', 'Ġis', 'Ġa', 'Ġword', '.']
(Do you know why there is no G for "Ban" , "ana"?)
If I include an initial space, it works fine: ['ĠBanana', 'Ġis', 'Ġa', 'Ġword', '.']
Hi, I tried to compute this with a sentence of 350 words and got GPU OOM.
mlm_model = scorer.MaskedLMScorer('xlm-roberta-base', 'cuda')
mlm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item())
Is this case normal?
Hi,
The transformer models, like "bert-base-multilingual-uncased" could be introduced in "minicons" help compute taken surprisal or probabilities for different languages if we have a text
of this language as an input. This can be applied in English, German, Spanish and some alphabet-based languages.
However, it doesn't seem to work in Chinese. As you know, Chinese will be pre-processed with word segmentation. Despite this, if the input Chinese text is the one with word segments (two-character, or three and more character combination)), the output is still on the surprisal with each Chinese character in the text rather than segmented Chinese words. I am not sure how to solve this problem on computing surprisal values for Chinese words.
Many thanks!
The following is the example:
from minicons import scorer
import torch
from torch.utils.data import DataLoader
import numpy as np
import json
model = scorer.IncrementalLMScorer('bert-base-multilingual-uncased', 'cpu')
Using bos_token, but it is not set yet.
sentences = ["我 昨天 下午 我 就 是 直接 买 了 一份 那个 凉菜", "他们 那边 都 是 小 小葱 包括 重庆 那边"]
model.token_score(sentences, surprisal = True, base_two = True)
[[('[CLS]', 0.0),
('我', 16.780792236328125),
('昨', 18.67901039123535),
('天', 29.759370803833008),
('下', 39.109107971191406),
('午', 33.43532943725586),
('我', 34.247886657714844),
('就', 23.704923629760742),
('是', 25.778093338012695),
('直', 31.338485717773438),
('接', 28.79427146911621),
('买', 44.60960388183594),
('了', 30.6632022857666),
('一', 25.942493438720703),
('份', 44.91115188598633),
('那', 35.40247344970703),
('个', 37.76634979248047),
('凉', 35.126708984375),
('菜', 11.82837963104248),
('[SEP]', 32.64777755737305)],
[('[CLS]', 0.0),
('他', 15.437037467956543),
('们', 10.030117988586426),
('那', 30.752634048461914),
('边', 45.248435974121094),
('都', 20.54657745361328),
('是', 27.90602684020996),
('小', 31.462167739868164),
('小', 2.2013779016560875e-05),
('葱', 17.992713928222656),
('包', 13.990900039672852),
('括', 34.425636291503906),
('重', 31.417207717895508),
('庆', 23.46117401123047),
('那', 34.11079788208008),
('边', 42.18030548095703),
('[SEP]', 36.32227325439453)]]
Thank you for providing this package which works well for me. I have been using it in a Jupyter notebook but I encountered some problems when trying to use it in an anaconda virtual environment.
I tried to pip install minicons in the virtual environment, but it results in the kernel crashing. Is there a way to conda install minicons?
Hi,
This issue is a question. I'm using the German GPT-2 model dbmdz/german-gpt2
to get log-probability and surprisal scores for each token.
The log-probabilities are quite low given that that the sentence is a grammatical German sentence. Below my code and a comparison between a German and an English sentence with the same model.
from minicons import scorer
gpt_model_scorer = scorer.IncrementalLMScorer("dbmdz/german-gpt2", "cpu")
log_probs = gpt_model_scorer.token_score(["Der Mensch sammelt die unterschiedlichsten Gegenstände."])
# results for German
# ('Der', 0.0)
# ('Mensch', -102.05498504638672)
# ('sammelt', -101.456787109375)
# ('die', -95.31419372558594)
# ('unterschiedlichsten', -98.86357116699219)
# ('Gegenstände', -89.14930725097656)
# ('.', -88.57086181640625)
log_probs_english = gpt_model_scorer.token_score(["The man collects various items."])
# results for English
# ('The', 0.0)
# ('man', -95.54633331298828)
# ('colle', -68.77188110351562)
# ('cts', -36.030174255371094)
# ('v', -87.14112854003906)
#('ario', -44.695987701416016)
# ('us', -45.50498962402344)
#('items', -79.1251449584961)
# ('.', -73.40552520751953)
Am I using the your code in the intended way? Is the issue the GPT-2 model?
Is it called 'minicons' because minicons are a type of small helpful mech in the fictional Transformers universe?
Hi,
I was giving a try to minicons again. I just installed it using pip, but I'm not sure if that's the latest version.
In the readme file scorer
shows that it allows for sequence_score
, in the website minicons.kanishka.website there is compute_stats
, but in my version I only have seq_score
. Have I installed the latest version?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.