The sent-bias from w4ngatang

sent-bias's People

Contributors

Stargazers

Watchers

sent-bias's Issues

SEAT score of bert-large-cased seems wrong

Dear authors,
It seems the reported SEAT score of bert-large-cased is wrong.

I was able to reproduce the results based on the current code base, however, I found two errors in the code.

1. Even though I called bert-large-cased, tokenized tokens are all lowercased.

import pytorch_pretrained_bert as bert

version = 'bert-large-cased'

tokenizer = bert.BertTokenizer.from_pretrained(version)
text = 'SEAT score of bert-large-CASED'
tokenized = tokenizer.tokenize(text)  # ['seat', 'score', 'of', 'be', '##rt', '-', 'large', '-', 'case', '##d']

2. The score is calculated from the first token embedding vector, not a [CLS] token.

Below is the sentbias/encoders/bert.py, and you can find text is not prepended with a [CLS] token.

''' Convenience functions for handling BERT '''
import torch
import pytorch_pretrained_bert as bert


def load_model(version='bert-large-uncased'):
    ''' Load BERT model and corresponding tokenizer '''
    tokenizer = bert.BertTokenizer.from_pretrained(version)
    model = bert.BertModel.from_pretrained(version)
    model.eval()

    return model, tokenizer


def encode(model, tokenizer, texts):
    ''' Use tokenizer and model to encode texts '''
    encs = {}
    for text in texts:
        tokenized = tokenizer.tokenize(text)  # <<< BUG: a [CLS] token should be prepended
        indexed = tokenizer.convert_tokens_to_ids(tokenized)
        segment_idxs = [0] * len(tokenized)
        tokens_tensor = torch.tensor([indexed])
        segments_tensor = torch.tensor([segment_idxs])
        enc, _ = model(tokens_tensor, segments_tensor, output_all_encoded_layers=False)

        enc = enc[:, 0, :]  # extract the last rep of the first input
        encs[text] = enc.detach().view(-1).numpy()
    return encs