Hej, thank you for your excellent work! Could you please tell a

Processing Reference Corpus [details] about exbert HOT 2 CLOSED

bhoov commented on July 18, 2024

Processing Reference Corpus [details]

from exbert.

Comments (2)

bhoov commented on July 18, 2024 1

Sure, I can clarify briefly. A "processed corpus" is a smaller corpus (typically not the training corpus) that can be fully tokenized and fed through a trained model (say GPT as you mentioned). The corpus is then fed sentence by sentence into GPT for inference, and we save a bunch of hidden states and information about each sentence. These pieces of information are:

The attention matrix for each input sentence at each head of each layer
The embedding of each token after each layer
The "context" (that is, the representation of each token from the perspective of each head, before the linear projection that will turn all the head information into the embedding for the next layer)
Linguistic metadata about each token in the model (e.g., Part of Speech, dependency... a bunch of metadata that Spacy provides)

As you can imagine, the HDF5 files that hold all this information can grow quite large in size for larger corpora and models. There is a README here that describes the code that runs to do this task.

Your assumption (pt 2) is correct: since we are not training the model, we don't need to force any kind of task on GPT, and we do not want to use any token predicted by GPT. We do, however, keep the attention mask for every token such that the embeddings for these autoregressive models can only be created from information in preceding word tokens.

from exbert.

nilinykh commented on July 18, 2024

Thank you a lot for the explanation! It makes sense =)

from exbert.

Recommend Projects

Processing Reference Corpus [details] about exbert HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs