Use Case: replace_unk most strat

Batch does not carry index about text HOT 5 CLOSED

pytorch commented on August 26, 2024

Batch does not carry index

from text.

Comments (5)

honnibal commented on August 26, 2024 1

We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here.

Tokenization is fully reversible if you have (orth_id, has_space) pairs. If you wanted a single sequence of ints, you would double the number of entries in the vocab in theory. Of course the extra bit introduces little extra entropy given the word ID.

So, spaCy's tokenizers are already fully reversible. You could use them as an internal mechanism to solve this, if you like :). It doesn't have to change your user-facing API, I don't think.

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!

I'm planning to add PyTorch tensors as a back-end option for thinc, in addition to Cupy. I also need to write examples of hooking PyTorch models into spaCy.

While I'm here: is it easy to pass a gradient back to a PyTorch model? Most libraries seem to communicate by loss, which makes it harder to compose them with models outside the library.

from text.

honnibal commented on August 26, 2024

I think in general changing the text is a bad thing. If you're happy to have all your tokenizers output spaCy Doc objects you could have a better solution.

The spaCy Doc object holds a TokenC* array, and each TokenC struct holds a const pointer to a LexemeC. The lexemes are vocabulary items, and they have a number of integer fields. This means you can register string transforms that are computed once over the vocabulary, with the results available to each lexical item.

There's currently a small gap in the API around this -- there's a method to register a new boolean feature flag, but not to register a new string feature. But even with the missing method, the code isn't too bad. I'll show usage where the tokenization is provided by e.g. NLTK, to show how this can be used without the rest of spaCy's stuff:

import nltk
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.attrs import NORM

def make_tokenizer(vocab_words, represent_oov):
    vocab = Vocab(lex_attr_getters: {NORM: represent_oov})
    for text in vocab_words:
        lex = vocab[text]
        lex.norm_ = text 
        lex.is_oov = False # Writing to Lexemes updates the vocab.
    # All other words will get their NORM via the represent_oov setter.
    # We also assign a setter for is_oov
    vocab.lex_attr_getters[IS_OOV] = lambda text: True
    
    def tokenize(text):
        words = nltk.word_tokenize(text)
        # If you use spaCy's tokenizer you won't have to do this part, but NLTK destroys the
        # alignment. Boo.
        # In spaCy each Token knows the length of its text-string, and whether a space followed.
        # the tokenizer cannot change the text, only decide where to split. We also don't throw away
        # characters other than ' '. This means we never lose alignment.
        spaces = align_tokens(words, text)
        return Doc(vocab, words=words, spaces=spaces)
    return tokenize

def works_in_theory_untested(vocab_list):
    tokenizer = make_tokenizer(vocab_list, lambda text: '<UNK>')
    doc = tokenizer(text)
    for word in doc:
        print(word.text, word.norm_)
    # Produces a numpy array of uint64 values.
    # You could also export LENGTH, SPACE. Then the cumulative of both columns
    # will give you the starting index of the token in the string.
    array = doc.to_array([NORM])
    return array

from text.

jekbradbury commented on August 26, 2024

So I really like the spaCy tokenizer/doc API, but if we went that route we'd probably need to be able to reconstruct Doc objects from the output of (e.g.) Seq2Seq decoders. We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here. So what I'm trying to do now is build a fully reversible tokenizer -- one that works for any language with spaces (and any other language when coupled with a BPE algorithm) -- and allow the raw source text to be fully reconstructed from a list of indices, whether those are indices from data or indices from a model. Check out revtok in the reversible branch; I'll add the subword stuff soon. The overall idea is to augment the existing Field API with a ReversibleField subclass that additionally exposes inverses of all the processing methods; I think that will solve the main question here (if you just want a batch of tags to let you retrieve something extra/nontextual from the original data sample, like an image, then include that tag as a vocabless Field).

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!

from text.

nelson-liu commented on August 26, 2024

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption.

+1, I don't think that this is the right thing to do at the moment.

from text.

jekbradbury commented on August 26, 2024

For passing a gradient back to PyTorch,var.backward has an optional grad_output argument that allows you to inject a gradient in a specific place in the computation graph. If you want to inject several gradients, you can use torch.autograd.backward((var_1, var_2), (grad_1, grad_2)) I believe.

from text.

Batch does not carry index about text HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs