GithubHelp home page GithubHelp logo

Batch does not carry index about text HOT 5 CLOSED

pytorch avatar pytorch commented on August 26, 2024
Batch does not carry index

from text.

Comments (5)

honnibal avatar honnibal commented on August 26, 2024 1

We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here.

Tokenization is fully reversible if you have (orth_id, has_space) pairs. If you wanted a single sequence of ints, you would double the number of entries in the vocab in theory. Of course the extra bit introduces little extra entropy given the word ID.

So, spaCy's tokenizers are already fully reversible. You could use them as an internal mechanism to solve this, if you like :). It doesn't have to change your user-facing API, I don't think.

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!

I'm planning to add PyTorch tensors as a back-end option for thinc, in addition to Cupy. I also need to write examples of hooking PyTorch models into spaCy.

While I'm here: is it easy to pass a gradient back to a PyTorch model? Most libraries seem to communicate by loss, which makes it harder to compose them with models outside the library.

from text.

honnibal avatar honnibal commented on August 26, 2024

I think in general changing the text is a bad thing. If you're happy to have all your tokenizers output spaCy Doc objects you could have a better solution.

The spaCy Doc object holds a TokenC* array, and each TokenC struct holds a const pointer to a LexemeC. The lexemes are vocabulary items, and they have a number of integer fields. This means you can register string transforms that are computed once over the vocabulary, with the results available to each lexical item.

There's currently a small gap in the API around this -- there's a method to register a new boolean feature flag, but not to register a new string feature. But even with the missing method, the code isn't too bad. I'll show usage where the tokenization is provided by e.g. NLTK, to show how this can be used without the rest of spaCy's stuff:

import nltk
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.attrs import NORM

def make_tokenizer(vocab_words, represent_oov):
    vocab = Vocab(lex_attr_getters: {NORM: represent_oov})
    for text in vocab_words:
        lex = vocab[text]
        lex.norm_ = text 
        lex.is_oov = False # Writing to Lexemes updates the vocab.
    # All other words will get their NORM via the represent_oov setter.
    # We also assign a setter for is_oov
    vocab.lex_attr_getters[IS_OOV] = lambda text: True
    
    def tokenize(text):
        words = nltk.word_tokenize(text)
        # If you use spaCy's tokenizer you won't have to do this part, but NLTK destroys the
        # alignment. Boo.
        # In spaCy each Token knows the length of its text-string, and whether a space followed.
        # the tokenizer cannot change the text, only decide where to split. We also don't throw away
        # characters other than ' '. This means we never lose alignment.
        spaces = align_tokens(words, text)
        return Doc(vocab, words=words, spaces=spaces)
    return tokenize

def works_in_theory_untested(vocab_list):
    tokenizer = make_tokenizer(vocab_list, lambda text: '<UNK>')
    doc = tokenizer(text)
    for word in doc:
        print(word.text, word.norm_)
    # Produces a numpy array of uint64 values.
    # You could also export LENGTH, SPACE. Then the cumulative of both columns
    # will give you the starting index of the token in the string.
    array = doc.to_array([NORM])
    return array

from text.

jekbradbury avatar jekbradbury commented on August 26, 2024

So I really like the spaCy tokenizer/doc API, but if we went that route we'd probably need to be able to reconstruct Doc objects from the output of (e.g.) Seq2Seq decoders. We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here. So what I'm trying to do now is build a fully reversible tokenizer -- one that works for any language with spaces (and any other language when coupled with a BPE algorithm) -- and allow the raw source text to be fully reconstructed from a list of indices, whether those are indices from data or indices from a model. Check out revtok in the reversible branch; I'll add the subword stuff soon. The overall idea is to augment the existing Field API with a ReversibleField subclass that additionally exposes inverses of all the processing methods; I think that will solve the main question here (if you just want a batch of tags to let you retrieve something extra/nontextual from the original data sample, like an image, then include that tag as a vocabless Field).

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!

from text.

nelson-liu avatar nelson-liu commented on August 26, 2024

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption.

+1, I don't think that this is the right thing to do at the moment.

from text.

jekbradbury avatar jekbradbury commented on August 26, 2024

For passing a gradient back to PyTorch,var.backward has an optional grad_output argument that allows you to inject a gradient in a specific place in the computation graph. If you want to inject several gradients, you can use torch.autograd.backward((var_1, var_2), (grad_1, grad_2)) I believe.

from text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.