Comments (5)
We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here.
Tokenization is fully reversible if you have (orth_id, has_space)
pairs. If you wanted a single sequence of ints, you would double the number of entries in the vocab in theory. Of course the extra bit introduces little extra entropy given the word ID.
So, spaCy's tokenizers are already fully reversible. You could use them as an internal mechanism to solve this, if you like :). It doesn't have to change your user-facing API, I don't think.
In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!
I'm planning to add PyTorch tensors as a back-end option for thinc, in addition to Cupy. I also need to write examples of hooking PyTorch models into spaCy.
While I'm here: is it easy to pass a gradient back to a PyTorch model? Most libraries seem to communicate by loss, which makes it harder to compose them with models outside the library.
from text.
I think in general changing the text is a bad thing. If you're happy to have all your tokenizers output spaCy Doc
objects you could have a better solution.
The spaCy Doc
object holds a TokenC*
array, and each TokenC
struct holds a const pointer to a LexemeC
. The lexemes are vocabulary items, and they have a number of integer fields. This means you can register string transforms that are computed once over the vocabulary, with the results available to each lexical item.
There's currently a small gap in the API around this -- there's a method to register a new boolean feature flag, but not to register a new string feature. But even with the missing method, the code isn't too bad. I'll show usage where the tokenization is provided by e.g. NLTK, to show how this can be used without the rest of spaCy's stuff:
import nltk
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.attrs import NORM
def make_tokenizer(vocab_words, represent_oov):
vocab = Vocab(lex_attr_getters: {NORM: represent_oov})
for text in vocab_words:
lex = vocab[text]
lex.norm_ = text
lex.is_oov = False # Writing to Lexemes updates the vocab.
# All other words will get their NORM via the represent_oov setter.
# We also assign a setter for is_oov
vocab.lex_attr_getters[IS_OOV] = lambda text: True
def tokenize(text):
words = nltk.word_tokenize(text)
# If you use spaCy's tokenizer you won't have to do this part, but NLTK destroys the
# alignment. Boo.
# In spaCy each Token knows the length of its text-string, and whether a space followed.
# the tokenizer cannot change the text, only decide where to split. We also don't throw away
# characters other than ' '. This means we never lose alignment.
spaces = align_tokens(words, text)
return Doc(vocab, words=words, spaces=spaces)
return tokenize
def works_in_theory_untested(vocab_list):
tokenizer = make_tokenizer(vocab_list, lambda text: '<UNK>')
doc = tokenizer(text)
for word in doc:
print(word.text, word.norm_)
# Produces a numpy array of uint64 values.
# You could also export LENGTH, SPACE. Then the cumulative of both columns
# will give you the starting index of the token in the string.
array = doc.to_array([NORM])
return array
from text.
So I really like the spaCy tokenizer/doc API, but if we went that route we'd probably need to be able to reconstruct Doc
objects from the output of (e.g.) Seq2Seq decoders. We could only do this if all info in a Doc
object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc
here. So what I'm trying to do now is build a fully reversible tokenizer -- one that works for any language with spaces (and any other language when coupled with a BPE algorithm) -- and allow the raw source text to be fully reconstructed from a list of indices, whether those are indices from data or indices from a model. Check out revtok
in the reversible
branch; I'll add the subword stuff soon. The overall idea is to augment the existing Field
API with a ReversibleField
subclass that additionally exposes inverses of all the processing methods; I think that will solve the main question here (if you just want a batch of tags to let you retrieve something extra/nontextual from the original data sample, like an image, then include that tag as a vocabless Field
).
In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn't work, though!
from text.
In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption.
+1, I don't think that this is the right thing to do at the moment.
from text.
For passing a gradient back to PyTorch,var.backward
has an optional grad_output
argument that allows you to inject a gradient in a specific place in the computation graph. If you want to inject several gradients, you can use torch.autograd.backward((var_1, var_2), (grad_1, grad_2))
I believe.
from text.
Related Issues (20)
- UTF-8 error with testing set of `torchtext.datasets.Multi30k(language_pair=("de", "en"))`. HOT 4
- Torch Text Transform Documentation Mismatch
- The Future of torchtext HOT 1
- BLEU_SCORE weird behaviour
- Fail to import torchtext KeyError: 'SP_DIR' HOT 1
- how to install libtorchtext for cpp project use? please give some operation .thanks
- Unable to download wikitext datasets HOT 4
- AttributeError: module 'torchtext' has no attribute 'legacy'
- # Liste von Namen und Alter personen = [ {"name": "Max", "alter": 30}, {"name": "Anna", "alter": 25}, {"name": "Lisa", "alter": 35} ] # Ausgabe der Liste for person in personen: print("Name:", person["name"]) print("Alter:", person["alter"]) print()
- [Release Blocking] TorchData is too old for PyTorch 2.3 HOT 1
- Remove SpaCy/NLTK as an optional dependency by creating our own tokenizer for a number of languages
- wikitext-2 is not available anymore HOT 2
- Why torchtext needs to reinstall torch
- [RFC] Deprecate/Stop TorchText releases starting with Pytorch release 2.4 HOT 9
- PyTorch 2.4 is not supported by TorchText
- Wikitext-103 URL is down HOT 3
- t5_demo can't retrieve CNNDM from drive.google; how to use local copy?
- Importing Batch TorchText.Legacy versus Torchtext Failures HOT 3
- strange pyd error with no documentation + "OSError: [WinError 127] The specified procedure could not be found" HOT 2
- undefined symbol
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text.