Comments (6)
@thinline72 correct, this was a necessity inherited from the original OpenAI repository. We're keeping our handling of tokenization as close to possible to the source implementation to prevent performance regressions.
from finetune.
@thinline72 also thanks for the recommendation to take a peek at sentencepiece
! Might have to use that the next time we're training a model from scratch -- tokenization + de-tokenization has been a huge pain in this repository, perhaps a project like sentencepiece
could help clean that up.
from finetune.
Talked this issue over with @benleetownsend -- the max_length
keyword argument to featurize()
can only be lowered after training. The base model for finetune
uses a max_length
of 512 so that's why you're seeing this behavior.
Because it's rather important that the sequence length used at training time matches the sequence length used at prediction time, we're probably going to opt to drop the max_length
keyword argument from featurize
, predict
, etc. in favor of having it be specified on initialization, (i.e. model = Classifier(max_length=1000)
). That still wouldn't fix your specific case of trying to featurize using the base model and a max_length
of > 512 tokens, however, so I think the plan is probably going to be to raise an explicit, informative error in cases like yours where you're trying to featurize a long document with the base model.
We're also discussing having an option to chunk the document up and deal with longer documents in sections of 512 tokens each. We already to this for sequence labeling tasks but have some thought to put in before we decide how to handle this with the Classifier
model and for the featurize
endpoint.
In terms of immediate workarounds, you could:
model = Classifier(max_length=1000)
model.fit(text) # modifies learned positional embeddings to function for max_length=100
model.featurize(text)
- Hope that the first 512 tokens are representative:
model = Classifier()
model.featurize(text)
- Take a peek at how we're handling this for sequence labeling tasks and try to implement a proper solution yourself
See: https://github.com/IndicoDataSolutions/finetune/blob/development/finetune/sequence_labeling.py#L40
This is pretty non-trivial but if you're willing to give it a shot I'm happy to provide much more context on how this code is structured / the approach we think would work.
from finetune.
Hi @madisonmay ,
Thank you a lot for the quick and very informative response. It does make a lot of sense.
I also noticed that SpaCy is used for tokenization. So is max_len param about SpaCy tokens or about BPE-tokens?
And could you help me to figure out the preprocessing part, please? If I understand correctly, spacy tokenizer is used firstly to get words and then BPE is used to get subwords, right?
from finetune.
@thinline72 you're correct in your understanding of how the tokenization works. The max_length
parameter actually relates to the number of subwords, however.
We actually have what you want partially implemented via a chunk_long_sequences
keyword. If you say, run the following code, you'll see that the sample text file is split up into 14 overlapping chunks (the first 2/3rds of each chunk are identical to the last 2/3rds of the previous chunk) and features are computed for each chunk. For your use case it may be sufficient to simply take the mean of these features over axis 0.
import numpy as np
from finetune import Classifier
import requests
text = requests.get("http://txt2html.sourceforge.net/sample.txt").text
model = Classifier(chunk_long_sequences=True)
features = model.featurize([text])
avg_features = np.mean(features, axis=0)
from finetune.
@madisonmay Got it, thank you! I'll close the issue.
Just last question. Why spacy tokenizer + bpe are used? Why not just bpe (or something like https://github.com/google/sentencepiece)? Is it something that was just inherited from original OpenAI model and preprocessing?
from finetune.
Related Issues (20)
- Add chunk long sequences support to featurize_sequence HOT 1
- Make tqdm behavior more consistent between train / predict
- ModuleNotFoundError: No module named 'tensorflow.contrib' HOT 1
- Make try/except in visible devices search less broad and print traceback to enable debugging deployed finetune
- Soft targets for sequence labeling models (with use_crf=False) HOT 2
- Add chunk long sequences support to featurize
- Add "download=False" argument to optionally prevent automatic download on instantiation
- Check md5sums of hashes base model file hashes
- McDonalds dataset link is broken HOT 1
- Document tensorflow 2.0 requirement and removal of 1.x support
- How to save checkpoint when finetune?
- Pin Spacy Version to < 3.0 HOT 1
- [Need help] How can I load a created base model HOT 4
- LayoutLM documentation
- Add absl-py to requirements.txt
- Question: Does the Bert base model support multiple languages?
- Use on Colab Fails HOT 2
- Finetune a model Q/A Response
- MultiLabelClassifier predictions bug HOT 2
- Unavailable pre-trained model configs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from finetune.