Hello, I've tried to use max_length more than 512 to featurize text:

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Talked this issue over with <a class="user-mention notranslate" data-hovercard-type="u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cannot specify max_length more than 512,about indicodatasolutions/finetune

Comments (6)

madisonmay commented on June 28, 2024 1

@thinline72 correct, this was a necessity inherited from the original OpenAI repository. We're keeping our handling of tokenization as close to possible to the source implementation to prevent performance regressions.

from finetune.

madisonmay commented on June 28, 2024 1

@thinline72 also thanks for the recommendation to take a peek at sentencepiece! Might have to use that the next time we're training a model from scratch -- tokenization + de-tokenization has been a huge pain in this repository, perhaps a project like sentencepiece could help clean that up.

from finetune.

madisonmay commented on June 28, 2024

Talked this issue over with @benleetownsend -- the max_length keyword argument to featurize() can only be lowered after training. The base model for finetune uses a max_length of 512 so that's why you're seeing this behavior.

Because it's rather important that the sequence length used at training time matches the sequence length used at prediction time, we're probably going to opt to drop the max_length keyword argument from featurize, predict, etc. in favor of having it be specified on initialization, (i.e. model = Classifier(max_length=1000)). That still wouldn't fix your specific case of trying to featurize using the base model and a max_length of > 512 tokens, however, so I think the plan is probably going to be to raise an explicit, informative error in cases like yours where you're trying to featurize a long document with the base model.

We're also discussing having an option to chunk the document up and deal with longer documents in sections of 512 tokens each. We already to this for sequence labeling tasks but have some thought to put in before we decide how to handle this with the Classifier model and for the featurize endpoint.

In terms of immediate workarounds, you could:

model = Classifier(max_length=1000)
model.fit(text) # modifies learned positional embeddings to function for max_length=100
model.featurize(text)

Hope that the first 512 tokens are representative:

model = Classifier()
model.featurize(text)

Take a peek at how we're handling this for sequence labeling tasks and try to implement a proper solution yourself
See: https://github.com/IndicoDataSolutions/finetune/blob/development/finetune/sequence_labeling.py#L40
This is pretty non-trivial but if you're willing to give it a shot I'm happy to provide much more context on how this code is structured / the approach we think would work.

from finetune.

thinline72 commented on June 28, 2024

Hi @madisonmay ,

Thank you a lot for the quick and very informative response. It does make a lot of sense.

I also noticed that SpaCy is used for tokenization. So is max_len param about SpaCy tokens or about BPE-tokens?

And could you help me to figure out the preprocessing part, please? If I understand correctly, spacy tokenizer is used firstly to get words and then BPE is used to get subwords, right?

from finetune.

madisonmay commented on June 28, 2024

@thinline72 you're correct in your understanding of how the tokenization works. The max_length parameter actually relates to the number of subwords, however.

We actually have what you want partially implemented via a chunk_long_sequences keyword. If you say, run the following code, you'll see that the sample text file is split up into 14 overlapping chunks (the first 2/3rds of each chunk are identical to the last 2/3rds of the previous chunk) and features are computed for each chunk. For your use case it may be sufficient to simply take the mean of these features over axis 0.

import numpy as np
from finetune import Classifier
import requests

text = requests.get("http://txt2html.sourceforge.net/sample.txt").text

model = Classifier(chunk_long_sequences=True)
features = model.featurize([text])
avg_features = np.mean(features, axis=0)

from finetune.

thinline72 commented on June 28, 2024

@madisonmay Got it, thank you! I'll close the issue.

Just last question. Why spacy tokenizer + bpe are used? Why not just bpe (or something like https://github.com/google/sentencepiece)? Is it something that was just inherited from original OpenAI model and preprocessing?

from finetune.

Cannot specify max_length more than 512 about finetune HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs