GithubHelp home page GithubHelp logo

Comments (6)

madisonmay avatar madisonmay commented on June 28, 2024 1

@thinline72 correct, this was a necessity inherited from the original OpenAI repository. We're keeping our handling of tokenization as close to possible to the source implementation to prevent performance regressions.

from finetune.

madisonmay avatar madisonmay commented on June 28, 2024 1

@thinline72 also thanks for the recommendation to take a peek at sentencepiece! Might have to use that the next time we're training a model from scratch -- tokenization + de-tokenization has been a huge pain in this repository, perhaps a project like sentencepiece could help clean that up.

from finetune.

madisonmay avatar madisonmay commented on June 28, 2024

Talked this issue over with @benleetownsend -- the max_length keyword argument to featurize() can only be lowered after training. The base model for finetune uses a max_length of 512 so that's why you're seeing this behavior.

Because it's rather important that the sequence length used at training time matches the sequence length used at prediction time, we're probably going to opt to drop the max_length keyword argument from featurize, predict, etc. in favor of having it be specified on initialization, (i.e. model = Classifier(max_length=1000)). That still wouldn't fix your specific case of trying to featurize using the base model and a max_length of > 512 tokens, however, so I think the plan is probably going to be to raise an explicit, informative error in cases like yours where you're trying to featurize a long document with the base model.

We're also discussing having an option to chunk the document up and deal with longer documents in sections of 512 tokens each. We already to this for sequence labeling tasks but have some thought to put in before we decide how to handle this with the Classifier model and for the featurize endpoint.

In terms of immediate workarounds, you could:

model = Classifier(max_length=1000)
model.fit(text) # modifies learned positional embeddings to function for max_length=100
model.featurize(text)
  1. Hope that the first 512 tokens are representative:
model = Classifier()
model.featurize(text)
  1. Take a peek at how we're handling this for sequence labeling tasks and try to implement a proper solution yourself
    See: https://github.com/IndicoDataSolutions/finetune/blob/development/finetune/sequence_labeling.py#L40
    This is pretty non-trivial but if you're willing to give it a shot I'm happy to provide much more context on how this code is structured / the approach we think would work.

from finetune.

thinline72 avatar thinline72 commented on June 28, 2024

Hi @madisonmay ,

Thank you a lot for the quick and very informative response. It does make a lot of sense.

I also noticed that SpaCy is used for tokenization. So is max_len param about SpaCy tokens or about BPE-tokens?

And could you help me to figure out the preprocessing part, please? If I understand correctly, spacy tokenizer is used firstly to get words and then BPE is used to get subwords, right?

from finetune.

madisonmay avatar madisonmay commented on June 28, 2024

@thinline72 you're correct in your understanding of how the tokenization works. The max_length parameter actually relates to the number of subwords, however.

We actually have what you want partially implemented via a chunk_long_sequences keyword. If you say, run the following code, you'll see that the sample text file is split up into 14 overlapping chunks (the first 2/3rds of each chunk are identical to the last 2/3rds of the previous chunk) and features are computed for each chunk. For your use case it may be sufficient to simply take the mean of these features over axis 0.

import numpy as np
from finetune import Classifier
import requests

text = requests.get("http://txt2html.sourceforge.net/sample.txt").text

model = Classifier(chunk_long_sequences=True)
features = model.featurize([text])
avg_features = np.mean(features, axis=0)

from finetune.

thinline72 avatar thinline72 commented on June 28, 2024

@madisonmay Got it, thank you! I'll close the issue.

Just last question. Why spacy tokenizer + bpe are used? Why not just bpe (or something like https://github.com/google/sentencepiece)? Is it something that was just inherited from original OpenAI model and preprocessing?

from finetune.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.