GithubHelp home page GithubHelp logo

Pretraining Objectives? about nougat HOT 6 OPEN

sgdescent avatar sgdescent commented on August 15, 2024
Pretraining Objectives?

from nougat.

Comments (6)

lukas-blecher avatar lukas-blecher commented on August 15, 2024 2

We initialize the encoder and decoder weights with pretrained model weights. Then we train on a data mix including PMC with simple non-markup targets for layout diversity. So yes, it is a simple OCR pretraining objective. In the following step the IDL data is removed from the training.

from nougat.

lukas-blecher avatar lukas-blecher commented on August 15, 2024 1

yes, sounds right. With most of the weight on arxiv since it is the cleanest source. PMC's math is not always in a parsable format (eg images) or inline math is just italic

from nougat.

sgdescent avatar sgdescent commented on August 15, 2024

Thanks a ton, this is helpful!

from nougat.

sgdescent avatar sgdescent commented on August 15, 2024

@lukas-blecher one more thing, this data mix that you mention from what I understood it has PMC and IDL data and this simple non-markup data and when you want to train the model completely you use PMC + Arxiv with full markup abilities that are generated by the dataset generation code?

from nougat.

sgdescent avatar sgdescent commented on August 15, 2024

@lukas-blecher Thank you for your response but I am still unclear about the whole procedure for pretraining the model

  1. If the loss function is the same, why two stages? If the loss function is not the same, then what is the objectives for 1st stage and what is for the 2nd stage?
  2. If two stages, what is the training schedule? In the paper, you only mentioned training the model for 3 epochs. Is that for pretraining (stage 1) or training (stage 2)
  3. For pretraining, what sources do your data contain, is it Arxiv + PMC + IDL? and do the papers used in pretraining for example from Arxiv used again in training?
  4. For pretraining, you mentioned only use non-markup data. Does that mean you use masking to mask out the markup data to compute the losses, and if not for no-markup do you have a simple script which only keeps pages that have no-markup for training, and is this script run on the .mmd files generated by the dataset generation script?
  5. Finally for training (stage 2) are some PMC files removed based on some criteria which is applied to the .mmd file, for cases where PMC's math is not always parsable?

from nougat.

sgdescent avatar sgdescent commented on August 15, 2024

Also one more question. I looked into the IDL data the JSON object returned for each file, do you convert it into a text document by parsing the JSON response?

from nougat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.