Hey, I was looking into the paper as I want to replicate the work. In the data pre

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Also one more question. I looked into the <a href="https://github.com/furkanbiten/idl_

Pretraining Objectives? about nougat HOT 6 OPEN

sgdescent commented on August 15, 2024

Pretraining Objectives?

from nougat.

Comments (6)

lukas-blecher commented on August 15, 2024 2

We initialize the encoder and decoder weights with pretrained model weights. Then we train on a data mix including PMC with simple non-markup targets for layout diversity. So yes, it is a simple OCR pretraining objective. In the following step the IDL data is removed from the training.

from nougat.

lukas-blecher commented on August 15, 2024 1

yes, sounds right. With most of the weight on arxiv since it is the cleanest source. PMC's math is not always in a parsable format (eg images) or inline math is just italic

from nougat.

sgdescent commented on August 15, 2024

Thanks a ton, this is helpful!

from nougat.

sgdescent commented on August 15, 2024

@lukas-blecher one more thing, this data mix that you mention from what I understood it has PMC and IDL data and this simple non-markup data and when you want to train the model completely you use PMC + Arxiv with full markup abilities that are generated by the dataset generation code?

from nougat.

sgdescent commented on August 15, 2024

@lukas-blecher Thank you for your response but I am still unclear about the whole procedure for pretraining the model

If the loss function is the same, why two stages? If the loss function is not the same, then what is the objectives for 1st stage and what is for the 2nd stage?
If two stages, what is the training schedule? In the paper, you only mentioned training the model for 3 epochs. Is that for pretraining (stage 1) or training (stage 2)
For pretraining, what sources do your data contain, is it Arxiv + PMC + IDL? and do the papers used in pretraining for example from Arxiv used again in training?
For pretraining, you mentioned only use non-markup data. Does that mean you use masking to mask out the markup data to compute the losses, and if not for no-markup do you have a simple script which only keeps pages that have no-markup for training, and is this script run on the .mmd files generated by the dataset generation script?
Finally for training (stage 2) are some PMC files removed based on some criteria which is applied to the .mmd file, for cases where PMC's math is not always parsable?

from nougat.

sgdescent commented on August 15, 2024

Also one more question. I looked into the IDL data the JSON object returned for each file, do you convert it into a text document by parsing the JSON response?

from nougat.

Recommend Projects

Pretraining Objectives? about nougat HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs