GithubHelp home page GithubHelp logo

Comments (4)

logan-markewich avatar logan-markewich commented on August 28, 2024 2

Just an idea, but you could probably model document hierarchy by adding special tokens to the tokenizer.

For example, DocVQA is trained using an input like '<s_docvqa><s_question>my question?</s_question><s_answer>', and the output then completes the input -> '<s_docvqa><s_question>my question?</s_question><s_answer>my answer</s_answer></s_docvqa>'

So following that logic, you could construct an input prompt as simply '<s_hierarchy>' and then the output could be '<s_hierarchy><s_title>My title<s_paragraph>My paragraph text</s_paragraph></s_title></s_hierarchy>'

This would work for training, as long as you add the proper special tokens.

In terms of the token limit though, you would definitely run into some problems. If you cared more about the structure, maybe you could just predict the first line of each hierarchical element to limit the prediction size. If you are working with synthetically generated data, this would be pretty easy to do!

In terms of doing DLA, by outputting the hierarchy you are already performing a type of layout analysis. Adapting existing datasets might be tough though, but maybe check out DocBank. They have token-level layout annotations that would work well for a text based model like this one

from donut.

jordanparker6 avatar jordanparker6 commented on August 28, 2024

@logan-markewich I like your approach.

What's your thoughts on replacing the BART encoder with one of the long-rage Transformer architecture (e.g. LongFormer or Performer)?

I don't have much experience with them but I understand that have O(n) complexity instead of O(n2) with the sequence length.

from donut.

logan-markewich avatar logan-markewich commented on August 28, 2024

@jordanparker6 I don't have any experience using long-range transformers, but it's probably worth trying! Assuming you have the resources needed to load and train the model ๐Ÿ‘๐Ÿป

Thankfully looking at the code, swapping out BART for anything else should be pretty straightforward, and you won't have to repeat the OCR pre-training either.

from donut.

jlia0 avatar jlia0 commented on August 28, 2024

Any updates on this? Why is Donut difficult for Document Layout Analysis task?

from donut.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.