GithubHelp home page GithubHelp logo

clovaai / donut Goto Github PK

View Code? Open in Web Editor NEW
5.3K 47.0 422.0 62.76 MB

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

Home Page: https://arxiv.org/abs/2111.15664

License: MIT License

Python 100.00%
document-ai eccv-2022 multimodal-pre-trained-model ocr nlp computer-vision

donut's Introduction

Donut ๐Ÿฉ : Document Understanding Transformer

Paper Conference Demo Demo PyPI Downloads

Official Implementation of Donut and SynthDoG | Paper | Slide | Poster

Introduction

Donut ๐Ÿฉ, Document understanding transformer, is a new method of document understanding that utilizes an OCR-free end-to-end Transformer model. Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing). In addition, we present SynthDoG ๐Ÿถ, Synthetic Document Generator, that helps the model pre-training to be flexible on various languages and domains.

Our academic paper, which describes our method in detail and provides full experimental results and analyses, can be found here:

OCR-free Document Understanding Transformer.
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. In ECCV 2022.

image

Pre-trained Models and Web Demos

Gradio web demos are available! Demo Demo
image
  • You can run the demo with ./app.py file.
  • Sample images are available at ./misc and more receipt images are available at CORD dataset link.
  • Web demos are available from the links in the following table.
  • Note: We have updated the Google Colab demo (as of June 15, 2023) to ensure its proper working.
Task Sec/Img Score Trained Model
Demo
CORD (Document Parsing) 0.7 /
0.7 /
1.2
91.3 /
91.1 /
90.9
donut-base-finetuned-cord-v2 (1280) /
donut-base-finetuned-cord-v1 (1280) /
donut-base-finetuned-cord-v1-2560
gradio space web demo,
google colab demo (updated at 23.06.15)
Train Ticket (Document Parsing) 0.6 98.7 donut-base-finetuned-zhtrainticket google colab demo (updated at 23.06.15)
RVL-CDIP (Document Classification) 0.75 95.3 donut-base-finetuned-rvlcdip gradio space web demo,
google colab demo (updated at 23.06.15)
DocVQA Task1 (Document VQA) 0.78 67.5 donut-base-finetuned-docvqa gradio space web demo,
google colab demo (updated at 23.06.15)

The links to the pre-trained backbones are here:

  • donut-base: trained with 64 A100 GPUs (~2.5 days), number of layers (encoder: {2,2,14,2}, decoder: 4), input size 2560x1920, swin window size 10, IIT-CDIP (11M) and SynthDoG (English, Chinese, Japanese, Korean, 0.5M x 4).
  • donut-proto: (preliminary model) trained with 8 V100 GPUs (~5 days), number of layers (encoder: {2,2,18,2}, decoder: 4), input size 2048x1536, swin window size 8, and SynthDoG (English, Japanese, Korean, 0.4M x 3).

Please see our paper for more details.

SynthDoG datasets

image

The links to the SynthDoG-generated datasets are here:

To generate synthetic datasets with our SynthDoG, please see ./synthdog/README.md and our paper for details.

Updates

2023-06-15 We have updated all Google Colab demos to ensure its proper working.
2022-11-14 New version 1.0.9 is released (pip install donut-python --upgrade). See 1.0.9 Release Notes.
2022-08-12 Donut ๐Ÿฉ is also available at huggingface/transformers ๐Ÿค— (contributed by @NielsRogge). donut-python loads the pre-trained weights from the official branch of the model repositories. See 1.0.5 Release Notes.
2022-08-05 A well-executed hands-on tutorial on donut ๐Ÿฉ is published at Towards Data Science (written by @estaudere).
2022-07-20 First Commit, We release our code, model weights, synthetic data and generator.

Software installation

PyPI Downloads

pip install donut-python

or clone this repository and install the dependencies:

git clone https://github.com/clovaai/donut.git
cd donut/
conda create -n donut_official python=3.7
conda activate donut_official
pip install .

We tested donut-python == 1.0.1 with:

Note: From several reported issues, we have noticed increased challenges in configuring the testing environment for donut-python due to recent updates in key dependency libraries. While we are actively working on a solution, we have updated the Google Colab demo (as of June 15, 2023) to ensure its proper working. For assistance, we encourage you to refer to the following demo links: CORD Colab Demo, Train Ticket Colab Demo, RVL-CDIP Colab Demo, DocVQA Colab Demo.

Getting Started

Data

This repository assumes the following structure of dataset:

> tree dataset_name
dataset_name
โ”œโ”€โ”€ test
โ”‚   โ”œโ”€โ”€ metadata.jsonl
โ”‚   โ”œโ”€โ”€ {image_path0}
โ”‚   โ”œโ”€โ”€ {image_path1}
โ”‚             .
โ”‚             .
โ”œโ”€โ”€ train
โ”‚   โ”œโ”€โ”€ metadata.jsonl
โ”‚   โ”œโ”€โ”€ {image_path0}
โ”‚   โ”œโ”€โ”€ {image_path1}
โ”‚             .
โ”‚             .
โ””โ”€โ”€ validation
    โ”œโ”€โ”€ metadata.jsonl
    โ”œโ”€โ”€ {image_path0}
    โ”œโ”€โ”€ {image_path1}
              .
              .

> cat dataset_name/test/metadata.jsonl
{"file_name": {image_path0}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
{"file_name": {image_path1}, "ground_truth": "{\"gt_parse\": {ground_truth_parse}, ... {other_metadata_not_used} ... }"}
     .
     .
  • The structure of metadata.jsonl file is in JSON Lines text format, i.e., .jsonl. Each line consists of
    • file_name : relative path to the image file.
    • ground_truth : string format (json dumped), the dictionary contains either gt_parse or gt_parses. Other fields (metadata) can be added to the dictionary but will not be used.
  • donut interprets all tasks as a JSON prediction problem. As a result, all donut model training share a same pipeline. For training and inference, the only thing to do is preparing gt_parse or gt_parses for the task in format described below.

For Document Classification

The gt_parse follows the format of {"class" : {class_name}}, for example, {"class" : "scientific_report"} or {"class" : "presentation"}.

  • Google colab demo is available here.
  • Gradio web demo is available here.

For Document Information Extraction

The gt_parse is a JSON object that contains full information of the document image, for example, the JSON object for a receipt may look like {"menu" : [{"nm": "ICE BLACKCOFFEE", "cnt": "2", ...}, ...], ...}.

  • More examples are available at CORD dataset.
  • Google colab demo is available here.
  • Gradio web demo is available here.

For Document Visual Question Answering

The gt_parses follows the format of [{"question" : {question_sentence}, "answer" : {answer_candidate_1}}, {"question" : {question_sentence}, "answer" : {answer_candidate_2}}, ...], for example, [{"question" : "what is the model name?", "answer" : "donut"}, {"question" : "what is the model name?", "answer" : "document understanding transformer"}].

  • DocVQA Task1 has multiple answers, hence gt_parses should be a list of dictionary that contains a pair of question and answer.
  • Google colab demo is available here.
  • Gradio web demo is available here.

For (Pseudo) Text Reading Task

The gt_parse looks like {"text_sequence" : "word1 word2 word3 ... "}

  • This task is also a pre-training task of Donut model.
  • You can use our SynthDoG ๐Ÿถ to generate synthetic images for the text reading task with proper gt_parse. See ./synthdog/README.md for details.

Training

This is the configuration of Donut model training on CORD dataset used in our experiment. We ran this with a single NVIDIA A100 GPU.

python train.py --config config/train_cord.yaml \
                --pretrained_model_name_or_path "naver-clova-ix/donut-base" \
                --dataset_name_or_paths '["naver-clova-ix/cord-v2"]' \
                --exp_version "test_experiment"    
  .
  .                                                                                                                                                                                                                                         
Prediction: <s_menu><s_nm>Lemon Tea (L)</s_nm><s_cnt>1</s_cnt><s_price>25.000</s_price></s_menu><s_total><s_total_price>25.000</s_total_price><s_cashprice>30.000</s_cashprice><s_changeprice>5.000</s_changeprice></s_total>
Answer: <s_menu><s_nm>Lemon Tea (L)</s_nm><s_cnt>1</s_cnt><s_price>25.000</s_price></s_menu><s_total><s_total_price>25.000</s_total_price><s_cashprice>30.000</s_cashprice><s_changeprice>5.000</s_changeprice></s_total>
Normed ED: 0.0
Prediction: <s_menu><s_nm>Hulk Topper Package</s_nm><s_cnt>1</s_cnt><s_price>100.000</s_price></s_menu><s_total><s_total_price>100.000</s_total_price><s_cashprice>100.000</s_cashprice><s_changeprice>0</s_changeprice></s_total>
Answer: <s_menu><s_nm>Hulk Topper Package</s_nm><s_cnt>1</s_cnt><s_price>100.000</s_price></s_menu><s_total><s_total_price>100.000</s_total_price><s_cashprice>100.000</s_cashprice><s_changeprice>0</s_changeprice></s_total>
Normed ED: 0.0
Prediction: <s_menu><s_nm>Giant Squid</s_nm><s_cnt>x 1</s_cnt><s_price>Rp. 39.000</s_price><s_sub><s_nm>C.Finishing - Cut</s_nm><s_price>Rp. 0</s_price><sep/><s_nm>B.Spicy Level - Extreme Hot Rp. 0</s_price></s_sub><sep/><s_nm>A.Flavour - Salt & Pepper</s_nm><s_price>Rp. 0</s_price></s_sub></s_menu><s_sub_total><s_subtotal_price>Rp. 39.000</s_subtotal_price></s_sub_total><s_total><s_total_price>Rp. 39.000</s_total_price><s_cashprice>Rp. 50.000</s_cashprice><s_changeprice>Rp. 11.000</s_changeprice></s_total>
Answer: <s_menu><s_nm>Giant Squid</s_nm><s_cnt>x1</s_cnt><s_price>Rp. 39.000</s_price><s_sub><s_nm>C.Finishing - Cut</s_nm><s_price>Rp. 0</s_price><sep/><s_nm>B.Spicy Level - Extreme Hot</s_nm><s_price>Rp. 0</s_price><sep/><s_nm>A.Flavour- Salt & Pepper</s_nm><s_price>Rp. 0</s_price></s_sub></s_menu><s_sub_total><s_subtotal_price>Rp. 39.000</s_subtotal_price></s_sub_total><s_total><s_total_price>Rp. 39.000</s_total_price><s_cashprice>Rp. 50.000</s_cashprice><s_changeprice>Rp. 11.000</s_changeprice></s_total>
Normed ED: 0.039603960396039604                                                                                                                                  
Epoch 29: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 200/200 [01:49<00:00,  1.82it/s, loss=0.00327, exp_name=train_cord, exp_version=test_experiment]

Some important arguments:

  • --config : config file path for model training.
  • --pretrained_model_name_or_path : string format, model name in Hugging Face modelhub or local path.
  • --dataset_name_or_paths : string format (json dumped), list of dataset names in Hugging Face datasets or local paths.
  • --result_path : file path to save model outputs/artifacts.
  • --exp_version : used for experiment versioning. The output files are saved at {result_path}/{exp_version}/*

Test

With the trained model, test images and ground truth parses, you can get inference results and accuracy scores.

python test.py --dataset_name_or_path naver-clova-ix/cord-v2 --pretrained_model_name_or_path ./result/train_cord/test_experiment --save_path ./result/output.json
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100/100 [00:35<00:00,  2.80it/s]
Total number of samples: 100, Tree Edit Distance (TED) based accuracy score: 0.9129639764131697, F1 accuracy score: 0.8406020841373987

Some important arguments:

  • --dataset_name_or_path : string format, the target dataset name in Hugging Face datasets or local path.
  • --pretrained_model_name_or_path : string format, the model name in Hugging Face modelhub or local path.
  • --save_path: file path to save predictions and scores.

How to Cite

If you find this work useful to you, please cite:

@inproceedings{kim2022donut,
  title     = {OCR-Free Document Understanding Transformer},
  author    = {Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2022}
}

License

MIT license

Copyright (c) 2022-present NAVER Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

donut's People

Contributors

dotneet avatar eltociear avatar gwkrsrch avatar mingosnake avatar moonbings avatar napatswift avatar samsamhuns avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

donut's Issues

How many minimum images required for training

Hello @SamSamhuns, @gwkrsrch, @VictorAtPL
I have around 60 Images and custom 8 tokens, each image consist of 3-4 same key but different values and annotation format is like SROIE I have followed this link
Converted my data to this structure and followed the converter script as mention in the blog

{
[{
    "Name": "Tom",
    "Buyer": "Conda",
    "contact_number": "989898989898",
    "alt_number": "55555555",
    "Buyer_id": "9856321023"
},

{
    "Name": "Hanks",
    "Buyer": "Conda",
    "contact_number": "99999999999",
    "alt_number": "25823102",
    "Buyer_id": "9856321024"
},

{
    "Name": "Lita",
    "Buyer": "Conda",
    "contact_number": "4545858402",
    "alt_number": "12121212121",
    "Buyer_id": "9856321022"
}]
}

My metadata.jsonl

{"file_name": "1.png", "ground_truth": "{\"gt_parse\": [{\"Name\": \"Tom\", \"Buyer\": \"Conda\", \"contact_number\": \"989898989898\", \"alt_number\": \"55555555\", \"Buyer_id\": \"9856321023\"}, {\"Name\": \"Hanks\", \"Buyer\": \"Conda\", \"contact_number\": \"99999999999\", \"alt_number\": \"25823102\", \"Buyer_id\": \"9856321024\"}, {\"Name\": \"Lita\", \"Buyer\": \"Conda\", \"contact_number\": \"4545858402\", \"alt_number\": \"12121212121\", \"Buyer_id\": \"9856321022\"}]}"}

this is my config my images size is variable max (2205 X 1693) min (1755 X 779)

resume_from_checkpoint_path: null # only used for resume_from_checkpoint option in PL
result_path: "/content/drive/MyDrive/results"
pretrained_model_name_or_path: "naver-clova-ix/donut-base" # loading a pre-trained model (from moldehub or path)
dataset_name_or_paths: ["/content/drive/MyDrive/my_VDU"] # loading datasets (from moldehub or path)
sort_json_key: False # cord dataset is preprocessed, and publicly available at 
train_batch_sizes: [1]
val_batch_sizes: [1]
input_size: [1280, 960] # when the input resolution differs from the pre-training setting, some weights will be newly initialized (but the model training would be okay)
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-5
warmup_steps: 300 # 800/8*30/10, 10%
num_training_samples_per_epoch: 800
max_epochs: 80
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 10
gradient_clip_val: 1.0
verbose: True

I have trained using this configuration to epoch's 300 ,200, 120,80 ,40, 20, but all the results were miss spell, number were wrong.
don't know if i am doing something wrong or should I do some tweaks,or increase my training data
I even tried to combine the synthdog 200 images data but no luck still results were miss spell

Inference result is different from test (test.py) results.

I trained a parser model using DONUT on SROIE dataset. After the training, I ran test.py, I got Tree Edit Distance (TED) based accuracy score: 0.9960054721345021, F1 accuracy score: 0.9548872180451128 and I checked the output.json, it has predicted well. But while inferencing on the same image, I am unable to get same result. It's missing some of the keys.

Example: {'predictions': [{'date': '25/12/2018', 'address': 'NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.'}]}

In the above output, it missed "company" and "total".

Any reasons or suggestions here?

Thanks and Regards

How to calculate field-level F1 score of CORD test set

Hi, @gwkrsrch ,

DONUT is an excellent work for VDU community! We can reproduce the tree-based edit-distance results on the CORD test set. But it is tricky to calculate the field-level F1 score based on the tree-based prediction. Could you please explain how to calculate F1 of CORD test set?

Many thanks for your effort!

About Paper Photos dataset

Thanks for great work! . Can you share paper photos datasets which you using for synthdog augmentation , or I missing something.

Using base model to OCR text

Hello,
Given it seems that the pretraining method that was used consists in asking DONUT to OCR the text, I was wondering if it was possible to use the pretrain model (https://huggingface.co/naver-clova-ix/donut-base) for OCR. If so, what prompt can we use to do that? And is there anything else that needs to be done?

Btw, this is amazing work, congratulations! :)

Error on validation

I tried training using the guide provided in this repo but failed due to following errors:

Validation:   0%|                                       | 0/100 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/100 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 150, in <module>
    train(config)
  File "train.py", line 134, in train
    trainer.fit(model_module, data_module)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 697, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
    self._run_validation()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
    self.val_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance
    output = self._evaluation_step(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 240, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 355, in validation_step
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 90, in forward
    return self.module.validation_step(*inputs, **kwargs)
  File "/home/jupyter/src/donut/donut/lightning_module.py", line 72, in validation_step
    return_attentions=False,
  File "/home/jupyter/src/donut/donut/donut/model.py", line 477, in inference
    output_attentions=return_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/generation_utils.py", line 1147, in generate
    self._validate_model_kwargs(model_kwargs.copy())
  File "/opt/conda/lib/python3.7/site-packages/transformers/generation_utils.py", line 863, in _validate_model_kwargs
    f"The following `model_kwargs` are not used by the model: {unused_model_args} (note: typos in the"
ValueError: The following `model_kwargs` are not used by the model: ['encoder_outputs'] (note: typos in the generate arguments will also show up in this list)

Config

resume_from_checkpoint_path: null # only used for resume_from_checkpoint option in PL
result_path: "./result4"
pretrained_model_name_or_path: "naver-clova-ix/donut-base" # loading a pre-trained model (from moldehub or path)
dataset_name_or_paths: ["naver-clova-ix/cord-v2"] # loading datasets (from moldehub or path)
sort_json_key: False # cord dataset is preprocessed, and publicly available at https://huggingface.co/datasets/naver-clova-ix/cord-v2
train_batch_sizes: [1]
val_batch_sizes: [1]
input_size: [1280, 960] # when the input resolution differs from the pre-training setting, some weights will be newly initialized (but the model training would be okay)
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-5
# warmup_steps: 300 # 800/8*30/10, 10%
warmup_steps: 10 # 800/8*30/10, 10%
num_training_samples_per_epoch: -1
max_epochs: 3
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 1
gradient_clip_val: 1.0
verbose: True
data_dir: ''

Training For Document Information Extraction

Its a great project, and I want to try it out the approach without OCR.
I have 3 questions related training

  1. We need to create ground truth for training test and validation, do we have any tool to perform the annotations to get the input as per training requirement.

  2. For training I think you need to use OCR to create ground truth data, than how it is extracted during inference?

  3. I see we need to provide dictionary hierarchy for classes in ground truth, can i use my own classes and custom hierarchy for ground truth example
    {
    "gt_parse": {
    "Item": [
    {
    "Description": "SPGTHY BOLOGNASE",
    "Quantity": "1",
    "Price": "58,000"
    },
    {
    "Description": "SPGTHY BOLOGNASE",
    "Quantity": "1",
    "Price": "58,000"
    }],

     	"Total": {"value": "20"},
     	"Sub_Total": {"value": "50"},
     	"Number": {"value": "80"}}}
    

Could you please guide.

Tips for training base model from scratch on smaller amount of datasets

Hello @gwkrsrch ,

I am very excited about this model and an e2e approach it implements.

For my master thesis, I'd like to make an experiment to compare your method of generating synthetic documents with mine. I am only interested in evaluating the model on the Document Information Extraction downstream task with the CORD dataset and my proprietary one (let's call it PolCORD).

I'd like to train the Donut model on the (Psuedo) Text Reading Task with:
1/ naver-clova-ix/synthdog-en; synthdog-id; synthdog-pl (total 1.5M examples)
2/ my-method-en, my-method-id, my-method-pl (total 1.2M examples)

Could you give me a hand and share you experience:

  1. how can I generated/prepare corpus for Indonesian and Polish language in the same way how you prepared here: https://github.com/clovaai/donut/tree/master/synthdog/resources/corpus
  2. if I am going to train the model on 1.2-1.5M examples instead of 13M, do you have any gut feeling if I need, and to what values I should, downsize the model defined here: https://huggingface.co/naver-clova-ix/donut-base/blob/main/config.json?
  3. How many examples were you able to fit into single A100 GPU card? I've a 40Gb version and I'm going to use 16 of them.

Question on fine-tuning document form parsing labeling requirement

My goal is to read a specific field (say, box 30) from a nationally standardized insurance claim form. The form has 40 boxes/fields in fixed locations and each boxed is labeled clearly with box number and title.

To save annotation time, I would like our labeling team to annotate the text from box 30 only (ignore all other boxes in the form). If I fine-tune on such annotations, is donut expected to give good results or not?

If we have to annotate the entire form box-by-box, the time it takes will be over 10x longer.

How much gpu memory size does single 1280,960 size photo need

I tried to run donut-base in 2080ti, batch set is 1. But it didn't work, and it looked like it's because the small gpu memory.
So I want to ask has anyone try to run it on 2080ti or how much gpu memory size does single 1280,960 size photo need

Erroneous Text output for IE task

Hi,
I tried fine tuning the model with custom receipt dataset for IE task and noticed issues with the output text extracted for given set of keys. It either misses out or add extra 1-2 characters to the actual text present in the document and this pattern is very frequent. I am using the default input_size: [1280, 960]. The images are really clear where any other off the shelf OCR model is able to extract text with no errors. I fine-tuned the model with 400 images with 15 keys and tested it on 100 samples. Has anyone encountered such issue?

donut processing on PDF Documents

Hello,

I have few certificate documents which are in a PDF format. I want to extract metadata on those documents as suggested by you .
Could you please clarify me on the below points.

1.Can I use your model directly without pretraining on the certificate data.
2. How to train your model on my certificates as it is confidential and what folder structure are you expecting to train the data.
3. How do I convert my dataset into your format (synthdog) โ€“ It was not much clear to me.

Thank you and looking forward to your response.

Best Regards,
Arun

Fine Tuning with Arabic

First I would to thank you for this repo
i want to work in Arabic lang and Arabic lang and the Arabic Lang is RTL
could you tell me a pref to the changes i would make when adding the Arabic Lang in the SynthDoG to create the Arabic dataset
and in the model creation

Is this available with other language?

Hi, Thank you for sharing nice work.
Is this available with other languages (like Korean, Japanese, ...)?
If so, Could you please give some tips for preparing the data?

Are "valid_line" and "meta" keys required for training?

I noticed in the cordv2 dataset there is "valid_line", "meta", and other keys in the jsonl dictionary.
Are these used/required during training for document parsing, or are they ignored by the system as it is not strictly gt_parse?

Training Never Starts in Single GPU machine (Solution)

Hi , I dont use much pytorch-lightning or performed distributed training (noob here),
but found that the training never starts on a single gpu single node configuration

The solution which I found out for the same was to modify the num_nodes parameter in your train configuration to 1
If number is greater that 1 pytorch lightning waits for the other nodes I presume

It took a lot of time for me to get it right , putting it out for fellow noobs : )

Thanks for sharing such an incredible work to the community !!!

How to train and annotate on custom dataset

@gwkrsrch
Its a great project but I do have couple on questions on how to annotate my custom dataset
I have 10K images with texts on them I want different categories from them like price objects count product name , product description, is there any tool to do so , if no then how can it be done.

How to use Donut Model encoders as embeddings?

Hi,
First of all amazing work!
I wanted to use pre-trained donut model to generate embeddings for my documents, is there any easy way to do?
or I would need to make some changes in forward function?
Thank you!

Optimizer settings for DONUT pre-training on Synthdog

Hi, @gwkrsrch ,

Many thanks for your effort to unblock the issues! I am trying to reproduce pre-training DONUT-proto on Synthdog, but I cannot get reasonable results. Could you please reveal the optimizer setting (i.e., settings of torch.optim.Adam andscheduler ) of DONUT-proto pretraining? It would be a great help to reproduce the pre-training!

Where do classes get added as special tokens?

Hi,

I've implemented Donut as a fork of HuggingFace Transformers, and soon I'll add it to the library. The model is implemented as an instance of VisionEncoderDecoderModel, which allows to combine any vision Transformer encoder (like ViT, Swin) with any text Transformer as decoder (like BERT, GPT-2, etc.). As Donut exactly did that, it was straightforward to implement it that way.

Here's a notebook that shows inference with it.

I do have 2 questions though:

  • I prepared a toy dataset of RVL-CDIP, in order to illustrate how to fine-tune the model on document image classification. However, I wonder where the different classes get added to the special tokens of the tokenizer + decoder. The toy dataset can be loaded as follows:
from datasets import load_dataset

dataset = load_dataset("nielsr/rvl_cdip_10_examples_per_class_donut")

when using this dataset when creating an instance of DonutDataset, it seems only "<s_class>", "</s_class>" and "<s_rvlcdip>" are added as special tokens. But looking at this file, it seems that one also defines special tokens for each class. Looking at the code, it seems only keys are added, not values of the dictionaries.

  • I've uploaded all weights to the hub, currently they are all hosted under my own name (nielsr). I wonder whether we can transfer them to the naver-clova-ix organization. Of course, the names are already taken for the PyPi package of this repository, so either we can use branches within the Github repos, to specify a specific revision, either we can give priority to either HuggingFace Transformers/this PyPi package for the names.

Let me know what you think!

Kind regards,

Niels
ML Engineer @ HuggingFace

Sample of metadata of DocVQA

Great Work. Could you please share the sample of ground truth part/meta data of the Document VQA? For example, in the ground truth(meta data) cord data, it has gt parse, meta, valid line and the valid line has each word along with quad information. I am curious about the ground truth part of the VQA data. what will be the structure of the valid line? will it be full answer with the quad information? or it will be answer splitted into word along with quad information?

Finetuning on DONUT-proto

Hi, @gwkrsrch ,

It works well in the case of DONUT-base, but DONUT-proto does not. Could you please provide the finetuning YAML configuration file of DONUT-proto? Many thanks for your effort!

For (Psuedo) Text Reading Task

Hi for text reading task it instructs that:

You can use our SynthDoG ๐Ÿถ to generate synthetic images for the text reading task with proper gt_parse. See ./synthdog/README.md for details.

But there is no detail over there for it.

Problem Finetuning with Provided Pretrained Model

Hi, I am encountering the following errors recently when I try to finetune using the provided pretrained models.

  1. When I cloned the original repo and tried finetuning on CORD per the instructions like below:
    python train.py --config config/train_cord.yaml \ --pretrained_model_name_or_path "naver-clova-ix/donut-base" \ --dataset_name_or_paths '["naver-clova-ix/cord-v2"]' \ --exp_version "test_experiment"
    the following error pops up:
    Traceback (most recent call last): File "train.py", line 149, in <module> train(config) File "train.py", line 130, in train callbacks=[lr_callback, checkpoint_callback], File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/argparse.py", line 345, in insert_env_defaults return fn(self, **kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 459, in __init__ training_epoch_loop = TrainingEpochLoop(min_steps=min_steps, max_steps=max_steps) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 51, in __init__ if max_steps < -1: TypeError: '<' not supported between instances of 'NoneType' and 'int'

  2. When I tried with my local modified repo, the following error pops up
    Traceback (most recent call last): File "train.py", line 146, in <module> train(config) File "train.py", line 57, in train model_module = DonutModelPLModule(config) File "/data/project/users/xingjianzhao/visual-information-extraction/code/Donut/donut/donut/lightning_module.py", line 94, in __init__ self.model = DonutModel.from_pretrained( File "/data/project/users/xingjianzhao/visual-information-extraction/code/Donut/donut/donut/donut/model.py", line 642, in from_pretrained model = super(DonutModel, cls).from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2155, in from_pretrained model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model( File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2282, in _load_pretrained_model model._init_weights(module) File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1050, in _init_weights raise NotImplementedError(f"Make sure_init_weightsis implemented for {self.__class__}") NotImplementedError: Make sure_init_weightsis implemented for <class 'donut.model.DonutModel'>
    While I did made some modifications, I tried with previous versions of my repo that worked perfectly fine and this error still pops up. However when I use my previous finetuned models (trained with the exact same code), it works fine. I'm wondering if you may have an idea on what the problems could be. Thanks!

Input size parameter clarification

I'm trying to run my own fine-tunning for document parsing. When building the train configuration I wondered: Is the input_size parameter related to the the size of the images in the dataset or is it only used for the Swin transfomer to create the embedding windows?

In case it's the second. When should it be customized and what constraints apply to the values provided?

Thank you!

Incorrect F-1 implementation

Thanks for the great work. However, I noticed the current field-level F-1 implementation might be erroneous.

donut/donut/util.py

Lines 239 to 253 in d2fd95a

def cal_f1(self, preds: List[dict], answers: List[dict]):
"""
Calculate global F1 accuracy score (field-level, micro-averaged) by counting all true positives, false negatives and false positives
"""
total_tp, total_fn_or_fp = 0, 0
for pred, answer in zip(preds, answers):
pred, answer = self.flatten(self.normalize_dict(pred)), self.flatten(self.normalize_dict(answer))
for pred_key, pred_values in pred.items():
for pred_value in pred_values:
if pred_key in answer and pred_value in answer[pred_key]:
answer[pred_key].remove(pred_value)
total_tp += 1
else:
total_fn_or_fp += 1
return total_tp / (total_tp + (total_fn_or_fp) / 2)

In line 252, predictions not matched with ground truth are accumulated as total_fn_or_fp, which in fact are false positive samples. Meanwhile, the leftover entities in answer.values() after removing (in L249) matched predictions are not added to total_fn_or_fp, which means your implementation is ruling out false negatives in F-1 calculation.

Can you confirm if it is an error or specific design choice?

Which OCR works in inference internally ?

I wanted to know which OCR it uses internally for training or inference, It claims to OCR free VDU but how it understands the coordinates and text in that case while inferencing the image, basically getting what text is written there while inference is necessary (According to my assumption)

Performance gap of baseline methods

Thanks for the inspiring work. When I checked the Table 2 in the main paper, I noticed the field-level F1 scores for baseline methods, such as LayoutLM, LayoutLMv2, and BROS are much lower than those in their papers. They have 90+ F1 score on CORD whereas in your paper they score ~80. Could you please provide an explanation?
image

Table from LayoutLMv2 paper

image

How to train and annotate on custom dataset

Hello @gwkrsrch First I want to thank you guys for open sourcing this amazing project. Maybe my questions are very common and silly but it would help me and others to get more clarity. I am trying to train custom Document Information Extraction but to annotate, i don't know which tool to use but in the comment by @VictorAtPL i have seen they are using label studio OCR template to annotate the images this is the exported example of label studio.

[
  {
    "ocr": "/data/upload/1/fe00.png",
    "id": 2,
    "bbox": [
      {
        "x": 20.62937062937063,
        "y": 23.60248447204969,
        "width": 18.88111888111888,
        "height": 8.695652173913043,
        "rotation": 0,
        "original_width": 1920,
        "original_height": 1080
      }
    ],
    "transcription": "Definitions",
    "annotator": 1,
    "annotation_id": 2,
    "created_at": "2022-09-06T23:23:49.284150Z",
    "updated_at": "2022-09-06T23:23:49.284176Z",
    "lead_time": 265.562
  }
]

My questions is

  1. Which is the best tool for annotating for donut Custom Document Information Extraction
  2. Should we annotate the text box + write the text, as in the example? if yes what will be the efficient way to do it.
  3. and is there any converter script which converts label studio format to donut format
  4. Is there any document where there is start to end training of custom data with annotation?

Performance with CPU

I notice you put the model on the Gradio demo, and it seems to be running nicely. However, when I attempt to "dockerize" the model and run it in the cloud with the following configuration: 4 vCPU and 16GB RAM, it remains frozen or extremely sluggish (5 minutes per picture).

Could you please give the infrared configuration on the Gradio demo? Is there anything I did wrong?

Local custom dataset & Potential typo in test.py

Hi, thanks for this interesting work!
I tried to use this model on a local custom dataset and followed the dataset structure as specified but it failed to load correctly. I ended up having to hard code some data loading code to make it work. It would be greatly appreciated if you guys can provide a demo or example of local dataset. Thanks!

PS: I think there may be a typo in the test.py: the '--pretrained_path' should probably be '--pretrained_model_name_or_path' ?

Dataset for pre-training

First of all, thank you for open-sourcing codebase and pre-trained models for tinkering. I am really excited to try new ideas to extend the project. Specifically, I want to train the model in a slightly different way. As mentioned in section 3.4, Clova-based results are awe-inspiring compared to others. I will be happy if you can share preprocessed dataset for training purposes.:)

test.py seems broken

In test.py, f-string formatting with double-quotation marks of ground_truth["gt_parses"][0]['question'].lower
causes some parsing issues:
test

Extracting the question prompt, i.e.

        if args.task_name == "docvqa":
            question = ground_truth["gt_parses"][0]['question'].lower()
            output = pretrained_model.inference(
                image=sample["image"],
                prompt=f"<s_{args.task_name}><s_question>{question}</s_question><s_answer>",
            )["predictions"][0]

solves the issue.

Release yaml files

Hi,

Thank you for sharing your interesting work. I was wondering if there is an expected date on when you will be releasing yaml files regarding anything other than CORD? I want to reproduce the experimental results in my environment.

DistributedDataParallel error in large dataset size

Hi,

I am running the Donut to pre-train on my custom data. However, when I scaled up the data size (2M images~), I got this error.
(But, I verify the success of running the Donut in the small data, such as DocVQA and CORD.)

    trainer.fit(model_module, data_module)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1171, in _run
    self.strategy.setup_environment()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 152, in setup_environment
    self.setup_distributed()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 205, in setup_distributed
    init_dist_connection(self.cluster_environment, self._process_group_backend)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 355, in init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 232, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 41890).

Could you tell me how to solve this error?

How to perform text reading task

Hi, thanks for the great project!
I am exciting to integrate the model into my document understanding project, and I want to implement text reading task.
I have one question:

  • According to my understanding, i should download the pretrained model from "naver-clova-ix/donut-base", but what would be the prompt word that fed into decoder?

Answer bounding box

Hi,

I appreciate very much this simple and effective approach to information extraction. My question is - can the model produce the bounding box for the extracted text?

As a workaround I am thinking of fuzzy matching the text an OCR with bounding boxes, but if the data is replicated on the page in multiple locations then it becomes difficult to know where the answer was copied from.

Thanks

Different input resolution throws error

Following is the error we get when we try to pass an input size of 512*2,512*3:
Are different input resolution/sizes are not supported currently?
Traceback (most recent call last):
File "train.py", line 149, in
train(config)
File "train.py", line 57, in train
model_module = DonutModelPLModule(config)
File "/home/souvic/Desktop/upwork1/donut/donut/lightning_module.py", line 35, in init
ignore_mismatched_sizes=True,
File "/home/souvic/Desktop/upwork1/donut/donut/donut/model.py", line 595, in from_pretrained
model = super(DonutModel, cls).from_pretrained(pretrained_model_name_or_path, revision="official", *model_args, **kwargs)
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/transformers/modeling_utils.py", line 2113, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/souvic/Desktop/upwork1/donut/donut/donut/model.py", line 387, in init
name_or_path=self.config.name_or_path,
File "/home/souvic/Desktop/upwork1/donut/donut/donut/model.py", line 70, in init
num_classes=0,
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 500, in init
downsample=PatchMerging if (i < self.num_layers - 1) else None
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 408, in init
for i in range(depth)])
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 408, in
for i in range(depth)])
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 281, in init
mask_windows = window_partition(img_mask, self.window_size) # num_win, window_size, window_size, 1
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 111, in window_partition
x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
RuntimeError: shape '[1, 25, 10, 38, 10, 1]' is invalid for input of size 98304

Task | Understanding Paragraphs & Document Layout Analysis

Thanks for publishing this interesting work.

Would I be able to extend the Document Understanding task to learn hierarchies over paragraphs of text within a page? Or is the 512 token limit going to prohibit the OCR of paragraphs?

Would the input look like so?

{
    "file_name": {image_path1}, 
    "ground_truth": `{ 
          "item": [{ "title": "item-title", "text": "insert some paragraph of text"], 
          "table": ["column 1 column 2 column 3 0 0 3], 
          "title"; ["title page"] 
    }`
}

Further, would it be possible to alter the task objective to Document Layout analysis and train on PubLayNet as per LayoutLMv3?

Finetuning Donut on FUNSD dataset

Hi,

Thank you for open sourcing DONUT and SynthDoG. I have two requests.

  1. After Pre-Training("how to read"/pseudo-OCR task), is there a documentation about how to finetune("how to understand") on a different dataset like FUNSD?
  2. Can we generate synthetic documents resembling forms/invoices using SyntheticDoG, if yes can you provide hints whether we need a template or something?

Add bounding boxes coordinates in predictions

It could be useful to get bounding boxes coordinates from Document Information Extraction task predictions.

on conventional pipeline :
Screenshot from 2022-09-05 06-33-35

on Donut it could be something like:

{
    'predictions': [{
        'menu': [{
                'cnt': '2',
                'nm': 'ICE BLAOKCOFFE',
                'price': '82,000',
                'bbox': [xmin, ymin, xmax, ymax]
            },
            {
                'cnt': '1',
                'nm': 'AVOCADO COFFEE',
                'price': '61,000',
                'bbox': [xmin, ymin, xmax, ymax]
            },
        ],
        'total': {
            'cashprice': '200,000',
            'changeprice': '25,400',
            'total_price': '174,600',
            'bbox': [xmin, ymin, xmax, ymax]
        }
    }]
}

possible solution (I did not succeed):
#16 (comment)

How to get confidence score for predictions?

Hi, thank you for this outstanding work. Could you point me to how one could generate confidence scores along with the JSON predictions from the models, especially the models for the Document Parsing Task?

Thanks

Finetuning Epochs on DocVQA and RVLCDIP

Hi, @SamSamhuns @gwkrsrch . Many thanks for your efforts!

I tried finetuning DONUT on RVLCDIP and DocVQA with 8 V100 GPUs, but the finetuning process is too long (up to weeks for RVLCDIP). May I know if the epochs of 100 for RVLCDIP and 300 for DocVQA are necessary and how you finetune the model (e.g., epoch and batch size)? The finetuning overhead is too large with such long epochs according to the provided configs.

model checkpoint did not match

I got error
"Some weights of DonutModel were not initialized from the model checkpoint at naver-clova-ix/donut-base and are newly initialized because the shapes did not match"
when using training code in README.md.

may be some layer's shape does not match.

image
image

Is it not a serious problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.