doc-analysis / docbank Goto Github PK

View Code? Open in Web Editor NEW

541.0 541.0 71.0 336.14 MB

DocBank: A Benchmark Dataset for Document Layout Analysis

License: Apache License 2.0

Python 100.00%

docbank's People

Contributors

Stargazers

Watchers

docbank's Issues

Prebuilt models: detectron2 X101 yields HTTP 409 now

Among the published models, the one for Detectron2 recently stopped working.

The server now responds with 409 Public access is not permitted on this storage account.

Could this be hosted elsewhere?

Also, in principle, would I be allowed to redistribute your models (e.g. as part of a Github pkg)?

Request 403 for Dataset resource

Hi guys, I got 403 response from dataset URL with following message:

<Error>
    <Code>AuthenticationFailed</Code>
    <Message>
        Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:63fa7064-c01e-0055-782d-92f8d9000000 Time:2024-04-19T07:46:53.2863477Z
    </Message>
    <AuthenticationErrorDetail>
        Signature did not match. String to sign used was layoutlm r b o 2023-06-08T08:48:15Z 2033-06-08T16:48:15Z https 2022-11-02
    </AuthenticationErrorDetail>
</Error>

How to deal with "date" category?

Nice work!

I just find there are actually 13 types in the docbank dataset. I wonder how do you deal with the "date" category in data preprocessing step? Do you remove the "date" category?

How to do inference with LayoutLM?

Hi, I'm trying to use run_seq_labeling.py from https://github.com/microsoft/unilm/tree/master/layoutlm on your data. However, the input format looks different.

Theirs uses _box.txt files that contain all samples.

I also noticed that run_seq_labeling.py adds a label "O" but your labels.txt files already has 13 classes like the pretrained models you provide, which makes me doubt that you used run_seq_labeling.py to train your model. Can you provide your training / evaluation code?

模型怎么用

请问模型怎么使用呢？layoutlm的run_seq_labeling是不是不能用于这个数据集呀？

How can I infer?

Do I need retrain to infer? Or can I use pre trained model and infer on Document Image Classification?
Should I use LayoutML to use the pre-trained model and infer?
Why didn't I find a script to infer, but one to retrain.

Detection MAP results?

The ResNeXt-101 model has been added to the Model Zoo.

Can you provide the official MAP results of this Detection model on Docbank. This detection baseline is important for comparisons between detection methods.

We evaluate the model in Model Zoo on Detectron2 and fix the category mapping. The AP is 74.867, right?

Error when downloading dataset

I tried to download the dataset in the dataset homepage.
But, following error occurred when I clicked the each dataset link.

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>Public access is not permitted on this storage account. RequestId:f11eb78c-a01e-000e-3bae-9dc1e2000000 Time:2023-06-13T04:19:31.5518598Z</Message>
</Error>

Is there any other way to download the dataset?

请问可以进行目标板块的分析吗如何实现这方面的数据

就像是PubLayNet一样
但是PubLayNet最局限的地方在于分类太少了
很难满足实际需求

Incorrect of annotation

I found the coordinates in the annotation file are incorrect.
For example, 2.tar_1801.00617.gz_idempotents_arxiv_4.txt in the DocBank_samples. The size of image is 1654x2339. However, all bounding boxes do not even reach the half of the page.

The final effect is:

Do I miss anything? @liminghao1630 @wolfshow @ranpox @doc-analysis

Image resize size?

The annotation bbox information and image size do not match.

AttributeError: 'Page' object has no attribute 'figures'

pdf_process.py报错，
traceback (most recent call last): | 0/42 [00:00<?, ?it/s]
File "pdf_process.py", line 200, in
worker(pdf_file, args.data_dir, args.output_dir)
File "pdf_process.py", line 109, in worker
for figure in this_page.figures:
AttributeError: 'Page' object has no attribute 'figures'
想请教一下这个是什么原因

pdf_process.py里面，为什么要把四个坐标除以1000？

请问代码pdf_process.py，为什么要把x0,top,x1,bottom这四个坐标除以1000？能否换成其他数？跟图片尺寸有关系吗？谢谢！

Public access is not permitted on this storage account

Hey, I can't download DocBank anymore.

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>
Public access is not permitted on this storage account. RequestId:b7c5e5e6-301e-0041-21a3-b0b0b6000000 Time:2023-07-07T07:21:41.1497126Z
</Message>
</Error>
"""

PDF process script requirements

PDF process script needs to have requirements needed for it to run as a separate file. In pdfplumber v.0.5.24 reference to Container.figures has been removed and script is not working properly. Version of pdfplumber 0.5.23 is the last version which can be used to run the script successfully.

errors when I was using huggingface to load pretrained_weights

Hi!I am using huggingface to the pretrained_weights of layoutlm_large_500k_epoch_1.But huggingface shows me the errors as below:
Traceback (most recent call last):
File "D:\SoulCode\PaddleDetection\DocBank\DocBank_infer.py", line 8, in
model = LayoutLMForTokenClassification.from_pretrained("D:\Download\layoutlm_large_500k_epoch_1")
File "D:\Python39\lib\site-packages\transformers\modeling_utils.py", line 2225, in from_pretrained
model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model(
File "D:\Python39\lib\site-packages\transformers\modeling_utils.py", line 2357, in _load_pretrained_model
raise ValueError(
ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?

Codes as below:
from transformers import AutoTokenizer,LayoutLMForTokenClassification

import torch

tokenizer = AutoTokenizer.from_pretrained('D:\Download\layoutlm_large_500k_epoch_1')
model = LayoutLMForTokenClassification.from_pretrained("D:\Download\layoutlm_large_500k_epoch_1")

How can I deal with it?Thanks!

Same type different area in one doc

Thanks for your excellent work!
It looks like that there are no signs used to distinguish regions in the data. For example, there are two paragraphs separated by a title. How can I distinguish these two paragraphs? Tokens in these two paragraphs are all marked as "paragraph" without id or other signs.
Can you release some flags to distinguish regions?

给 arxiv的.tex 文件添加颜色标签并生成PDF的代码是否能开源

pdf files not included in the dataset

I have been working on DocBank_samples since a month now. Today I downloaded the main dataset from onedrive and I could not see any pdf files!
I wanted to request , If it is possible to provide the PDF files too?

I appreciate the help!

Is the correct inference method?

Hi, I use your released models with transformers and try to do the inference. However, the test results are not so good. So I wonder if my inference method is correct. During this process, I ran into a few problems:

The annotation bboxes should transfer into the tokens in your voc, but how to combine the tokens' labels to the bboxes' label?For example, "Hello" may be divided into "he" "llo", and their labels are "1" "8", then how to define the label of "hello"? I try to recover the label with the first token, as above, I use the "he"'s label "1" as the "hello"'s label. Is it correct?
For the document contains more the 512 tokens, for example 782, I divided into 512 and 270 independant input to the model, and concat the results. Is it correct?
For the "zero area" tokens, such as[23, 405, 23, 407], do you calculate the area?
Thanks a lot for your attention, I'm looking forward to your reply.

数据集的标注

您好, 感谢您富有创意的工作,。
这个数据集是根据文字的token标注的, 我看到一个bounding box区域就是一个单词, 这种标注是非常细粒度的。
我在做文档重建方面的研究工作, 请问是否可提供栏目块样式的标注, 即一个bounding box 是正文的一个段落或者标题,次级标题这种类似的标注

Pretrained ResNeXt-101 Class -> ID ?

Which classes corresponds to which ID in the pretrained network (e.g. author = ID 8)?

I tried the three different data subsets, but to me it looked like none of the categories and their corresponding ID matched the order of the annotations. The three different subset train, valid and test have different ID/Class category ordering.

For example ID 2 has the class section assigned in the training set, while in valid its equation and in test its reference.

Is there some additional information regarding this?

请问预训练模型只支持13个类别的标签种类吗？

请问预训练模型只支持13个类别的标签种类吗？尝试过修改config文件的label2id，labels.txt，增加了两个类别，但报错15和13不匹配。

Direct link for download

Hi, @liminghao1630 @ranpox @doc-analysis could you please product the direct downloadable link in the onedrive please, so we could download it on the server?

Image embeddings were not used?

Your pretrained/docbank model doesn't have image embeddings, does it?

only the contextualized embed + bbox embedding?

How do you train with those NOT-TEXT elements.

Dear author,
For some documents that contain massive not-text elements, such as hundreds of thousands of "##LTLine##". How do you deal with them actually?
For example, you try to train&predict all those elements with text '##LTLine##'.

Thank you!

some files are wrong when downloaded

在下载数据解压过程中会有一些文件提示损坏，请问有其他数据盘吗？or 百度云？

Direct download in browser is unstable. Could you offer a method by which we can download the data from the command line, e.g. wget?

Hi:

Thank you for your datasets. Direct download in browser is unstable. Could you offer a method by which we can download the data from the command line, e.g. wget? @liminghao1630 @wolfshow @ranpox

Reading order and multi-column

Hi, thank you a lot for your publication and this github repository.
I tried to reproducce some of the paper results by first training a Bert network on DocBank dataset, but I fail to reach similar performance as the one provided in the paper. One of my hypothesis concerns the order of word in the input that I provide to BERT.

When looking at the data, it appears to me that, on some example, the order of words is not in reading order but in left-to-right order. For example if we look at file 10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt we jump from one column to the second one.

In my understanding, the reading order is really important to be able to use/finetune Bert.

Moreover in your publication it appears that you used dataset in reading order ( 'We organize the DocBank dataset using the reading order" section).

Here are my questions:

can you confirm that the word in the .txt files are not necessarly in the reading order ?
do you provide the dataset in the reading order ?

Thank you again, and I hope that my questions make sense.

dataset/ baseline release

Impressive work done.

I'm wondering when the dataset or baseline will be released out?

Many thx.

pdf_process.py script does not generate RGB Values and Label in .txt files

I am trying to use the pdf_process.py script to parse the content from a PDF file and create a annotated file with 10 fields.
Unfortunately, for both black and colored PDF files I am not getting RGB and Label values.

Can you please suggest how can I generate a .txt file with 10 fields using this script?

Complex	80	66	198	91	AAEWKY+NimbusRomNo9L-Regu
Block	210	66	288	91	AAEWKY+NimbusRomNo9L-Regu
Floating-Point	299	66	487	91	AAEWKY+NimbusRomNo9L-Regu

MNRAS	70	94	123	104	UFZJOE+CMR8
000,	128	94	156	104	ECMFEV+CMBX8
1–16	161	94	190	104	UFZJOE+CMR8
(2017)	194	94	234	104	UFZJOE+CMR8

Thanks.

Pretrained LayoutLM model results in different architecture

Hi, I'm trying to replicate your results, and I see that the config.json for the pretrained LayoutLM has "bert" as the the model type.

When I load the pretrained model, it results in You are using a model of type bert to instantiate a model of type layoutlm. This is not supported for all configurations of models and can yield errors.

I was able to replicate your results for Roberta-Large and Bert-Large, but not for LayoutLM. Could you advise? Thank you.

Faster R-CNN weights and code

Hi,
Do you have a plan for releasing faster rcnn weights and code?

数据集是否有sharepoint之外的下载方式？

非常棒的工作！
请问作者，数据集比较大，sharepoint网页下载太慢，总是中止，是否有其他可断点续传的下载方式，或者百度网盘方式？

DocBank Model Zoo Issue

Hi @wolfshow, @liminghao1630, @ranpox and @doc-analysis,
Thank you for setting up this repo. I am trying to download models from your model zoo: https://github.com/doc-analysis/DocBank/blob/master/MODEL_ZOO.md. However, onedrive does not allow download of zip files > 100MB (most models are greater than 100MB).

What do you suggest is the best way to download the model welights?

thank you!

Issue training with DOCBANK COCO format annotations

Hello,
I am trying to use the X101 arch from the Model ZOO as a backbone for one of my experiments with the DOCBANK dataset.
I am using the COCO format provided for DOCBANK. However, I am getting really bad results on Inference.

Am I doing something wrong?

Use pretrained weights for backbone.
Freeze the top few layers and finetune the trainable layers along with the HEAD layers with a low LR.

Also, I would like to ask if the pretrained weights trained on the COCO based annotations or on the original token based annotations?.

Any help would be appreciated.

LayoutLM from scratch模型是指不经过任何预训练，直接使用docbank训练吗？

您好，论文表格中，LayoutLM from scratch的效果还不错，比bert初始化参数的效果要好，那么这里模型是指不经过任何预训练，直接使用docbank训练吗？
我下载了docbank，发现标签中没有标注BIE，请问您在训练的时候是否有标注BIE呢？
数据中没有标明图片的大小，而layoutlm中图片是经过resize的，我们这篇有经过缩放吗？图片的大小也是固定的754,1000吗？

Are component-level annotations available?

I was wondering if the annotations for an element (a region comprising one or more tokens), such as for example a paragraph (similar to the annotations for PubLayNet) are available, or if there exists a mechanism to compute them. I suppose running some sort of clustering algorithm could do the work decently, but I would like to know if these annotations already exist.

Thanks in advance for your help!

Possible damaged image files in the dataset

I have found some broken images from "DocBank_500K_ori_img" which would cause the error while training, considering the great amount of the dataset, the mid-break during training is unbearable.

I suggest the author to re-check the image dataset especially for the following images:
9.tar_1501.04477.gz_RobustSwitching-ergodic_3_ori.jpg
97.tar_1705.02752.gz_kcenters_22_ori.jpg
66.tar_1504.08256.gz_Manipulation_Partial_Info_without_cref_5_ori.jpg
88.tar_1704.08423.gz_SingularSystem_1-2_17_ori.jpg

None of them could be loaded neither by PIL or OpenCV lib.

Since the author did not provide the HASH checksum of the archive file "DocBank_500K_ori_img.zip" and it's really large, I'd rather not to re-download it to check the completeness of my downloaded file.

label mapping and bert/roberta output

I tried bert and Roberta models avaialable in the model zoo. However, I am got only LABEL_8 for all the tokens. I tried with simpletransformers and hugging face transformer library.

from simpletransformers import ner
import pytesseract
import json
with open("bert_large_500k_epoch_1/config.json") as f:
    config = json.load(f)

model_args = ner.NERArgs()
model_args.config = config
model_args.labels_list = ["LABEL_0", "LABEL_1", "LABEL_2", "LABEL_3", "LABEL_4", "LABEL_5", "LABEL_6",
                          "LABEL_7", "LABEL_8", "LABEL_9", "LABEL_10", "LABEL_11", "LABEL_12"]
model = ner.NERModel(
    'bert',
    'bert_large_500k_epoch_1',
    args=model_args,
    use_cuda=False
)
predictions, raw_outputs = model.predict([pytesseract.image_to_string('closing-disclosure-H25B-1.aec3e9325e5b.png')])
print(predictions)

from transformers import RobertaTokenizer, RobertaForTokenClassification, RobertaConfig
import torch
import pandas as pd

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForTokenClassification.from_pretrained("roberta_large_500k_epoch_1")
model.eval()

df = pd.read_csv("10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt", sep='\t', header=None)
text = " ".join(df[0].to_list())
sample_input = tokenizer(text, return_tensors="pt", max_length=512)

output = model(**sample_input)

print(torch.argmax(output[0], dim=-1))

Is there anything I am missing, I tried with some txt files from the dataset as well however I faced a max_seq_length issue and got only LABEL_8 and LABEL_10 with truncation
What is the exact mapping of LABEL_* mentioned in config.json of these models, Which is title, table, list etc.

Scaling Issue

Using the data points given for the docBank dataset on pdf files, we are getting a very distorted annotation on overlaying the data-points on images -- resized to the pdf height and width. We used the normalization to 1000 procedure that you have suggested in other issue. However the following is an example of the distorted output:

数据集中的公式部分，有部分定位不准确，我觉得这部分是脏数据，有没有什么方法可以过滤掉这部分

两张图片是在sample某一张图片的ann.jpg和ari.jpg对应的部分截图

The categories order between MSCOCO_format_annotation/500k_train.json and 500k_valid.json is different!

This lead to a very confused result during training and it's difficult to locate this hidden problem...Need modify the categories order to keep them same.
Though this is a small problem, but really influenced the experience of training. Hope @wolfshow @liminghao1630 can fix this and replace the .zip file

How to treat the document that contains more than 512 words?

For the document that contains more than 512 words, how do you split the data? I have two ideas:

For example, if a document contains 5 words: ABCDE. We assume the window size equals to 2.

It can be split into three independent documents and each document is 'AB', 'CD' and 'E', respectively. However, the problem is that these three documents are independent, which may obtain lower performance.
It can be split into several documents via sliding windows. For example, with a window size of 3 words and padding of 1 word, the document can be split into five documents and each document is 'AB', 'ABC', 'BCD', 'CDE', 'DE', respectively. For 'BCD', the B and D are padding and the target word is C.

Do you use one of the above methods or other methods?

Thank you!

Are tablebank images included in docbank?

Thanks for the great work!

My question is as titled.

doc-analysis / docbank Goto Github PK

docbank's People

Contributors

Stargazers

Watchers

Forkers

docbank's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs