doc-analysis / docbank Goto Github PK
View Code? Open in Web Editor NEWDocBank: A Benchmark Dataset for Document Layout Analysis
License: Apache License 2.0
DocBank: A Benchmark Dataset for Document Layout Analysis
License: Apache License 2.0
Using the data points given for the docBank dataset on pdf files, we are getting a very distorted annotation on overlaying the data-points on images -- resized to the pdf height and width. We used the normalization to 1000 procedure that you have suggested in other issue. However the following is an example of the distorted output:
I tried bert and Roberta models avaialable in the model zoo. However, I am got only LABEL_8 for all the tokens. I tried with simpletransformers and hugging face transformer library.
from simpletransformers import ner
import pytesseract
import json
with open("bert_large_500k_epoch_1/config.json") as f:
config = json.load(f)
model_args = ner.NERArgs()
model_args.config = config
model_args.labels_list = ["LABEL_0", "LABEL_1", "LABEL_2", "LABEL_3", "LABEL_4", "LABEL_5", "LABEL_6",
"LABEL_7", "LABEL_8", "LABEL_9", "LABEL_10", "LABEL_11", "LABEL_12"]
model = ner.NERModel(
'bert',
'bert_large_500k_epoch_1',
args=model_args,
use_cuda=False
)
predictions, raw_outputs = model.predict([pytesseract.image_to_string('closing-disclosure-H25B-1.aec3e9325e5b.png')])
print(predictions)
from transformers import RobertaTokenizer, RobertaForTokenClassification, RobertaConfig
import torch
import pandas as pd
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForTokenClassification.from_pretrained("roberta_large_500k_epoch_1")
model.eval()
df = pd.read_csv("10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt", sep='\t', header=None)
text = " ".join(df[0].to_list())
sample_input = tokenizer(text, return_tensors="pt", max_length=512)
output = model(**sample_input)
print(torch.argmax(output[0], dim=-1))
Is there anything I am missing, I tried with some txt files from the dataset as well however I faced a max_seq_length issue and got only LABEL_8 and LABEL_10 with truncation
What is the exact mapping of LABEL_* mentioned in config.json of these models, Which is title, table, list etc.
Among the published models, the one for Detectron2 recently stopped working.
The server now responds with 409 Public access is not permitted on this storage account.
Could this be hosted elsewhere?
Also, in principle, would I be allowed to redistribute your models (e.g. as part of a Github pkg)?
Hi, thank you a lot for your publication and this github repository.
I tried to reproducce some of the paper results by first training a Bert network on DocBank dataset, but I fail to reach similar performance as the one provided in the paper. One of my hypothesis concerns the order of word in the input that I provide to BERT.
When looking at the data, it appears to me that, on some example, the order of words is not in reading order but in left-to-right order. For example if we look at file 10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt we jump from one column to the second one.
In my understanding, the reading order is really important to be able to use/finetune Bert.
Moreover in your publication it appears that you used dataset in reading order ( 'We organize the DocBank dataset using the reading order" section).
Here are my questions:
Thank you again, and I hope that my questions make sense.
Hi, I'm trying to use run_seq_labeling.py
from https://github.com/microsoft/unilm/tree/master/layoutlm on your data. However, the input format looks different.
Theirs uses _box.txt
files that contain all samples.
I also noticed that run_seq_labeling.py
adds a label "O"
but your labels.txt
files already has 13 classes like the pretrained models you provide, which makes me doubt that you used run_seq_labeling.py
to train your model. Can you provide your training / evaluation code?
I am trying to use the pdf_process.py script to parse the content from a PDF file and create a annotated file with 10 fields.
Unfortunately, for both black and colored PDF files I am not getting RGB and Label values.
Can you please suggest how can I generate a .txt file with 10 fields using this script?
Complex 80 66 198 91 AAEWKY+NimbusRomNo9L-Regu
Block 210 66 288 91 AAEWKY+NimbusRomNo9L-Regu
Floating-Point 299 66 487 91 AAEWKY+NimbusRomNo9L-Regu
MNRAS 70 94 123 104 UFZJOE+CMR8
000, 128 94 156 104 ECMFEV+CMBX8
1–16 161 94 190 104 UFZJOE+CMR8
(2017) 194 94 234 104 UFZJOE+CMR8
Thanks.
PDF process script needs to have requirements needed for it to run as a separate file. In pdfplumber v.0.5.24 reference to Container.figures has been removed and script is not working properly. Version of pdfplumber 0.5.23 is the last version which can be used to run the script successfully.
Do I need retrain to infer? Or can I use pre trained model and infer on Document Image Classification?
Should I use LayoutML to use the pre-trained model and infer?
Why didn't I find a script to infer, but one to retrain.
pdf_process.py报错,
traceback (most recent call last): | 0/42 [00:00<?, ?it/s]
File "pdf_process.py", line 200, in
worker(pdf_file, args.data_dir, args.output_dir)
File "pdf_process.py", line 109, in worker
for figure in this_page.figures:
AttributeError: 'Page' object has no attribute 'figures'
想请教一下这个是什么原因
Thanks for the great work!
My question is as titled.
I have been working on DocBank_samples since a month now. Today I downloaded the main dataset from onedrive and I could not see any pdf files!
I wanted to request , If it is possible to provide the PDF files too?
I appreciate the help!
Which classes corresponds to which ID in the pretrained network (e.g. author = ID 8)?
I tried the three different data subsets, but to me it looked like none of the categories and their corresponding ID matched the order of the annotations. The three different subset train, valid and test have different ID/Class category ordering.
For example ID 2 has the class section assigned in the training set, while in valid its equation and in test its reference.
Is there some additional information regarding this?
就像是PubLayNet一样
但是PubLayNet最局限的地方在于分类太少了
很难满足实际需求
Impressive work done.
I'm wondering when the dataset or baseline will be released out?
Many thx.
Hi:
Thank you for your datasets. Direct download in browser is unstable. Could you offer a method by which we can download the data from the command line, e.g. wget? @liminghao1630 @wolfshow @ranpox
请问模型怎么使用呢?layoutlm的run_seq_labeling是不是不能用于这个数据集呀?
Hi, I use your released models with transformers and try to do the inference. However, the test results are not so good. So I wonder if my inference method is correct. During this process, I ran into a few problems:
For the document that contains more than 512 words, how do you split the data? I have two ideas:
For example, if a document contains 5 words: ABCDE. We assume the window size equals to 2.
Do you use one of the above methods or other methods?
Thank you!
Hey, I can't download DocBank anymore.
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>
Public access is not permitted on this storage account. RequestId:b7c5e5e6-301e-0041-21a3-b0b0b6000000 Time:2023-07-07T07:21:41.1497126Z
</Message>
</Error>
"""
Hi, @liminghao1630 @ranpox @doc-analysis could you please product the direct downloadable link in the onedrive please, so we could download it on the server?
Hi guys, I got 403 response from dataset URL with following message:
<Error>
<Code>AuthenticationFailed</Code>
<Message>
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:63fa7064-c01e-0055-782d-92f8d9000000 Time:2024-04-19T07:46:53.2863477Z
</Message>
<AuthenticationErrorDetail>
Signature did not match. String to sign used was layoutlm r b o 2023-06-08T08:48:15Z 2033-06-08T16:48:15Z https 2022-11-02
</AuthenticationErrorDetail>
</Error>
I have found some broken images from "DocBank_500K_ori_img" which would cause the error while training, considering the great amount of the dataset, the mid-break during training is unbearable.
I suggest the author to re-check the image dataset especially for the following images:
9.tar_1501.04477.gz_RobustSwitching-ergodic_3_ori.jpg
97.tar_1705.02752.gz_kcenters_22_ori.jpg
66.tar_1504.08256.gz_Manipulation_Partial_Info_without_cref_5_ori.jpg
88.tar_1704.08423.gz_SingularSystem_1-2_17_ori.jpg
None of them could be loaded neither by PIL or OpenCV lib.
Since the author did not provide the HASH checksum of the archive file "DocBank_500K_ori_img.zip" and it's really large, I'd rather not to re-download it to check the completeness of my downloaded file.
非常棒的工作!
请问作者,数据集比较大,sharepoint网页下载太慢,总是中止,是否有其他可断点续传的下载方式,或者百度网盘方式?
Hello,
I am trying to use the X101 arch from the Model ZOO as a backbone for one of my experiments with the DOCBANK dataset.
I am using the COCO format provided for DOCBANK. However, I am getting really bad results on Inference.
Am I doing something wrong?
Also, I would like to ask if the pretrained weights trained on the COCO based annotations or on the original token based annotations?.
Any help would be appreciated.
在下载数据解压过程中会有一些文件提示损坏,请问有其他数据盘吗?or 百度云?
Hi!I am using huggingface to the pretrained_weights of layoutlm_large_500k_epoch_1.But huggingface shows me the errors as below:
Traceback (most recent call last):
File "D:\SoulCode\PaddleDetection\DocBank\DocBank_infer.py", line 8, in
model = LayoutLMForTokenClassification.from_pretrained("D:\Download\layoutlm_large_500k_epoch_1")
File "D:\Python39\lib\site-packages\transformers\modeling_utils.py", line 2225, in from_pretrained
model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model(
File "D:\Python39\lib\site-packages\transformers\modeling_utils.py", line 2357, in _load_pretrained_model
raise ValueError(
ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?
Codes as below:
from transformers import AutoTokenizer,LayoutLMForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained('D:\Download\layoutlm_large_500k_epoch_1')
model = LayoutLMForTokenClassification.from_pretrained("D:\Download\layoutlm_large_500k_epoch_1")
How can I deal with it?Thanks!
Your pretrained/docbank model doesn't have image embeddings, does it?
only the contextualized embed + bbox embedding?
The ResNeXt-101 model has been added to the Model Zoo.
Can you provide the official MAP results of this Detection model on Docbank. This detection baseline is important for comparisons between detection methods.
We evaluate the model in Model Zoo on Detectron2 and fix the category mapping. The AP is 74.867, right?
Hi, I'm trying to replicate your results, and I see that the config.json for the pretrained LayoutLM has "bert" as the the model type.
When I load the pretrained model, it results in You are using a model of type bert to instantiate a model of type layoutlm. This is not supported for all configurations of models and can yield errors.
I was able to replicate your results for Roberta-Large and Bert-Large, but not for LayoutLM. Could you advise? Thank you.
请问预训练模型只支持13个类别的标签种类吗?尝试过修改config文件的label2id,labels.txt,增加了两个类别,但报错15和13不匹配。
I tried to download the dataset in the dataset homepage.
But, following error occurred when I clicked the each dataset link.
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>Public access is not permitted on this storage account. RequestId:f11eb78c-a01e-000e-3bae-9dc1e2000000 Time:2023-06-13T04:19:31.5518598Z</Message>
</Error>
Is there any other way to download the dataset?
I found the coordinates in the annotation file are incorrect.
For example, 2.tar_1801.00617.gz_idempotents_arxiv_4.txt
in the DocBank_samples
. The size of image is 1654x2339. However, all bounding boxes do not even reach the half of the page.
Do I miss anything? @liminghao1630 @wolfshow @ranpox @doc-analysis
This lead to a very confused result during training and it's difficult to locate this hidden problem...Need modify the categories order to keep them same.
Though this is a small problem, but really influenced the experience of training. Hope @wolfshow @liminghao1630 can fix this and replace the .zip file
Thanks for your excellent work!
It looks like that there are no signs used to distinguish regions in the data. For example, there are two paragraphs separated by a title. How can I distinguish these two paragraphs? Tokens in these two paragraphs are all marked as "paragraph" without id or other signs.
Can you release some flags to distinguish regions?
请问代码pdf_process.py,为什么要把x0,top,x1,bottom这四个坐标除以1000?能否换成其他数?跟图片尺寸有关系吗?谢谢!
I was wondering if the annotations for an element (a region comprising one or more tokens), such as for example a paragraph (similar to the annotations for PubLayNet) are available, or if there exists a mechanism to compute them. I suppose running some sort of clustering algorithm could do the work decently, but I would like to know if these annotations already exist.
Thanks in advance for your help!
Hi,
Do you have a plan for releasing faster rcnn weights and code?
您好,论文表格中,LayoutLM from scratch的效果还不错,比bert初始化参数的效果要好,那么这里模型是指不经过任何预训练,直接使用docbank训练吗?
我下载了docbank,发现标签中没有标注BIE,请问您在训练的时候是否有标注BIE呢?
数据中没有标明图片的大小,而layoutlm中图片是经过resize的,我们这篇有经过缩放吗?图片的大小也是固定的754,1000吗?
您好, 感谢您富有创意的工作,。
这个数据集是根据文字的token标注的, 我看到一个bounding box区域就是一个单词, 这种标注是非常细粒度的。
我在做文档重建方面的研究工作, 请问是否可提供栏目块样式的标注, 即一个bounding box 是正文的一个段落或者标题,次级标题这种类似的标注
Hi @wolfshow, @liminghao1630, @ranpox and @doc-analysis,
Thank you for setting up this repo. I am trying to download models from your model zoo: https://github.com/doc-analysis/DocBank/blob/master/MODEL_ZOO.md. However, onedrive does not allow download of zip files > 100MB (most models are greater than 100MB).
What do you suggest is the best way to download the model welights?
thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.