GithubHelp home page GithubHelp logo

docbank's People

Contributors

liminghao1630 avatar ranpox avatar shivamsnaik avatar wolfshow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docbank's Issues

Scaling Issue

Using the data points given for the docBank dataset on pdf files, we are getting a very distorted annotation on overlaying the data-points on images -- resized to the pdf height and width. We used the normalization to 1000 procedure that you have suggested in other issue. However the following is an example of the distorted output:

docBank_problem

label mapping and bert/roberta output

I tried bert and Roberta models avaialable in the model zoo. However, I am got only LABEL_8 for all the tokens. I tried with simpletransformers and hugging face transformer library.

from simpletransformers import ner
import pytesseract
import json
with open("bert_large_500k_epoch_1/config.json") as f:
    config = json.load(f)

model_args = ner.NERArgs()
model_args.config = config
model_args.labels_list = ["LABEL_0", "LABEL_1", "LABEL_2", "LABEL_3", "LABEL_4", "LABEL_5", "LABEL_6",
                          "LABEL_7", "LABEL_8", "LABEL_9", "LABEL_10", "LABEL_11", "LABEL_12"]
model = ner.NERModel(
    'bert',
    'bert_large_500k_epoch_1',
    args=model_args,
    use_cuda=False
)
predictions, raw_outputs = model.predict([pytesseract.image_to_string('closing-disclosure-H25B-1.aec3e9325e5b.png')])
print(predictions)
from transformers import RobertaTokenizer, RobertaForTokenClassification, RobertaConfig
import torch
import pandas as pd

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForTokenClassification.from_pretrained("roberta_large_500k_epoch_1")
model.eval()

df = pd.read_csv("10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt", sep='\t', header=None)
text = " ".join(df[0].to_list())
sample_input = tokenizer(text, return_tensors="pt", max_length=512)

output = model(**sample_input)

print(torch.argmax(output[0], dim=-1))

Is there anything I am missing, I tried with some txt files from the dataset as well however I faced a max_seq_length issue and got only LABEL_8 and LABEL_10 with truncation
What is the exact mapping of LABEL_* mentioned in config.json of these models, Which is title, table, list etc.
closing-disclosure-H25B-1 aec3e9325e5b

Reading order and multi-column

Hi, thank you a lot for your publication and this github repository.
I tried to reproducce some of the paper results by first training a Bert network on DocBank dataset, but I fail to reach similar performance as the one provided in the paper. One of my hypothesis concerns the order of word in the input that I provide to BERT.

When looking at the data, it appears to me that, on some example, the order of words is not in reading order but in left-to-right order. For example if we look at file 10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt we jump from one column to the second one.

Capture d’écran 2021-04-28 à 15 27 20

Capture d’écran 2021-04-28 à 15 26 10

In my understanding, the reading order is really important to be able to use/finetune Bert.

Moreover in your publication it appears that you used dataset in reading order ( 'We organize the DocBank dataset using the reading order" section).

Here are my questions:

  • can you confirm that the word in the .txt files are not necessarly in the reading order ?
  • do you provide the dataset in the reading order ?

Thank you again, and I hope that my questions make sense.

How to do inference with LayoutLM?

Hi, I'm trying to use run_seq_labeling.py from https://github.com/microsoft/unilm/tree/master/layoutlm on your data. However, the input format looks different.

Theirs uses _box.txt files that contain all samples.

I also noticed that run_seq_labeling.py adds a label "O" but your labels.txt files already has 13 classes like the pretrained models you provide, which makes me doubt that you used run_seq_labeling.py to train your model. Can you provide your training / evaluation code?

pdf_process.py script does not generate RGB Values and Label in .txt files

I am trying to use the pdf_process.py script to parse the content from a PDF file and create a annotated file with 10 fields.
Unfortunately, for both black and colored PDF files I am not getting RGB and Label values.

Can you please suggest how can I generate a .txt file with 10 fields using this script?

Complex	80	66	198	91	AAEWKY+NimbusRomNo9L-Regu
Block	210	66	288	91	AAEWKY+NimbusRomNo9L-Regu
Floating-Point	299	66	487	91	AAEWKY+NimbusRomNo9L-Regu
MNRAS	70	94	123	104	UFZJOE+CMR8
000,	128	94	156	104	ECMFEV+CMBX8
1–16	161	94	190	104	UFZJOE+CMR8
(2017)	194	94	234	104	UFZJOE+CMR8

Thanks.

PDF process script requirements

PDF process script needs to have requirements needed for it to run as a separate file. In pdfplumber v.0.5.24 reference to Container.figures has been removed and script is not working properly. Version of pdfplumber 0.5.23 is the last version which can be used to run the script successfully.

How can I infer?

Do I need retrain to infer? Or can I use pre trained model and infer on Document Image Classification?
Should I use LayoutML to use the pre-trained model and infer?
Why didn't I find a script to infer, but one to retrain.

AttributeError: 'Page' object has no attribute 'figures'

pdf_process.py报错,
traceback (most recent call last): | 0/42 [00:00<?, ?it/s]
File "pdf_process.py", line 200, in
worker(pdf_file, args.data_dir, args.output_dir)
File "pdf_process.py", line 109, in worker
for figure in this_page.figures:
AttributeError: 'Page' object has no attribute 'figures'
想请教一下这个是什么原因

pdf files not included in the dataset

I have been working on DocBank_samples since a month now. Today I downloaded the main dataset from onedrive and I could not see any pdf files!
I wanted to request , If it is possible to provide the PDF files too?

I appreciate the help!

Pretrained ResNeXt-101 Class -> ID ?

Which classes corresponds to which ID in the pretrained network (e.g. author = ID 8)?

I tried the three different data subsets, but to me it looked like none of the categories and their corresponding ID matched the order of the annotations. The three different subset train, valid and test have different ID/Class category ordering.

For example ID 2 has the class section assigned in the training set, while in valid its equation and in test its reference.

Is there some additional information regarding this?

模型怎么用

请问模型怎么使用呢?layoutlm的run_seq_labeling是不是不能用于这个数据集呀?

Is the correct inference method?

Hi, I use your released models with transformers and try to do the inference. However, the test results are not so good. So I wonder if my inference method is correct. During this process, I ran into a few problems:

  1. The annotation bboxes should transfer into the tokens in your voc, but how to combine the tokens' labels to the bboxes' label?For example, "Hello" may be divided into "he" "llo", and their labels are "1" "8", then how to define the label of "hello"? I try to recover the label with the first token, as above, I use the "he"'s label "1" as the "hello"'s label. Is it correct?
  2. For the document contains more the 512 tokens, for example 782, I divided into 512 and 270 independant input to the model, and concat the results. Is it correct?
  3. For the "zero area" tokens, such as[23, 405, 23, 407], do you calculate the area?
    Thanks a lot for your attention, I'm looking forward to your reply.

How to treat the document that contains more than 512 words?

For the document that contains more than 512 words, how do you split the data? I have two ideas:

For example, if a document contains 5 words: ABCDE. We assume the window size equals to 2.

  1. It can be split into three independent documents and each document is 'AB', 'CD' and 'E', respectively. However, the problem is that these three documents are independent, which may obtain lower performance.
  2. It can be split into several documents via sliding windows. For example, with a window size of 3 words and padding of 1 word, the document can be split into five documents and each document is 'AB', 'ABC', 'BCD', 'CDE', 'DE', respectively. For 'BCD', the B and D are padding and the target word is C.

Do you use one of the above methods or other methods?

Thank you!

Public access is not permitted on this storage account

Hey, I can't download DocBank anymore.

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>
Public access is not permitted on this storage account. RequestId:b7c5e5e6-301e-0041-21a3-b0b0b6000000 Time:2023-07-07T07:21:41.1497126Z
</Message>
</Error>
"""

Request 403 for Dataset resource

Hi guys, I got 403 response from dataset URL with following message:

<Error>
    <Code>AuthenticationFailed</Code>
    <Message>
        Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:63fa7064-c01e-0055-782d-92f8d9000000 Time:2024-04-19T07:46:53.2863477Z
    </Message>
    <AuthenticationErrorDetail>
        Signature did not match. String to sign used was layoutlm r b o 2023-06-08T08:48:15Z 2033-06-08T16:48:15Z https 2022-11-02
    </AuthenticationErrorDetail>
</Error>

Possible damaged image files in the dataset

I have found some broken images from "DocBank_500K_ori_img" which would cause the error while training, considering the great amount of the dataset, the mid-break during training is unbearable.

I suggest the author to re-check the image dataset especially for the following images:
9.tar_1501.04477.gz_RobustSwitching-ergodic_3_ori.jpg
97.tar_1705.02752.gz_kcenters_22_ori.jpg
66.tar_1504.08256.gz_Manipulation_Partial_Info_without_cref_5_ori.jpg
88.tar_1704.08423.gz_SingularSystem_1-2_17_ori.jpg

None of them could be loaded neither by PIL or OpenCV lib.

Since the author did not provide the HASH checksum of the archive file "DocBank_500K_ori_img.zip" and it's really large, I'd rather not to re-download it to check the completeness of my downloaded file.

Issue training with DOCBANK COCO format annotations

Hello,
I am trying to use the X101 arch from the Model ZOO as a backbone for one of my experiments with the DOCBANK dataset.
I am using the COCO format provided for DOCBANK. However, I am getting really bad results on Inference.

Am I doing something wrong?

  • Use pretrained weights for backbone.
  • Freeze the top few layers and finetune the trainable layers along with the HEAD layers with a low LR.

Also, I would like to ask if the pretrained weights trained on the COCO based annotations or on the original token based annotations?.

Any help would be appreciated.

errors when I was using huggingface to load pretrained_weights

Hi!I am using huggingface to the pretrained_weights of layoutlm_large_500k_epoch_1.But huggingface shows me the errors as below:
Traceback (most recent call last):
File "D:\SoulCode\PaddleDetection\DocBank\DocBank_infer.py", line 8, in
model = LayoutLMForTokenClassification.from_pretrained("D:\Download\layoutlm_large_500k_epoch_1")
File "D:\Python39\lib\site-packages\transformers\modeling_utils.py", line 2225, in from_pretrained
model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model(
File "D:\Python39\lib\site-packages\transformers\modeling_utils.py", line 2357, in _load_pretrained_model
raise ValueError(
ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?

Codes as below:
from transformers import AutoTokenizer,LayoutLMForTokenClassification

import torch

tokenizer = AutoTokenizer.from_pretrained('D:\Download\layoutlm_large_500k_epoch_1')
model = LayoutLMForTokenClassification.from_pretrained("D:\Download\layoutlm_large_500k_epoch_1")

How can I deal with it?Thanks!

Detection MAP results?

The ResNeXt-101 model has been added to the Model Zoo.

Can you provide the official MAP results of this Detection model on Docbank. This detection baseline is important for comparisons between detection methods.

We evaluate the model in Model Zoo on Detectron2 and fix the category mapping. The AP is 74.867, right?

Pretrained LayoutLM model results in different architecture

Hi, I'm trying to replicate your results, and I see that the config.json for the pretrained LayoutLM has "bert" as the the model type.

When I load the pretrained model, it results in You are using a model of type bert to instantiate a model of type layoutlm. This is not supported for all configurations of models and can yield errors.

I was able to replicate your results for Roberta-Large and Bert-Large, but not for LayoutLM. Could you advise? Thank you.

Error when downloading dataset

I tried to download the dataset in the dataset homepage.
But, following error occurred when I clicked the each dataset link.

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>Public access is not permitted on this storage account. RequestId:f11eb78c-a01e-000e-3bae-9dc1e2000000 Time:2023-06-13T04:19:31.5518598Z</Message>
</Error>

Is there any other way to download the dataset?

How to deal with "date" category?

Nice work!

I just find there are actually 13 types in the docbank dataset. I wonder how do you deal with the "date" category in data preprocessing step? Do you remove the "date" category?

image

Incorrect of annotation

I found the coordinates in the annotation file are incorrect.
For example, 2.tar_1801.00617.gz_idempotents_arxiv_4.txt in the DocBank_samples. The size of image is 1654x2339. However, all bounding boxes do not even reach the half of the page.

The final effect is:
image

Do I miss anything? @liminghao1630 @wolfshow @ranpox @doc-analysis

How do you train with those NOT-TEXT elements.

Dear author,
For some documents that contain massive not-text elements, such as hundreds of thousands of "##LTLine##". How do you deal with them actually?
For example, you try to train&predict all those elements with text '##LTLine##'.

Thank you!
image

Same type different area in one doc

Thanks for your excellent work!
It looks like that there are no signs used to distinguish regions in the data. For example, there are two paragraphs separated by a title. How can I distinguish these two paragraphs? Tokens in these two paragraphs are all marked as "paragraph" without id or other signs.
Can you release some flags to distinguish regions?

Are component-level annotations available?

I was wondering if the annotations for an element (a region comprising one or more tokens), such as for example a paragraph (similar to the annotations for PubLayNet) are available, or if there exists a mechanism to compute them. I suppose running some sort of clustering algorithm could do the work decently, but I would like to know if these annotations already exist.

Thanks in advance for your help!

LayoutLM from scratch模型是指不经过任何预训练,直接使用docbank训练吗?

您好,论文表格中,LayoutLM from scratch的效果还不错,比bert初始化参数的效果要好,那么这里模型是指不经过任何预训练,直接使用docbank训练吗?
我下载了docbank,发现标签中没有标注BIE,请问您在训练的时候是否有标注BIE呢?
数据中没有标明图片的大小,而layoutlm中图片是经过resize的,我们这篇有经过缩放吗?图片的大小也是固定的754,1000吗?

数据集的标注

您好, 感谢您富有创意的工作,。
这个数据集是根据文字的token标注的, 我看到一个bounding box区域就是一个单词, 这种标注是非常细粒度的。
我在做文档重建方面的研究工作, 请问是否可提供栏目块样式的标注, 即一个bounding box 是正文的一个段落或者标题,次级标题这种类似的标注

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.