<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The polygon based annotations can be generated from existing rectangle based box

yes if you can get such annotations for free, it may be worth try

Trying to train a Text Region detector but failed about docextractor HOT 6 CLOSED

monniert commented on May 23, 2024

Trying to train a Text Region detector but failed

from docextractor.

Comments (6)

monniert commented on May 23, 2024

Hi, why do you say results are bad? I see only one mistake (2 paragraphs merged) occurring in predicted segmentations. Maybe you are referring to extracted regions which are much larger than predicted regions, but for that you need to play with extractor.ADDITIONAL_MARGIN_RATIO and set it to a value close to 0 in such case of paragraph extraction (and not thin text lines)

To prevent merged paragraphs, you can additionally predict paragraph border as done with text lines

from docextractor.

seekingdeep commented on May 23, 2024

you can additionally predict paragraph border as done with text lines

What do you mean? do you have an example image so i understand.

There might be a solution for both the issues that i posted:

When converting the json into mask files, make each mask different color than the other, you can use 2 colors, this will be helpful when having multiple masks that are merged or intersecting one-another.
Try to segment the actual text instead of boarder or box. Similar to diva-hisdb, which is very precise and accurate.
What do you think?

from docextractor.

monniert commented on May 23, 2024

Sure, here is an example directly taken from SynDoc dataset

Yes diva-hisdb annotations may be a solution but (i) this kind of annotations is very time consuming (you can do only a few pages per day) and (ii) I think they are a bit ambiguous (especially between words) and thus it will be difficult to learn and generalize

from docextractor.

seekingdeep commented on May 23, 2024

The polygon based annotations can be generated from existing rectangle based boxes, or even be synthesized. It's easy.
In my case, i have existing rectangle based annotations, which then can run an algorithm to detect the points of the text itself, similar to an ".svg" file, and then create connections between the letters, words using the closest points to each other.
For synthesizing, this might be even easier since you can create the lines in an ".svg" format or a ".png" from start.
These type of labels can be easily generated and synthesized.
When using a polygon based annotations, the lines can be accurately separated even without a sophisticated text-detection method nor seem-carving, since they already accurately segmented and connected.

If you decide to stick with the (x-height+border) labeling method, then you might want to use 2 colors for boarders pf close regions, and even then you might still have some difficulties especially when the boarder of the 1rst region is too close, or even intersecting the 2nd region. The boarders work well for regions that have clear space between them, but even Printed text can seem irregular sometimes and act like handwriting text, by being too small, too close, or even intersecting each other.
for may paragraph dataset, you stated that i should also predict the paragraph boarders.
how can i do that? some paragraphs are very close to each other.

from docextractor.

monniert commented on May 23, 2024

yes if you can get such annotations for free, it may be worth trying, please keep me updated with the results you get but I suspect you will still have the problem of overlapping lines
I am not sure about this solution, having 2 colors for a same semantic region (here borders around text) leads to ambiguity which often makes the learning of the network harder: say you rotate your document by 180degree why should it start with a specific color rather than the other? But again, I am curious, please keep me updated about the results you may end up with, if I were you I would first try 2 alternative colors on the textline regions (without any border, there is no need if this works) rather than 2 colors for the borders (as you first suggested)
maybe try modeling the borders inside the paragraph regions: for each region, erode it for a couple of pixels (5?) then use the difference between the full region and eroded one to fill with the border color (see this tutorial for info about morphological operations)

from docextractor.

monniert commented on May 23, 2024

@seekingdeep closing the issue, please reopen if necessary

from docextractor.

Trying to train a Text Region detector but failed about docextractor HOT 6 CLOSED

Comments (6)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs