GithubHelp home page GithubHelp logo

Comments (6)

monniert avatar monniert commented on May 23, 2024

Hi, why do you say results are bad? I see only one mistake (2 paragraphs merged) occurring in predicted segmentations. Maybe you are referring to extracted regions which are much larger than predicted regions, but for that you need to play with extractor.ADDITIONAL_MARGIN_RATIO and set it to a value close to 0 in such case of paragraph extraction (and not thin text lines)

To prevent merged paragraphs, you can additionally predict paragraph border as done with text lines

from docextractor.

seekingdeep avatar seekingdeep commented on May 23, 2024

you can additionally predict paragraph border as done with text lines

What do you mean? do you have an example image so i understand.

There might be a solution for both the issues that i posted:

  • When converting the json into mask files, make each mask different color than the other, you can use 2 colors, this will be helpful when having multiple masks that are merged or intersecting one-another.
  • Try to segment the actual text instead of boarder or box. Similar to diva-hisdb, which is very precise and accurate.
  • What do you think?

e-codices_csg-0018_149_max-GT-lineSeperation

example

from docextractor.

monniert avatar monniert commented on May 23, 2024

Sure, here is an example directly taken from SynDoc dataset

10004
10004_seg

Yes diva-hisdb annotations may be a solution but (i) this kind of annotations is very time consuming (you can do only a few pages per day) and (ii) I think they are a bit ambiguous (especially between words) and thus it will be difficult to learn and generalize

from docextractor.

seekingdeep avatar seekingdeep commented on May 23, 2024
  1. The polygon based annotations can be generated from existing rectangle based boxes, or even be synthesized. It's easy.
    In my case, i have existing rectangle based annotations, which then can run an algorithm to detect the points of the text itself, similar to an ".svg" file, and then create connections between the letters, words using the closest points to each other.
    For synthesizing, this might be even easier since you can create the lines in an ".svg" format or a ".png" from start.
    These type of labels can be easily generated and synthesized.
    When using a polygon based annotations, the lines can be accurately separated even without a sophisticated text-detection method nor seem-carving, since they already accurately segmented and connected.

image

  1. If you decide to stick with the (x-height+border) labeling method, then you might want to use 2 colors for boarders pf close regions, and even then you might still have some difficulties especially when the boarder of the 1rst region is too close, or even intersecting the 2nd region. The boarders work well for regions that have clear space between them, but even Printed text can seem irregular sometimes and act like handwriting text, by being too small, too close, or even intersecting each other.
    image

  2. for may paragraph dataset, you stated that i should also predict the paragraph boarders.
    how can i do that? some paragraphs are very close to each other.

from docextractor.

monniert avatar monniert commented on May 23, 2024
  1. yes if you can get such annotations for free, it may be worth trying, please keep me updated with the results you get but I suspect you will still have the problem of overlapping lines

  2. I am not sure about this solution, having 2 colors for a same semantic region (here borders around text) leads to ambiguity which often makes the learning of the network harder: say you rotate your document by 180degree why should it start with a specific color rather than the other? But again, I am curious, please keep me updated about the results you may end up with, if I were you I would first try 2 alternative colors on the textline regions (without any border, there is no need if this works) rather than 2 colors for the borders (as you first suggested)

  3. maybe try modeling the borders inside the paragraph regions: for each region, erode it for a couple of pixels (5?) then use the difference between the full region and eroded one to fill with the border color (see this tutorial for info about morphological operations)

from docextractor.

monniert avatar monniert commented on May 23, 2024

@seekingdeep closing the issue, please reopen if necessary

from docextractor.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.