Comments (6)
Hi, why do you say results are bad? I see only one mistake (2 paragraphs merged) occurring in predicted segmentations. Maybe you are referring to extracted regions which are much larger than predicted regions, but for that you need to play with extractor.ADDITIONAL_MARGIN_RATIO
and set it to a value close to 0 in such case of paragraph extraction (and not thin text lines)
To prevent merged paragraphs, you can additionally predict paragraph border as done with text lines
from docextractor.
you can additionally predict paragraph border as done with text lines
What do you mean? do you have an example image so i understand.
There might be a solution for both the issues that i posted:
- When converting the json into mask files, make each mask different color than the other, you can use 2 colors, this will be helpful when having multiple masks that are merged or intersecting one-another.
- Try to segment the actual text instead of boarder or box. Similar to diva-hisdb, which is very precise and accurate.
- What do you think?
from docextractor.
Sure, here is an example directly taken from SynDoc dataset
Yes diva-hisdb annotations may be a solution but (i) this kind of annotations is very time consuming (you can do only a few pages per day) and (ii) I think they are a bit ambiguous (especially between words) and thus it will be difficult to learn and generalize
from docextractor.
- The polygon based annotations can be generated from existing rectangle based boxes, or even be synthesized. It's easy.
In my case, i have existing rectangle based annotations, which then can run an algorithm to detect the points of the text itself, similar to an ".svg" file, and then create connections between the letters, words using the closest points to each other.
For synthesizing, this might be even easier since you can create the lines in an ".svg" format or a ".png" from start.
These type of labels can be easily generated and synthesized.
When using a polygon based annotations, the lines can be accurately separated even without a sophisticated text-detection method nor seem-carving, since they already accurately segmented and connected.
-
If you decide to stick with the (x-height+border) labeling method, then you might want to use 2 colors for boarders pf close regions, and even then you might still have some difficulties especially when the boarder of the 1rst region is too close, or even intersecting the 2nd region. The boarders work well for regions that have clear space between them, but even Printed text can seem irregular sometimes and act like handwriting text, by being too small, too close, or even intersecting each other.
-
for may paragraph dataset, you stated that i should also predict the paragraph boarders.
how can i do that? some paragraphs are very close to each other.
from docextractor.
-
yes if you can get such annotations for free, it may be worth trying, please keep me updated with the results you get but I suspect you will still have the problem of overlapping lines
-
I am not sure about this solution, having 2 colors for a same semantic region (here borders around text) leads to ambiguity which often makes the learning of the network harder: say you rotate your document by 180degree why should it start with a specific color rather than the other? But again, I am curious, please keep me updated about the results you may end up with, if I were you I would first try 2 alternative colors on the textline regions (without any border, there is no need if this works) rather than 2 colors for the borders (as you first suggested)
-
maybe try modeling the borders inside the paragraph regions: for each region, erode it for a couple of pixels (5?) then use the difference between the full region and eroded one to fill with the border color (see this tutorial for info about morphological operations)
from docextractor.
@seekingdeep closing the issue, please reopen if necessary
from docextractor.
Related Issues (19)
- can't get wikiart.zip to download HOT 3
- Training a Text-Line detector and want to create annotations with x-height+ border automatically HOT 3
- The process of GT generation HOT 2
- via_converter.py generate with boarders HOT 3
- error HOT 6
- Problem with PolynomialLR HOT 5
- Post-processing step HOT 2
- bug -- tester.py HOT 2
- Demo website down HOT 7
- where is the UI? HOT 1
- quality/resolution of image results HOT 2
- [bug] translation.exception.TranslateError: No translation get, you may retry HOT 4
- [bug] KeyError: 'filename' HOT 2
- [suggestion] save detected regions as vgg-json HOT 1
- [suggestion] directly input vgg.json for training from scratch or finetuning HOT 1
- [donate] include FUNDING.yml to accept donations HOT 1
- [suggestion] loading the data on the fly HOT 3
- from line level to word level? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docextractor.