tesseract seems to be able to produce PDFs these days with text overlaid on the image.

Would you be interested in contributing <a class="user-mention notranslate" data-hover

So this is basically what I was talking about. doc2text's exis

it'd be nice if this could produce text-overlaid PDFs about doc2text HOT 6 OPEN

jlsutherland commented on July 21, 2024

it'd be nice if this could produce text-overlaid PDFs

from doc2text.

Comments (6)

jlsutherland commented on July 21, 2024

Definitely. I think it would relatively straightforward to integrate. Would suggest building the text insertion into the Page class and then put a export_to_pdf() method on the Document class.

from doc2text.

jlsutherland commented on July 21, 2024

Would you be interested in contributing @jbothma ?

from doc2text.

jbothma commented on July 21, 2024

Yup - would love to. Won't get to it before next week but will start a PR when I can :)

It's part of the ocr command as an optional output format so not sure what the right place would be to integrate it with doc2text.

from doc2text.

jlsutherland commented on July 21, 2024

Awesome, thank you!

The method's location in the code would be conditional on the way tesseract embeds that data. Does tesseract insert the data into a PDF, or it in a separate state that contains the text and placement information?

In the first case, we would need the method you mentioned that produces a nicely optimized pdf from the images first, then the embedding second. We need this method regardless, I think. In the second case, we could run the tesseract embed method at any time after we produce the fixed image crop.

Thoughts?

from doc2text.

jbothma commented on July 21, 2024

So this is basically what I was talking about.

doc2text's existing functionality to straighten and flatten and normalise would run first,
product a multipage tif or whatever,
then give to tesseract to OCR with pdf config file (for pdf output).

wget http://mfma.treasury.gov.za/MFMA/Urban%20Development%20Zones/Gazette%20No.%2026866.pdf
gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH -sOutputFile=test.tif  Gazette\ No.\ 26866.pdf
tesseract test.tif outbase pdf

produces https://www.scribd.com/document/324084564/Out-Base

from doc2text.

jbothma commented on July 21, 2024

Tesseract produces the PDF already, so you'd select that as the output format of the OCR step. There's no intermediate hOCR or anything.

from doc2text.

Recommend Projects

it'd be nice if this could produce text-overlaid PDFs about doc2text HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs