GithubHelp home page GithubHelp logo

Comments (6)

jlsutherland avatar jlsutherland commented on July 21, 2024

Definitely. I think it would relatively straightforward to integrate. Would suggest building the text insertion into the Page class and then put a export_to_pdf() method on the Document class.

from doc2text.

jlsutherland avatar jlsutherland commented on July 21, 2024

Would you be interested in contributing @jbothma ?

from doc2text.

jbothma avatar jbothma commented on July 21, 2024

Yup - would love to. Won't get to it before next week but will start a PR when I can :)

It's part of the ocr command as an optional output format so not sure what the right place would be to integrate it with doc2text.

from doc2text.

jlsutherland avatar jlsutherland commented on July 21, 2024

Awesome, thank you!

The method's location in the code would be conditional on the way tesseract embeds that data. Does tesseract insert the data into a PDF, or it in a separate state that contains the text and placement information?

In the first case, we would need the method you mentioned that produces a nicely optimized pdf from the images first, then the embedding second. We need this method regardless, I think. In the second case, we could run the tesseract embed method at any time after we produce the fixed image crop.

Thoughts?

from doc2text.

jbothma avatar jbothma commented on July 21, 2024

So this is basically what I was talking about.

  • doc2text's existing functionality to straighten and flatten and normalise would run first,
  • product a multipage tif or whatever,
  • then give to tesseract to OCR with pdf config file (for pdf output).
wget http://mfma.treasury.gov.za/MFMA/Urban%20Development%20Zones/Gazette%20No.%2026866.pdf
gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH -sOutputFile=test.tif  Gazette\ No.\ 26866.pdf
tesseract test.tif outbase pdf

produces https://www.scribd.com/document/324084564/Out-Base

from doc2text.

jbothma avatar jbothma commented on July 21, 2024

Tesseract produces the PDF already, so you'd select that as the output format of the OCR step. There's no intermediate hOCR or anything.

from doc2text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.