Comments (6)
Definitely. I think it would relatively straightforward to integrate. Would suggest building the text insertion into the Page
class and then put a export_to_pdf()
method on the Document
class.
from doc2text.
Would you be interested in contributing @jbothma ?
from doc2text.
Yup - would love to. Won't get to it before next week but will start a PR when I can :)
It's part of the ocr command as an optional output format so not sure what the right place would be to integrate it with doc2text.
from doc2text.
Awesome, thank you!
The method's location in the code would be conditional on the way tesseract embeds that data. Does tesseract insert the data into a PDF, or it in a separate state that contains the text and placement information?
In the first case, we would need the method you mentioned that produces a nicely optimized pdf from the images first, then the embedding second. We need this method regardless, I think. In the second case, we could run the tesseract embed method at any time after we produce the fixed image crop.
Thoughts?
from doc2text.
So this is basically what I was talking about.
- doc2text's existing functionality to straighten and flatten and normalise would run first,
- product a multipage tif or whatever,
- then give to tesseract to OCR with pdf config file (for pdf output).
wget http://mfma.treasury.gov.za/MFMA/Urban%20Development%20Zones/Gazette%20No.%2026866.pdf
gs -dNOPAUSE -q -r500 -sDEVICE=tiffg4 -dBATCH -sOutputFile=test.tif Gazette\ No.\ 26866.pdf
tesseract test.tif outbase pdf
produces https://www.scribd.com/document/324084564/Out-Base
from doc2text.
Tesseract produces the PDF already, so you'd select that as the output format of the OCR step. There's no intermediate hOCR or anything.
from doc2text.
Related Issues (20)
- What is wrong with this ? Can someone please explain ? HOT 1
- Error on doc.process() HOT 2
- Support for non scanned documents (.doc, .docx, regular pdf) HOT 4
- Error passing the lang to the class HOT 1
- Question: Support for Windows
- Unable to process
- text extraction from png files does not seem to work
- Python 3.5 compatibility HOT 6
- Eror on pip install PythonMagick HOT 2
- Can not install pythonmagick. HOT 2
- ModuleNotFoundError: No module named 'PyPDF2' HOT 1
- FileNotFoundError
- AttributeError: 'Page' object has no attribute 'image' ISSUE HOT 1
- Maybe a stupid question about the api, can't find in source code
- Does is support stream data ?
- No module name PythonMagick HOT 2
- Image not cropped accurately
- Can't pip install this HOT 1
- Does not work on python3 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from doc2text.