Comments (10)
Comment by fritz-hh
Sat Sep 20 17:20:53 2014
Unfortunately, it is not (easily) possible to extract the original images from a pdf file using opensource linux sw and keep its orientation like in the pdf file. Therefore I use pdftoppm to GENERATE an image from the pdf file. The image is generated with the same resolution than the original image, but it is not the original image.
If anybody has an idea how proceed to solve this limitation, please let me know!
from ocrmypdf.
Comment by jbarlow83
Sat Sep 20 20:54:29 2014
The PyPDF2 library can read internal PDF structure and get the page orientation.
$ ipython3
import PyPDF2 as pypdf
pdf = pypdf.PdfFileReader('example.pdf')
pdf.pages[0]['/Rotate']
That field records the rotation anyone has applied to fix the orientation of a given page and should work in a lot of cases. It would be possible to get the page and image dimensions as well, I believe, and faster than the various calls to pdftoppm
and pdfimages
since everything could be done in a single process. That would be a first step and should work as long as each page contains one image that fills the page – probably good enough for most scanned PDFs with no OCR.
For multiple images to a page you would have to interpret the PostScript to determine where images are rendered, because PostScript can apply an arbitrary transformation matrix (translate, rotate to arbitrary angle, scale, skew) to an image before rendering an image. In this case you'd run OCR jobs on the extracted images and then insert the OCR hidden layer into the PostScript stream. Needless to say this would be much harder.
from ocrmypdf.
Comment by fritz-hh
Sat Sep 20 21:09:05 2014
Even if the page contains only 1 image, I am not sure knowing the page orientation would be enough. Indeed there would still be 2 possible rotation angles for the image (x and x+180°). Are there tools to easily extract the transformation matrix of the image? (especially the rotation angle?
from ocrmypdf.
Comment by jbarlow83
Sat Sep 20 21:39:42 2014
It's not page orientation as in landscape/portrait use of paper. /Rotate records the 0/90/180/270° rotation that is usually set by a user. It should be enough to determine the image rotation for simple cases like scanned PDF output. The /MediaBox is also part of the picture - one can specify the virtual paper size with /MediaBox, and then rotate it. So for the simple case (scanner PDF output), with /MediaBox and /Rotate, you should be able to determine the orientation of the image.
If I understand correctly the transformation matrix is sort of like a CPU register in the PostScript language, sensitive to state, so in general you have to interpret all of the preceding PostScript on a page to determine its value at a point of interest. So it is harder (although pdftoppm
would have code to do this). But this should only be necessary for more complex PDFs, not the output of scanning software.
from ocrmypdf.
Comment by fritz-hh
Tue Sep 30 20:12:35 2014
Ok. I understand what you mean now.
I would propose to interoduce this feature one ocrpage has been rewritten in python
from ocrmypdf.
Comment by kebekus
Mon Nov 24 08:24:05 2014
Actually, I believe adding a text layer without changing the original contents of the PDF file is easy. PDFTK can do that. Here is what I do with my personal files:
- extract content of PDF file as image
- run tesseract on each image, generating hocr files
- use the program hocrTransform.py that comes with OCRmyPDF on each hocr file, in order to generate a PDF file containing the text. I do not include the graphics file
- join the so-generated text-layer-PDF into one file that has exactly as many pages as the original file
- use pdftk 'multbackground' to add merge the text-layer-PDF with the original one
Here is a little shell script that does a very similar task (it takes the images from my scanner, not from a PDF file).
#!/bin/bash
# Clear directory
rm -f *.pnm *.djvu
# Scan images
scanimage --batch=scan\%03d.pnm --mode=Gray --adf-auto-scan=yes -x 210mm -y 296mm --resolution 600 --adf-mode=Simplex
# Threshold and cut scanned pages
for page in `ls scan*|sort`
do
file_name=$(basename $page)
file_name_witout_ext=${file_name%.*}
echo "Cut and threshold page $file_name_witout_ext"
pnmcut -left 105 -bottom 6500 <$page | pgmtopbm -threshold -value 0.78 >$file_name_witout_ext.pbm
done
# Generate PDF document
echo "Compress as PDF"
jbig2 -v -p -s scan*.pbm
pdf.py output >fertig.pdf
# Delete jbig2 temporary files
rm output*
# OCR
# OCR each page, and produce PDF file(s) containing the background (=text) layer
for page in `ls scan*.pbm|sort`
do
file_name=$(basename $page)
file_name_witout_ext=${file_name%.*}
echo "Character recognition $page"
tesseract -l eng $page $file_name_witout_ext hocr >/dev/null
python2 ~/bin/OCRmyPDF-2.2-stable/src/hocrTransform.py -r 600 $file_name_witout_ext.html $file_name_witout_ext.pdf
# Delete temporary hocr file
rm $file_name_witout_ext.html
done
# Join PDF files into one file that contains all OCR backgrounds
pdftk scan*.pdf output ocr.pdf
# Delete temporary scan*.pdf files
rm scan*.pdf
# Merge OCR background PDF into the main PDF document
pdftk fertig.pdf multibackground ocr.pdf output fertig-ocr.pdf
from ocrmypdf.
The feature described by @kebebus is now implemented (although not in the manner described). A text layer is inserted into a page extracted from the original PDF, without changing the original PDF page. In some cases the original PDF page needs to be translated or rotated, but this is a lossless operation. (If preprocessing instructions like --deskew are provided, the original page must be changed, so it is.)
ocrmypdf keeps PDF contents unchanged until it runs Ghostscript. Ghostscript may transcode as part of PDF/A conversion.
from ocrmypdf.
Sorry to post on a closed issue, but it still seems to be the case that the generated PDF is much larger than the original (#43). So should I take it that Ghostscript is transcoding every time?
Thanks.
from ocrmypdf.
@Jmuccigr Ghostscript always transcodes and might transcode JPEGs to higher quality. It also converts JBIG2 to CCITT for monochrome which can also make files bigger. If the size increase is more than 20% open a new ticket with your before and after files and I'll look.
I'm toying with the idea of replacing Ghostscript with pdfbox, which would allow more control over encoding, but that means using Java and depending on a JVM. On purpose! :(
from ocrmypdf.
I'll open a new ticket. I'd be happy with just putting the text over the original. Don't need new images.
from ocrmypdf.
Related Issues (20)
- [Bug]: crashes with tesseract 5.4.0 HOT 8
- [Bug]: ocrmypdf 16.3.1 fails on a file on Arch that 13.4.0 on Ubuntu handles well HOT 1
- [Feature]: Alternative AI OCR "surya" as opposed to EasyOCR, Just found it today and it dominated the accuracy and speed of Tesseract & EasyOCR HOT 3
- [Bug]: Paperless-ngx Release 2.9.0 Ghostscript rasterizing failed HOT 1
- [Bug]: MetadataProgress does not respect progress_bar=False argument
- [Bug]: No errors and no output for large DPI files HOT 2
- [Bug]: `lots of diacritics - possibly poor OCR` but using standalone tesseract works perfectly HOT 1
- [Bug]: ocrmypdf (16.3.1) and Tesseract 5.4.1 HOT 3
- [Bug]: Existing text is completely replaced with other characters HOT 3
- [Request]: Please make rich logging library an optional dependency HOT 1
- [Feature]: Enable execution on GPU HOT 1
- [Bug]: doesn't always parse Latin with diacritics HOT 3
- Output file images are corrupted HOT 1
- [Bug]: OSError: [Errno 28] No space left on device HOT 4
- [Bug]: problem with tif "DPI is not credible". Estimate dpi HOT 3
- [Bug]: Ghostscript can't create a PDF/A-file (Page object was reserved for an Annotation destination) HOT 3
- [Bug]: KeyError: '/Subtype'
- [Bug]: Ghostscript rasterizing failed HOT 3
- [Bug]: files signed with a-trust are not recognised as digitally signed and hence processed HOT 1
- --sidecar writes text content and messages to file HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.