<img src="https://avatars.githubusercontent.com/

<img src="https://avatars.githubusercontent.com

<img src="https://avatars.githubusercontent.com

<img src="https://avatars.githubusercontent.com/

<img src="https://avatars.githubusercontent.com/u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

original images not kept unaltered,about ocrmypdf/ocrmypdf

Comments (10)

OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Sat Sep 20 17:20:53 2014

Unfortunately, it is not (easily) possible to extract the original images from a pdf file using opensource linux sw and keep its orientation like in the pdf file. Therefore I use pdftoppm to GENERATE an image from the pdf file. The image is generated with the same resolution than the original image, but it is not the original image.

If anybody has an idea how proceed to solve this limitation, please let me know!

from ocrmypdf.

OCRmyPDF-issuebot commented on July 19, 2024

Comment by jbarlow83
Sat Sep 20 20:54:29 2014

The PyPDF2 library can read internal PDF structure and get the page orientation.

$ ipython3
import PyPDF2 as pypdf
pdf = pypdf.PdfFileReader('example.pdf')
pdf.pages[0]['/Rotate']

That field records the rotation anyone has applied to fix the orientation of a given page and should work in a lot of cases. It would be possible to get the page and image dimensions as well, I believe, and faster than the various calls to pdftoppm and pdfimages since everything could be done in a single process. That would be a first step and should work as long as each page contains one image that fills the page – probably good enough for most scanned PDFs with no OCR.

For multiple images to a page you would have to interpret the PostScript to determine where images are rendered, because PostScript can apply an arbitrary transformation matrix (translate, rotate to arbitrary angle, scale, skew) to an image before rendering an image. In this case you'd run OCR jobs on the extracted images and then insert the OCR hidden layer into the PostScript stream. Needless to say this would be much harder.

from ocrmypdf.

OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Sat Sep 20 21:09:05 2014

Even if the page contains only 1 image, I am not sure knowing the page orientation would be enough. Indeed there would still be 2 possible rotation angles for the image (x and x+180°). Are there tools to easily extract the transformation matrix of the image? (especially the rotation angle?

from ocrmypdf.

OCRmyPDF-issuebot commented on July 19, 2024

Comment by jbarlow83
Sat Sep 20 21:39:42 2014

It's not page orientation as in landscape/portrait use of paper. /Rotate records the 0/90/180/270° rotation that is usually set by a user. It should be enough to determine the image rotation for simple cases like scanned PDF output. The /MediaBox is also part of the picture - one can specify the virtual paper size with /MediaBox, and then rotate it. So for the simple case (scanner PDF output), with /MediaBox and /Rotate, you should be able to determine the orientation of the image.

If I understand correctly the transformation matrix is sort of like a CPU register in the PostScript language, sensitive to state, so in general you have to interpret all of the preceding PostScript on a page to determine its value at a point of interest. So it is harder (although pdftoppm would have code to do this). But this should only be necessary for more complex PDFs, not the output of scanning software.

from ocrmypdf.

OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Tue Sep 30 20:12:35 2014

Ok. I understand what you mean now.
I would propose to interoduce this feature one ocrpage has been rewritten in python

from ocrmypdf.

OCRmyPDF-issuebot commented on July 19, 2024

Comment by kebekus
Mon Nov 24 08:24:05 2014

Actually, I believe adding a text layer without changing the original contents of the PDF file is easy. PDFTK can do that. Here is what I do with my personal files:

extract content of PDF file as image
run tesseract on each image, generating hocr files
use the program hocrTransform.py that comes with OCRmyPDF on each hocr file, in order to generate a PDF file containing the text. I do not include the graphics file
join the so-generated text-layer-PDF into one file that has exactly as many pages as the original file
use pdftk 'multbackground' to add merge the text-layer-PDF with the original one

Here is a little shell script that does a very similar task (it takes the images from my scanner, not from a PDF file).

#!/bin/bash

# Clear directory
rm -f *.pnm *.djvu


# Scan images
scanimage --batch=scan\%03d.pnm --mode=Gray --adf-auto-scan=yes -x 210mm -y 296mm --resolution 600 --adf-mode=Simplex


# Threshold and cut scanned pages
for page in `ls scan*|sort`
do
    file_name=$(basename $page)
    file_name_witout_ext=${file_name%.*}

    echo "Cut and threshold page $file_name_witout_ext"
    pnmcut -left 105 -bottom 6500 <$page | pgmtopbm -threshold -value 0.78 >$file_name_witout_ext.pbm
done


# Generate PDF document
echo "Compress as PDF"
jbig2 -v -p -s scan*.pbm
pdf.py output >fertig.pdf
# Delete jbig2 temporary files
rm output* 


# OCR

# OCR each page, and produce PDF file(s) containing the background (=text) layer
for page in `ls scan*.pbm|sort`
do
    file_name=$(basename $page)
    file_name_witout_ext=${file_name%.*}

    echo "Character recognition $page"
    tesseract -l eng $page $file_name_witout_ext hocr >/dev/null
    python2 ~/bin/OCRmyPDF-2.2-stable/src/hocrTransform.py -r 600 $file_name_witout_ext.html $file_name_witout_ext.pdf
    # Delete temporary hocr file
    rm $file_name_witout_ext.html
done

# Join PDF files into one file that contains all OCR backgrounds
pdftk scan*.pdf output ocr.pdf
# Delete temporary scan*.pdf files
rm scan*.pdf

# Merge OCR background PDF into the main PDF document
pdftk fertig.pdf multibackground ocr.pdf output fertig-ocr.pdf

from ocrmypdf.

jbarlow83 commented on July 19, 2024

The feature described by @kebebus is now implemented (although not in the manner described). A text layer is inserted into a page extracted from the original PDF, without changing the original PDF page. In some cases the original PDF page needs to be translated or rotated, but this is a lossless operation. (If preprocessing instructions like --deskew are provided, the original page must be changed, so it is.)

ocrmypdf keeps PDF contents unchanged until it runs Ghostscript. Ghostscript may transcode as part of PDF/A conversion.

from ocrmypdf.

Jmuccigr commented on July 19, 2024

Sorry to post on a closed issue, but it still seems to be the case that the generated PDF is much larger than the original (#43). So should I take it that Ghostscript is transcoding every time?

Thanks.

from ocrmypdf.

jbarlow83 commented on July 19, 2024

@Jmuccigr Ghostscript always transcodes and might transcode JPEGs to higher quality. It also converts JBIG2 to CCITT for monochrome which can also make files bigger. If the size increase is more than 20% open a new ticket with your before and after files and I'll look.

I'm toying with the idea of replacing Ghostscript with pdfbox, which would allow more control over encoding, but that means using Java and depending on a JVM. On purpose! :(

from ocrmypdf.

Jmuccigr commented on July 19, 2024

I'll open a new ticket. I'd be happy with just putting the text over the original. Don't need new images.

from ocrmypdf.

original images not kept unaltered about ocrmypdf HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs