GithubHelp home page GithubHelp logo

Comments (10)

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Sat Sep 20 17:20:53 2014


Unfortunately, it is not (easily) possible to extract the original images from a pdf file using opensource linux sw and keep its orientation like in the pdf file. Therefore I use pdftoppm to GENERATE an image from the pdf file. The image is generated with the same resolution than the original image, but it is not the original image.

If anybody has an idea how proceed to solve this limitation, please let me know!

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by jbarlow83
Sat Sep 20 20:54:29 2014


The PyPDF2 library can read internal PDF structure and get the page orientation.

$ ipython3
import PyPDF2 as pypdf
pdf = pypdf.PdfFileReader('example.pdf')
pdf.pages[0]['/Rotate']

That field records the rotation anyone has applied to fix the orientation of a given page and should work in a lot of cases. It would be possible to get the page and image dimensions as well, I believe, and faster than the various calls to pdftoppm and pdfimages since everything could be done in a single process. That would be a first step and should work as long as each page contains one image that fills the page – probably good enough for most scanned PDFs with no OCR.

For multiple images to a page you would have to interpret the PostScript to determine where images are rendered, because PostScript can apply an arbitrary transformation matrix (translate, rotate to arbitrary angle, scale, skew) to an image before rendering an image. In this case you'd run OCR jobs on the extracted images and then insert the OCR hidden layer into the PostScript stream. Needless to say this would be much harder.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Sat Sep 20 21:09:05 2014


Even if the page contains only 1 image, I am not sure knowing the page orientation would be enough. Indeed there would still be 2 possible rotation angles for the image (x and x+180°). Are there tools to easily extract the transformation matrix of the image? (especially the rotation angle?

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by jbarlow83
Sat Sep 20 21:39:42 2014


It's not page orientation as in landscape/portrait use of paper. /Rotate records the 0/90/180/270° rotation that is usually set by a user. It should be enough to determine the image rotation for simple cases like scanned PDF output. The /MediaBox is also part of the picture - one can specify the virtual paper size with /MediaBox, and then rotate it. So for the simple case (scanner PDF output), with /MediaBox and /Rotate, you should be able to determine the orientation of the image.

If I understand correctly the transformation matrix is sort of like a CPU register in the PostScript language, sensitive to state, so in general you have to interpret all of the preceding PostScript on a page to determine its value at a point of interest. So it is harder (although pdftoppm would have code to do this). But this should only be necessary for more complex PDFs, not the output of scanning software.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Tue Sep 30 20:12:35 2014


Ok. I understand what you mean now.
I would propose to interoduce this feature one ocrpage has been rewritten in python

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by kebekus
Mon Nov 24 08:24:05 2014


Actually, I believe adding a text layer without changing the original contents of the PDF file is easy. PDFTK can do that. Here is what I do with my personal files:

  • extract content of PDF file as image
  • run tesseract on each image, generating hocr files
  • use the program hocrTransform.py that comes with OCRmyPDF on each hocr file, in order to generate a PDF file containing the text. I do not include the graphics file
  • join the so-generated text-layer-PDF into one file that has exactly as many pages as the original file
  • use pdftk 'multbackground' to add merge the text-layer-PDF with the original one

Here is a little shell script that does a very similar task (it takes the images from my scanner, not from a PDF file).

#!/bin/bash

# Clear directory
rm -f *.pnm *.djvu


# Scan images
scanimage --batch=scan\%03d.pnm --mode=Gray --adf-auto-scan=yes -x 210mm -y 296mm --resolution 600 --adf-mode=Simplex


# Threshold and cut scanned pages
for page in `ls scan*|sort`
do
    file_name=$(basename $page)
    file_name_witout_ext=${file_name%.*}

    echo "Cut and threshold page $file_name_witout_ext"
    pnmcut -left 105 -bottom 6500 <$page | pgmtopbm -threshold -value 0.78 >$file_name_witout_ext.pbm
done


# Generate PDF document
echo "Compress as PDF"
jbig2 -v -p -s scan*.pbm
pdf.py output >fertig.pdf
# Delete jbig2 temporary files
rm output* 


# OCR

# OCR each page, and produce PDF file(s) containing the background (=text) layer
for page in `ls scan*.pbm|sort`
do
    file_name=$(basename $page)
    file_name_witout_ext=${file_name%.*}

    echo "Character recognition $page"
    tesseract -l eng $page $file_name_witout_ext hocr >/dev/null
    python2 ~/bin/OCRmyPDF-2.2-stable/src/hocrTransform.py -r 600 $file_name_witout_ext.html $file_name_witout_ext.pdf
    # Delete temporary hocr file
    rm $file_name_witout_ext.html
done

# Join PDF files into one file that contains all OCR backgrounds
pdftk scan*.pdf output ocr.pdf
# Delete temporary scan*.pdf files
rm scan*.pdf

# Merge OCR background PDF into the main PDF document
pdftk fertig.pdf multibackground ocr.pdf output fertig-ocr.pdf

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

The feature described by @kebebus is now implemented (although not in the manner described). A text layer is inserted into a page extracted from the original PDF, without changing the original PDF page. In some cases the original PDF page needs to be translated or rotated, but this is a lossless operation. (If preprocessing instructions like --deskew are provided, the original page must be changed, so it is.)

ocrmypdf keeps PDF contents unchanged until it runs Ghostscript. Ghostscript may transcode as part of PDF/A conversion.

from ocrmypdf.

Jmuccigr avatar Jmuccigr commented on July 19, 2024

Sorry to post on a closed issue, but it still seems to be the case that the generated PDF is much larger than the original (#43). So should I take it that Ghostscript is transcoding every time?

Thanks.

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

@Jmuccigr Ghostscript always transcodes and might transcode JPEGs to higher quality. It also converts JBIG2 to CCITT for monochrome which can also make files bigger. If the size increase is more than 20% open a new ticket with your before and after files and I'll look.

I'm toying with the idea of replacing Ghostscript with pdfbox, which would allow more control over encoding, but that means using Java and depending on a JVM. On purpose! :(

from ocrmypdf.

Jmuccigr avatar Jmuccigr commented on July 19, 2024

I'll open a new ticket. I'd be happy with just putting the text over the original. Don't need new images.

from ocrmypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.