GithubHelp home page GithubHelp logo

ocrmypdf / ocrmypdf Goto Github PK

View Code? Open in Web Editor NEW
12.0K 136.0 890.0 64.99 MB

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Home Page: http://ocrmypdf.readthedocs.io/

License: Mozilla Public License 2.0

Shell 2.90% Python 96.75% Dockerfile 0.35%
python ocr pdf image-processing tesseract

ocrmypdf's Introduction

OCRmyPDF

Build Status PyPI version Homebrew version ReadTheDocs Python versions

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output

See the release notes for details on the latest changes.

Main features

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a "lossless" operation without disrupting any other content
  • Optimizes PDF images, often producing files smaller than the input file
  • If requested, deskews and/or cleans the image before performing OCR
  • Validates input and output files
  • Distributes work across all available CPU cores
  • Uses Tesseract OCR engine to recognize more than 100 languages
  • Keeps your private data private.
  • Scales properly to handle files with thousands of pages.
  • Battle-tested on millions of PDFs.

Demo of OCRmyPDF in a terminal session

For details: please consult the documentation.

Motivation

I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:

  • Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
  • Or they did not handle accents and multilingual characters
  • Or they changed the resolution of the embedded images
  • Or they generated ridiculously large PDF files
  • Or they crashed when trying to OCR
  • Or they did not produce valid PDF files
  • On top of that none of them produced PDF/A files (format dedicated for long time storage)

...so I decided to develop my own tool.

Installation

Linux, Windows, macOS and FreeBSD are supported. Docker images are also available, for both x64 and ARM.

Operating system Install command
Debian, Ubuntu apt install ocrmypdf
Windows Subsystem for Linux apt install ocrmypdf
Fedora dnf install ocrmypdf
macOS (Homebrew) brew install ocrmypdf
macOS (MacPorts) port install ocrmypdf
macOS (nix) nix-env -i ocrmypdf
LinuxBrew brew install ocrmypdf
FreeBSD pkg install py-ocrmypdf
Conda conda install ocrmypdf
Ubuntu Snap snap install ocrmypdf

For everyone else, see our documentation for installation steps.

Languages

OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users, you can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language pack

# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs

# brew macOS users
brew install tesseract-lang

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested.

OCRmyPDF supports Tesseract 4.1.1+. It will automatically use whichever version it finds first on the PATH environment variable. On Windows, if PATH does not provide a Tesseract binary, we use the highest version number that is installed according to the Windows Registry.

Documentation and support

Once OCRmyPDF is installed, the built-in help which explains the command syntax and options can be accessed via:

ocrmypdf --help

Our documentation is served on Read the Docs.

Please report issues on our GitHub issues page, and follow the issue template for quick response.

Requirements

In addition to the required Python version (3.8+), OCRmyPDF requires external program installations of Ghostscript and Tesseract OCR. OCRmyPDF is pure Python, and runs on pretty much everything: Linux, macOS, Windows and FreeBSD.

Press & Media

Business enquiries

OCRmyPDF would not be the software that it is today without companies and users choosing to provide support for feature development and consulting enquiries. We are happy to discuss all enquiries, whether for extending the existing feature set, or integrating OCRmyPDF into a larger system.

License

The OCRmyPDF software is licensed under the Mozilla Public License 2.0 (MPL-2.0). This license permits integration of OCRmyPDF with other code, included commercial and closed source, but asks you to publish source-level modifications you make to OCRmyPDF.

Some components of OCRmyPDF have other licenses, as indicated by standard SPDX license identifiers or the DEP5 copyright and licensing information file. Generally speaking, non-core code is licensed under MIT, and the documentation and test files are licensed under Creative Commons ShareAlike 4.0 (CC-BY-SA 4.0).

Disclaimer

The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

ocrmypdf's People

Contributors

androbin avatar brlin-tw avatar cforcey avatar ctbarbour avatar dependabot[bot] avatar dorianscholz avatar dotlambda avatar fpille avatar fritz-hh avatar hrnciar avatar ianalexander avatar jbarlow83 avatar knobix avatar mara004 avatar mawi12345 avatar mb720 avatar musicinmybrain avatar nilsro avatar oxplot avatar pigmonkey avatar qulogic avatar spwhitton avatar ss8931 avatar stumpylog avatar stweil avatar timgates42 avatar tklerx avatar tomraz avatar xave avatar yasoob avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ocrmypdf's Issues

Introduce environment variables to redirect subprocess calls

For example, OCRMYPDF_GHOSTSCRIPT could point at an alternate Ghostscript binary.

This is mainly for test cases, to allow replacing the real binary with one that always fails, or to stub out/cache Tesseract when the OCR output doesn't matter.

original images not kept unaltered

Issue by femifrak
Wed May 28 16:01:43 2014
Originally opened as fritz-hh/OCRmyPDF#78


When using the 2.x version available as zip file at the right side of
https://github.com/fritz-hh/OCRmyPDF
with xubuntu 14.04 the original pdf is altered although i did not use -i
The first page of
http://www.loaditup.de/files/817245_gcstsh3wuy.pdf
shows the original black and white pdf, the second page the altered pdf which unfortunately looks frazzled. (I merged both pages for convenience.)
Is there a way to avoid this quality loss?

I tested the suggestion of #61 but without success, which is clear as no "-i" was used.
I also tested a pdf with integer number of pixels but without success.
Maybe it has to do with the problem described here? http://lists.freedesktop.org/archives/poppler-bugs/2013-August/010469.html

Thanks for the help.

Here the output with -g:

># /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g sw_original.pdf sw_original_OCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g sw_original.pdf sw_original_OCR.pdf
Checking if all dependencies are installed
--------------------------------
ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP    

--------------------------------
GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, 
;login: The USENIX Magazine, February 2011:42-47.
--------------------------------
Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
--------------------------------
unpaper version:
0.4.2
--------------------------------
tesseract version:
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

--------------------------------
python2 version:
Python 2.7.6
--------------------------------
Ghostscript version:
9.10
--------------------------------
Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
--------------------------------
Created temporary folder: "/tmp/tmp.ZIHGjUFKJS"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x594 (h*w in pt)
Page 0001: Size 3508x2477 (in pixel)
Page 0001: Extracting image as pbm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.ZIHGjUFKJS/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 25 seconds

pagesize is altered by ocrmypdf

Hi,

thank you very much for this really decent software!

When i run ocrmypdf 3.1 on an pdf which contains pages in A5 size (148 x 210 mm), the output page size is not the same (125 x 210 mm).

Regards,

Femi

Output pdf is distorted

Hey!

I have a problem. I played around a lot with the options of ocrmypdf but still my outout file is heavily distorted.

The input and the output file as well as the file generated in /tmp can be found here:
http://www.file-upload.net/download-11132774/sample_in.pdf.html
http://www.file-upload.net/download-11132775/sample_out.pdf.html
http://www.file-upload.net/download-11132776/com.github.ocrmypdf.wfze06uzsample_in.repaired.pdf.html

Thanks for your help!
Sammy

I append the -v 1 STDOUT:
$ ocrmypdf -v 1 sample_in.pdf sample_out.pdf


Tasks which will be run:

Task enters queue = 'ocrmypdf.main.repair_pdf'

[{'images': [{'width': 4299, 'height': 3035, 'bpc': 8, 'dpi': Decimal('482.983'), 'comp': 1, 'enc': 'jpeg', 'color': 'gray', 'dpi_h': Decimal('333.110'), 'dpi_w': Decimal('700.289')}], 'width_pixels': 4299, 'xres': Decimal('700.289'), 'has_text': False, 'pageno': 0, 'height_inches': Decimal('9.11111'), 'yres': Decimal('333.110'), 'height_pixels': 3035, 'width_inches': Decimal('6.13889')}]

Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.e6lkdkqy/000001.page.pdf, /tmp/com.github.ocrmypdf.e6lkdkqy/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.skip_page'
Uptodate Task = 'ocrmypdf.main.skip_page'

WARNING:
In Task 'ocrmypdf.main.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.

Rendering 000001.ocr.page.pdf with pnggray

Completed Task = 'ocrmypdf.main.generate_postscript_stub'
Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.preprocess_deskew'
os.symlink(/tmp/com.github.ocrmypdf.e6lkdkqy/000001.page.png, /tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew'
Task enters queue = 'ocrmypdf.main.preprocess_clean'
os.symlink(/tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-clean.png)
Completed Task = 'ocrmypdf.main.preprocess_clean'
Task enters queue = 'ocrmypdf.main.select_image_for_pdf'
Task enters queue = 'ocrmypdf.main.ocr_tesseract_hocr'
Completed Task = 'ocrmypdf.main.select_image_for_pdf'
Tesseract Open Source OCR Engine v3.03 with Leptonica

Completed Task = 'ocrmypdf.main.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.main.render_hocr_page'
Completed Task = 'ocrmypdf.main.render_hocr_page'
Task enters queue = 'ocrmypdf.main.merge_pages'
['/tmp/com.github.ocrmypdf.e6lkdkqy/000001.rendered.pdf', '/tmp/com.github.ocrmypdf.e6lkdkqy/com.github.ocrmypdf.e6lkdkqysample_in.pdfa_def.ps']
Completed Task = 'ocrmypdf.main.merge_pages'
Task enters queue = 'ocrmypdf.main.copy_final'
Completed Task = 'ocrmypdf.main.copy_final'

Make ruffus pipeline re-entrant

In its current form the pipeline is not re-entrant -- it is assembled based on command line arguments prior to main() and cannot be changed after that. As such, there is no value to "import ocrmypdf".

Also, all test cases need to run in a subprocess which is not ideal for inspecting test failures.

A re-entrant pipeline would make it possible to customize the pipeline if ocrmypdf were used as a library.

problem with unpaper

Issue by femifrak
Wed May 7 05:34:43 2014
Originally opened as fritz-hh/OCRmyPDF#75


When using OCRmyODF-2.x with -dci there remain black borders in the generated pdf. Shouldn't unpaper remove them? The input is a black and white pdf.

Here the output:

root@xu:/home/tho/test# /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g -d -c -i test.pdf testOCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -d -c -i test.pdf testOCR.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP


GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

0.4.2

tesseract version:
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0


python2 version:

Python 2.7.6

Ghostscript version:

9.10

Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.cL2lCvVStC"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 578x342 (h*w in pt)
Page 0001: Size 3424x2208 (in pixel)
Page 0001: Extracting image as pbm file (445 dpi)
Page 0001: Deskewing image
Page 0001: Cleaning image with unpaper
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.cL2lCvVStC/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 31 seconds

Document Python API?

Issue by mlissner
Fri Aug 21 00:10:05 2015
Originally opened as fritz-hh/OCRmyPDF#114


I could be wrong, but I haven't been able to find documentation for the command itself. Either for the command line API nor for the Python API that looks like it might be coming in 3.0.

Am I blind? If not, this would be great to get. If so, my apologies!

Looks like a great project.

Error when trying to OCR JPEG or PNG

When I try to run:

sudo ocrmypdf --verbose 3 eiffel.jpg eiffel.pdf

I get:

Original exception:
Exception #1
  'builtins.TypeError(Can't convert 'list' object to str implicitly)' raised in ...
   Task = def ocrmypdf.main.split_pages(...):
   Job  = [[] -> .../com.github.ocrmypdf.45n_qza7/*.page.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 415, in split_pages
    npages = qpdf.get_npages(input_file)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/qpdf.py", line 68, in get_npages
    universal_newlines=True, close_fds=True)
  File "/usr/lib/python3.4/subprocess.py", line 607, in check_output
    with Popen(*popenargs, stdout=PIPE, **kwargs) as process:
  File "/usr/lib/python3.4/subprocess.py", line 859, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.4/subprocess.py", line 1395, in _execute_child
    restore_signals, start_new_session, preexec_fn)
TypeError: Can't convert 'list' object to str implicitly

If I try the same thing on a PDF file it works fine. This is for version 3.1.1, thanks!

I can repeat the bug on both Mac OS X El Capitan and Debian 8, I can also repeat the error in version 3.1 and 3.0.

The file in question is here (yes I know there isn't any text I was just using it for testing):

eiffel

Linux Install Error

When installing on Debian Wheezy I am getting:

$ sudo pip-3.2 install ocrmypdf
Downloading/unpacking ocrmypdf
Downloading ocrmypdf-3.1.tar.gz
Running setup.py egg_info for package ocrmypdf
Traceback (most recent call last):
File "", line 14, in
File "/home/shaun/build/ocrmypdf/setup.py", line 7, in
from collections.abc import Mapping
ImportError: No module named abc
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 14, in
File "/home/shaun/build/ocrmypdf/setup.py", line 7, in
from collections.abc import Mapping

ImportError: No module named abc

Command python setup.py egg_info failed with error code 1 in /home/shaun/build/ocrmypdf
Storing complete log in /root/.pip/pip.log

Any idea how to fix this? Thanks!

weird text order

Issue by femifrak
Fri Jan 31 18:50:11 2014
Originally opened as fritz-hh/OCRmyPDF#64


OCRmyPDF is brilliant but sometimes i have a problem with the order of text that is underlaid. When i select the text starting from top left and go to the right end of the line and then successively down line by line, there are sometimes gaps of text which is not selected. After a few more lines these gaps are suddenly selected. Copying the selected text and pasting it into another application reveals the order, which is unfortunately wrong. I use latest stable version and have no error or warning messages.

http://www.loaditup.de/files/803343_acxm67dsue.pdf
(problem occurs in second paragraph.)

installation problems

I just wanted to install 4.0.1 but had unfortunately no success.
Have you got a clue how to align the ducks?


>$ sudo pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git
Downloading/unpacking git+https://github.com/jbarlow83/OCRmyPDF.git
  Cloning https://github.com/jbarlow83/OCRmyPDF.git to /tmp/pip-jyrz2gnr-build
  Running setup.py (path:/tmp/pip-jyrz2gnr-build/setup.py) egg_info for package from git+https://github.com/jbarlow83/OCRmyPDF.git

    Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg
    zip_safe flag not set; analyzing archive contents...

    Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>
        zip_safe=False)
      File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__
        self.fetch_build_eggs(attrs['setup_requires'])
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
        replace_conflicting=True,
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match
        dist = working_set.find(req)
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find
        raise VersionConflict(dist, req)
    pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))
    Complete output from command python setup.py egg_info:


Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg

zip_safe flag not set; analyzing archive contents...



Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg

Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>

    zip_safe=False)

  File "/usr/lib/python3.4/distutils/core.py", line 108, in setup

    _setup_distribution = dist = klass(attrs)

  File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__

    self.fetch_build_eggs(attrs['setup_requires'])

  File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs

    replace_conflicting=True,

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve

    dist = best[req.key] = env.best_match(req, ws, installer)

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match

    dist = working_set.find(req)

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find

    raise VersionConflict(dist, req)

pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build
Storing debug log for failure in /home/xxx/.pip/pip.log

The mentioned pip.log file says:

------------------------------------------------------------
/usr/bin/pip3 run on Thu Feb 18 12:14:54 2016
Downloading/unpacking git+https://github.com/jbarlow83/OCRmyPDF.git
  Cloning https://github.com/jbarlow83/OCRmyPDF.git to /tmp/pip-jyrz2gnr-build
  Found command 'git' at '/usr/bin/git'
  Running command /usr/bin/git clone -q https://github.com/jbarlow83/OCRmyPDF.git /tmp/pip-jyrz2gnr-build
  Running setup.py (path:/tmp/pip-jyrz2gnr-build/setup.py) egg_info for package from git+https://github.com/jbarlow83/OCRmyPDF.git

    Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg
    zip_safe flag not set; analyzing archive contents...

    Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>
        zip_safe=False)
      File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__
        self.fetch_build_eggs(attrs['setup_requires'])
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
        replace_conflicting=True,
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match
        dist = working_set.find(req)
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find
        raise VersionConflict(dist, req)
    pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))
    Complete output from command python setup.py egg_info:


Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg

zip_safe flag not set; analyzing archive contents...



Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg

Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>

    zip_safe=False)

  File "/usr/lib/python3.4/distutils/core.py", line 108, in setup

    _setup_distribution = dist = klass(attrs)

  File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__

    self.fetch_build_eggs(attrs['setup_requires'])

  File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs

    replace_conflicting=True,

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve

    dist = best[req.key] = env.best_match(req, ws, installer)

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match

    dist = working_set.find(req)

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find

    raise VersionConflict(dist, req)

pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build
Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/usr/lib/python3/dist-packages/pip/commands/install.py", line 304, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3/dist-packages/pip/req.py", line 1230, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3/dist-packages/pip/req.py", line 326, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3/dist-packages/pip/util.py", line 716, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build

Spell check with aspell

Issue by witchi
Mon Mar 23 10:50:36 2015
Originally opened as fritz-hh/OCRmyPDF#106


Hi,

Nice script, I use it with another script from http://www.konradvoelkel.com/2013/03/scan-to-pdfa/
Can you enhance your script with a call to aspell?

I have tried it within src/ocrPage.sh on line 198:

# perform spell check
[ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Performing spell check"
!aspell --dont-backup --lang=de_DE --mode=sgml -c "${curHocr}" < /dev/tty   \
        && echo "Could not spell checking file \"${curHocr}\". Exiting..." && exit $EXIT_OTHER_ERROR

but it doesn't work with the Gnu-Parallel tool.

Thank you
Andre

No output pdf file

Issue by sjoswig
Wed Jul 22 09:25:37 2015
Originally opened as fritz-hh/OCRmyPDF#110


I'm using ocrmypdf 2.1.0-1 on my arch and the last weeks I had no problem get ocr out of pdfs correctly with ocrmypdf, but no only temporary files were created and no single output pdf.

Here is the log file:

`OCRmyPDF version: v2.1-stable
Arguments: -f -vvv -l deu 2015-03-13 Kraftfahrtversicherung_ohne.pdf /home/js/Share/2015-03-13 Kraftfahrtversicherung.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.9.1-8 Q16 x86_64 2015-07-14 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC HDRI Modules OpenCL OpenMP
Delegates (built-in): bzlib cairo fontconfig freetype gslib jng jp2 jpeg lcms lqr ltdl lzma pangocairo png ps rsvg tiff webp wmf x xml zlib


GNU Parallel version:
GNU parallel 20150622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015 Ole Tange
and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication

please cite as described in 'parallel --bibtex'.

Poppler-utils version:
pdfimages version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

6.1

tesseract version:
tesseract 3.04.00
leptonica-1.71
libgif 5.1.0 : libjpeg 8d : libpng 1.6.16 : libtiff 4.0.4 : zlib 1.2.8 : libwebp 0.4.3


python2 version:

Python 2.7.10

Ghostscript version:

9.16

Java version:
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)

Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.XZtlIvt11N"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x596 (h*w in pt)
Page 0001: Size 2482x3510 (in pixel)
Page 0001: Extracting image as pbm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF`

OCRmyPDF with docker on CentOS7 not working

Hi,
I followed the docs for installing the docker container.
Running "docker run ocrmypdf --help" works fine.
But if I try to execute ocrmypdf on a local file, I get an error:

[root@CentOS7 test]# docker run -v "/srv/test/:/home/docker/" ocrmypdf ocrmypdf -v 1 x.pdf 1.pdf
usage: ocrmypdf [-h] [--verbose [VERBOSE]] [--version] [-L FILE] [-j N]
[--use_threads] [-n] [--flowchart FILE] [-l LANGUAGE]
[--title TITLE] [--author AUTHOR] [--subject SUBJECT]
[--keywords KEYWORDS] [-d] [-c] [-i] [--oversample DPI] [-f]
[-s] [--skip-big MPixels]
[--tesseract-config TESSERACT_CONFIG]
[--pdf-renderer {tesseract,hocr}]
[--tesseract-timeout TESSERACT_TIMEOUT] [-k] [-g]
input_file output_file
ocrmypdf: error: unrecognized arguments: 1.pdf

Any help would be nice.

Thank you!
Kind regards,
Nicole

Some input PDFs with Tesseract-OCR throw error in PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...

Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120


A few PDF input files (which were already processed by tesseract-ocr pdf mode) throw an error in OCRmyPDF even in the --force-ocr mode. At the moment, I have no idea what exactly happens, but the problem appears to be in PyPDF2 (I use PyPDF2 1.25.1).

The error message is only shown when OCRmyPDF is used with the -v option.

Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?

Full output:

Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf' 
Original exception:
    Exception #1
      'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
       Task = def ocrmypdf.main.repair_pdf(...):
       Job  = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
        pdfinfo.extend(pdf_get_all_pageinfo(output_file))
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
        text = page.extractText()
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
        content = ContentStream(content, self.pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
        self.__parseContentStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
        operands.append(readObject(stream, None))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
        return ArrayObject.readFromStream(stream, pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
        arr.append(readObject(stream, pdf))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
        return readStringFromStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
        raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
    PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'

landscape orientation with 90 deg rotation not correctly handled

I have a scanner that produces PDF files with lansdcape layout but with 90 degrees rotation. This kind of files is displayed correctly in nautilus file manager for example (Linux) as portrait file.
I have other scanned files from other scanner that produced portrait files directly. They are correctly handled.
As an example take attached test2.pdf which is a standard test print page scanned.
But in ocrmypdf I got a wrong file (see test2b.pdf)
test2.pdf
test2b.pdf

Raw image to OCRmyPDF

Issue by geaplanet
Sun Mar 8 10:46:08 2015
Originally opened as fritz-hh/OCRmyPDF#104


Is there any posibility to use OCRmyPDF passing raw TIFF images as a parameter?
OCRmyPDF convert pdfs to image to work with them, but in case you have got raw images from scanner or cam, how can you use it?

misalignment of graphic layer

Sometimes the graphic layer is misaligned while the text layer seems to be placed correctly. I uploaded a sample pdf (test07.pdf) at:

http://www.loaditup.de/838186-ns8hr3kcbg.html

ocrmypdf --oversample 600 test07.pdf test07ocr.pdf

shows what I mean. test07ocr.pdf can be seen here:
http://www.loaditup.de/838187-4hkqhkbvnm.html

Additionally ocrmypdf gives a warning:

   **** File did not complete the page properly and may be damaged.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> PyPDF2 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

I don't know whether this warning is justified. At least I have no problems in viewing the pdf in common pdf viewers. Have you got any idea about this?

setup.py fails with python 2.7

Although setup attempts to check the python version and throw an error message, in fact with python2.7 you never get that far: it barfs on the copyright symbol on line 2.

 $ python setup.py build
 File "setup.py", line 2
 SyntaxError: Non-ASCII character '\xc2' in file setup.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Adapt to Google Vision API

Google has just released alpha access to the Google Vision API, I am hopeful their OCR will be better than Tesseract, if that's true would this be a good option to potentially add as another way to handle the OCR input to this project, maybe you could add a switch somewhere to choose the OCR source? The sign up page for alpha access is here: https://services.google.com/fb/forms/visionapialpha/. It would be great to get your opinion on this. Thanks!

unpaper fails to deskew some files with obvious skew

unpaper may not be a viable deskewer and ImageMagick is awful. It seems that presence of italics font may be part of the issue.

Tesseract does not calculate the skew angle (logically, since there is no global skew angle on a page).

Best option is to go back to Leptonica.

loseless jbig2 compression changed to ccitt

I'm using version 3.2.1 but still pdfs with jbig2 compression are changed to ccitt leading to considerably greater file sizes. Am I doing something wrong or is there a bug?

This is the output (see attachment for test.pdf):

$ pdfimages -list test.pdf 
page   num  type   width height color comp bpc  enc interp  object ID
---------------------------------------------------------------------
   1     0 image    2062  3190  gray    1   1  jbig2  no         5  0

$ ocrmypdf -v 1 test.pdf test-ocr.pdf 

________________________________________
Tasks which will be run:


Task enters queue = 'ocrmypdf.main.repair_pdf' 
    [{'xres': Decimal('599.999'), 'height_inches': Decimal('5.31667'), 'width_pixels': 2062, 'width_inches': Decimal('3.43667'), 'pageno': 0, 'images': [{'color': 'gray', 'bpc': 1, 'enc': 'jbig2', 'dpi_w': Decimal('599.999'), 'width': 2062, 'height': 3190, 'comp': 1, 'dpi_h': Decimal('600.000'), 'dpi': Decimal('599.999')}], 'yres': Decimal('600.000'), 'height_pixels': 3190, 'has_text': False}]
Completed Task = 'ocrmypdf.main.repair_pdf' 
Task enters queue = 'ocrmypdf.main.split_pages' 
Task enters queue = 'ocrmypdf.main.generate_postscript_stub' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.pdf, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages' 
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript' 
Task enters queue = 'ocrmypdf.main.skip_page' 
Uptodate Task = 'ocrmypdf.main.skip_page'


WARNING:
        In Task 'ocrmypdf.main.skip_page':
        No jobs were run because no file names matched.
        Please make sure that the regular expression is correctly specified. 

    Rendering 000001.ocr.page.pdf with pngmono
Completed Task = 'ocrmypdf.main.generate_postscript_stub' 
Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript' 
Task enters queue = 'ocrmypdf.main.preprocess_deskew' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew' 
Task enters queue = 'ocrmypdf.main.preprocess_clean' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-clean.png)
Completed Task = 'ocrmypdf.main.preprocess_clean' 
Task enters queue = 'ocrmypdf.main.ocr_tesseract_hocr' 
Task enters queue = 'ocrmypdf.main.select_image_for_pdf' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.image)
Completed Task = 'ocrmypdf.main.select_image_for_pdf' 
Task enters queue = 'ocrmypdf.main.select_image_layer' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.image-layer.pdf)
Completed Task = 'ocrmypdf.main.select_image_layer' 
    Tesseract Open Source OCR Engine v3.03 with Leptonica

Completed Task = 'ocrmypdf.main.ocr_tesseract_hocr' 
Task enters queue = 'ocrmypdf.main.render_hocr_page' 
Completed Task = 'ocrmypdf.main.render_hocr_page' 
Task enters queue = 'ocrmypdf.main.add_text_layer' 
Completed Task = 'ocrmypdf.main.add_text_layer' 
Task enters queue = 'ocrmypdf.main.merge_pages' 
    ['/tmp/com.github.ocrmypdf.hjkqg9uk/000001.rendered.pdf', '/tmp/com.github.ocrmypdf.hjkqg9uk/pdfa_def.ps']
Completed Task = 'ocrmypdf.main.merge_pages' 
Task enters queue = 'ocrmypdf.main.copy_final' 
Completed Task = 'ocrmypdf.main.copy_final'
$ pdfimages -list test-ocr.pdf 
page   num  type   width height color comp bpc  enc interp  object ID
---------------------------------------------------------------------
   1     0 image    2062  3190  gray    1   1  ccitt  no        10  0 

test.zip

Some input PDFs with Tesseract-OCR throw error in PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...

Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120


A few PDF input files (which were already processed by tesseract-ocr pdf mode) throw an error in OCRmyPDF even in the --force-ocr mode. At the moment, I have no idea what exactly happens, but the problem appears to be in PyPDF2 (I use PyPDF2 1.25.1).

The error message is only shown when OCRmyPDF is used with the -v option.

Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?

Full output:

Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf' 
Original exception:
    Exception #1
      'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
       Task = def ocrmypdf.main.repair_pdf(...):
       Job  = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
        pdfinfo.extend(pdf_get_all_pageinfo(output_file))
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
        text = page.extractText()
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
        content = ContentStream(content, self.pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
        self.__parseContentStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
        operands.append(readObject(stream, None))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
        return ArrayObject.readFromStream(stream, pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
        arr.append(readObject(stream, pdf))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
        return readStringFromStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
        raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
    PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'

ocrmypdf with incrontab / inotify: help ruffus a writable location for its database

Issue by segro21
Thu Sep 25 13:09:49 2014
Originally opened as fritz-hh/OCRmyPDF#90


Hi,
this is not realey an issue of ocrmypdf, but I'm trying to get this to work on an samba-share with incrontab /inotify.
I've created a folder and watch activities in this folder with incrontab. That works fine for things like pdftk, but nothing happens on ocrmypdf. Syslog shows the command correct, but then ends.

my incrontab -e
/home/pdfin IN_CLOSE_WRITE /opt/ocrmypdf/ocrmypdf.sh $@/$# $@/out/$#
/home/pdfin/out IN_CLOSE_WRITE /bin/rm $@/../$#
->this works fine for stamping pdfs with logo
/home/stamp IN_CLOSE_WRITE /usr/bin/pdftk $@/$# stamp $@/BB.pdf output $@/out/$#
/home/stamp/out IN_CLOSE_WRITE /bin/rm $@/../$#

Any ideas?

MRC

Issue by b21e
Fri Sep 19 16:14:39 2014
Originally opened as fritz-hh/OCRmyPDF#88


Hi, especially for scans integration with jbig2enc for better compression of the textimage layer would make this software perfect.

output file much bigger (7x), because not original embedded image files copied

Issue by alphablue52
Tue Feb 18 20:11:19 2014
Originally opened as fritz-hh/OCRmyPDF#70


Hello,
first thanks for the great work with this script. It made me work with OCR again at all after 10 years of frustrated absence :-)
Only one negative thing: Most of my PDFs come from a Canon ImageRunner scan, and are very good in quality vs. size. OCR gives great results, but the output PDFs are 7-8x bigger than input. As far as I can see, the embedded images get recompressed to JPEG, while the original is /CCITTFaxDecode.
If this is because of PDF/A compatibility, I suggest to add an option for non-PDF/A output.

You can download input.pdf and output.pdf here:
https://www.dropbox.com/l/KYlpYRiSs6IjWVOmF1fX39

Here is the output of the script with -g option.

~/bin/OCRmyPDF-2.0-stable$ sh OCRmyPDF.sh -g -l deu input.pdf output.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -l deu input.pdf output.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.7.7-10 2013-09-10 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP


GNU Parallel version:
GNU parallel 20130622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version:
pdfimages version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

OCRmyPDF.sh: 190: OCRmyPDF.sh: unpaper: not found

tesseract version:
tesseract 3.02.01
leptonica-1.69
libgif 4.1.6 : libjpeg 8d : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.8


python2 version:

Python 2.7.5+

Ghostscript version:

9.10

Java version:
java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.13.10.1)

OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

Created temporary folder: "/tmp/tmp.X82OQourlI"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0014
Page 0001: Size 842x595 (h_w in pt)
Page 0001: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0001: Continuing anyway, assuming a default resolution of 300 dpi
Page 0001: Extracting image as ppm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Processing page 0002 / 0014
Page 0002: Size 842x595 (h_w in pt)
Page 0002: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0002: Continuing anyway, assuming a default resolution of 300 dpi
Page 0002: Extracting image as ppm file (300 dpi)
Page 0002: Performing OCR
Page 0002: Embedding text in PDF
Page 0002: Embedding text in PDF (debug page)
Processing page 0003 / 0014
Page 0003: Size 842x595 (h_w in pt)
Page 0003: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0003: Continuing anyway, assuming a default resolution of 300 dpi
Page 0003: Extracting image as ppm file (300 dpi)
Page 0003: Performing OCR
Page 0003: Embedding text in PDF
Page 0003: Embedding text in PDF (debug page)
Processing page 0004 / 0014
Page 0004: Size 842x595 (h_w in pt)
Page 0004: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0004: Continuing anyway, assuming a default resolution of 300 dpi
Page 0004: Extracting image as ppm file (300 dpi)
Page 0004: Performing OCR
Page 0004: Embedding text in PDF
Page 0004: Embedding text in PDF (debug page)
Processing page 0005 / 0014
Page 0005: Size 842x595 (h_w in pt)
Page 0005: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0005: Continuing anyway, assuming a default resolution of 300 dpi
Page 0005: Extracting image as ppm file (300 dpi)
Page 0005: Performing OCR
Page 0005: Embedding text in PDF
Page 0005: Embedding text in PDF (debug page)
Processing page 0006 / 0014
Page 0006: Size 842x595 (h_w in pt)
Page 0006: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0006: Continuing anyway, assuming a default resolution of 300 dpi
Page 0006: Extracting image as ppm file (300 dpi)
Page 0006: Performing OCR
Page 0006: Embedding text in PDF
Page 0006: Embedding text in PDF (debug page)
Processing page 0007 / 0014
Page 0007: Size 842x595 (h_w in pt)
Page 0007: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0007: Continuing anyway, assuming a default resolution of 300 dpi
Page 0007: Extracting image as ppm file (300 dpi)
Page 0007: Performing OCR
Page 0007: Embedding text in PDF
Page 0007: Embedding text in PDF (debug page)
Processing page 0008 / 0014
Page 0008: Size 842x595 (h_w in pt)
Page 0008: Expecting exactly 1 image covering the whole page (found 8). Cannot compute dpi value.
Page 0008: Continuing anyway, assuming a default resolution of 300 dpi
Page 0008: Extracting image as ppm file (300 dpi)
Page 0008: Performing OCR
Page 0008: Embedding text in PDF
Page 0008: Embedding text in PDF (debug page)
Processing page 0009 / 0014
Page 0009: Size 842x595 (h_w in pt)
Page 0009: Expecting exactly 1 image covering the whole page (found 5). Cannot compute dpi value.
Page 0009: Continuing anyway, assuming a default resolution of 300 dpi
Page 0009: Extracting image as ppm file (300 dpi)
Page 0009: Performing OCR
Page 0009: Embedding text in PDF
Page 0009: Embedding text in PDF (debug page)
Processing page 0010 / 0014
Page 0010: Size 842x595 (h_w in pt)
Page 0010: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0010: Continuing anyway, assuming a default resolution of 300 dpi
Page 0010: Extracting image as ppm file (300 dpi)
Page 0010: Performing OCR
Page 0010: Embedding text in PDF
Page 0010: Embedding text in PDF (debug page)
Processing page 0011 / 0014
Page 0011: Size 842x595 (h_w in pt)
Page 0011: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0011: Continuing anyway, assuming a default resolution of 300 dpi
Page 0011: Extracting image as ppm file (300 dpi)
Page 0011: Performing OCR
Page 0011: Embedding text in PDF
Page 0011: Embedding text in PDF (debug page)
Processing page 0012 / 0014
Page 0012: Size 842x595 (h_w in pt)
Page 0012: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0012: Continuing anyway, assuming a default resolution of 300 dpi
Page 0012: Extracting image as ppm file (300 dpi)
Page 0012: Performing OCR
Page 0012: Embedding text in PDF
Page 0012: Embedding text in PDF (debug page)
Processing page 0013 / 0014
Page 0013: Size 842x595 (h_w in pt)
Page 0013: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0013: Continuing anyway, assuming a default resolution of 300 dpi
Page 0013: Extracting image as ppm file (300 dpi)
Page 0013: Performing OCR
Page 0013: Embedding text in PDF
Page 0013: Embedding text in PDF (debug page)
Processing page 0014 / 0014
Page 0014: Size 842x595 (h_w in pt)
Page 0014: Size 1240x1753 (in pixel)
Page 0014: Low image resolution detected (150 dpi). If needed, please use the "-o" to try to get better OCR results.
Page 0014: Extracting image as pgm file (150 dpi)
Page 0014: Performing OCR
Page 0014: Embedding text in PDF
Page 0014: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.X82OQourlI/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 20 seconds

OCRmyPDF issue - no new file, can't find iccprofiles?

Hi there, and thank you for any assistance,

OCRmyPDF fails to create a new file.

here's the install process:

pip3 install ocrmypdf
Downloading/unpacking ocrmypdf
Downloading ocrmypdf-3.0.tar.gz
Running setup.py (path:/tmp/pip-build-wqh0224e/ocrmypdf/setup.py) egg_info for package ocrmypdf
Checking for tesseract >= 3.02.02...
Found tesseract 3.03
Checking for gs >= 9.14...
Found gs 9.14
Checking for unpaper >= 6.1...
Found unpaper 6.1
Checking for qpdf >= 5.0.0...
Found qpdf 5.1.2

warning: no previously-included files matching '*' found under directory 'tests/output'

Requirement already satisfied (use --upgrade to upgrade): ruffus>=2.6.3 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): Pillow>=2.4.0 in /usr/lib/python3/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): reportlab>=3.1.44 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): PyPDF2>=1.25.1 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): pip>=1.4.1 in /usr/lib/python3/dist-packages (from reportlab>=3.1.44->ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): setuptools>=2.2 in /usr/lib/python3/dist-packages (from reportlab>=3.1.44->ocrmypdf)
Installing collected packages: ocrmypdf
Running setup.py install for ocrmypdf
Checking for tesseract >= 3.02.02...
Found tesseract 3.03
Checking for gs >= 9.14...
Found gs 9.14
Checking for unpaper >= 6.1...
Found unpaper 6.1
Checking for qpdf >= 5.0.0...
Found qpdf 5.1.2

warning: no previously-included files matching '*' found under directory 'tests/output'
Installing ocrmypdf script to /usr/local/bin

Successfully installed ocrmypdf
Cleaning up...

Verbose mode for conversion shows this:

$ ocrmypdf A.pdf B.pdf --verbose


Tasks which will be run:

Task enters queue = 'ocrmypdf.main.repair_pdf'

[{'height_inches': Decimal('24.3611'), 'pageno': 0, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 1, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 2, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 3, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 4, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 5, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 6, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 7, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 8, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 9, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 10, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 11, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 12, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 13, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 14, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 15, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 16, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 17, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 18, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 19, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 20, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 21, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 22, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 23, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 24, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 25, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 26, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 27, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 28, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 29, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 30, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 31, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 32, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 33, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 34, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 35, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 36, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 37, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 38, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 39, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 40, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 41, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 42, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 43, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 44, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 45, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 46, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 47, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 48, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 49, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 50, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 51, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 52, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 53, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 54, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 55, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}]

Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000048.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000048.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000003.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000003.ocr.page.pdf)
Page 33 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000033.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000033.skip.page.pdf)
Page 50 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000050.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000050.skip.page.pdf)
Page 2 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000002.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000002.skip.page.pdf)
Page 52 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000052.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000052.skip.page.pdf)
Page 8 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000008.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000008.skip.page.pdf)
Page 12 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000012.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000012.skip.page.pdf)
Page 41 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000041.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000041.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000039.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000039.ocr.page.pdf)
Page 1 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000001.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000001.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000026.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000026.ocr.page.pdf)
Page 5 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000005.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000005.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000016.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000016.ocr.page.pdf)
Page 11 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000011.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000011.skip.page.pdf)
Page 21 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000021.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000021.skip.page.pdf)
Page 28 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000028.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000028.skip.page.pdf)
Page 38 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000038.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000038.skip.page.pdf)
Page 47 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000047.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000047.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000017.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000017.ocr.page.pdf)
Page 49 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000049.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000049.skip.page.pdf)
Page 29 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000029.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000029.skip.page.pdf)
Page 31 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000031.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000031.skip.page.pdf)
Page 9 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000009.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000009.skip.page.pdf)
Page 43 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000043.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000043.skip.page.pdf)
Page 20 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000020.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000020.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000013.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000013.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000014.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000014.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000037.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000037.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000056.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000056.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000025.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000025.ocr.page.pdf)
Page 45 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000045.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000045.skip.page.pdf)
Page 55 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000055.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000055.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000032.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000032.ocr.page.pdf)
Page 51 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000051.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000051.skip.page.pdf)
Page 27 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000027.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000027.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000040.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000040.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000019.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000019.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000053.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000053.ocr.page.pdf)
Page 36 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000036.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000036.skip.page.pdf)
Page 46 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000046.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000046.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000024.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000024.ocr.page.pdf)
Page 10 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000010.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000010.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000007.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000007.ocr.page.pdf)
Page 23 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000023.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000023.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000044.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000044.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000035.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000035.ocr.page.pdf)
Page 6 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000006.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000006.skip.page.pdf)
Page 18 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000018.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000018.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000054.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000054.ocr.page.pdf)
Page 15 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000015.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000015.skip.page.pdf)
Page 22 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000022.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000022.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000004.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000004.ocr.page.pdf)
Page 42 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000042.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000042.skip.page.pdf)
Page 30 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000030.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000030.skip.page.pdf)
Page 34 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000034.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000034.skip.page.pdf)

Original exception:

Exception #1
  'builtins.FileNotFoundError(Could not find Ghostscript's iccprofiles)' raised in ...
   Task = def ocrmypdf.main.generate_postscript_stub(...):
   Job  = [.../com.github.ocrmypdf.fwij8o72/YummyS3ptember2015.repaired.pdf -> .../com.github.ocrmypdf.fwij8o72/YummyS3ptember2015.pdfa_def.ps, <ocrmypdf.main.WrappedLogger>]

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 761, in generate_postscript_stub
    generate_pdfa_def(output_file, pdfmark)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pdfa.py", line 123, in generate_pdfa_def
    icc_profile = os.path.join(_get_postscript_icc_path(), 'srgb.icc')
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pdfa.py", line 118, in _get_postscript_icc_path
    raise FileNotFoundError("Could not find Ghostscript's iccprofiles")
FileNotFoundError: Could not find Ghostscript's iccprofiles

I tried removing ocrmypdf and re-installing it and had the same behaviour. Any ideas on what I need to do to fix this?

Thanks in advance.

Adam

Ubuntu 15.10: python3 exception - convert() got an unexpected keyword argument 'dpi'

Install on Ubuntu 15.10

Software versions

$ ocrmypdf --version
3.2
$ python3 --version
Python 3.4.3+
$ unpaper -version
6.1

Exception (on every attempt)

$ ocrmypdf --verbose 1 --force-ocr scansmpl.pdf test.pdf

Original exception:

    Exception #1
      'builtins.TypeError(convert() got an unexpected keyword argument 'dpi')' raised in ...
       Task = def ocrmypdf.main.select_image_layer(...):
       Job  = [[.../com.github.ocrmypdf.aziws_b9/000001.image, .../com.github.ocrmypdf.aziws_b9/000001.ocr.page.pdf] -> .../com.github.ocrmypdf.aziws_b9/000001.image-layer.pdf, <ocrmypdf.main.WrappedLogger>, [{'width_inches': Decimal('8.48611'), 'width_pixels': 1696, 'pageno': 0, 'images': [{'dpi_h': Decimal('2E+2'), 'color': 'gray', 'width': 1696, 'comp': 1, 'dpi': Decimal('199.928'), 'height': 2175, 'bpc': 1, 'dpi_w': Decimal('199.856'), 'enc': 'ccitt'}], 'has_text': False, 'xres': Decimal('199.856'), 'height_inches': Decimal('10.875'), 'height_pixels': 2175, 'yres': Decimal('2E+2')}], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 597, in select_image_layer
        img2pdf.convert([image], dpi=dpi, outputstream=pdf)
    TypeError: convert() got an unexpected keyword argument 'dpi'

Some input PDFs with Tesseract-OCR throw error in PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...

Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120


A few PDF input files (which were already processed by tesseract-ocr pdf mode) throw an error in OCRmyPDF even in the --force-ocr mode. At the moment, I have no idea what exactly happens, but the problem appears to be in PyPDF2 (I use PyPDF2 1.25.1).

The error message is only shown when OCRmyPDF is used with the -v option.

Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?

Full output:

Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf' 
Original exception:
    Exception #1
      'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
       Task = def ocrmypdf.main.repair_pdf(...):
       Job  = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
        pdfinfo.extend(pdf_get_all_pageinfo(output_file))
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
        text = page.extractText()
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
        content = ContentStream(content, self.pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
        self.__parseContentStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
        operands.append(readObject(stream, None))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
        return ArrayObject.readFromStream(stream, pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
        arr.append(readObject(stream, pdf))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
        return readStringFromStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
        raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
    PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'

close_fds is not supported on Windows platforms...

Hi,

I'm getting this error

[Anaconda3] C:\Users\Carlos\Anaconda3>ocrmypdf --help
Traceback (most recent call last):
File "C:\Users\Carlos\Anaconda3\Scripts\ocrmypdf-script.py", line 9, in
load_entry_point('ocrmypdf==3.1.1', 'console_scripts', 'ocrmypdf')()
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 549, in load_entry_point
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2709, in load_entry_point
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2369, in load
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2375, in resolve
File "C:\Users\Carlos\Anaconda3\lib\site-packages\ocrmypdf-3.1.1-py3.5.egg\ocrmypdf\main.py", line 51, in
if tesseract.version() < MINIMUM_TESS_VERSION:
File "C:\Users\Carlos\Anaconda3\lib\site-packages\ocrmypdf-3.1.1-py3.5.egg\ocrmypdf\tesseract.py", line 51, in version
stderr=STDOUT)
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 629, in check_output
*_kwargs).stdout
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 696, in run
with Popen(_popenargs, **kwargs) as process:
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 873, in init
"close_fds is not supported on Windows platforms"
ValueError: close_fds is not supported on Windows platforms if you redirect stdin/stdout/stderr

Thank you for your help.

How to add languages for tesseract-ocr in the image?

Sorry I am new to docker. I just pull the latest, and want to use language chi_sim in tesseract, but it seems this language support is not installed by default, as it complains:

~/work/tmp$ docker run -v "$(pwd):/home/docker" ocrmypdf 31.pdf 31-ocr.pdf -l chi_sim
The installed version of tesseract does not have language data for the following requested languages:
chi_sim

It seems the tesseract used by the docker image is different from the system's tesseract-ocr package, with which I installed the language package by "apt-get install tesseract-ocr-chi-sim".

How to update the docker image for including the desired language support? And how to check which languages are supported (like "tesseract --list-langs" in the system)?

Thanks a lot.

Improve error message output

Ruffus's console logging seems to be far too quiet, suppressing error messages in some cases.

Find out how to create our own error logging and tell ruffus about it.

file size

ocrmypdf increases file size by about a factor of 4 (even more when using oversampling).
I assume this is because the graphic layer is created instead of using the original graphic. Correct?
Is it possible to force ocrmypdf to use the original graphics? (I do not understand issue #8, but the comment from kebekus sounds promising to me.)
If the graphics have to be generated because of some missing information: Would it be possible to feed ocrmypdf with this information (e.g. I know the scanning resolution, orientation, and page size and i could provide this information to ocrmypdf during function call).

Thanks, Femi

OCRmyPDF silently fails on input filenames like uppercase *.PDF

My new duplex scanner is BROTHER ADS-2600we. It generates PDF (which are not compatible and make also convert fail. It can however generate PDF/A. The standard filenames have the form

[0-9]{8}.PDF

Example: 06091500.PDF, 06091501.PDF etc. for files scanned on 06. September 2015. These filenames (I don't like the format) cannot be changed in the scanner.

Problem:

When you start

ocrmypdf --verbose -L log.txt -l deu 06091500.PDF 06091500.ocr.pdf

this silently fails ! ("...No jobs were run because no file names matched.")

Workaround:
Rename files 06091500.PDF to x.pdf and process then.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.