ocrmypdf / ocrmypdf Goto Github PK

View Code? Open in Web Editor NEW

12.0K 136.0 890.0 64.99 MB

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Home Page: http://ocrmypdf.readthedocs.io/

License: Mozilla Public License 2.0

Shell 2.90% Python 96.75% Dockerfile 0.35%

python ocr pdf image-processing tesseract

ocrmypdf's Introduction

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output

See the release notes for details on the latest changes.

Main features

Generates a searchable PDF/A file from a regular PDF
Places OCR text accurately below the image to ease copy / paste
Keeps the exact resolution of the original embedded images
When possible, inserts OCR information as a "lossless" operation without disrupting any other content
Optimizes PDF images, often producing files smaller than the input file
If requested, deskews and/or cleans the image before performing OCR
Validates input and output files
Distributes work across all available CPU cores
Uses Tesseract OCR engine to recognize more than 100 languages
Keeps your private data private.
Scales properly to handle files with thousands of pages.
Battle-tested on millions of PDFs.

For details: please consult the documentation.

Motivation

I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:

Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
Or they did not handle accents and multilingual characters
Or they changed the resolution of the embedded images
Or they generated ridiculously large PDF files
Or they crashed when trying to OCR
Or they did not produce valid PDF files
On top of that none of them produced PDF/A files (format dedicated for long time storage)

...so I decided to develop my own tool.

Installation

Linux, Windows, macOS and FreeBSD are supported. Docker images are also available, for both x64 and ARM.

Operating system	Install command
Debian, Ubuntu	`apt install ocrmypdf`
Windows Subsystem for Linux	`apt install ocrmypdf`
Fedora	`dnf install ocrmypdf`
macOS (Homebrew)	`brew install ocrmypdf`
macOS (MacPorts)	`port install ocrmypdf`
macOS (nix)	`nix-env -i ocrmypdf`
LinuxBrew	`brew install ocrmypdf`
FreeBSD	`pkg install py-ocrmypdf`
Conda	`conda install ocrmypdf`
Ubuntu Snap	`snap install ocrmypdf`

For everyone else, see our documentation for installation steps.

Languages

OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users, you can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language pack

# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs

# brew macOS users
brew install tesseract-lang

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested.

OCRmyPDF supports Tesseract 4.1.1+. It will automatically use whichever version it finds first on the PATH environment variable. On Windows, if PATH does not provide a Tesseract binary, we use the highest version number that is installed according to the Windows Registry.

Documentation and support

Once OCRmyPDF is installed, the built-in help which explains the command syntax and options can be accessed via:

ocrmypdf --help

Our documentation is served on Read the Docs.

Please report issues on our GitHub issues page, and follow the issue template for quick response.

Requirements

In addition to the required Python version (3.8+), OCRmyPDF requires external program installations of Ghostscript and Tesseract OCR. OCRmyPDF is pure Python, and runs on pretty much everything: Linux, macOS, Windows and FreeBSD.

Press & Media

Going paperless with OCRmyPDF
Converting a scanned document into a compressed searchable PDF with redactions
c't 1-2014, page 59: Detailed presentation of OCRmyPDF v1.0 in the leading German IT magazine c't
heise Open Source, 09/2014: Texterkennung mit OCRmyPDF
heise Durchsuchbare PDF-Dokumente mit OCRmyPDF erstellen
Excellent Utilities: OCRmyPDF
LinuxUser Texterkennung mit OCRmyPDF und Scanbd automatisieren
Y Combinator discussion

Business enquiries

OCRmyPDF would not be the software that it is today without companies and users choosing to provide support for feature development and consulting enquiries. We are happy to discuss all enquiries, whether for extending the existing feature set, or integrating OCRmyPDF into a larger system.

License

The OCRmyPDF software is licensed under the Mozilla Public License 2.0 (MPL-2.0). This license permits integration of OCRmyPDF with other code, included commercial and closed source, but asks you to publish source-level modifications you make to OCRmyPDF.

Some components of OCRmyPDF have other licenses, as indicated by standard SPDX license identifiers or the DEP5 copyright and licensing information file. Generally speaking, non-core code is licensed under MIT, and the documentation and test files are licensed under Creative Commons ShareAlike 4.0 (CC-BY-SA 4.0).

Disclaimer

The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

ocrmypdf's People

Contributors

Stargazers

Watchers

Forkers

shemgp sosukeinu balu- bhartikukreja2015 silasxue stweil frhd gluwaile dmxd sesceu jmuccigr concepz parvizk1 thecocce bhohbaum petrandreev astorresy clear-datacenter jjediny benjamesbabala svengamasin ganesh-git2014 hilsonp wanghaisheng maninwindow hermast peterdocter pfei-yu modulexcite josetiagobispo pedcremo espacenetworks unmeshpro fleischkatapult li363849131 aachokey tuhulihongbing tomraz aagahi python3pkg thepowerofswift 47billion zgsxwsdxg d3xt3-bitstechlab toddwprice ryanfb yan4821567 abhiman24 daodinhcuong liveisbetter cmpscabral mrab54 qwzhong1988 it-sec-std arcodergh xiongfeihtp gptcod tosccolors escoand chunlei kowen0813 fsl-jyt tangcheng2014 boragocode jackbuh endolith rbarazzutti abhichabhi sbhttcha fang2x spwhitton yanyuanai ginking citytianya jamesbrink halo2404 updiversity mozhouwen levis0045 magicfab machineiearning emrul baifengbai jsanae gong-yuan rnctx oorahdev xuronghao sgaechter hanwsf tearar msgrizz matrixkong fajarlabs raghuvar pbt001 oblakhh opairdrop totwood neuroradiology

ocrmypdf's Issues

Introduce environment variables to redirect subprocess calls

For example, OCRMYPDF_GHOSTSCRIPT could point at an alternate Ghostscript binary.

This is mainly for test cases, to allow replacing the real binary with one that always fails, or to stub out/cache Tesseract when the OCR output doesn't matter.

original images not kept unaltered

Issue by femifrak
Wed May 28 16:01:43 2014
Originally opened as fritz-hh/OCRmyPDF#78

When using the 2.x version available as zip file at the right side of
https://github.com/fritz-hh/OCRmyPDF
with xubuntu 14.04 the original pdf is altered although i did not use -i
The first page of
http://www.loaditup.de/files/817245_gcstsh3wuy.pdf
shows the original black and white pdf, the second page the altered pdf which unfortunately looks frazzled. (I merged both pages for convenience.)
Is there a way to avoid this quality loss?

I tested the suggestion of #61 but without success, which is clear as no "-i" was used.
I also tested a pdf with integer number of pixels but without success.
Maybe it has to do with the problem described here? http://lists.freedesktop.org/archives/poppler-bugs/2013-August/010469.html

Thanks for the help.

Here the output with -g:

># /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g sw_original.pdf sw_original_OCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g sw_original.pdf sw_original_OCR.pdf
Checking if all dependencies are installed
--------------------------------
ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP    

--------------------------------
GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, 
;login: The USENIX Magazine, February 2011:42-47.
--------------------------------
Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
--------------------------------
unpaper version:
0.4.2
--------------------------------
tesseract version:
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

--------------------------------
python2 version:
Python 2.7.6
--------------------------------
Ghostscript version:
9.10
--------------------------------
Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
--------------------------------
Created temporary folder: "/tmp/tmp.ZIHGjUFKJS"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x594 (h*w in pt)
Page 0001: Size 3508x2477 (in pixel)
Page 0001: Extracting image as pbm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.ZIHGjUFKJS/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 25 seconds

Output files of OCRmyPDF cannot be processed a second time

If you try to OCRmyPDF a file, the output file cannot again be processed as input file. OPCRmyPDF fails.

In my view, this is a consequence of an unknown problem inside Tesseract, already filed as #19

pagesize is altered by ocrmypdf

Hi,

thank you very much for this really decent software!

When i run ocrmypdf 3.1 on an pdf which contains pages in A5 size (148 x 210 mm), the output page size is not the same (125 x 210 mm).

Regards,

Femi

Change to PDF/A-2b output instead of -1b

-2b seems nicer because of support for transparency and higher PDF format version.

Provided Ghostscript can produce correct -2b's.

Output pdf is distorted

Hey!

I have a problem. I played around a lot with the options of ocrmypdf but still my outout file is heavily distorted.

The input and the output file as well as the file generated in /tmp can be found here:
http://www.file-upload.net/download-11132774/sample_in.pdf.html
http://www.file-upload.net/download-11132775/sample_out.pdf.html
http://www.file-upload.net/download-11132776/com.github.ocrmypdf.wfze06uzsample_in.repaired.pdf.html

Thanks for your help!
Sammy

I append the -v 1 STDOUT:
$ ocrmypdf -v 1 sample_in.pdf sample_out.pdf

Tasks which will be run:

Task enters queue = 'ocrmypdf.main.repair_pdf'

[{'images': [{'width': 4299, 'height': 3035, 'bpc': 8, 'dpi': Decimal('482.983'), 'comp': 1, 'enc': 'jpeg', 'color': 'gray', 'dpi_h': Decimal('333.110'), 'dpi_w': Decimal('700.289')}], 'width_pixels': 4299, 'xres': Decimal('700.289'), 'has_text': False, 'pageno': 0, 'height_inches': Decimal('9.11111'), 'yres': Decimal('333.110'), 'height_pixels': 3035, 'width_inches': Decimal('6.13889')}]

WARNING:
In Task 'ocrmypdf.main.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.

Rendering 000001.ocr.page.pdf with pnggray

Completed Task = 'ocrmypdf.main.generate_postscript_stub'
Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.preprocess_deskew'
os.symlink(/tmp/com.github.ocrmypdf.e6lkdkqy/000001.page.png, /tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew'
Task enters queue = 'ocrmypdf.main.preprocess_clean'
os.symlink(/tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.e6lkdkqy/000001.pp-clean.png)
Completed Task = 'ocrmypdf.main.preprocess_clean'
Task enters queue = 'ocrmypdf.main.select_image_for_pdf'
Task enters queue = 'ocrmypdf.main.ocr_tesseract_hocr'
Completed Task = 'ocrmypdf.main.select_image_for_pdf'
Tesseract Open Source OCR Engine v3.03 with Leptonica

Completed Task = 'ocrmypdf.main.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.main.render_hocr_page'
Completed Task = 'ocrmypdf.main.render_hocr_page'
Task enters queue = 'ocrmypdf.main.merge_pages'
['/tmp/com.github.ocrmypdf.e6lkdkqy/000001.rendered.pdf', '/tmp/com.github.ocrmypdf.e6lkdkqy/com.github.ocrmypdf.e6lkdkqysample_in.pdfa_def.ps']
Completed Task = 'ocrmypdf.main.merge_pages'
Task enters queue = 'ocrmypdf.main.copy_final'
Completed Task = 'ocrmypdf.main.copy_final'

Make ruffus pipeline re-entrant

In its current form the pipeline is not re-entrant -- it is assembled based on command line arguments prior to main() and cannot be changed after that. As such, there is no value to "import ocrmypdf".

Also, all test cases need to run in a subprocess which is not ideal for inspecting test failures.

A re-entrant pipeline would make it possible to customize the pipeline if ocrmypdf were used as a library.

problem with unpaper

Issue by femifrak
Wed May 7 05:34:43 2014
Originally opened as fritz-hh/OCRmyPDF#75

When using OCRmyODF-2.x with -dci there remain black borders in the generated pdf. Shouldn't unpaper remove them? The input is a black and white pdf.

Here the output:

root@xu:/home/tho/test# /opt/OCRmyPDF/OCRmyPDF-2.x/OCRmyPDF.sh -g -d -c -i test.pdf testOCR.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -d -c -i test.pdf testOCR.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP

GNU Parallel version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version:
pdfimages version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.5
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

0.4.2

tesseract version:
tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

python2 version:

Python 2.7.6

Ghostscript version:

9.10

Java version:
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.cL2lCvVStC"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 578x342 (h*w in pt)
Page 0001: Size 3424x2208 (in pixel)
Page 0001: Extracting image as pbm file (445 dpi)
Page 0001: Deskewing image
Page 0001: Cleaning image with unpaper
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.cL2lCvVStC/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 31 seconds

Document Python API?

Issue by mlissner
Fri Aug 21 00:10:05 2015
Originally opened as fritz-hh/OCRmyPDF#114

I could be wrong, but I haven't been able to find documentation for the command itself. Either for the command line API nor for the Python API that looks like it might be coming in 3.0.

Am I blind? If not, this would be great to get. If so, my apologies!

Looks like a great project.

Error when trying to OCR JPEG or PNG

When I try to run:

sudo ocrmypdf --verbose 3 eiffel.jpg eiffel.pdf

I get:

Original exception:
Exception #1
  'builtins.TypeError(Can't convert 'list' object to str implicitly)' raised in ...
   Task = def ocrmypdf.main.split_pages(...):
   Job  = [[] -> .../com.github.ocrmypdf.45n_qza7/*.page.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 415, in split_pages
    npages = qpdf.get_npages(input_file)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/qpdf.py", line 68, in get_npages
    universal_newlines=True, close_fds=True)
  File "/usr/lib/python3.4/subprocess.py", line 607, in check_output
    with Popen(*popenargs, stdout=PIPE, **kwargs) as process:
  File "/usr/lib/python3.4/subprocess.py", line 859, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.4/subprocess.py", line 1395, in _execute_child
    restore_signals, start_new_session, preexec_fn)
TypeError: Can't convert 'list' object to str implicitly

If I try the same thing on a PDF file it works fine. This is for version 3.1.1, thanks!

I can repeat the bug on both Mac OS X El Capitan and Debian 8, I can also repeat the error in version 3.1 and 3.0.

The file in question is here (yes I know there isn't any text I was just using it for testing):

Linux Install Error

When installing on Debian Wheezy I am getting:

$ sudo pip-3.2 install ocrmypdf
Downloading/unpacking ocrmypdf
Downloading ocrmypdf-3.1.tar.gz
Running setup.py egg_info for package ocrmypdf
Traceback (most recent call last):
File "", line 14, in
File "/home/shaun/build/ocrmypdf/setup.py", line 7, in
from collections.abc import Mapping
ImportError: No module named abc
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 14, in
File "/home/shaun/build/ocrmypdf/setup.py", line 7, in
from collections.abc import Mapping

ImportError: No module named abc

Command python setup.py egg_info failed with error code 1 in /home/shaun/build/ocrmypdf
Storing complete log in /root/.pip/pip.log

Any idea how to fix this? Thanks!

Removal eval() call in main.py

eval is asking for trouble.

Auto correct image rotation (-180, -90, 0, +90)

Issue by fritz-hh
Wed Jan 8 22:05:16 2014
Originally opened as fritz-hh/OCRmyPDF#46

weird text order

Issue by femifrak
Fri Jan 31 18:50:11 2014
Originally opened as fritz-hh/OCRmyPDF#64

OCRmyPDF is brilliant but sometimes i have a problem with the order of text that is underlaid. When i select the text starting from top left and go to the right end of the line and then successively down line by line, there are sometimes gaps of text which is not selected. After a few more lines these gaps are suddenly selected. Copying the selected text and pasting it into another application reveals the order, which is unfortunately wrong. I use latest stable version and have no error or warning messages.

http://www.loaditup.de/files/803343_acxm67dsue.pdf
(problem occurs in second paragraph.)

installation problems

I just wanted to install 4.0.1 but had unfortunately no success.
Have you got a clue how to align the ducks?


>$ sudo pip3 install git+https://github.com/jbarlow83/OCRmyPDF.git
Downloading/unpacking git+https://github.com/jbarlow83/OCRmyPDF.git
  Cloning https://github.com/jbarlow83/OCRmyPDF.git to /tmp/pip-jyrz2gnr-build
  Running setup.py (path:/tmp/pip-jyrz2gnr-build/setup.py) egg_info for package from git+https://github.com/jbarlow83/OCRmyPDF.git

    Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg
    zip_safe flag not set; analyzing archive contents...

    Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>
        zip_safe=False)
      File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__
        self.fetch_build_eggs(attrs['setup_requires'])
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
        replace_conflicting=True,
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match
        dist = working_set.find(req)
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find
        raise VersionConflict(dist, req)
    pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))
    Complete output from command python setup.py egg_info:


Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg

zip_safe flag not set; analyzing archive contents...



Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg

Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>

    zip_safe=False)

  File "/usr/lib/python3.4/distutils/core.py", line 108, in setup

    _setup_distribution = dist = klass(attrs)

  File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__

    self.fetch_build_eggs(attrs['setup_requires'])

  File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs

    replace_conflicting=True,

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve

    dist = best[req.key] = env.best_match(req, ws, installer)

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match

    dist = working_set.find(req)

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find

    raise VersionConflict(dist, req)

pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build
Storing debug log for failure in /home/xxx/.pip/pip.log

The mentioned pip.log file says:

------------------------------------------------------------
/usr/bin/pip3 run on Thu Feb 18 12:14:54 2016
Downloading/unpacking git+https://github.com/jbarlow83/OCRmyPDF.git
  Cloning https://github.com/jbarlow83/OCRmyPDF.git to /tmp/pip-jyrz2gnr-build
  Found command 'git' at '/usr/bin/git'
  Running command /usr/bin/git clone -q https://github.com/jbarlow83/OCRmyPDF.git /tmp/pip-jyrz2gnr-build
  Running setup.py (path:/tmp/pip-jyrz2gnr-build/setup.py) egg_info for package from git+https://github.com/jbarlow83/OCRmyPDF.git

    Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg
    zip_safe flag not set; analyzing archive contents...

    Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>
        zip_safe=False)
      File "/usr/lib/python3.4/distutils/core.py", line 108, in setup
        _setup_distribution = dist = klass(attrs)
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__
        self.fetch_build_eggs(attrs['setup_requires'])
      File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs
        replace_conflicting=True,
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match
        dist = working_set.find(req)
      File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find
        raise VersionConflict(dist, req)
    pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))
    Complete output from command python setup.py egg_info:


Installed /tmp/easy_install-y2a_jcj5/pytest-runner-2.7/.eggs/setuptools_scm-1.10.1-py3.4.egg

zip_safe flag not set; analyzing archive contents...



Installed /tmp/pip-jyrz2gnr-build/.eggs/pytest_runner-2.7-py3.4.egg

Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip-jyrz2gnr-build/setup.py", line 235, in <module>

    zip_safe=False)

  File "/usr/lib/python3.4/distutils/core.py", line 108, in setup

    _setup_distribution = dist = klass(attrs)

  File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 268, in __init__

    self.fetch_build_eggs(attrs['setup_requires'])

  File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 313, in fetch_build_eggs

    replace_conflicting=True,

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 836, in resolve

    dist = best[req.key] = env.best_match(req, ws, installer)

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1074, in best_match

    dist = working_set.find(req)

  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 711, in find

    raise VersionConflict(dist, req)

pkg_resources.VersionConflict: (cffi 1.1.2 (/usr/lib/python3/dist-packages), Requirement.parse('cffi>=1.5.0'))

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build
Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/usr/lib/python3/dist-packages/pip/commands/install.py", line 304, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/usr/lib/python3/dist-packages/pip/req.py", line 1230, in prepare_files
    req_to_install.run_egg_info()
  File "/usr/lib/python3/dist-packages/pip/req.py", line 326, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/usr/lib/python3/dist-packages/pip/util.py", line 716, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-jyrz2gnr-build

Spell check with aspell

Issue by witchi
Mon Mar 23 10:50:36 2015
Originally opened as fritz-hh/OCRmyPDF#106

Hi,

Nice script, I use it with another script from http://www.konradvoelkel.com/2013/03/scan-to-pdfa/
Can you enhance your script with a call to aspell?

I have tried it within src/ocrPage.sh on line 198:

# perform spell check
[ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Performing spell check"
!aspell --dont-backup --lang=de_DE --mode=sgml -c "${curHocr}" < /dev/tty   \
        && echo "Could not spell checking file \"${curHocr}\". Exiting..." && exit $EXIT_OTHER_ERROR

but it doesn't work with the Gnu-Parallel tool.

Thank you
Andre

No output pdf file

Issue by sjoswig
Wed Jul 22 09:25:37 2015
Originally opened as fritz-hh/OCRmyPDF#110

I'm using ocrmypdf 2.1.0-1 on my arch and the last weeks I had no problem get ocr out of pdfs correctly with ocrmypdf, but no only temporary files were created and no single output pdf.

Here is the log file:

`OCRmyPDF version: v2.1-stable
Arguments: -f -vvv -l deu 2015-03-13 Kraftfahrtversicherung_ohne.pdf /home/js/Share/2015-03-13 Kraftfahrtversicherung.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.9.1-8 Q16 x86_64 2015-07-14 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC HDRI Modules OpenCL OpenMP
Delegates (built-in): bzlib cairo fontconfig freetype gslib jng jp2 jpeg lcms lqr ltdl lzma pangocairo png ps rsvg tiff webp wmf x xml zlib

GNU Parallel version:
GNU parallel 20150622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015 Ole Tange
and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication

please cite as described in 'parallel --bibtex'.

Poppler-utils version:
pdfimages version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org

Copyright 1996-2011 Glyph & Cog, LLC

unpaper version:

6.1

tesseract version:
tesseract 3.04.00
leptonica-1.71
libgif 5.1.0 : libjpeg 8d : libpng 1.6.16 : libtiff 4.0.4 : zlib 1.2.8 : libwebp 0.4.3

python2 version:

Python 2.7.10

Ghostscript version:

9.16

Java version:
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)

Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)

Created temporary folder: "/tmp/tmp.XZtlIvt11N"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0001
Page 0001: Size 842x596 (h*w in pt)
Page 0001: Size 2482x3510 (in pixel)
Page 0001: Extracting image as pbm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF`

OCRmyPDF with docker on CentOS7 not working

Hi,
I followed the docs for installing the docker container.
Running "docker run ocrmypdf --help" works fine.
But if I try to execute ocrmypdf on a local file, I get an error:

[root@CentOS7 test]# docker run -v "/srv/test/:/home/docker/" ocrmypdf ocrmypdf -v 1 x.pdf 1.pdf
usage: ocrmypdf [-h] [--verbose [VERBOSE]] [--version] [-L FILE] [-j N]
[--use_threads] [-n] [--flowchart FILE] [-l LANGUAGE]
[--title TITLE] [--author AUTHOR] [--subject SUBJECT]
[--keywords KEYWORDS] [-d] [-c] [-i] [--oversample DPI] [-f]
[-s] [--skip-big MPixels]
[--tesseract-config TESSERACT_CONFIG]
[--pdf-renderer {tesseract,hocr}]
[--tesseract-timeout TESSERACT_TIMEOUT] [-k] [-g]
input_file output_file
ocrmypdf: error: unrecognized arguments: 1.pdf

Any help would be nice.

Thank you!
Kind regards,
Nicole

Some input PDFs with Tesseract-OCR throw error in PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...

Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120

A few PDF input files (which were already processed by tesseract-ocr pdf mode) throw an error in OCRmyPDF even in the --force-ocr mode. At the moment, I have no idea what exactly happens, but the problem appears to be in PyPDF2 (I use PyPDF2 1.25.1).

The error message is only shown when OCRmyPDF is used with the -v option.

Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?

Full output:

Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf' 
Original exception:
    Exception #1
      'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
       Task = def ocrmypdf.main.repair_pdf(...):
       Job  = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
        pdfinfo.extend(pdf_get_all_pageinfo(output_file))
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
        text = page.extractText()
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
        content = ContentStream(content, self.pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
        self.__parseContentStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
        operands.append(readObject(stream, None))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
        return ArrayObject.readFromStream(stream, pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
        arr.append(readObject(stream, pdf))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
        return readStringFromStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
        raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
    PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'

landscape orientation with 90 deg rotation not correctly handled

I have a scanner that produces PDF files with lansdcape layout but with 90 degrees rotation. This kind of files is displayed correctly in nautilus file manager for example (Linux) as portrait file.
I have other scanned files from other scanner that produced portrait files directly. They are correctly handled.
As an example take attached test2.pdf which is a standard test print page scanned.
But in ocrmypdf I got a wrong file (see test2b.pdf)
test2.pdf
test2b.pdf

Raw image to OCRmyPDF

Issue by geaplanet
Sun Mar 8 10:46:08 2015
Originally opened as fritz-hh/OCRmyPDF#104

Is there any posibility to use OCRmyPDF passing raw TIFF images as a parameter?
OCRmyPDF convert pdfs to image to work with them, but in case you have got raw images from scanner or cam, how can you use it?

misalignment of graphic layer

Sometimes the graphic layer is misaligned while the text layer seems to be placed correctly. I uploaded a sample pdf (test07.pdf) at:

http://www.loaditup.de/838186-ns8hr3kcbg.html

ocrmypdf --oversample 600 test07.pdf test07ocr.pdf

shows what I mean. test07ocr.pdf can be seen here:
http://www.loaditup.de/838187-4hkqhkbvnm.html

Additionally ocrmypdf gives a warning:

   **** File did not complete the page properly and may be damaged.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> PyPDF2 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

I don't know whether this warning is justified. At least I have no problems in viewing the pdf in common pdf viewers. Have you got any idea about this?

setup.py fails with python 2.7

Although setup attempts to check the python version and throw an error message, in fact with python2.7 you never get that far: it barfs on the copyright symbol on line 2.

 $ python setup.py build
 File "setup.py", line 2
 SyntaxError: Non-ASCII character '\xc2' in file setup.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Adapt to Google Vision API

Google has just released alpha access to the Google Vision API, I am hopeful their OCR will be better than Tesseract, if that's true would this be a good option to potentially add as another way to handle the OCR input to this project, maybe you could add a switch somewhere to choose the OCR source? The sign up page for alpha access is here: https://services.google.com/fb/forms/visionapialpha/. It would be great to get your opinion on this. Thanks!

dependecy problem reportlab - allthough installed...

Issue by andreasotto
Tue Nov 4 10:44:25 2014
Originally opened as fritz-hh/OCRmyPDF#99

# ./OCRmyPDF.sh /home/ao/Leerungstermine189973.PDF /home/ao/test.pdf
Please install the python library reportlab. Exiting...

# apt-get install python-reportlab
python-reportlab ist schon die neueste Version.

.. already installed.

Debian 6 squeeze

unpaper fails to deskew some files with obvious skew

unpaper may not be a viable deskewer and ImageMagick is awful. It seems that presence of italics font may be part of the issue.

Tesseract does not calculate the skew angle (logically, since there is no global skew angle on a page).

Best option is to go back to Leptonica.

OCRmyPDF fails to detect text on pages created by Tesseract 3.04

Due to bug(s) in PyPDF's extractText, which does not find text OCR'ed by Tesseract 3.04.

There are probably other cases.

Use preview image if higher quality than raster image

Preview image is grayscale 200 DPI JPEG. This is generated twice if the actual image is near or at those specs, so it could be reused.

loseless jbig2 compression changed to ccitt

I'm using version 3.2.1 but still pdfs with jbig2 compression are changed to ccitt leading to considerably greater file sizes. Am I doing something wrong or is there a bug?

This is the output (see attachment for test.pdf):

$ pdfimages -list test.pdf 
page   num  type   width height color comp bpc  enc interp  object ID
---------------------------------------------------------------------
   1     0 image    2062  3190  gray    1   1  jbig2  no         5  0

$ ocrmypdf -v 1 test.pdf test-ocr.pdf 

________________________________________
Tasks which will be run:


Task enters queue = 'ocrmypdf.main.repair_pdf' 
    [{'xres': Decimal('599.999'), 'height_inches': Decimal('5.31667'), 'width_pixels': 2062, 'width_inches': Decimal('3.43667'), 'pageno': 0, 'images': [{'color': 'gray', 'bpc': 1, 'enc': 'jbig2', 'dpi_w': Decimal('599.999'), 'width': 2062, 'height': 3190, 'comp': 1, 'dpi_h': Decimal('600.000'), 'dpi': Decimal('599.999')}], 'yres': Decimal('600.000'), 'height_pixels': 3190, 'has_text': False}]
Completed Task = 'ocrmypdf.main.repair_pdf' 
Task enters queue = 'ocrmypdf.main.split_pages' 
Task enters queue = 'ocrmypdf.main.generate_postscript_stub' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.pdf, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages' 
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript' 
Task enters queue = 'ocrmypdf.main.skip_page' 
Uptodate Task = 'ocrmypdf.main.skip_page'


WARNING:
        In Task 'ocrmypdf.main.skip_page':
        No jobs were run because no file names matched.
        Please make sure that the regular expression is correctly specified. 

    Rendering 000001.ocr.page.pdf with pngmono
Completed Task = 'ocrmypdf.main.generate_postscript_stub' 
Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript' 
Task enters queue = 'ocrmypdf.main.preprocess_deskew' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew' 
Task enters queue = 'ocrmypdf.main.preprocess_clean' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.pp-clean.png)
Completed Task = 'ocrmypdf.main.preprocess_clean' 
Task enters queue = 'ocrmypdf.main.ocr_tesseract_hocr' 
Task enters queue = 'ocrmypdf.main.select_image_for_pdf' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.page.png, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.image)
Completed Task = 'ocrmypdf.main.select_image_for_pdf' 
Task enters queue = 'ocrmypdf.main.select_image_layer' 
    os.symlink(/tmp/com.github.ocrmypdf.hjkqg9uk/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.hjkqg9uk/000001.image-layer.pdf)
Completed Task = 'ocrmypdf.main.select_image_layer' 
    Tesseract Open Source OCR Engine v3.03 with Leptonica

Completed Task = 'ocrmypdf.main.ocr_tesseract_hocr' 
Task enters queue = 'ocrmypdf.main.render_hocr_page' 
Completed Task = 'ocrmypdf.main.render_hocr_page' 
Task enters queue = 'ocrmypdf.main.add_text_layer' 
Completed Task = 'ocrmypdf.main.add_text_layer' 
Task enters queue = 'ocrmypdf.main.merge_pages' 
    ['/tmp/com.github.ocrmypdf.hjkqg9uk/000001.rendered.pdf', '/tmp/com.github.ocrmypdf.hjkqg9uk/pdfa_def.ps']
Completed Task = 'ocrmypdf.main.merge_pages' 
Task enters queue = 'ocrmypdf.main.copy_final' 
Completed Task = 'ocrmypdf.main.copy_final'
$ pdfimages -list test-ocr.pdf 
page   num  type   width height color comp bpc  enc interp  object ID
---------------------------------------------------------------------
   1     0 image    2062  3190  gray    1   1  ccitt  no        10  0

test.zip

add a cmd line switch to generate a txt file to along with the pdf

Issue by fritz-hh
Sun Sep 28 20:27:47 2014
Originally opened as fritz-hh/OCRmyPDF#93

Some input PDFs with Tesseract-OCR throw error in PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...

Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120

The error message is only shown when OCRmyPDF is used with the -v option.

Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?

Full output:

Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf' 
Original exception:
    Exception #1
      'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
       Task = def ocrmypdf.main.repair_pdf(...):
       Job  = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
        pdfinfo.extend(pdf_get_all_pageinfo(output_file))
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
        text = page.extractText()
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
        content = ContentStream(content, self.pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
        self.__parseContentStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
        operands.append(readObject(stream, None))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
        return ArrayObject.readFromStream(stream, pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
        arr.append(readObject(stream, pdf))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
        return readStringFromStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
        raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
    PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'

ocrmypdf with incrontab / inotify: help ruffus a writable location for its database

Issue by segro21
Thu Sep 25 13:09:49 2014
Originally opened as fritz-hh/OCRmyPDF#90

Hi,
this is not realey an issue of ocrmypdf, but I'm trying to get this to work on an samba-share with incrontab /inotify.
I've created a folder and watch activities in this folder with incrontab. That works fine for things like pdftk, but nothing happens on ocrmypdf. Syslog shows the command correct, but then ends.

my incrontab -e
/home/pdfin IN_CLOSE_WRITE /opt/ocrmypdf/ocrmypdf.sh $@/$# $@/out/$#
/home/pdfin/out IN_CLOSE_WRITE /bin/rm $@/../$#
->this works fine for stamping pdfs with logo
/home/stamp IN_CLOSE_WRITE /usr/bin/pdftk $@/$# stamp $@/BB.pdf output $@/out/$#
/home/stamp/out IN_CLOSE_WRITE /bin/rm $@/../$#

Any ideas?

Cannot specify exactly one CPU

-j 1 will get mapped to available_cpu_count() and use all CPUs. Did I add this to work around a ruffus issue?

OCRmyPDF does not realise if an input file is not present

it looks, as if a basic test for "input file exists" is missing, or not working.

Assume, a file x.pdf exists.

Then

ocrmypdf x.pdf x.ocr.pdf

works, however

ocrmypdf x-no-such-file.pdf x.ocr.pdf

silently fails.

MRC

Issue by b21e
Fri Sep 19 16:14:39 2014
Originally opened as fritz-hh/OCRmyPDF#88

Hi, especially for scans integration with jbig2enc for better compression of the textimage layer would make this software perfect.

output file much bigger (7x), because not original embedded image files copied

Issue by alphablue52
Tue Feb 18 20:11:19 2014
Originally opened as fritz-hh/OCRmyPDF#70

Hello,
first thanks for the great work with this script. It made me work with OCR again at all after 10 years of frustrated absence :-)
Only one negative thing: Most of my PDFs come from a Canon ImageRunner scan, and are very good in quality vs. size. OCR gives great results, but the output PDFs are 7-8x bigger than input. As far as I can see, the embedded images get recompressed to JPEG, while the original is /CCITTFaxDecode.
If this is because of PDF/A compatibility, I suggest to add an option for non-PDF/A output.

You can download input.pdf and output.pdf here:
https://www.dropbox.com/l/KYlpYRiSs6IjWVOmF1fX39

Here is the output of the script with -g option.

~/bin/OCRmyPDF-2.0-stable$ sh OCRmyPDF.sh -g -l deu input.pdf output.pdf
OCRmyPDF version: v2.0-stable
Arguments: -g -l deu input.pdf output.pdf

Checking if all dependencies are installed

ImageMagick version:
Version: ImageMagick 6.7.7-10 2013-09-10 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP

GNU Parallel version:
GNU parallel 20130622
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using GNU Parallel for a publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,

;login: The USENIX Magazine, February 2011:42-47.

Poppler-utils version:
pdfimages version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdftoppm version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
pdffonts version 0.24.1
Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org

unpaper version:

OCRmyPDF.sh: 190: OCRmyPDF.sh: unpaper: not found

tesseract version:
tesseract 3.02.01
leptonica-1.69
libgif 4.1.6 : libjpeg 8d : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.8

python2 version:

Python 2.7.5+

Ghostscript version:

9.10

Java version:
java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.13.10.1)

OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

Created temporary folder: "/tmp/tmp.X82OQourlI"
Input file: Extracting size of each page (in pt)
Processing page 0001 / 0014
Page 0001: Size 842x595 (h_w in pt)
Page 0001: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0001: Continuing anyway, assuming a default resolution of 300 dpi
Page 0001: Extracting image as ppm file (300 dpi)
Page 0001: Performing OCR
Page 0001: Embedding text in PDF
Page 0001: Embedding text in PDF (debug page)
Processing page 0002 / 0014
Page 0002: Size 842x595 (h_w in pt)
Page 0002: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0002: Continuing anyway, assuming a default resolution of 300 dpi
Page 0002: Extracting image as ppm file (300 dpi)
Page 0002: Performing OCR
Page 0002: Embedding text in PDF
Page 0002: Embedding text in PDF (debug page)
Processing page 0003 / 0014
Page 0003: Size 842x595 (h_w in pt)
Page 0003: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0003: Continuing anyway, assuming a default resolution of 300 dpi
Page 0003: Extracting image as ppm file (300 dpi)
Page 0003: Performing OCR
Page 0003: Embedding text in PDF
Page 0003: Embedding text in PDF (debug page)
Processing page 0004 / 0014
Page 0004: Size 842x595 (h_w in pt)
Page 0004: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0004: Continuing anyway, assuming a default resolution of 300 dpi
Page 0004: Extracting image as ppm file (300 dpi)
Page 0004: Performing OCR
Page 0004: Embedding text in PDF
Page 0004: Embedding text in PDF (debug page)
Processing page 0005 / 0014
Page 0005: Size 842x595 (h_w in pt)
Page 0005: Expecting exactly 1 image covering the whole page (found 2). Cannot compute dpi value.
Page 0005: Continuing anyway, assuming a default resolution of 300 dpi
Page 0005: Extracting image as ppm file (300 dpi)
Page 0005: Performing OCR
Page 0005: Embedding text in PDF
Page 0005: Embedding text in PDF (debug page)
Processing page 0006 / 0014
Page 0006: Size 842x595 (h_w in pt)
Page 0006: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0006: Continuing anyway, assuming a default resolution of 300 dpi
Page 0006: Extracting image as ppm file (300 dpi)
Page 0006: Performing OCR
Page 0006: Embedding text in PDF
Page 0006: Embedding text in PDF (debug page)
Processing page 0007 / 0014
Page 0007: Size 842x595 (h_w in pt)
Page 0007: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0007: Continuing anyway, assuming a default resolution of 300 dpi
Page 0007: Extracting image as ppm file (300 dpi)
Page 0007: Performing OCR
Page 0007: Embedding text in PDF
Page 0007: Embedding text in PDF (debug page)
Processing page 0008 / 0014
Page 0008: Size 842x595 (h_w in pt)
Page 0008: Expecting exactly 1 image covering the whole page (found 8). Cannot compute dpi value.
Page 0008: Continuing anyway, assuming a default resolution of 300 dpi
Page 0008: Extracting image as ppm file (300 dpi)
Page 0008: Performing OCR
Page 0008: Embedding text in PDF
Page 0008: Embedding text in PDF (debug page)
Processing page 0009 / 0014
Page 0009: Size 842x595 (h_w in pt)
Page 0009: Expecting exactly 1 image covering the whole page (found 5). Cannot compute dpi value.
Page 0009: Continuing anyway, assuming a default resolution of 300 dpi
Page 0009: Extracting image as ppm file (300 dpi)
Page 0009: Performing OCR
Page 0009: Embedding text in PDF
Page 0009: Embedding text in PDF (debug page)
Processing page 0010 / 0014
Page 0010: Size 842x595 (h_w in pt)
Page 0010: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0010: Continuing anyway, assuming a default resolution of 300 dpi
Page 0010: Extracting image as ppm file (300 dpi)
Page 0010: Performing OCR
Page 0010: Embedding text in PDF
Page 0010: Embedding text in PDF (debug page)
Processing page 0011 / 0014
Page 0011: Size 842x595 (h_w in pt)
Page 0011: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0011: Continuing anyway, assuming a default resolution of 300 dpi
Page 0011: Extracting image as ppm file (300 dpi)
Page 0011: Performing OCR
Page 0011: Embedding text in PDF
Page 0011: Embedding text in PDF (debug page)
Processing page 0012 / 0014
Page 0012: Size 842x595 (h_w in pt)
Page 0012: Expecting exactly 1 image covering the whole page (found 3). Cannot compute dpi value.
Page 0012: Continuing anyway, assuming a default resolution of 300 dpi
Page 0012: Extracting image as ppm file (300 dpi)
Page 0012: Performing OCR
Page 0012: Embedding text in PDF
Page 0012: Embedding text in PDF (debug page)
Processing page 0013 / 0014
Page 0013: Size 842x595 (h_w in pt)
Page 0013: Expecting exactly 1 image covering the whole page (found 4). Cannot compute dpi value.
Page 0013: Continuing anyway, assuming a default resolution of 300 dpi
Page 0013: Extracting image as ppm file (300 dpi)
Page 0013: Performing OCR
Page 0013: Embedding text in PDF
Page 0013: Embedding text in PDF (debug page)
Processing page 0014 / 0014
Page 0014: Size 842x595 (h_w in pt)
Page 0014: Size 1240x1753 (in pixel)
Page 0014: Low image resolution detected (150 dpi). If needed, please use the "-o" to try to get better OCR results.
Page 0014: Extracting image as pgm file (150 dpi)
Page 0014: Performing OCR
Page 0014: Embedding text in PDF
Page 0014: Embedding text in PDF (debug page)
Output file: Concatenating all pages to the final PDF/A file
Output file: Checking compliance to PDF/A standard
The full validation log is available here: "/tmp/tmp.X82OQourlI/pdf_validation.log"
Output file: The generated PDF/A file is VALID
Script took 20 seconds

OCRmyPDF issue - no new file, can't find iccprofiles?

Hi there, and thank you for any assistance,

OCRmyPDF fails to create a new file.

here's the install process:

pip3 install ocrmypdf
Downloading/unpacking ocrmypdf
Downloading ocrmypdf-3.0.tar.gz
Running setup.py (path:/tmp/pip-build-wqh0224e/ocrmypdf/setup.py) egg_info for package ocrmypdf
Checking for tesseract >= 3.02.02...
Found tesseract 3.03
Checking for gs >= 9.14...
Found gs 9.14
Checking for unpaper >= 6.1...
Found unpaper 6.1
Checking for qpdf >= 5.0.0...
Found qpdf 5.1.2

warning: no previously-included files matching '*' found under directory 'tests/output'

Requirement already satisfied (use --upgrade to upgrade): ruffus>=2.6.3 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): Pillow>=2.4.0 in /usr/lib/python3/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): reportlab>=3.1.44 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): PyPDF2>=1.25.1 in /usr/local/lib/python3.4/dist-packages (from ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): pip>=1.4.1 in /usr/lib/python3/dist-packages (from reportlab>=3.1.44->ocrmypdf)
Requirement already satisfied (use --upgrade to upgrade): setuptools>=2.2 in /usr/lib/python3/dist-packages (from reportlab>=3.1.44->ocrmypdf)
Installing collected packages: ocrmypdf
Running setup.py install for ocrmypdf
Checking for tesseract >= 3.02.02...
Found tesseract 3.03
Checking for gs >= 9.14...
Found gs 9.14
Checking for unpaper >= 6.1...
Found unpaper 6.1
Checking for qpdf >= 5.0.0...
Found qpdf 5.1.2

warning: no previously-included files matching '*' found under directory 'tests/output'
Installing ocrmypdf script to /usr/local/bin

Successfully installed ocrmypdf
Cleaning up...

Verbose mode for conversion shows this:

$ ocrmypdf A.pdf B.pdf --verbose

Tasks which will be run:

Task enters queue = 'ocrmypdf.main.repair_pdf'

[{'height_inches': Decimal('24.3611'), 'pageno': 0, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 1, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 2, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 3, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 4, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 5, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 6, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 7, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 8, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 9, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 10, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 11, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 12, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 13, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 14, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 15, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 16, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 17, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 18, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 19, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 20, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 21, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 22, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 23, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 24, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 25, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 26, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 27, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 28, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 29, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 30, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 31, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 32, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 33, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 34, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 35, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 36, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 37, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 38, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 39, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 40, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 41, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 42, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 43, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 44, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 45, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 46, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 47, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 48, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 49, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 50, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 51, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 52, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 53, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 54, 'images': [], 'has_text': False, 'width_inches': Decimal('18.8056')}, {'height_inches': Decimal('24.3611'), 'pageno': 55, 'xres': Decimal('71.9998'), 'height_pixels': 1754, 'width_pixels': 1354, 'yres': Decimal('72.0000'), 'images': [{'color': 'rgb', 'dpi_h': Decimal('72.0000'), 'enc': 'jpeg', 'comp': 3, 'bpc': 8, 'dpi': Decimal('71.9999'), 'height': 1754, 'dpi_w': Decimal('71.9998'), 'width': 1354}], 'has_text': False, 'width_inches': Decimal('18.8056')}]

Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000048.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000048.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000003.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000003.ocr.page.pdf)
Page 33 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000033.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000033.skip.page.pdf)
Page 50 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000050.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000050.skip.page.pdf)
Page 2 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000002.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000002.skip.page.pdf)
Page 52 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000052.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000052.skip.page.pdf)
Page 8 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000008.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000008.skip.page.pdf)
Page 12 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000012.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000012.skip.page.pdf)
Page 41 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000041.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000041.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000039.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000039.ocr.page.pdf)
Page 1 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000001.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000001.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000026.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000026.ocr.page.pdf)
Page 5 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000005.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000005.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000016.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000016.ocr.page.pdf)
Page 11 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000011.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000011.skip.page.pdf)
Page 21 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000021.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000021.skip.page.pdf)
Page 28 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000028.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000028.skip.page.pdf)
Page 38 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000038.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000038.skip.page.pdf)
Page 47 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000047.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000047.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000017.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000017.ocr.page.pdf)
Page 49 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000049.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000049.skip.page.pdf)
Page 29 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000029.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000029.skip.page.pdf)
Page 31 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000031.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000031.skip.page.pdf)
Page 9 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000009.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000009.skip.page.pdf)
Page 43 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000043.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000043.skip.page.pdf)
Page 20 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000020.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000020.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000013.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000013.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000014.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000014.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000037.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000037.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000056.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000056.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000025.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000025.ocr.page.pdf)
Page 45 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000045.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000045.skip.page.pdf)
Page 55 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000055.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000055.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000032.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000032.ocr.page.pdf)
Page 51 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000051.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000051.skip.page.pdf)
Page 27 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000027.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000027.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000040.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000040.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000019.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000019.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000053.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000053.ocr.page.pdf)
Page 36 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000036.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000036.skip.page.pdf)
Page 46 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000046.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000046.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000024.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000024.ocr.page.pdf)
Page 10 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000010.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000010.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000007.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000007.ocr.page.pdf)
Page 23 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000023.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000023.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000044.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000044.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000035.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000035.ocr.page.pdf)
Page 6 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000006.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000006.skip.page.pdf)
Page 18 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000018.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000018.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000054.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000054.ocr.page.pdf)
Page 15 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000015.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000015.skip.page.pdf)
Page 22 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000022.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000022.skip.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000004.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000004.ocr.page.pdf)
Page 42 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000042.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000042.skip.page.pdf)
Page 30 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000030.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000030.skip.page.pdf)
Page 34 has no images - skipping OCR
os.symlink(/tmp/com.github.ocrmypdf.fwij8o72/000034.page.pdf, /tmp/com.github.ocrmypdf.fwij8o72/000034.skip.page.pdf)

Original exception:

Exception #1
  'builtins.FileNotFoundError(Could not find Ghostscript's iccprofiles)' raised in ...
   Task = def ocrmypdf.main.generate_postscript_stub(...):
   Job  = [.../com.github.ocrmypdf.fwij8o72/YummyS3ptember2015.repaired.pdf -> .../com.github.ocrmypdf.fwij8o72/YummyS3ptember2015.pdfa_def.ps, <ocrmypdf.main.WrappedLogger>]

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 761, in generate_postscript_stub
    generate_pdfa_def(output_file, pdfmark)
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pdfa.py", line 123, in generate_pdfa_def
    icc_profile = os.path.join(_get_postscript_icc_path(), 'srgb.icc')
  File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pdfa.py", line 118, in _get_postscript_icc_path
    raise FileNotFoundError("Could not find Ghostscript's iccprofiles")
FileNotFoundError: Could not find Ghostscript's iccprofiles

I tried removing ocrmypdf and re-installing it and had the same behaviour. Any ideas on what I need to do to fix this?

Thanks in advance.

Adam

Ubuntu 15.10: python3 exception - convert() got an unexpected keyword argument 'dpi'

Install on Ubuntu 15.10

Software versions

$ ocrmypdf --version
3.2
$ python3 --version
Python 3.4.3+
$ unpaper -version
6.1

Exception (on every attempt)

$ ocrmypdf --verbose 1 --force-ocr scansmpl.pdf test.pdf

Original exception:

    Exception #1
      'builtins.TypeError(convert() got an unexpected keyword argument 'dpi')' raised in ...
       Task = def ocrmypdf.main.select_image_layer(...):
       Job  = [[.../com.github.ocrmypdf.aziws_b9/000001.image, .../com.github.ocrmypdf.aziws_b9/000001.ocr.page.pdf] -> .../com.github.ocrmypdf.aziws_b9/000001.image-layer.pdf, <ocrmypdf.main.WrappedLogger>, [{'width_inches': Decimal('8.48611'), 'width_pixels': 1696, 'pageno': 0, 'images': [{'dpi_h': Decimal('2E+2'), 'color': 'gray', 'width': 1696, 'comp': 1, 'dpi': Decimal('199.928'), 'height': 2175, 'bpc': 1, 'dpi_w': Decimal('199.856'), 'enc': 'ccitt'}], 'has_text': False, 'xres': Decimal('199.856'), 'height_inches': Decimal('10.875'), 'height_pixels': 2175, 'yres': Decimal('2E+2')}], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 597, in select_image_layer
        img2pdf.convert([image], dpi=dpi, outputstream=pdf)
    TypeError: convert() got an unexpected keyword argument 'dpi'

Some input PDFs with Tesseract-OCR throw error in PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...

Issue by Wikinaut
Mon Sep 7 18:56:47 2015
Originally opened as fritz-hh/OCRmyPDF#120

The error message is only shown when OCRmyPDF is used with the -v option.

Do you have an idea what went wrong in these cases, or what else can be done to let OCRmyPDF apply another OCR run to such a PDF?

Full output:

Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf' 
Original exception:
    Exception #1
      'PyPDF2.utils.PdfReadError(Unexpected escaped string: b'{')' raised in ...
       Task = def ocrmypdf.main.repair_pdf(...):
       Job  = [ARCHIV.pdf -> .../com.github.ocrmypdf.49q2h1fj/ARCHIV.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]

    Traceback (most recent call last):
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/lib/python3.4/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/src/OCRmyPDF/ocrmypdf/main.py", line 372, in repair_pdf
        pdfinfo.extend(pdf_get_all_pageinfo(output_file))
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in pdf_get_all_pageinfo
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 145, in <listcomp>
        return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
      File "/usr/local/src/OCRmyPDF/ocrmypdf/pageinfo.py", line 115, in _pdf_get_pageinfo
        text = page.extractText()
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2566, in extractText
        content = ContentStream(content, self.pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2645, in __init__
        self.__parseContentStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream
        operands.append(readObject(stream, None))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 71, in readObject
        return ArrayObject.readFromStream(stream, pdf)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 166, in readFromStream
        arr.append(readObject(stream, pdf))
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 77, in readObject
        return readStringFromStream(stream)
      File "/usr/lib/python3.4/site-packages/PyPDF2/generic.py", line 386, in readStringFromStream
        raise utils.PdfReadError(r"Unexpected escaped string: %s" % tok)
    PyPDF2.utils.PdfReadError: Unexpected escaped string: b'{'

close_fds is not supported on Windows platforms...

Hi,

I'm getting this error

[Anaconda3] C:\Users\Carlos\Anaconda3>ocrmypdf --help
Traceback (most recent call last):
File "C:\Users\Carlos\Anaconda3\Scripts\ocrmypdf-script.py", line 9, in
load_entry_point('ocrmypdf==3.1.1', 'console_scripts', 'ocrmypdf')()
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 549, in load_entry_point
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2709, in load_entry_point
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2369, in load
File "C:\Users\Carlos\Anaconda3\lib\site-packages\setuptools-19.4-py3.5.egg\pkg_resources__init__.py", line 2375, in resolve
File "C:\Users\Carlos\Anaconda3\lib\site-packages\ocrmypdf-3.1.1-py3.5.egg\ocrmypdf\main.py", line 51, in
if tesseract.version() < MINIMUM_TESS_VERSION:
File "C:\Users\Carlos\Anaconda3\lib\site-packages\ocrmypdf-3.1.1-py3.5.egg\ocrmypdf\tesseract.py", line 51, in version
stderr=STDOUT)
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 629, in check_output
*_kwargs).stdout
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 696, in run
with Popen(_popenargs, **kwargs) as process:
File "C:\Users\Carlos\Anaconda3\lib\subprocess.py", line 873, in init
"close_fds is not supported on Windows platforms"
ValueError: close_fds is not supported on Windows platforms if you redirect stdin/stdout/stderr

Thank you for your help.

automatic language detection

woukd be great if ocrmypdf would automatically recognize the language used in a pdf and used it

Vertical text not displayed correctly

Issue by fritz-hh
Thu Apr 25 17:18:06 2013
Originally opened as fritz-hh/OCRmyPDF#15

Vertical text is not oriented correctly because the hocr file produced by tesseract does not contain the "textangle" attribute.

How to add languages for tesseract-ocr in the image?

Sorry I am new to docker. I just pull the latest, and want to use language chi_sim in tesseract, but it seems this language support is not installed by default, as it complains:

~/work/tmp$ docker run -v "$(pwd):/home/docker" ocrmypdf 31.pdf 31-ocr.pdf -l chi_sim
The installed version of tesseract does not have language data for the following requested languages:
chi_sim

It seems the tesseract used by the docker image is different from the system's tesseract-ocr package, with which I installed the language package by "apt-get install tesseract-ocr-chi-sim".

How to update the docker image for including the desired language support? And how to check which languages are supported (like "tesseract --list-langs" in the system)?

Thanks a lot.

Skip text removes text from output file

I might be reading this wrong, but when I run skip-text on this PDF:

https://www.dropbox.com/s/dt0d3wpwb6ovngi/OTII_PressRelease-200110301.pdf?dl=0

Which has text in it the output file looks the same except the searchable text is now gone, what am I doing wrong? Thanks!

Improve error message output

Ruffus's console logging seems to be far too quiet, suppressing error messages in some cases.

Find out how to create our own error logging and tell ruffus about it.

Pagesegmode

Hey,

is there a way to define the pagesegmode for the tesseract OCR?
(https://tesseract-ocr.googlecode.com/git/doc/tesseract.1.html)

Thank you very much
tuxasus

Option to remove blank pages

Issue by drdownload
Thu Oct 30 08:25:16 2014
Originally opened as fritz-hh/OCRmyPDF#98

it would be great to have an option to remove blank pages. I scan a lot of images with my duplex scanner and not all scanned documents have a printed backside.

file size

ocrmypdf increases file size by about a factor of 4 (even more when using oversampling).
I assume this is because the graphic layer is created instead of using the original graphic. Correct?
Is it possible to force ocrmypdf to use the original graphics? (I do not understand issue #8, but the comment from kebekus sounds promising to me.)
If the graphics have to be generated because of some missing information: Would it be possible to feed ocrmypdf with this information (e.g. I know the scanning resolution, orientation, and page size and i could provide this information to ocrmypdf during function call).

Thanks, Femi

OCRmyPDF silently fails on input filenames like uppercase *.PDF

My new duplex scanner is BROTHER ADS-2600we. It generates PDF (which are not compatible and make also convert fail. It can however generate PDF/A. The standard filenames have the form

[0-9]{8}.PDF

Example: 06091500.PDF, 06091501.PDF etc. for files scanned on 06. September 2015. These filenames (I don't like the format) cannot be changed in the scanner.

Problem:

When you start

ocrmypdf --verbose -L log.txt -l deu 06091500.PDF 06091500.ocr.pdf

this silently fails ! ("...No jobs were run because no file names matched.")

Workaround:
Rename files 06091500.PDF to x.pdf and process then.

bug when OCRing German "et cetera" ("&c.")

Issue by femifrak
Mon Nov 17 01:56:35 2014
Originally opened as fritz-hh/OCRmyPDF#100

In the german fractur there exist a sign for "et cetera" which is ocr'ed by "&c." (see http://de.wikipedia.org/wiki/Et_cetera). In the hocr file this somehow creates a conflict with the html code and leads to exit.

ocrmypdf / ocrmypdf Goto Github PK

ocrmypdf's Introduction

Main features

Motivation

Installation

Languages

Documentation and support

Requirements

Press & Media

Business enquiries

License

Disclaimer

ocrmypdf's People

Contributors

Stargazers

Watchers

Forkers

ocrmypdf's Issues

Checking if all dependencies are installed

;login: The USENIX Magazine, February 2011:42-47.

Copyright 1996-2011 Glyph & Cog, LLC

0.4.2

Python 2.7.6

9.10

OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

ImportError: No module named abc

Checking if all dependencies are installed

please cite as described in 'parallel --bibtex'.

Copyright 1996-2011 Glyph & Cog, LLC

6.1

Python 2.7.10

9.16

Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)

Checking if all dependencies are installed

;login: The USENIX Magazine, February 2011:42-47.

Copyright 1996-2011 Glyph & Cog, LLC

OCRmyPDF.sh: 190: OCRmyPDF.sh: unpaper: not found

Python 2.7.5+

9.10

OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

Verbose mode for conversion shows this:

Problem:

Recommend Projects

Recommend Topics

Recommend Org

Jobs