GithubHelp home page GithubHelp logo

Comments (14)

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by zorglups
Tue Mar 10 21:32:25 2015


My scanner produces multi jpg per document and I'm adapting the script to deal with multiple image files to create the final pdf.
When get something working, I'll post a comment here.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Tue Mar 10 21:35:37 2015


Hi. Currently you need to convert the images to PDF first (e.g. using ghostscript, imagemagik).
Note: A ticket for this enhancement already exists (#17), but it is still open

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by geaplanet
Sat Mar 14 09:02:37 2015


Convert images to PDF first is not an option because I have books in TIFF format and the total size is about 5Gb!

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by zorglups
Sun Mar 29 21:10:57 2015


@geoplanet

I have a need for this too. We are first translating OCRmyPDF.sh into Python and I will then add support to ingest images (may be multiple ordered images) instead of pdf.

Could you post (attach) here a TIFF image so I make sure I test one of yours as well ?

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by zorglups
Sun Mar 29 21:18:18 2015


IMHO this is a duplicate of #17 and could therefore be closed.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by geaplanet
Sun Apr 12 10:54:46 2015


I get this when I try to use the script:


./OCRmyPDF.sh -v -l spa 010.tif 010.pdf
When using programs that use GNU Parallel to process data for publication please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.
Or you can get GNU Parallel without this requirement by paying 10000 EUR.

To silence this citation notice run 'parallel --bibtex' once or use '--no-notice'.

Page 0001: Expecting exactly 1 image covering the whole page (found 0). Cannot compute dpi value.
Page 0001: Continuing anyway, assuming a default resolution of 300 dpi
Could not extract page 0001 as ppm from "/tmp/pruebas/010.tif". Exiting...
Error: May not be a PDF file (continuing anyway)
Error: PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
Error: May not be a PDF file (continuing anyway)
Error: PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
Error: May not be a PDF file (continuing anyway)
Error: PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
parallel: Starting no more jobs. Waiting for 1 jobs to finish. This job failed:
./src/ocrPage.sh /tmp/pruebas/010.tif 0001\ 6373\ 7999 0001 /tmp/tmp.PLXzCNYRF2 1 spa 0 0 0 0 0 0 0 0


Is it possible pass as argument a bulk of files? I tryed *.tif, but it doesn't work.
I have tested with jpg and tif files.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by jbarlow83
Sun Apr 12 21:09:52 2015


The input has to be a PDF. Conversion of raw images is a frequently requested feature.

You can use tiff2pdf (from poppler-utils or xpdf) or the Python package img2pdf to do batch command line conversion of images to PDF.

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

For anyone looking, this issue is still open because of the difficulty of successfully converting images to a PDF.

img2pdf does a good job for most common, single frame images that can be parsed by PIL, provided the image is monochrome, grayscale or 8-bit RGB. It does not most unusual file types, however and does not support multipage TIFFs. Unfortunately this useful program is no longer maintained. I have a public fork that implements some fixes that prevent it from producing invalid PDFs for very large images.

tiff2pdf fails catastrophically for uncommon TIFFs, as in it silently produces a valid PDF that in no way resembles the input. Earlier versions produce malformed PDFs that were the color will be interpreted incorrectly on some viewers, but no others.

from ocrmypdf.

Daniel-KM avatar Daniel-KM commented on July 19, 2024

Hi,

I implemented this feature in an old shell version of OCRmyPDF (see Daniel-KM/OCRmyPDF@8da737d), and now I need it with the current version. So, do you have a plan to implement it?

About img2pdf, it's no longer maintained on github ("free software needs free tools"), but it's maintained on https://gitlab.mister-muffin.de/josch/img2pdf and there is a recent package for Debian. It's a dependance of OCRmyPDF on the last Debian release. Anyway, it's not required: imagick can be used instead.

There are two steps:

  • define the arguments to differentiate input and output: either add -o / --output argument, and/or --input argument (-i is already used), so it is possible to set a list of one-page pdf files as input;
  • allow to input any type of images, one-page pdf or jpg, png, etc. and the program will take care of them automatically.

Sincerely,

Daniel Berthereau
Infodoc & Knowledge management

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

My older comment is no longer accurate. I've since worked with img2pdf's creator on some bug fixes on the gitlab version of that tool, and incidentally ocrmypdf will use img2pdf to generate single page PDF internally in some execution paths.

I haven't implemented this feature yet because I really like things to just work when I try to use them. Image to PDF conversion is hard to get right, given the variation of images in the wild, poor handling and outright disregard for image metadata that is sometimes essential for PDFs (DPI, rotation, colorspaces), and high complexity of TIFFs and the various half-baked scanner drivers that create degenerate image files and unleash them on the world. Sometimes you arrive at a situation where you need additional input from the user, like if the DPI is a ridiculous number. Because of the difficulty in getting it right, I recommend people first convert all their images to PDF, review them to be sure they are happy with them, and then do OCR.

That said, I think there's an argument for handling lists of single page images if img2pdf can handle them and whimpering out if the user asks for something that really should be handled by a dedicated image to PDF converter. I'd certainly welcome a PR if you're willing to contribute.

I would want ocrmypdf to make an effort at interpreting the user's arguments if there is a reasonable interpretation. If there is an --output then that is the output file and all others are input; if not, then treat the last filename in the list as the output file provided it ends with .pdf and does not exist, similar to how cp a b c dir will copy files a, b, c to dir/ if dir/ exists. That maintains backward compatibility with the current syntax, the only change being it won't overwrite existing files.

from ocrmypdf.

Daniel-KM avatar Daniel-KM commented on July 19, 2024

So when there are no dpi, no information about rotation, or no standard colorspaces in all of the input images, ocrmypdf should stop to warn user to add these info as arguments (format of these arguments?) or to correct images (simpler to understand, and no need of new arguments). These checks can be a first step before processing the ocr (via identify?). If they are fine, the process can use img2pdf to process them.

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

Pillow, an existing dependency, can be used to test if images are usable. img2pdf also uses Pillow, so img2pdf supports a subset of images supported by Pillow.

For the other cases I would implement the following, although other cases may come up:

  • warn if DPI < 150 or DPI > 1200
  • error if DPI is very small, very large, or implies an invalid PDF page size (200" maximum)
  • ignore EXIF rotation
  • assume a 3-channel image with no colorspace is sRGB, but print a warning
  • error out for 4-channel images with no colorspace

Are you planning to work on this?

from ocrmypdf.

Daniel-KM avatar Daniel-KM commented on July 19, 2024

Sorry, I have no time until September, because this tool is a subtool of other projects and I use an intermediate script.

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

ocrmypdf will now do this for a single image at a time as a convenience and with --image-dpi to change the DPI, but I'm still recommending people use img2pdf for anything more complex.

(img2pdf does a better job than ImageMagick. It's faster and it does not recompress images.)

from ocrmypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.