GithubHelp home page GithubHelp logo

file size about ocrmypdf HOT 14 CLOSED

femifrak avatar femifrak commented on July 19, 2024
file size

from ocrmypdf.

Comments (14)

jbarlow83 avatar jbarlow83 commented on July 19, 2024

The graphics have to be generated because Tesseract works on images. A PDF
file itself does not have a resolution; it can be rendered at any
resolution.

So ocrmypdf tries to calculate the minimum resolution needed to render an
image without loss of information. But it cannot calculate this exactly –
for tricky situations it guesses wrong. So because of that you need to
oversample, to force it to use more resolution.

ocrmypdf does rebuild the PDF. Depending on how the input was compressed,
it might be inflated, particularly if the input was monochrome JBIG2 or
CCITT since PNG encoding is not as efficient.

The idea discussed in #8 is to insert the text layer into the original
page. But doing that correctly is a low-level Postscript problem. It's
almost doable, but last I tried the Python PDF library didn't work as
advertised and caused problems.

On Tue, 12 Jan 2016 at 23:50 femifrak [email protected] wrote:

ocrmypdf increases file size by about a factor of 4 (even more when using
oversampling)
I assume this is because the graphic layer is created instead of using the
original graphic Correct?
Is it possible to force ocrmypdf to use the original graphics? (I do not
understand issue #8 #8, but
the comment from kebekus sounds promising to me)
If the graphics have to be generated because of some missing information:
Would it be possible to feed ocrmypdf with this information (eg I know the
scanning resolution, orientation, and page size and i could provide this
information to ocrmypdf during function call)

Thanks, Femi


Reply to this email directly or view it on GitHub
#43.

from ocrmypdf.

femifrak avatar femifrak commented on July 19, 2024

Thanks for this competent explanation.

I tried to optimize the pdfs before and after ocr'ing in acrobat which allows to optimize b/w and colour images separately. Before ocr'ing optimization of b/w images was most effective, after ocr'ing optimization of colour images was most effective. Can i conclude that ocrmypdf converts b/w images to colour images? And if so, can ocrmypdf be asked to keep/save images b/w?

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

If it is converting b/w images to colour that is a bug. You can use the
program pdfimages from poppler-utils as a quick way to inspect the images
inside the PDF and see what formats were used in saving them. Acrobat's
Preflight also lets you examine this.

ocrmypdf is supposed to select the lowest colorspace that will capture all
of colours. That means if you have two images on a paper, one color and one
b/w, it will treat the whole page as colour. It will try to use grayscale
for an all gray page. Scanners generally don't make multiple images on a
PDF page, but some PDF optimizers will split content in a B/W image and a
color or grayscale background image.

ocrmypdf cannot detect the case where an image was saved as color but no
colors were used. It's not a PDF optimizer.

You could override the color selection logic by forcing device = 'pngmono'
in ocrmypdf/main.py :: rasterize_with_ghostscript.

On Thu, 14 Jan 2016 at 06:40 femifrak [email protected] wrote:

Thanks for this competent explanation.

I tried to optimize the pdfs before and after ocr'ing in acrobat which
allows to optimize b/w and colour images separately. Before ocr'ing
optimization of b/w images was most effective, after ocr'ing optimization
of colour images was most effective. Can i conclude that ocrmypdf converts
b/w images to colour images? And if so, can ocrmypdf be asked to keep/save
images b/w?


Reply to this email directly or view it on GitHub
#43 (comment).

from ocrmypdf.

femifrak avatar femifrak commented on July 19, 2024

Inspecting the original pdf and the ocr'ed pdf with pdfimages delivers, resp.:

color: icc rgb
comp: 1 3
bpc: 1 8

Setting device = 'pngmono' in main.py does unfortunately not reduce file size and results in rgb,3,8

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

I traced this to a regression in reportlab.

You could try --pdf-renderer tesseract, although I recommend using the
latest revision of tesseract. The most recent released version has other
bugs in its PDF output.

On Fri, 15 Jan 2016 at 11:49 femifrak [email protected] wrote:

Inspecting the original pdf and the ocr'ed pdf with pdfimages delivers,
resp.:

color: icc rgb
comp: 1 3

bpc: 1 8

Setting device = 'pngmono' in main.py does unfortunately not reduce file
size and results in rgb,3,8


Reply to this email directly or view it on GitHub
#43 (comment).

from ocrmypdf.

femifrak avatar femifrak commented on July 19, 2024

What does "trace to a regression" mean? Does it mean reportlab has a bug?
Is there a workaround?

Thanks and kind regards,

Femi

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

Sorry for the vague answer.

Back in version 2.x of OCRmyPDF we did a bunch of testing of this case and
pretty much had it working, but now when I check the code for reportlab it
seems to me that it's not capable of doing the right thing. I don't know if
it actually was a regression or not.

In any case, I wrote a workaround that's been in demand for a while because
it has others benefits. In the latest development version, ocrmypdf will
try to extract the existing PDF page, insert a text layer, and then use
that for output. In this way, it keeps the original compression settings.
If you specify settings that are incompatible with this sequence because
they alter the image (--force, --deskew, --clean), then there may be
recompression, but it handles this better than reportlab does now.

You're welcome to try the development version: [354e619]

On Sat, 16 Jan 2016 at 00:07 femifrak [email protected] wrote:

What does "trace to a regression" mean? Does it mean reportlab has a bug?
Is there a workaround?

Thanks and kind regards,

Femi


Reply to this email directly or view it on GitHub
#43 (comment).

from ocrmypdf.

femifrak avatar femifrak commented on July 19, 2024

I ran
git clone -b develop https://github.com/jbarlow83/OCRmyPDF.git
sudo pip3 install -e .
and
ocrmypdf in.pdf (--oversample 600) out.pdf

and the file size is only slightly larger than the original image with monochrome images. That works, however, the images in out.pdf are no longer centered but moved right and down (also when not using oversampling but then the page size is incorrect).

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

My guess is that you cloned the develop branch but your system "ocrmypdf"
command points at the previously installed version. Try creating a virtual
environment and running it from inside that.

pyvenv ocrmypdf-dev
. ocrmypdf-dev/bin/activate
pip install git+
https://github.com/jbarlow83/OCRmyPDF.git@354e61946e0ad7ec090189c609ebdb99824e1973

Run ocrmypdf -v 1 to see what it is doing. If the development branch is
active select_image_layer is a new sequence that pipeline; if it appears
that confirms that you are using the right codebase.

On Sun, 17 Jan 2016 at 22:06 femifrak [email protected] wrote:

As I could not find the files tests/test_hocrtransform.py and
tests/test_pageinfo.py
I created them including the changes and changed test_main.py as well.

Unfortunately the outputfile is not affected by this.
Did i miss something?


Reply to this email directly or view it on GitHub
#43 (comment).

from ocrmypdf.

femifrak avatar femifrak commented on July 19, 2024

I am lost. I thought all relevant data are in the download directory.
"pip install git+" results in some exceptions. (Also with sudo.)

How can I uninstall ocrmypdf and any virtual environment and start from beginning?

I tried "sudo pip3 uninstall ocrmypdf" but "Can't uninstall 'ocrmypdf'. No files were found to uninstall." although "pip3 list" lists ocrmypdf.

Or can I just delete all directories with ocrmypdf (download dir, /usr/local/lib/python3.4/dist-packages/ocrmypdf...,/usr/local/bin/ocrmypdf)?

Thanks, Femi

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

You might have an easier time installing v3.2rc1. At least that is a posted
version on PyPI and Github, which makes it easier to reference.

If you create a new virtual environment that pretty much is starting from
the beginning. virtualenvs do not automatically include third party
packages, just the Python standard library.

So pyvenv ocrmypdf-env, activate, pip install --upgrade pip, pip install
ocrmypdf==3.2rc1 and hope for the best.

On Tue, 19 Jan 2016 at 08:35 femifrak [email protected] wrote:

I am lost. I thought all relevant data are in the download directory.
"pip install git+" results in some exceptions. (Also with sudo.)

How can I uninstall ocrmypdf and any virtual environment and start from
beginning?

I tried "sudo pip3 uninstall ocrmypdf" but "Can't uninstall 'ocrmypdf'. No
files were found to uninstall." although "pip3 list" lists ocrmypdf.

Or can I just delete all directories with ocrmypdf (download dir,
/usr/local/lib/python3.4/dist-packages/ocrmypdf...,/usr/local/bin/ocrmypdf)?

Thanks, Femi


Reply to this email directly or view it on GitHub
#43 (comment).

from ocrmypdf.

femifrak avatar femifrak commented on July 19, 2024

I deleted all that has to do with ocrmypdf, upraded pip and tried to install 3.2rc1, with the following result:

:~/tmp$ pip install ocrmypdf==3.2rc1
Collecting ocrmypdf==3.2rc1
  Downloading ocrmypdf-3.2rc1.tar.gz (19.6MB)
    100% |████████████████████████████████| 19.6MB 26kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-PjBf70/ocrmypdf/setup.py", line 7, in <module>
        from collections.abc import Mapping
    ImportError: No module named abc

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-PjBf70/ocrmypdf

I am not familiar with pip, pip3, git and so on. Can you please give me a hint?

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

3.2 is released now. Maybe try the docker version?

from ocrmypdf.

femifrak avatar femifrak commented on July 19, 2024

thanks, works with 3.2.1

from ocrmypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.