GithubHelp home page GithubHelp logo

Option to remove blank pages about ocrmypdf HOT 19 OPEN

ocrmypdf avatar ocrmypdf commented on August 20, 2024 7
Option to remove blank pages

from ocrmypdf.

Comments (19)

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on August 20, 2024

Comment by eloops
Wed Dec 10 11:49:27 2014


I've been modifying this for my own use, there is a specific program (here) called 'empty-page' which will return a 0 or 1 depending on if the page is blank or not. It works on pnm files as well as TIFF, and is fast.

I inserted the following code in ocrPage.sh just after the conversion to .pnm:

# check to see if image is a blank page ... if so, delete it
[ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Detecting if blank page ..."
empty-page -i "$curImgPixmap" >/dev/null 2>&1
if [ $? -ne 1 ] && [ $KEEP_TMP -eq 0 ]; then
  [ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Deleting blank page and moving on."
  rm -f "$curOrigImg"*
  rm -f "$curHocr"
  rm -f "$curImgPixmap"
  rm -f "$curImgPixmapDeskewed"
  rm -f "$curImgPixmapClean"
  rm -f "$curImgInfo"
  exit 0
fi

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on August 20, 2024

Comment by zorglups
Fri Mar 13 21:51:50 2015


My 2 cents.
I scan most thing with a duplex scanner too and will implement things to remove the last page if it is blank.

As I want to keep my PDFs a replica of documents, intermediate blank pages should remain in my archiving use.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on August 20, 2024

Comment by Wikinaut
Tue Sep 8 13:47:03 2015


Just another idea:

Another (secondary) empty-page detection decision could be based on Tesseract's text output for that page (e.g. number of detected characters or the like).

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on August 20, 2024

Comment by Wikinaut
Tue Sep 8 13:51:26 2015


and http://superuser.com/questions/343385/detecting-blank-image-files shows alternative solutions for blank page detection.

from ocrmypdf.

modulexcite avatar modulexcite commented on August 20, 2024

@jbarlow83 thanks for an excellent tool and for sharing with the community. you 🤘
Is this feature request still being worked on?

from ocrmypdf.

sojusnik avatar sojusnik commented on August 20, 2024

I've discovered a script that uses ghostscript for blank page detection.

Maybe it will be of use to you and easily integrable into ocrmypdf.

from ocrmypdf.

WillemJansen avatar WillemJansen commented on August 20, 2024

+1 for this one - currently I simply do a preprocessing for that.

from ocrmypdf.

svenihoney avatar svenihoney commented on August 20, 2024

@jbarlow83 Thanks too for this great tool. I used my own script since I was not aware of this amazing piece of software.

I would like to push this one with my ideas as well. Since I am not too familiar with python, pipeline and leptonica I am unfortunately unable to implement it by myself. But I would like to share my ideas about this, since all the basics are available:

An empty page could be detected by calculation the ratio of black pixels to white pixels. If the ratio is below e.g. 0.005, the page is considered as blank.

Calculating the ratio could be done with ghostscript's inkcov option as in the script above or even easier with leptonica: Use a 1bpp representation of the page and pixCountPixels to determine the black pixels, divided by the number of pixels of the page.

So the todo would be:

  • An option --remove-empty-pages to enable the page removal process
  • An option --empty-page-threshold 0.005 to be able to modify the threshold of the percentage of black pixels in the page
  • Create a task_remove_empty_page which calls the function to determine if the page is empty. Either the task removes the PNG or can stop the pipeline, that's where I don't know enough of ruffus magic. Maybe somthing similar like ocr_or_skip task.

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on August 20, 2024

I'm close to releasing a new version most of which is in the api branch which could (should) hopefully make this sort of thing easier since there will be a sort of plugin interface where various steps can be customized externally. Although I don't yet have an extension point that would change the page count, but that's at least the place to look for targeting this change.

A separate issue is actually getting a good blank page detector. My scanner has a blank page detector based on a threshold but I've turned it off because it's so unreliable. I can tell you from experience that a single threshold counting based blank page detector is more trouble than it's worth on real documents due to false negatives (i.e. discarding a useful page that wasn't blank).

Looking at black and white only, the detector ought to put more weight on the center of a page being blanked and less on the margins or typical locations of hole punch and staples. A single mark like a page number at the bottom of a blank page should not cause removal. Paper with grainy texture will tend to scan with a lot of "salt and pepper" noise, but should still be considered black. Any thresholds need to be scaled reasonable for documents with large page sizes. And it should need work for grayscale and color pages consistently, with the unique cases that brings: bleed through from the previous page and very faded pages, and multiple colors being indicative of content.

It would be worth seeing if there's anything in the literature. I'd recommend starting there if you're interested in working on and seeing if we can get a good algorithm. Using unpaper is one possibility.

from ocrmypdf.

enterframe avatar enterframe commented on August 20, 2024

I have the case that OCRmyPDF is stopping when it detects a blank page, page 6 here:

➜ OCRmyPDF-LOG:

INFO - reading file from standard input
INFO - Start processing 2 pages concurrently
INFO - 2: page is facing ⇧, confidence 13.48 - no change
INFO - 1: page is facing ⇧, confidence 12.45 - no change
INFO - 3: page is facing ⇧, confidence 14.77 - rotation appears correct
INFO - 4: page is facing ⇧, confidence 16.16 - rotation appears correct
WARNING - 6: [tesseract] Warning. Invalid resolution 0 dpi. Using 70 instead.
INFO - 6: [tesseract] Too few characters. Skipping this page
ERROR - 6: [tesseract] Error during processing.
INFO - 6: page is facing ⇧, confidence 0.00 - no change
INFO - 5: page is facing ⇧, confidence 11.54 - no change
INFO - Optimize ratio: 1.00 savings: -0.1%
INFO - Image optimization did not improve the file - discarded
INFO - Output sent to stdout

← OCRmyPDF-LOG-END

any hints?

Sebastian

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on August 20, 2024

@enterframe That message simply says that too few characters were recognized on a particular page, so Tesseract assumed that none of them were valid. It did not stop process, it just did not find anything it was confident was text. It appears that the file was created successfully ("sent to stdout").

from ocrmypdf.

disaster123 avatar disaster123 commented on August 20, 2024

+1 on this

from ocrmypdf.

disaster123 avatar disaster123 commented on August 20, 2024

Does anybody have a workaround?

from ocrmypdf.

CWempe avatar CWempe commented on August 20, 2024

Just a thought:

When there will be a feature to detect blank pages it might come in handy to optionally replace a "physical blank" page with a "digital blank" page instead of removing it.

As @zorglups stated, there might be reasons to keep blank pages.
Like for archiving purposes or to keep the order of even and odd pages when displaying them side by side in a pdf viewer.

The benefit of a "digital blank" page would be a much smaller file size (almost zero) compared to a "physical blank" page that might have a slightly gray shadow for example.

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on August 20, 2024

I agree - digital blank has significant advantages in most cases.

If only there were a reliable algorithm for blank page detection.... I think it may be a machine learning problem.

from ocrmypdf.

vlad12244 avatar vlad12244 commented on August 20, 2024

You can try my "Noora PDF" software project. It has AI inside, and I trained it on some scanned pages with punch holes. Maybe this will work for you: https://www.softpedia.com/get/Office-tools/PDF/Zautin-Simple-PDF-Watermark.shtml
It is free, but you can Donate :)

from ocrmypdf.

lecramr avatar lecramr commented on August 20, 2024

+1

from ocrmypdf.

patric-r avatar patric-r commented on August 20, 2024

+1

from ocrmypdf.

kidexx avatar kidexx commented on August 20, 2024

+1

from ocrmypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.