GithubHelp home page GithubHelp logo

despeck's Introduction

Build status (Ubuntu) Gem Version

Despeck

Remove unwanted stamps or watermarks from scanned images

despeck is a Ruby gem that helps you remove unwanted stamps or watermarks from scanned images/PDFs, primarily prior to OCR.

Its image processing operations are based on libvips via the ruby-vips Ruby-bindings.

It can be used to:

  • detect uniform watermarks from a series of images,

  • output a watermark pattern file (image, mask) that describes a watermark pattern, and

  • remove a specified watermark pattern from input images regardless of the location of the watermark on these images.

Assumptions on input:

  • The input may be a single image, or a PDF of multiple pages of images.

  • In the case of multiple pages, not all pages may have the watermark.

  • The input images are assumed to be purely monochrome text-based.

  • The watermarks are colored. For example, if the watermark is a “GREEN SQUARE PATTERN”, for all the pages that contain this mark, despeck will attempt to detect this pattern and remove them.

Installation

General

Install gem manually:

$ gem install despeck

Or add it to your Gemfile:

gem 'despeck'

and then run bundle install

Prerequisites

Despeck depends on libvips for its functionality, you must have it installed to utilize Despeck.

MacOS

$ brew install vips

Ubuntu/Debian

$ apt install libvips libvips-dev libvips-tools

PDF functionality

To remove watermarks from PDF files, your libvips must be built with PDF support.

Make sure you have PDFium or poppler-glib installed before building libvips. If you’re using Homebrew to install libvips, it should work by default.

Details on how to configure a libvips installation can be found at: https://libvips.github.io/libvips/install.html

OCR functionality

To extract text via despeck ocr command, you’ll need to install:

  • Tesseract (3.x)

  • ImageMagick (6.x)

  • Desired languages

MacOS

To install Tesseract itself (with all languages pre-installed):

$ brew install tesseract --all-languages

Or you can install Tesseract with some languages manually:

$ brew install tesseract
$ mkdir -p ~/Downloads/tessdata
$ cd ~/Downloads/tessdata
$ wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/chi_sim.traineddata

To install ImageMagick:

$ brew install imagemagick@6
$ echo 'export PATH="/usr/local/opt/imagemagick@6/bin:$PATH"' >> ~/.bash_profile
$ export PKG_CONFIG_PATH=/usr/local/opt/imagemagick@6/lib/pkgconfig
$ brew link --force imagemagick@6

The full list of languages trained data can be found here (note, they’re different for different Tesseract versions):

Ubuntu/Debian

$ apt-get install tesseract-ocr tesseract-ocr-chi-sim imagemagick

FAQ

I’m getting the following error:

'convert': No such file or directory @ rb_sysopen - /var/folders/2t/xmdrn2sd2lv2w49dv0zw9_q00000gp/T/1521805124.661379908.txt (RTesseract::ConversionError)

This error means you don’t have the appropriate Tesseract language installed (or Tesseract is unable to find that language). See language installation instructions above.

Usage (Command Line)

Getting actual help:

# To show general help
despeck -h
despeck remove -h
despeck ocr -h
despeck despeck -h

All-in-one (aka Despeck)

If you need to remove watermark and extract OCR text, you may want to use:

$ bundle exec despeck despeck -l chi_sim input.jpg

This is the same as two following commands:

$ bundle exec despeck remove input.jpg output.jpg
$ bundle exec despeck ocr -l chi_sim output.jpg

Remove watermark

To remove watermark:

$ despeck remove /path/to/input.jpg /path/to/output.jpg

With the command above, Despeck will try to find the watermark colour, and apply best filter settings to remove the watermark. It may be wrong, so you can pass several parameters to help Despeck with that:

$ despeck remove --color 00FF00 --sensitivity 120 --black-const -60 --add-contrast /path/to/input.pdf /path/to/output.pdf

A lit of available options:

  • --color 00FF00 - to say watermark is ~ green.

  • --sensitivity 120 - increases sensitivity (if with default 100 watermark is still visible).

  • --black-const -60 - by default, Despeck tries to improve text quality by increasing black by -110. This may be too much for you, so you can reduce that number.

  • --add-contrast - disabled by default, increases output image’s contrast.

  • --accurate - disabled by default. Applies filters to the area with watermark only, preserving the rest of the image untouched.

  • --debug - shows debug information during command execution.

"Accurate" option

By default, despeck applies colour filters to the entire image and tries to improve the quality of the image by increasing contrast and cleaning the image.

It may decrease the original image quality in some cases, so there is the --accurate option, which forces despeck to apply despeck filters only to the area where watermark was found, leaving the rest of the image intact.

For example:

Original image
Original image
Despecked with default options
Despecked with defaults
Despecked with --accurate option
Despecked with --accurate option

Usage

(still under development)

wr = Despeck::WatermarkRemover.new(black_const: -90, resize: 0.01)
# => #<Despeck::WatermarkRemover:0x007f935b5a1a68 @add_contrast=true, @black_const=-110, @watermark_color=nil, @resize=0.1, @sensitivity=100>
image = Vips::Image.new_from_file("/path/to/image.jpg")
# => #<Image 4816x6900 uchar, 3 bands, srgb>
output_image = wr.remove_watermark(image)
# => #<Image 4816x6900 float, 3 bands, b-w>
output_image.write_to_file('/path/to/output.jpg')

despeck's People

Contributors

barockok avatar nattfodd avatar ronaldtse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

despeck's Issues

how to write shell script of "removing watermark and ocr" in automator quick action?

I have tested all despeck features like remove watermark, OCR, extract text from pdf. Now my concern is how I implement these features in Automator for quick action. it's taking too much time to remove the watermark or OCR for bulk operation. so I want only one click operation using Automator quick action functionality. can you provide a shell script for removing the watermark, OCR, and extract text from pdf, actually I'm facing the problem "how to tell "what is the file name and how to assign output name in a shell script" I have tried but an error show "to many arguments" check this shell script code.....

source ~/.bash_profile
for f in "$@"
do
echo "Removing Watermark: $FILE"
EXT=${FILE##.}
despeck remove "$FILE" -o "${FILE/%.$EXT/.
}"
done

2

3

Require Ruby 2.4+

Rubocop has dropped support of 2.3 since 0.8.1. We will upgrade to Ruby 2.4+.

The last version that supports 2.3 is 0.4.2. 0.5+ will no longer support 2.3.

Installation error using gem

I have the following error when installing despeck with gem install despeck

Traceback (most recent call last):
12: from /usr/local/bin/despeck:23:in <main>' 11: from /usr/local/bin/despeck:23:in load'
10: from /var/lib/gems/2.5.0/gems/despeck-0.5.0/exe/despeck:5:in <top (required)>' 9: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in require'
8: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in require' 7: from /var/lib/gems/2.5.0/gems/despeck-0.5.0/lib/despeck.rb:7:in <top (required)>'
6: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in require' 5: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in require'
4: from /var/lib/gems/2.5.0/gems/ruby-vips-2.0.17/lib/vips.rb:525:in <top (required)>' 3: from /var/lib/gems/2.5.0/gems/ruby-vips-2.0.17/lib/vips.rb:528:in module:Vips'
2: from /var/lib/gems/2.5.0/gems/ffi-1.13.1/lib/ffi/library.rb:99:in ffi_lib' 1: from /var/lib/gems/2.5.0/gems/ffi-1.13.1/lib/ffi/library.rb:99:in map'
/var/lib/gems/2.5.0/gems/ffi-1.13.1/lib/ffi/library.rb:145:in `block in ffi_lib': Could not open library 'vips.so.42': vips.so.42: cannot open shared object file: No such file or directory. (LoadError)
Could not open library 'libvips.so.42': libvips.so.42: cannot open shared object file: No such file or directory

Watermark detection across multiple pages

In some cases, a watermark is placed on different locations and/or different pages -- for example, a watermark if placed on a blank page will give the full specification of the watermark, which can be extracted and used to remove watermarks on other pages (e.g., when a watermark obstructed by text).

How do we best support this:

  1. Ask the user to specify a full watermark (i.e., watermark placed on a blank page)
  2. Automatically detect the best watermark amongst the series of pages, allow user to extract the watermark (as an image or specification) for future use?

Here's a sample attached.
07
10

when using '--sensitivity', somehow the program recognizes it as '--add-contrast'

I was testing what '--add-contrast' would give me, but after I used this, '--sensitivity' seems to be somehow changed to '--add-contrast'. It was working fine before but now whenever I tried to use '--sensitivity' it is giving me the result '--add-contrast' would generate. Really strange. Tried to restart computer but it didn't help.

remove

How exactly do you "remove a specified watermark pattern from input images regardless of the location of the watermark on these images" ?

undefined method `write_to_file'

Greetings!

I'm using despeck 0.3.0 with ruby 2.7.0, and after checking and rechecking any dependencies, I get this error when trying to use despeck remove with some PDF files:

/var/lib/gems/2.7.0/gems/despeck-0.3.0/lib/despeck/pdf_tools.rb:49:in block in for_each_image_file': undefined method write_to_file' for nil:NilClass (NoMethodError)

As I'm not completely familiarized with Ruby yet, I'm not sure whether the problem is in the write_to_file method, or in that the program is not assigning the proper class to each image (and so it cannot apply the write_to_file).

Any ideas what should I do to fix this?

Thanks!

VipsOperation: class "pdfload" not found error arises while converting pdf

Using despeck remove for the pdf, it is giving error for the VipsOperation: class "pdfload" not found

Used through terminal line as below

despeck remove --color 999999 --sensitivity 120 --black-const -60 --add-contrast '/home/Desktop/Doc-Test-3.pdf' '/home/dl64/Desktop/output_44.pdf'
error log is as below

  18: from /home/.rvm/gems/ruby-2.5.3/bin/ruby_executable_hooks:24:in `<main>'
  17: from /home/.rvm/gems/ruby-2.5.3/bin/ruby_executable_hooks:24:in `eval'
  16: from /home/.rvm/gems/ruby-2.5.3/bin/despeck:23:in `<main>'
  15: from /home/.rvm/gems/ruby-2.5.3/bin/despeck:23:in `load'
  14: from /home/.rvm/gems/ruby-2.5.3/gems/despeck-0.3.0/bin/despeck:8:in `<top (required)>'
  13: from /home/.rvm/gems/ruby-2.5.3/gems/clamp-1.3.0/lib/clamp/command.rb:140:in `run'
  12: from /home/.rvm/gems/ruby-2.5.3/gems/clamp-1.3.0/lib/clamp/command.rb:66:in `run'
  11: from /home/.rvm/gems/ruby-2.5.3/gems/clamp-1.3.0/lib/clamp/subcommand/execution.rb:18:in `execute'
  10: from /home/.rvm/gems/ruby-2.5.3/gems/clamp-1.3.0/lib/clamp/command.rb:66:in `run'
   9: from /home/.rvm/gems/ruby-2.5.3/gems/despeck-0.3.0/lib/commands/remove.rb:46:in `execute'
   8: from /home/.rvm/gems/ruby-2.5.3/gems/despeck-0.3.0/lib/despeck/pdf_tools.rb:12:in `pdf_to_images'
   7: from /home/.rvm/gems/ruby-2.5.3/gems/despeck-0.3.0/lib/despeck/pdf_tools.rb:39:in `for_each_page'
   6: from /home/.rvm/gems/ruby-2.5.3/gems/despeck-0.3.0/lib/despeck/pdf_tools.rb:39:in `times'
   5: from /home/.rvm/gems/ruby-2.5.3/gems/despeck-0.3.0/lib/despeck/pdf_tools.rb:40:in `block in for_each_page'
   4: from /home/.rvm/gems/ruby-2.5.3/gems/despeck-0.3.0/lib/despeck/pdf_tools.rb:13:in `block in pdf_to_images'
   3: from /home/.rvm/gems/ruby-2.5.3/gems/ruby-vips-2.0.13/lib/vips/image.rb:215:in `method_missing'
   2: from /home/.rvm/gems/ruby-2.5.3/gems/ruby-vips-2.0.13/lib/vips/operation.rb:232:in `call'
   1: from /home/.rvm/gems/ruby-2.5.3/gems/ruby-vips-2.0.13/lib/vips/operation.rb:232:in `new'
/home/.rvm/gems/ruby-2.5.3/gems/ruby-vips-2.0.13/lib/vips/operation.rb:65:in `initialize': VipsOperation: class "pdfload" not found (Vips::Error)

Thanks

`despeck ocr` usage issue

I just ran this in the despeck gem folder but received an error:

bin/despeck ocr -l chi_sim spec/fixtures/red_watermark.jpg

=>

gems/2.3.0/gems/rtesseract-2.2.0/lib/rtesseract.rb:187:in `convert': No such file or directory @ rb_sysopen - /var/folders/2t/xmdrn2sd2lv2w49dv0zw9_q00000gp/T/1521805124.661379908.txt (RTesseract::ConversionError)
	from gems/2.3.0/gems/rtesseract-2.2.0/lib/rtesseract.rb:199:in `to_s'
	from /Users/mulgogi/src/despeck/lib/despeck/ocr.rb:13:in `text'
	from /Users/mulgogi/src/despeck/lib/commands/ocr.rb:15:in `execute'
	from gems/2.3.0/gems/clamp-1.2.1/lib/clamp/command.rb:63:in `run'
	from gems/2.3.0/gems/clamp-1.2.1/lib/clamp/subcommand/execution.rb:11:in `execute'
	from gems/2.3.0/gems/clamp-1.2.1/lib/clamp/command.rb:63:in `run'
	from gems/2.3.0/gems/clamp-1.2.1/lib/clamp/command.rb:132:in `run'
	from bin/despeck:7:in `<main>'

Need help in installing despeck

I'm new to ruby I'm trying to install 'despeck' I keep getting this error. I'm is provided. Even i tried to install 'rmagick' still i keep getting this error.How to find out the necessary library for it. I've installed 'ruby-vips' also still not working.

clamp ~> 1.2
pdf-reader ~> 2.1
prawn ~> 2.2
rtesseract ~> 2.2
ruby-vips ~> 2.0
bundler ~> 1.16
pry >= 0
rake ~> 10.0
rspec ~> 3.0
rubocop ~> 0.52
I've install the above library but not rmagick ~> 2

Any solution.

image

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.