strider1551 / djvubind Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 6.0 1.51 MB

Combine multiple image files into an optimally compressed djvu file.

Python 100.00%

djvubind's People

Contributors

Stargazers

Watchers

Forkers

ospalh tigran123 kant freddiewitherden fougeroc rsimno

djvubind's Issues

"This page will have no OCR content."

I get this constantly and randomly on a LOT of files:
Ubuntu 14.04, Xeon Dual Hexa, Ram isn't even nearly exeeded.

djvubind version 1.2.1 ( Revision: 9d9c575c2bf9 ) on linux
Executing with these parameters:
{'cjb2_options': '-lossy', 'cuneiform_options': '', 'title_exclude': {}, 'quiet': False, 'title_uppercase': False, 'cpaldjvu_options': '', 'c44_options': '', 'color_encoder': 'csepdjvu', 'title_start_number': 1, 'bitonal_encoder': 'cjb2', 'win_path': 'C:\\Program Files\\DjVuZone\\DjVuLibre\\;C:\\Program Files\\Tesseract-OCR;C:\\Program Files\\ImageMagick-6.6.5-Q16', 'cores': 6, 'minidjvu_options': '--match --pages-per-dict 100', 'ocr': True, 'ocr_engine': 'tesseract', 'title_start': False, 'tesseract_options': '"-l deu"', 'verbose': True, 'csepdjvu_options': ''}

* Collecting files to be processed.
  Binding a total of 186 file(s).
* Analyzing image information.
  Spawning 6 processing threads.
* Performing optical character recognition.
  Spawning 6 processing threads.
wrn: OCR failure on test00012.tif - This page will have no OCR content.
...
wrn: tesseract produced a significant mismatch between textual data and character position data on "tests00092.tif".  This may result in partial ocr content for this page.
...

When I run tesseract (3.03 with leptonica 1.71) an create hocr-files for example, it works with no problems at all.

Stopping with ^C during a run can ruin an image

I did start a djvubind run but noticed that something was wrong and hit Ctrl-C to stop it.
This was during the phase were i got msg: NN: Bitonal image but with a depth greater than 1. Modifying image depth. messages for each image. (I guess this is #8, as tiffinfo says Bits/Sample: 1 and Samples/Pixel: 1.)

Then i got

err: [utils.execute()] Command exited with bad status.
     cmd = mogrify -colorspace gray -depth 1 "NN.tif"
     exit status = -2
wrn: Analysis failure on NN.tif.

And the image was broken. I couldn’t get any program to show it any more.
Once it happened with the only copy of a file i got. (Well, i wanted to back it up. As a djvu…)

No way to disable "-lossy" cjb2 option on djvubind command line (and it should NOT be default)

The -lossy option of cjb2 is extremely dangerous --- it often creates a complete mess, i.e. the result contains completely different text from the original (because it mistakes the letter shapes for similar letters like daleth and resh). I used to use -lossy in the past but then discovered that at least for Hebrew and Syriac texts it is dangerous. Not sure about Russian and English texts, but if it fails on Hebrew/Syriac I wouldn't trust it on Russian/English either...

Is it safe to edit the /usr/bin/djvubind file and simply replace -lossy in there with -lossless? I hope so.

Use existing hocr/html

Hi there,

i'd really like if there was a feature to use already existing hocr/html data from tesseract. This would allow to run tesseract seperately (for whatever reason) and reuse this for djvubind afterwards.
Or is there a feature like that already?

Yours Boredland

The dpi checker does not look at the units

I did run mogrify with the LANG environment variable set to German (or rather de_DE.utf8), and when i looked at the output of identify -f "%x" i got 236.22 PixelsPerCentimeter, which, when you convert it, is 599.9988 dpi. (One pixel less for every 21 m than 600 dpi 😕)
That pixel/cm value, or rather the decimal, confused djvubind here. Somehow the ValueError did get caught but the processing just stopped anyway.

P.S.: maybe i’ll try to fix this myself, too.

Program should not modify input data

In my opinion the step mogrify -colorspace gray -depth 1 "NN.tif" should not be done. It should be done as convert, writing the depth 1 image to a temporary file. (As /tmp/ may be too small perhaps a temp folder in the working directory should be used?)

The program should not modify the input data without clearly stating that. What if i like my bitonal images with higher depth. What if i like the modification time of my images. What if i want to keep the raw images, but don't want them backed up twice, once before the djvubind run, once after it.

P.S.: I have started using pgmagick. Works well for me.
(copy-edited)

"wrn: OCR failure" erroneously(?) shown

This might be related to Issue #7, but it seems different to me, so I'm opening a new issue. I'm running djvubind on a directory full of .tif images from ScanTailor, and am getting a warning about "OCR failure" for each page. However, djvubind is creating HOCR files — an .html file appears for each page. These HOCR files are able to be used successfully by pdfbeads, so they seem to be valid. But djvubind seems not to understand that OCR was successful, so the output .djvu file doesn't have any selectable text.

Here's the output:

* Collecting files to be processed.
  Binding a total of 14 file(s).
* Analyzing image information.
  Spawning 8 processing threads.
* Performing optical character recognition.
  Spawning 8 processing threads.
wrn: OCR failure on P1100185_1L.tif - This page will have no OCR content.
wrn: OCR failure on P1100187_2R.tif - This page will have no OCR content.
wrn: OCR failure on P1100184_2R.tif - This page will have no OCR content.
wrn: OCR failure on P1100184_1L.tif - This page will have no OCR content.
wrn: OCR failure on P1100186_2R.tif - This page will have no OCR content.
wrn: OCR failure on P1100186_1L.tif - This page will have no OCR content.
wrn: OCR failure on P1100187_1L.tif - This page will have no OCR content.
wrn: OCR failure on P1100185_2R.tif - This page will have no OCR content.
wrn: OCR failure on P1100188_2R.tif - This page will have no OCR content.
wrn: OCR failure on P1100188_1L.tif - This page will have no OCR content.
wrn: OCR failure on P1100190_1L.tif - This page will have no OCR content.
wrn: OCR failure on P1100189_2R.tif - This page will have no OCR content.
wrn: OCR failure on P1100190_2R.tif - This page will have no OCR content.
wrn: OCR failure on P1100189_1L.tif - This page will have no OCR content.
* Encoding all information to /path/to/book(1).djvu.

(I redacted the path on the last line)

I installed djvubind from the latest commit on master (3f11cfb), and am running it in openSUSE 13.1. I'd be grateful for any advice you have, and can send my test files as needed (GitHub won't let me upload .tif, .html, .djvu, or .tar.gz files).

Thank you for your work on this project!

deliver output messages in stdout

Hi there,

you've got some fancy percentage-display for the progress report - which is of course useful if you're working single files. Since in my case big amounts of data are to be processed I'd rather like to stream that into logfiles during my batch processing.

exec > >(tee $(date +%F)_$(date +"%I-%M-%S")_main_log.txt)
exec 2>&1

But sadly that isn't the case. Any ideas?
P.S: I invoke djvubind in my script in the according folder with djvubind -v and it runs perfectly.

Simplify bookmark file format

It would be nice if djvubind and pdfbeads used the same format for bookmark information so that users dual-encoding their books could use a single file. Since the pdfbeads folks are unreachable, I'm filing this here as a feature request in case you are also interested.

The djvubind format is currently:

(bookmarks
 ("Cover" "#1" )
 ("Chapter 1" "#4" )
)

while the pdfbeads one is simply:

"Cover" 1
"Chapter 1" 4

OCR using tesseract doesn't work

If I invoke tesseract manually, like this it works fine:

$ l page.tif
-rw-r--r-- 1 tigran tigran 1511448 Nov  4 09:52 page.tif
$ tesseract page.tif -l eng page_box batch makebox
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
$ l page_box.box 
-rw-r--r-- 1 tigran tigran 29442 Nov  4 09:58 page_box.box
$ rm page_box.box

But if I run djvubind, then it fails:

$ djvubind
* Collecting files to be processed.
  Binding a total of 1 file(s).
* Analyzing image information.
  Spawning 4 processing threads.
msg: page.tif: Bitonal image but with a depth greater than 1.  Modifying image depth.
* Performing optical character recognition.
  Spawning 4 processing threads.
wrn: OCR failure on page.tif - This page will have no OCR content.
* Encoding all information to /home/tigran/overstreet/test/book.djvu.

UPDATE: I added batch argument to tesseract as that is how djvubind invokes it. The results are exactly the same, so I just edited the above listing.

djvubind failure on 1600dpi files.

Here is what happens:

$ ls -l
total 9312
-rw-rw-r-- 1 tigran tigran 543844 Jul 21  2013 paper000-0001.tif
-rw-rw-r-- 1 tigran tigran 544484 Jul 21  2013 paper000-0002.tif
-rw-rw-r-- 1 tigran tigran 464914 Jul 21  2013 paper000-0003.tif
-rw-rw-r-- 1 tigran tigran 545314 Jul 21  2013 paper000-0004.tif
-rw-rw-r-- 1 tigran tigran 506398 Jul 21  2013 paper000-0005.tif
-rw-rw-r-- 1 tigran tigran 622212 Jul 21  2013 paper000-0006.tif
-rw-rw-r-- 1 tigran tigran 614352 Jul 21  2013 paper000-0007.tif
-rw-rw-r-- 1 tigran tigran 591676 Jul 21  2013 paper000-0008.tif
-rw-rw-r-- 1 tigran tigran 534892 Jul 21  2013 paper000-0009.tif
-rw-rw-r-- 1 tigran tigran 576298 Jul 21  2013 paper000-0010.tif
-rw-rw-r-- 1 tigran tigran 524160 Jul 21  2013 paper000-0011.tif
-rw-rw-r-- 1 tigran tigran 626706 Jul 21  2013 paper000-0012.tif
-rw-rw-r-- 1 tigran tigran 608412 Jul 21  2013 paper000-0013.tif
-rw-rw-r-- 1 tigran tigran 613104 Jul 21  2013 paper000-0014.tif
-rw-rw-r-- 1 tigran tigran 627752 Jul 21  2013 paper000-0015.tif
-rw-rw-r-- 1 tigran tigran 612474 Jul 21  2013 paper000-0016.tif
-rw-rw-r-- 1 tigran tigran 352130 Jul 21  2013 paper000-0017.tif
$ djvubind
* Collecting files to be processed.
  Binding a total of 17 file(s).
* Analyzing image information.
  Spawning 4 processing threads.
* Performing optical character recognition.
  Spawning 4 processing threads.
* Encoding all information to /disk/Work/Guardian-Plates/tif/1stEdition-PartI.tif/paper000/book.djvu.
Traceback (most recent call last):
  File "/usr/bin/djvubind", line 451, in <module>
    proj.bind()
  File "/usr/bin/djvubind", line 172, in bind
    self.enc.enc_book(self.book, self.out)
  File "/usr/lib/python3.2/site-packages/djvubind/encode.py", line 264, in enc_book
    self._cjb2(page.path, tempfile, page.dpi)
  File "/usr/lib/python3.2/site-packages/djvubind/encode.py", line 84, in _cjb2
    utils.execute(cmd)
  File "/usr/lib/python3.2/site-packages/djvubind/utils.py", line 193, in execute
    print(utils.color("err: [utils.execute()] Command exited with bad status.", 'red'), file=sys.stderr)
NameError: global name 'utils' is not defined

But on some other book which I scanned today (at 600dpi) it works fine, i.e. produces the book.djvu file with OCR text embedded correctly. Maybe the 1600dpi resolution is too high?

$ identify paper000-0001.tif 
paper000-0001.tif TIFF 8206x14323 8206x14323+0+0 1-bit Bilevel DirectClass 544KB 0.000u 0:00.000

better handling of scantailor mixed mode

Presently there is the option to use csepdjvu to get better compression from a scantailor mixed mode image. I'm not sure how csepdjvu exactly works, but I think better results could achieved cjb2, iw44, and djvumake.

Color and Mixed mode scantailor pages have reversed colors after djvubind

Running stock debian Jessie on two different machines I get the following results:

Pages that have mixed or color mode set in scantailor look normal when viewing the tif files but after djvubind processes them they end up almost completely black in the final djvu file as if the colors have been inverted. Pages marked as "color" result in fully inverted colors, including white text on a black background, while "mixed" mode ones have just the black background with no text visible. OCR seems unaffected; tesseract can run on both types of pages and produces OCR output normally.

This occurs even when re-using archived source tif files from previous projects that djvubind previously encoded without trouble. I've set/unset all the c44, cpaldjvu, and csepdjvu options in my ~/.djvubind/config file that I could find in the relevant man pages. I also tried moving the config file aside entirely in case one of my standard options was causing it, though I only modify minidjvu settings so that seemed unlikely. Nothing has resolved the problem. Bitonal pages continue to work flawlessly.

I have some single-page test and output files I can send you though github prevents me from uploading either to this ticket. Just let me know if you need them and I'll email directly.

Djvubind doesn't work with tesseracts version strings

The ocr.py only accepts int() values as the version, while tesseract versions may be sth. like (in my case) "tesseract 4.0.0-beta.1-69-g10f4".

djvubind/djvubind/ocr.py

Line 339 in 20aa6d6

self.version = int(version)

That leads to ValueError: invalid literal for int() with base 10: "b''"

Error when running on tifs from scantailor processed with mixed mode (color + b&w) (same for color+grayscale mode)

In a folder with tif images produced by the software ScanTailor with the mixed mode output setting, running "djvubind" results in no djvu file and in this log:

* Encoding all information to /home/kaue/Study/Alemão/Zorach/out-color/book.djvu.
Traceback (most recent call last):
File "/bin/djvubind", line 446, in <module>
proj.bind()
File "/bin/djvubind", line 171, in bind
self.enc.enc_book(self.book, self.out)
File "/usr/lib/python3.10/site-packages/djvubind/encode.py", line 281, in enc_book
self._csepdjvu(page.path, tempfile, page.dpi)
File "/usr/lib/python3.10/site-packages/djvubind/encode.py", line 137, in _csepdjvu
self._cjb2('temp_textual.tif', 'enc_bitonal_out.djvu', dpi)
File "/usr/lib/python3.10/site-packages/djvubind/encode.py", line 84, in _cjb2
utils.execute(cmd)
File "/usr/lib/python3.10/site-packages/djvubind/utils.py", line 193, in execute
print(utils.color("err: [utils.execute()] Command exited with bad status.", 'red'), file=sys.stderr)
NameError: name 'utils' is not defined

Detecting bit-depth broken

Something is broken with detecting bit-depth when the image is bitonal. I think perhaps identify changed the way it outputs identify -ping -format %z in some more recent version.

Suggestion: change default cjb2 option from "-lossy" to "-lossless"

I have just spent several hours generating a djvu file and to my amazement the resulting file was too small (about 5MB, whereas all 600dpi djvu books produced by myself in the past 17 or so years (since djvu first appeared) are at least 40MB and the 1600dpi ones are 400MB or so). Looking at the awful quality of text (misaligned letters, the usual artefact of using "cjb2 -lossy") I immediately realised that -lossy option must have been passed. Ok, so I fixed /etc/djvubind/config to pass -lossless instead (which is not even necessary as it is default for cjb2 anyway, very sensibly so). And then restarted djvubind. Waited for a few more hours and --- lo and behold --- the file is tiny and has the same rubbish quality text again! So, now I did locate djvubind and discovered that there is apparently a local config file as well --- ~/.config/djvubind/config so I had to edit it as well and restart djvubind (and now will have to wait for those hours again).

So, here is my suggestion: please change the default cjb2 option from -lossy to -lossless as, obviously, nobody would ever wish to create a low quality book (i.e. with misaligned text) at the expense of saving those few hundreds of megabytes (and for most books, i.e. 600dpi or less, it wouldn't even be hundreds but a few tens of megabytes difference). Nowdays the best eInk devices for DjVu and PDF reading come with 512GB of storage (256GB internal + 256GB micro sdxc on Kobo Aura H2O) and so the filesize is much much less important than the quality of the book.

But if for some perverse reason someone really wants to create a horrible quality book, then he can always overwrite this default and pass -lossy instead.

Note that this behaviour would match the default of cjb2(1), so it makes perfect sense and harmony.

I am running Ubuntu 16.04.1 and djvubind is version 1.2.1 Revision: 4e797677f1bc if that is important.

Greater number of threads than files to process

djvubind spawns the same number of threads as there are cores available, unless specified otherwise. If there happens to be more cores than files to process (e.g., an eight core cpu but six images to bind), it will spawn more threads than necessary.

This doesn't seem to cause any failure. No doubt the extra threads are spawned, have nothing in the queue to do, and are closed. Nevertheless, it should not spawn more threads than there are files.

Example below.

* Collecting files to be processed.
  Binding a total of 6 file(s).
* Analyzing image information.
  Spawning 8 processing threads.
* Performing optical character recognition.
  Spawning 8 processing threads.

Implement temporary files

Djvubind can produce a large number of temporary files as it works. Presently these simply exist in the working directory and are deleted manually. Instead, we should take advantage of the tempfile module. It's cross-platform and it cleans up the files in the event of an unhandled exception. Also, some people have hardware configurations designed for better handling of temporary files (e.g., I have /tmp/ mounted as a ramdisk to save wear and tear on the real hard drive).

References:
http://docs.python.org/dev/library/tempfile.html#module-tempfile

Parallel tests

Hi there,
since I've a lot of processing-cores I thought perhaps one could use djvubind a little bit more parallel.
So I started trying to parallelize externally with gnu parallels. I think my results are worth mentioning:
When creating a bunch of smaller djvu-jobs that are merged by djvm, that decreases the time needed for a job around a third - the more compression needed, the more time you save using it.
I think you don't do that natively, because of your library-compression. Suprisingly my results are of the same size or smaller - I don't know why.
I put that together into a script around with some example-files. You can try it simply unpacking it and running ./testscript. Make sure to set the variables in the top of the script according to your system.
Critique will be kindly appreciated.
http://www.file-upload.net/download-10445190/test.tar.bz2.html
Boredland

Exposing Options for Compressing Images with cjb2

By default, when djvubind uses cjb2 compression on a tiff, it invokes cjb2 with the -lossy option. After processing a set of images from a scanned book through Scan Tailor and sent to djvubind as bitonal images, I've noticed that on these pages, the rs have become ts and vice versa.

After running cjb2 with the -losslevel 35 option, my pages look fine. Might there be a way to expose the losslevel option through djvubind?

strider1551 / djvubind Goto Github PK

djvubind's People

Contributors

Stargazers

Watchers

Forkers

djvubind's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs