kcroker / dpsprep Goto Github PK

Python DJVU to PDF converter which preserves OCR text and bookmark metadata (e.g. TOC)

License: Other

Python 88.00% Roff 9.50% TeX 0.35% Makefile 2.15%

dpsprep's Introduction

dpsprep

This tool, initially made specifically for use with Sony's Digital Paper System (DPS), is now a general-purpose DjVu to PDF converter with a focus on small output size and the ability to preserve document outlines (e.g. TOC) and text layers (e.g. OCR).

Usage

Full example (the name of the PDF is optional and inferred from the input name):

dpsprep --pool=8 --quality=50 input.djvu output.pdf

If you have OCRmyPDF installed, you can use its PDF optimizer:

dpsprep -O3 input.djvu

You can also skip translating the text layer (it is sometimes not translated well) and redo the OCR (rather than launching the ocrmypdf CLI, we use the API directly and accept options in JSON format):

dpsprep --ocr '{"language": ["rus", "eng"]}' input.djvu

Consult the man file (online) for details; there are a lot of options to consider.

See the next section for different ways to run the program.

Installation

The easiest way to obtain dpsprep is to clone the repository.

The tool depends on several Python libraries, which can easily be installed via poetry. A configuration for pyenv is also included.

The only hard prerequisite is djvulibre. Optional prerequisites are:

libtiff for bitonal image compression.
libjpeg (or libjpeg-turbo) for multitotal (RGB or grayscale) compression.
OCRmyPDF and jbig2enc for PDF optimization (see the next section).

libtiff depends on libjpeg, so installing libtiff will likely install both.

For details on how these dependencies can be installed, see the GitHub Actions workflow and the dpsprep-git package for Arch Linux.

Note that Windows support in djvulibre-python requires 64-bit djvulibre, and they only officially distribute 32-bit Windows packages. If you manage to make it work, consider opening a pull request.

Once inside the cloned repository, the environment for the program can be set up by simply running poetry install. After than, the following should work:

poetry run python -m dpsprep input.djvu

The program can easily be installed as a Python module via poetry and pip:

poetry build
pip install [--user] dist/*.whl

If you are packaging this for some other package manager, consider using PEP-517 tools as shown in this PKGBUILD file.

A convenience script that can be copied or linked to any directory in $PATH can be found at ./bin/dpsprep.

Previous versions of the tool itself used to depend on third-party binaries, but this is no longer the case. The test fixtures are checked in, however regenerating them (see ./fixtures/makefile) requires pdflatex (texlive, among others), gs (Ghostscript), pdftotext (Poppler), djvudigital (GSDjVU) and djvused (DjVuLibre). Similarly, the man file is checked in, but building it from markdown depends on ronn.

Note regarding compression

We perform compression in two stages:

The first one is the default compression provided by Pillow. For bitonal images, the PDF generation code says that, if libtiff is available, group4 compression is used.
If OCRmyPDF is installed, its PDF optimization can be used via the flags -O1 to -O3 (this involves no OCR). This allows us to use advanced techniques, including JBIG2 compression via jbig2enc.

If manually running OCRmyPDF, note that the optimization command suggested in the documentation (setting --tesseract-timeout to 0) may ruin existing text layers. To perform only PDF optimization you can use the following undocumented tool instead:

python -m ocrmypdf.optimize <input_file> <level> <output_file>

Acknowledgements

The font invisible1.ttf is taken from here. See the djvu_pages_to_text_fpdf function in ./dpsprep/text.py for how it is used.

Kevin's notes regarding the first version

I wrote this with the specific intent of converting ebooks in the DJVU format into PDFs for use with the fantastic (but pricey) Sony Digital Paper System.

DjVu technology is strikingly superior for many ebook applications, yet the Sony Digital Paper System (rev 1.3 US) only supports PDF technology: this is because its primary design purpose is not as an ereader. The device, however, is quite nearly the perfect ereader.

Unfortunately, all presently available DjVu to PDF tools seem to just dump flattened enormous TIFF images. This is ridiculous. Since PDF really can't do that much better on the way it stores image data, a 5-6x bloat cannot be avoided. However, none of the existing tools preserve:

The OCR'd text content
Table of Contents or Internal links

This is kind of silly, but until Sony's Digital Paper, there was no need to move functional DjVu files to PDFs. In order to make workable PDFs from DjVu files for use on the Digital Paper System, I have implemented in one location the following procedures detailed here:

By automating the procedure of user zetah for extracting the text and getting it in the correct locations: http://askubuntu.com/questions/46233/converting-djvu-to-pdf (OCR text transfer)

By implementing the procedure of user pyrocrasty for extracting the outline, and putting it into the PDF generated above: http://superuser.com/questions/801893/converting-djvu-to-pdf-and-preserving-table-of-contents-how-is-it-possible (bookmark transfer)

dpsprep's People

Contributors

Stargazers

Watchers

Forkers

dark-saber h-plus-time snowphone ktp-forked-repos anton-latukha brettneese itgergo rheehot rhythmicode progmonster v-- yukontaf r-3141592-pi ruanimal alkaid-benetnash

dpsprep's Issues

A Guide for Installing Dependencies

Personally, I've had problems getting all the dependencies on my ubuntu 18.04 LTS vm. I think better information on which dependencies should be installed would make using this python program easier. I'm not sure if this is actively maintained so I'll leave this issue here as it is, and if I manage to install the dependencies make a pull request with my process. Just wanted to let you know about this.

wrong page - all indexes equal

I tried convert this http://bookzz.org/dl/1180286/aeaccb?alt=1
but all toc entries point to page 1

ImportError: No module named sexpdata

It won't work. I tried it on Ubuntu 14.04 x64

$ ./dpsprep book.djvu
Traceback (most recent call last):
File "./dpsprep", line 6, in
import sexpdata
ImportError: No module named sexpdata

How to convert DJVU to PDF properly with keeping outline and searchable text?

Thanks,
jonghyun

ValueError: invalid literal for int() with base 10

I was trying to convert this djvu file to pdf.

Gerald_B._Folland-Real_Analysis__Modern_Techniques_and_Their_Applications,_2nd_Ed.djvu.zip

And I got this error. Do you know what's going wrong? I'm on Mac OS using Python 3.11.6, pip 23.3.1, Poetry 1.7.1, and dpsprep 2.2.2.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/__main__.py", line 3, in <module>
    dpsprep()
  File "/Users/wilder/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wilder/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/wilder/Library/Python/3.11/lib/python/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wilder/Library/Python/3.11/lib/python/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/dpsprep.py", line 168, in dpsprep
    outline = OutlineTransformVisitor().visit(document.outline.sexpr)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/sexpr.py", line 34, in visit
    return self.visit_list(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/sexpr.py", line 12, in visit_list
    return method(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/outline.py", line 45, in visit_list_bookmarks
    self.visit(child, parent=outline)
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/sexpr.py", line 34, in visit
    return self.visit_list(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/sexpr.py", line 15, in visit_list
    return self.visit_plain_list(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/dpsprep/outline.py", line 13, in visit_plain_list
    page_number = int(page.value[1:]) - 1
                  ^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'f007.djvu'

Misinterpreting actual page numbers to file's page numbers

I have a DJVU file whose conversion has incorrect page numbers in its TOC are wrong.
bmarks.out:

(bookmarks
("Cover"
"#cover_2.djvu" )
("Front matter"
"#_005.djvu" )
("Preface to the Dover Edition "
"#_008.djvu" )
("Preface "
"#_010.djvu" )
("Table of Contents "
"#_012.djvu" )
("Introduction"
"#_014.djvu" )
("Chapter 1. The Basic Notions "
"#3.djvu" )
("Chapter 2. Order and Well-Foundedness "
"#032.djvu" )
("Chapter 3. Cardinal Numbers"
"#076.djvu" )
("Chapter 4. The Ordinals"
"#112.djvu" )
("Chapter 5. The Axiom of Choice and Some of its Consequences"
"#158.djvu" )
("Chapter 6. A Review of Point Set Topology. "
"#199.djvu" )
("Chapter 7. The Real Spaces "
"#216.djvu" )
("Chapter 8. Boolean Algebras "
"#244.djvu" )
("Chapter 9. Infinite Combinatorics and Large Cardinals "
"#289.djvu" )
("Appendix 10. The Eliminability and Conservation Theorems"
"#357.djvu" )
("Bibliography"
"#367.djvu" )
("Additional Bibliography "
"#376.djvu" )
("Index of Notation "
"#377.djvu" )
("Index"
"#383.djvu" )
("Appendix: Corrections and Additions "
"#393.djvu" ) )

These numbers represent the actual page numbers (as opposed to the file's page numbers), and when converted to PDF, it misinterprets these numbers as the file's page numbers, with the exception that the page numbers that have an underscore seem to be totally ignored (they seem to represent "negative" page numbers relative to page 1). The DJVU file can be found here: https://drive.google.com/open?id=1zVKe0qXA08_q5Tq42LzvNVVgRXOeHVyM

There is probably a way of determining where page 1 of the actual book/article/etc. is since DjView is able to display everything correctly.

Installation fails due to distutils deprecation

disutils seems deprecated in latest python releases (3.9 and above. The installation fails because of a Deprecation Warning

no OCR after conversion (wrongly OCR'ed djvus?)

This issue continues that part of #16 about OCR, but with other files.

Two files. File Kornai. I can correctly copy text from djvu file in DjVu4, but not in Ocular, I can't see boxes of text in blue in latter. Evince let me see boxes and copy text (even correct), but very strangely, you could see (wrong orientation and placement, I was copying first paragraph):

No OCR after conversion. Something is wrong with djvu file, I doubt that can be solved without re-OCR.

File 2.djvu has correct (with many mistakes, but that shouldn't matter, I think) OCR that can be seen in Ocular and other viewers, I can copy text correctly from them. And no OCR after conversion. This case is more strange, because djvu file seems normal.

Make dpsprep work with python 3

Is there any way to make the script compatible with the newest python version without rewriting it? That'd be great!

inverted colors of images in pdfs and messed OCR after conversion

Hello.

Dpsprep inverts colors of images (mostly covers of books converted) while converting. Black and white images stays intact, I think, but color and gray-scale (I'm not sure about latter, but I think so) ones always are inverted, so black become white and so on. Sometimes all pages of converted pdf can became black with white letters, but such occasions are rare.
This happens even with minimal command, like poetry run python -m dpsprep -v.

As example, colors of the first page of this book (cover) from this link (djvu file is zipped and site is http) are inverted. All books from this library can be downloaded and used for personal needs freely and legally. They are in cyrilic, but that should not be problem, I think.

Maybe something is wrong with my system, I don't know. It is latest Manjaro linux (Arch derivative) stable, no Python2, I am working inside recently cloned this repository (didn't install dpsprep as Python module). I can provide more information if it is needed.

Ant thank you for this converter – it is lifesaver when Zotero is working only with pdfs and I have quite a lot of djvu with OCR already. So no need to OCR converted pdfs. And dpsprep is the only converter that preserves OCR, I tried several solutions.

Thank you,

Valdemaras