GithubHelp home page GithubHelp logo

altabeh / tesseract-ocr-wrapper Goto Github PK

View Code? Open in Web Editor NEW
17.0 1.0 4.0 27 KB

This is a highly efficient python wrapper for tesseract-ocr.

License: MIT License

Python 100.00%
tesseract-ocr text-extraction multiprocessing leptonica xpdf

tesseract-ocr-wrapper's Introduction

tesseract-ocr-wrapper

This is a python wrapper for tesseract-ocr. You can use the following code to extract text from a PDF:

from utils import ocr_to_text

OCR_CONFIG = {
            "grayscale": "true",
            "user_defined_dpi": "250",
            "oem": "1",
            }

numpage_text_bundle = sorted(
                    [page for page in ocr_to_text(pdf_path, **OCR_CONFIG)],
                    key=lambda x: x[1],
)
ocr_text = "\n".join([page[0] for page in numpage_text_bundle])

Unlike pytesseract, there are two main advantages of the function ocr_to_text above that make it perfect for extracting text from multiple PDFs.

  • It is fully parallelized using ProcessPoolExecutor of the concurrent.futures module. ocr_to_text yields a list of tuples where the first element of each tuple is the page text and the second element is the page number. In a real implementation of multiprocessing, we don't have to wait for all the tasks to yield the results. Likewise, the text extraction from each page is submitted as a task over to the multiprocessing pool that will then be returned without having to wait until all the tasks are completed. So we need the page numbers later to sort the results in ascending order (key=lambda x: x[1]).

  • There is a severe memory issue in pytesseract stemming from the fact that it keeps dumping everything in RAM across possibly many iterations over the pages of a PDF file without a controlled garbage collection. Here, however, every page is converted to an image sitting in a TemporaryDirectory() that is cleaned up and removed from the filesystem on completion of extraction. This leads to the release of the unused data every time we iterate over a single page of a PDF file that significantly reduces the memory usage not only for a single file, but also for thousands of PDF files being looped over.

Requirements

After cloning the repo, go to the directory tesseract-ocr-wrapper and then type

$ pip install -r requirements.txt

to download and install the required python packages. There are many other packages such as leptonica and poppler-utils that need to be installed. Since these packages are system dependent, I have included a _sys.py file that allows you to directly download and install them if you are on Linux. For all other operating systems, you can modify this file, in particular the commands listed in _sys.py i.e.

commands = [
            "sudo yum install poppler-utils",
            "sudo yum install autoconf automake libpng-devel libtiff-devel libtool pkgconfig.x86_64 libpng12-devel.x86_64 libjpeg-devel libtiff-devel.x86_64 zlib-devel.x86_64",
            "cd /tmp",
            "wget http://www.leptonica.org/source/leptonica-1.79.0.tar.gz",
            "tar -zxvf leptonica-1.79.0.tar.gz",
            "cd leptonica-1.79.0",
            "./configure --prefix=/usr/local/",
            "make",
            "sudo make install",
            "export PKG_CONFIG_PATH=/usr/local/leptonica-1.79.0/lib/pkgconfig",
            "cd /tmp",
            "wget https://codeload.github.com/tesseract-ocr/tesseract/tar.gz/4.1.1",
            "tar -zxvf 4.1.1",
            "cd tesseract-4.1.1",
            "./autogen.sh",
            "LIBLEPT_HEADERSDIR=/usr/local/lib ./configure --prefix=/usr/local/ --with-extra-libraries=/usr/local/lib",
            "autoreconf --force --install",
            "./configure",
            "make",
            "sudo make install",
            "sudo ldconfig",
            "cd /tmp",
            "wget https://github.com/tesseract-ocr/tessdata_fast/blob/master/eng.traineddata?raw=true",
            "sudo mv /tmp/eng.traineddata?raw=true /usr/local/share/tessdata/eng_fast.traineddata",
            f"cd {current_dir}",
        ]

and then simply enter $ python _sys.py in the terminal.

tesseract-ocr-wrapper's People

Contributors

altabeh avatar lionello avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

tesseract-ocr-wrapper's Issues

pdf conversion without writing to disk

with TemporaryDirectory() as path:
path_to_pages = convert_from_path(
pdf_path,
output_folder=path,
fmt="tiff",
dpi=int(resolution),
first_page=page,
last_page=min(page + batch_size - 1, page_count),
paths_only=True,
)

Did you consider to convert pdf to image without writing to disk? With pymupdf (fitz) it is possible.

Example code e.g. https://bucket401.blogspot.com/2021/03/pdf-to-imagemultipage-in-python.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.