Tesseract OCR for PHP
A wrapper to work with Tesseract OCR inside PHP.
Installation
Via Composer:
$ composer require thiagoalessio/tesseract_ocr
This library depends on Tesseract OCR, version 3.03 or later.
Note for Windows users
There are many ways to install Tesseract OCR on your system, but if you just want something quick to get up and running, I recommend installing the Capture2Text package with Chocolatey.
choco install capture2text
Note for macOS users
With MacPorts you can install support for individual languages, like so:
$ sudo port install tesseract-<langcode>
But that is not possible with Homebrew. It comes only with English support by default, so if you intend to use it for other language, the quickest solution is to install them all:
$ brew install tesseract --with-all-languages
Usage
Basic usage
use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('text.png'))
->run();
The quick brown fox
jumps over
the lazy dog.
Other languages
use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('german.png'))
->lang('deu')
->run();
Bülowstraße
Multiple languages
use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('mixed-languages.png'))
->lang('eng', 'jpn', 'spa')
->run();
I eat すし y Pollo
Inducing recognition
use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('8055.png'))
->whitelist(range('A', 'Z'))
->run();
BOSS
Breaking CAPTCHAs
Yes, I know some of you might want to use this library for the noble purpose of breaking CAPTCHAs, so please take a look on this comment:
API
executable
Define a custom location of the tesseract
executable,
if by any reason it is not present in the $PATH
.
echo (new TesseractOCR('img.png'))
->executable('/path/to/tesseract')
->run();
tessdataDir
Specify a custom location for the tessdata directory.
echo (new TesseractOCR('img.png'))
->tessdataDir('/path')
->run();
userWords
Specify the location of user words file.
This is a plain text file containing a list of words that you want to be
considered as a normal dictionary words by tesseract
.
Useful when dealing with contents that contain technical terminology, jargon, etc.
$ cat /path/to/user-words.txt
foo
bar
echo (new TesseractOCR('img.png'))
->userWords('/path/to/user-words.txt')
->run();
userPatterns
Specify the location of user patterns file.
If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.
$ cat /path/to/user-patterns.txt'
1-\d\d\d-GOOG-441
www.\n\\\*.com
echo (new TesseractOCR('img.png'))
->userPatterns('/path/to/user-patterns.txt')
->run();
lang
Define one or more languages to be used during the recognition. A complete list of available languages can be found here.
Tip from @daijiale: Use the combination ->lang('chi_sim', 'chi_tra')
for proper recognition of Chinese.
echo (new TesseractOCR('img.png'))
->lang('lang1', 'lang2', 'lang3')
->run();
psm
Specify the Page Segmentation Mode, which instructs tesseract
how to
interpret the given image.
Possible psm
values are:
Value | Description |
---|---|
0 | Orientation and script detection (OSD) only. |
1 | Automatic page segmentation with OSD. |
2 | Automatic page segmentation, but no OSD, or OCR. |
3 | Fully automatic page segmentation, but no OSD. (Default) |
4 | Assume a single column of text of variable sizes. |
5 | Assume a single uniform block of vertically aligned text. |
6 | Assume a single uniform block of text. |
7 | Treat the image as a single text line. |
8 | Treat the image as a single word. |
9 | Treat the image as a single word in a circle. |
10 | Treat the image as a single character. |
echo (new TesseractOCR('img.png'))
->psm(6)
->run();
whitelist
This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....')
.
echo (new TesseractOCR('img.png'))
->whitelist(range('a', 'z'), range(0, 9), '-_@')
->run();
Other options
Tesseract offers incredible control to the user through its 600+ configuration options. You can see the complete list by running the following command:
$ tesseract --print-parameters
Tesseract parameters:
... long list with all parameters ...
echo (new TesseractOCR('img.png'))
->config('config_var', 'value')
->config('other_config_var', 'other value')
->run();
// or better yet, just cammel case any of the options:
echo (new TesseractOCR('img.png'))
->configVar('value')
->otherConfigVar('other value')
->run();
Where to get help
Join the chat at https://gitter.im/thiagoalessio/tesseract-ocr-for-php
License
tesseract-ocr-for-php is released under the MIT License.