openpaperwork / pyocr Goto Github PK

View Code? Open in Web Editor NEW

929.0 42.0 152.0 2.39 MB

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab

Home Page: https://gitlab.gnome.org/World/OpenPaperwork/pyocr

Python 95.04% Shell 3.51% Makefile 1.45%

python python27 python3 ocr tesseract-ocr tesseract-ocr-api cuneiform pillow pil

pyocr's Introduction

PyOCR

Moved to Gnome's Gitlab.

pyocr's People

Contributors

Stargazers

Watchers

Forkers

atbrox jbest tejastank otiai10 phaebz bzisch tonglanli rvandegrift stamhe zhongxingzhi vehrka ffreitasalves luis-wang plietar jondb zhanghongfei huawenyu gsh45 kimyongyeon lexleisure kylechamberlin paulomigalmeida llsdimple kc87654321 alviandk stone5495 asch99 liuyun1217 huaiwen sunboy-2050 guoyonggang178 zanthos siddhant bluerain20 kexplo zhangxinnan alistairwalsh nohtaray arthurtalkgoal jakubsemerak pierrehao alphadyz upzone sydkj liujinliang99 ttunduny eternonq yangkf1985 poke1024 yetanothertimes tom-zeit jayinai toxic-0518 chuxz777 dondrews murat-erdogan kuswandanu ccrichard tomcattiger1230 bnguyenvanyen jango2015 dannywei7 thperret lijinhua1990 abhinavthomas ribx lamhocn blueroutecn ziranyidu guanlicome sawon1234 luanalabs aszlig mengqhui jfzhang2 rcxld andyttt huangkbaaron jasonhjiang zerolugithub bclyc kalvar emilbryggare ardentran sbsyrwlb6x binkes telescopeuser solertis johnieli naafx8 taodream zhuguangqiang jack003007 sawani02 hcysct sqooba skatebill mrab54 anttutu gdeverhart7

pyocr's Issues

Tesseract C API

As suggested by zdenop on tesseract-ocr/tesseract#85 (comment) , using the C API could have many advantages:

faster
no annoying fork() + exec()
no temporary file
could avoid annoying regression like openpaperwork/paperwork#392

To check: thread safety.

https://github.com/tesseract-ocr/tesseract/wiki/APIExample#c-api-in-python

Tesseract segmentation mode with custom zone?

Hi, I have been using builders.WordBoxBuilder to get the positions of some words of interest in an image.

Is there functionality to run Tesseract in segmentation mode with inputted zones of interest?

I took a look at your builders.TextBuilder code and saw that I could change the tesseract_layout parameter in the constructor to change segmentation modes. However, some of the modes need an .uzn file that must share the name of the image being processed by Tesseract. The problem is, I can't get the name of the image file because you write to a random temp file in your tesseract.image_to_string code.

Just wondering if there was a straightforward way currently.

Thanks.

Tesseract: use --list-langs to get the available languages list

TODO: Use --list-langs

pyocr does not support multiple languages. simple fix

when trying set language to multiple languages, e.g. "heb+eng", there is an exception.

"image_to_string" function at libtesseract/init.py should be modified to something like:

        for lang_item in clang.split('+'):
            if lang_item not in tesseract_raw.get_available_languages(handle):
                raise TesseractError(
                 "no lang",
                 "language {} is not available".format(lang_item)
                )

Empty image detection

Currently, it seems like the only way to detect whether an image is empty or not is using image_to_string(...). However, this method is very inefficient on images containing thousands of characters or more. If it's possible, I'd be great to implement something like is_image_empty(image), which would return a bool describing, whether the image is empty or not.

Windows 7 Python27 complainted no tools found

NO OCR tool found - tesseract 3.01 installed and working, but pyocr failed to locate tesseract

when I use it in ubuntu, it comes out "No module named builders"

The code is
from PIL import Image
import sys
import pyocr
import pyocr.builders

hOCR : too much data is stripped

Even when using the LineBoxBuilder, it seems too much data is stripped from the hOCR files.

Confidence score

Is it possible to get a confidence score for the predictions (not orientation) ?

Potential bug: output of Tesseract (C-API) and Tesseract (sh) is different

I got this simple example:

from PIL import Image
from pyocr import pyocr

py_img = Image.open('text.png')
for tool in pyocr.get_available_tools():
    print("Using pyocr tool '%s'" % (tool.get_name()))
    print(tool.image_to_string(py_img))

Where text.png is:

As this is a fairly simple case I would have expected the outcome to be the same however the outcome is:

Using pyocr tool 'Tesseract (C-API)'
Empty page!!

Using pyocr tool 'Tesseract (sh)'
3/2

Is this a bug or are the two tools configured differently by default?
I know the Tesseract (C-API) works properly on my computer as I have used it successfully with similar but different input, however in this very particular case, it fails.

tesseract.detect_orientation() dies with empty pages

Hi!

I'm encountering this error with some of my PDFs:

consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |
consumer_1 |    **** This file had errors that were repaired or ignored.
consumer_1 |    **** The file was produced by:
consumer_1 |    **** >>>> Mac OS X 10.8.2 Quartz PDFContext <<<<
consumer_1 |    **** Please notify the author of the software that produced this
consumer_1 |    **** file that it does not conform to Adobe's published PDF
consumer_1 |    **** specification.
consumer_1 |
consumer_1 | multiprocessing.pool.RemoteTraceback:
consumer_1 | """
consumer_1 | multiprocessing.pool.RemoteTraceback:
consumer_1 | """
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/local/lib/python3.5/site-packages/pyocr/tesseract.py", line 171, in detect_orientation
consumer_1 |     angle = int(output['Orientation in degrees'])
consumer_1 | KeyError: 'Orientation in degrees'
consumer_1 |
consumer_1 | During handling of the above exception, another exception occurred:
consumer_1 |
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 119, in worker
consumer_1 |     result = (True, func(*args, **kwds))
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
consumer_1 |     return list(map(*args))
consumer_1 |   File "/usr/src/paperless/src/documents/consumer.py", line 32, in image_to_string
consumer_1 |     orientation = self.OCR.detect_orientation(f, lang=lang)
consumer_1 |   File "/usr/local/lib/python3.5/site-packages/pyocr/tesseract.py", line 180, in detect_orientation
consumer_1 |     % original_output)
consumer_1 | pyocr.tesseract.TesseractError: (-1, 'No script found in image (Too few characters. Skipping this page)')
consumer_1 | """

AttributeError: 'module' object has no attribute 'get_available_tools'

Hello guys !
I've created a test file in a separate folder : my code

from PIL import Image
import sys
import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'tesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'

txt = tool.image_to_string(Image.open('http://www.domain.com/fr/i/3518721/phone'),
                           lang=lang,
                           builder=pyocr.builders.TextBuilder())
word_boxes = tool.image_to_string(Image.open('http://www.domain.com/fr/i/3518721/phone'),
                                  lang=lang,
                                  builder=pyocr.builders.WordBoxBuilder())
line_and_word_boxes = tool.image_to_string(
        Image.open('test.png'), lang=lang,
        builder=pyocr.builders.LineBoxBuilder())

and I get this error message

Traceback (most recent call last):
  File "./test.py", line 6, in <module>
    tools = pyocr.get_available_tools()
AttributeError: 'module' object has no attribute 'get_available_tools'

any Idea ?

Usage under win32 failed.

Please refine tesseract binary checking algorithm within util.py.

os.access(os.path.join('"C:\Program Files\Tesseract-OCR"',"tesseract"), os.X_OK)
False
os.access(os.path.join("C:\Program Files\Tesseract-OCR","tesseract"), os.X_OK)
False
os.access(os.path.join("C:\Program Files\Tesseract-OCR","tesseract"), os.X_OK)
False
os.access(os.path.join("C:\Program Files\Tesseract-OCR","tesseract.exe"), os.X_OK)
True
os.access(os.path.join('"C:\Program Files\Tesseract-OCR"',"tesseract.exe"), os.X_OK)
False
os.access(os.path.join("C:\Program Files\Tesseract-OCR","tesseract.exe"), os.X_OK)
True

Direct file input

Both tesseract and cuneiform allow the user to pass in a file name as input. I would like to add a function that will take a file name directly and pass it to the OCR engine, rather than having to create a temporary input file. Since I was not able to replace the file IO with memory pipe for tesseract, having a function like that will speed things up since it will eliminate the unnecessary file IO.

Tesseract 3.02.01 : hocr support doesn't work anymore

I've updated Tesseract to the version 3.02.01 (debian package). Since then, I can't get the boxes.

There are two TesseractError

There are two exceptions TesseractError in pyocr ... As shown by the-paperless-project/paperless#154 , this is useless and confusing.

We need only one.
TesseractError cannot be used with Cuneiform. It wouldn't make sense.

Tesseract init.py not working.

import tesseract
File "C:\Users\AppData\Local\Continuum\Anaconda3\lib\site-packages\tesseract__init__.py", line 34
print 'Creating user config file: {}'.format(_config_file_usr)
^
SyntaxError: invalid syntax
Traceback (most recent call last):
File "", line 1, in

there at the line 34 it shows like this:

imports from init.py not working

I am doing import pyocr as described in the README, but after that I can't access pyocr.get_available_tools(). So is there something wrong with the from pyocr import * in __init__.py? FWIW, I am running on Debian Testing in a virtualenv with pyocr installed with pip.

Update:
I tried with the old way from pyocr import pyocr which gives the error

In [1]: from pyocr import pyocr
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-6e3defe0fffd> in <module>()
----> 1 from pyocr import pyocr

/home/user/.virtualenvs/venv/lib/python3.3/site-packages/pyocr/pyocr.py in <module>()
     46 """
     47 
---> 48 import cuneiform
     49 import tesseract
     50 

ImportError: No module named 'cuneiform'

No OCR tool found

Hi,
I have tried installing PyTesseract and Pyocr but there are no available tools .
Kindly see the Windows Shell output below

PS C:\WINDOWS\system32> pip install pyocr --ignore-installed
Collecting pyocr
Collecting six (from pyocr)
  Downloading six-1.10.0-py2.py3-none-any.whl
Collecting Pillow (from pyocr)
  Using cached Pillow-4.2.1-cp27-cp27m-win_amd64.whl
Collecting olefile (from Pillow->pyocr)
Installing collected packages: six, olefile, Pillow, pyocr
Successfully installed Pillow-4.2.1 olefile-0.44 pyocr-0.4.7 six-1.10.0
PS C:\WINDOWS\system32> pip install pytesseract --ignore-installed
Collecting pytesseract
Collecting Pillow (from pytesseract)
  Using cached Pillow-4.2.1-cp27-cp27m-win_amd64.whl
Collecting olefile (from Pillow->pytesseract)
Installing collected packages: olefile, Pillow, pytesseract
Successfully installed Pillow-4.2.1 olefile-0.44 pytesseract-0.1.7
PS C:\WINDOWS\system32> python
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from PIL import Image
>>> import pyocr
>>> import pyocr.builders
>>> import pytesseract
>>> tools = pyocr.get_available_tools()
>>> tools
[]
>>> print(tools)
[]

Orientation detection failed on Tesseract (C-API)

Hello, when using Tesseract (C-API) with Tesseract 3.05.00, I get PyocrException with this message when trying to use orientation detection:

('detect_orientation failed', 'TessBaseAPIDetectOS() failed')

Reason: TessBaseAPIDetectOS() is considered unsafe and always returns false. It's also deprecated and may be removed soon.

Support for digit-only OCR

Hello,

I am using your library in conjunction with Tesseract to recognize digit-only images.
On the first try, Tesseract had some issues with some digit like "0" taken as "D" and so on until I notice there is a parameter for Tesseract to instruct it that the image contains only digit.
Doing so the recognition is perfect (99%).

To activate this feature (that is, adding digits to the Tessetact command line), I subclassed the Text Builder this way:

class DigitBuilder(TextBuilder):
            """
                Specialization for Tesseract to use Digit Only recognition
            """

            def __init__(self, tesseract_layout=3):
                self.tesseract_configs = ["-psm", str(tesseract_layout), "digits"]

I would like to write a pull request on it, but I do not know how you manage the builders and if Cuneiform has a similar feature.

If you provide me some hints I will surely help this useful project.
Regards

Detect os and use tesseract.exe in windows environment

I'm a windows user and when using this module it can't detect my tesseract, after reading the source I found in windows we should use tesseract.exe instead of tesseract as TESSERACT_PATH.

I think detect os and use tesseract.exe in windows environment will surely decrease the pain of windows users, or at least write something like FAQ in readme will be good.

error: package directory 'src/pyocr/tesseract_capi' does not exist

When executing sudo python setup.py install, the following error is returned:

error: package directory 'src/pyocr/tesseract_capi' does not exist

Specify regions of text

Hi, is there a way to specify the regions of text in pyocr?
Currently, I am cropping out the text-regions, and give them to pyocr one-at-a-time. This help avoid some inaccuracies in Tesseract's page-layout analysis. But, it's very slow.

Error opening chinese data file

I got this error:

TesseractError: (1, 'Error opening data file /usr/workspace/tesseract/chi-sim.traineddata\nPlease make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.\nFailed loading language 'chi-sim'\nTesseract couldn't load any languages!\nCould not initialize tesseract.\n')

Then I tried eng, fra traineddata file and all went well.

And it took me a long time to find out that it was the naming problem. Atfer I changed the filename from "chi-sim.traineddata" to "chi.traineddata" and changed them in programs, all went ok.I guess it's because pyocr have problem reading data file with "-" in its name. However official tesseract doesn't have this issue.

Please fix this, thank you!

Inconsistent spelling of Pyocr/PyOCR

Hi Jerome,
What is the exact spelling of the project ? The README uses PyOCR in the title but Pyocr in the description.
This may sound like nitpicking but debian requires me to specify the case-sensitive name of the project.

Would you mind changing all occurrences to one of the spellings ?

Thanks !

Libtesseract: need stress-testing

Someone has been reporting crashes of Paperwork when running the OCR. They are using Tesseract 3.04.01 .. so there may be something wrong with the libtesseract binding.

(Note: currently, the preference order has been changed so Pyocr uses tesseract-sh if possible)

Unnecessary file IO

The way the image_to_string functions are currently implemented, the output of the engine is written to file, which is then read in and returned to the user.
Both Cuneiform and Tesseract now support sending the output to stdout thus eliminating the need for the 2 extra file IO operations.
I'll attempt implementing this - hopefully it will result in speeding things up a bit.

tesseract.detect_orientation returning wrong rotation direction after tesseract update

Due to this change in tesseract

tesseract-ocr/tesseract@6bbcb50#diff-8f75e5c5721b655480127da396bd5caa

The output of "psm 0" has changed to:

Page number: 0
Orientation in degrees: 90
Rotate: 270
Orientation confidence: 19.30
Script: Latin
Script confidence: 18.28

From previously:

Orientation: 1
Orientation in degrees: 270
Orientation confidence: 19.30
Script: 1
Script confidence: 18.28

This in turn causes the image to be flipped upside down instead of right side up.

Add support for Scribo

Leptonica causes unnecessary TesseractError

TestOrientation throws the following error on Tesseract 3.04.01 (installed via HomeBrew on OSX 10.10.5):

TesseractError: (-1, u'No script found in image (Warning in pixReadMemBmp: work-around: writing to a temp file\nPage number: 0\nOrientation in degrees: 0\nRotate: 0\nOrientation confidence: 15.38\nScript: Latin\nScript confidence: 466.67)')

The error is encountered when executing output = {x: y for (x, y) in output} on line 172.

This is caused by the PixReadMemBmp error which contains an extra colon, resulting in an array of 3 elements when split with [line.split(": ",1) for line in output if (": " in line)], resulting in a ValueError later on at {x: y for (x, y) in output}.

More on the cause of the PixReadMemBmp error can be found here and here.

As the orientation and confidence are calculated correctly, I think the error is not critical and should not cause the test to fail?

Builders : bad use of class attributes

Ok I'm pretty much done for the digit builders,
but I stumbled on what I think is a bug.
The builders have lists as class attributes -- file_extensions, tesseract_configs, cuneiform_args -- and at init these lists are appended to, so that :

a = TextBuilder()
b = TextBuilder()
c = TextBuilder()
print(TextBuilder.tesseract_configs)

prints ['-psm', '3', '-psm', '3', '-psm', '3']

But there's worse. Since DigitBuilder inherits from TextBuilder and appends "digits" to tesseract_configs, any subsequent call to TextBuilder interprets the input as digits -- this was caught in tests, so they're useful :)

Proposed fixes :

simple : Make these lists instance attributes and not class attributes
preferred : Do not use lists at all, but just pass a dict of options
then make it into a list later
(this could also be used with a **kw for gathering tool-specific options without polluting the builders)
minimal : Redefine the class attribute in children. This still means only one config by class -- impossible to compare TextBuilder results with different psm.

Also ideally those attributes should be documented.

Tesseract 4.0 alpha support

is there any plan support Tesseract 4.0 alpha

Cuneiform: Split text areas before OCR

Cuneiform tends to stop reading pages when it reachs a large non-readable area. Because of this, when using Cuneiform, all the keywords are not actually extracted.

A way to work around this problem would be to split the text areas prior to OCR.

For instance, unpaper can do that (ocrfeeder uses it).

ImportError: No module named builders

I am getting the below error while executing a script mentioned in initialization section of the readme file

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pyocr/init.py", line 1, in
from .pyocr import *
File "/usr/local/lib/python2.7/dist-packages/pyocr/pyocr.py", line 50, in
from . import tesseract
File "/usr/local/lib/python2.7/dist-packages/pyocr/tesseract.py", line 28, in
from pyocr.builders import DigitBuilder # backward compatibility

Adding languages

I'm new to pyocr and ocr in general. I'm trying to use pyocr for languages such as french, chinese etc, but the get_available_languages returns only 3 options: osd, eng, equ. How can I add other languages?

Tesseract test fails on test_orientation_90, KeyError "Uknown Extension"

Error output:
Tesseract:
........................E
======================================================================
ERROR: test_orientation_90 (tests.tests_tesseract.TestOrientation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/james/pyocr/tests/tests_tesseract.py", line 352, in test_orientation_90
    result = tesseract.detect_orientation(img, lang='eng')
  File "src/pyocr/tesseract.py", line 159, in detect_orientation
    image.save(proc.stdin, format=image.format)
  File "/usr/lib/python2.7/dist-packages/PIL/Image.py", line 1453, in save
    raise KeyError(ext) # unknown extension
KeyError: ''

I'm using the python-imaging library from Ubuntu's repositories. This is the only test that fails. Please advise on how to fix.

Tesseract C API : Digits-only (DigitBuilder)

Hello,
I'd like to use tesseract with a numerical input, but as it is this is only possible with the tesseract command line tool and its DigitBuilder, since f36f249

However, this looks easy enough to implement with the C API too, with a new function in libtesseract/tesseract_raw.py :

def set_numeric_only(handle) :
    global g_libtesseract
    assert(g_libtesseract)

    g_libtesseract.TessBaseAPISetVariable(
        ctypes.c_void_p(handle),
        b"classify_bln_numeric_mode",
        b"1"
    )

The most conservative way would be to use it in a new builder subclass in libtesseract/__init__.py , in the same was as for tesseract.py.

But I think it might be better to move this to image_to_string both in
libtesseract/__init__.py and tesseract.py, with a new option, like it's done for choosing the language, since from what I understand builders should be more for choosing the format of the output.

I am not too familiar with github, ctypes, or pyocr, so sorry if I'm misunderstanding the code or doing something wrong.

Thank you for your work on this package,
Regards

PS : It looks like the C API also offers possibilities for getting confidence scores for words, which might be interesting to get to a Builder.

Cryptic TesseractError (-9) when processing image

Using the latest version of pyocr and attempting to parse text on a file run through unpaper:

with open('test.unpaper.pnm', 'r') as f:
    text = ocr.image_to_string(f, lang='eng')

Is causing the follow stacktrace:

File "/usr/local/lib/python3.5/dist-packages/pyocr/tesseract.py", line 358, in image_to_string
    raise TesseractError(status, errors)
pyocr.error.TesseractError: (-9, b'Tesseract Open Source OCR Engine v3.04.01 with Leptonica\n')

However, I can run the following command:

tesseract test.unpaper.pnm output

And it works without errors. After searching, I cannot find any reference to the -9 return value, and it seems like the error output is being truncated (it's just the top stdout when you first run Tesseract).

Suggestions?

AttibuteError: tobytes

I met a problem when using this wrapper.
My test code goes here:

from PIL import Image
import sys
import pyocr
import pyocr.builders
tools=pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
tool = tools[0]
print("Will use lang '%s'" % (lang))

txt = tool.image_to_string(
    Image.open('test.jpg'), 
    lang="eng",
    builder=pyocr.builders.TextBuilder()
)
print txt

$ python test.py

Will use tool 'Tesseract (C-API)'
Available languages: char100,digit,chi_sim,eng
Will use lang 'eng'
Traceback (most recent call last):
  File "test.py", line 23, in <module>
    builder=pyocr.builders.TextBuilder()
  File "build/bdist.linux-x86_64/egg/pyocr/libtesseract/__init__.py", line 96, in image_to_string

  File "build/bdist.linux-x86_64/egg/pyocr/libtesseract/tesseract_raw.py", line 359, in set_image
  File "/usr/local/lib/python2.7/dist-packages/PIL/Image.py", line 512, in __getattr__
    raise AttributeError(name)
AttributeError: tobytes

My PIL version is 1.1.7 with libjpeg(JPEG support) and zlib(PNG/ZIP support)

I changed the source code in tesseract_raw.py
relpaced line 359 with:

try:
    imgdata = image.tobytes("raw", "RGB")
except AttributeError:
    imgdata = image.tostring("raw", "RGB")

and the tool finally worked well.
I think this is a PIL bug but I don't know how it comes.

Langs: can more languages be supported?

i have got equ, eng and osd in the language list and if it is possible to support other languages like Chinese

Tests/Tesseract: Differences between Pyocr and reference output

For some reason there are differences between the references and the actual results. And it seem the actual results are good, so it's problably a bug in update_test_data.sh

Failed tests

Hi again,

Some tests fail for me (fedora 16), with tesseract-ocr 3.00 probably because the version is not up to date (3.01), and also for cuneiform 1.1.0, for some reason. Looks like 1.1.0 is the latest one.

Here is the output related to cuneiform:

$ python run_tests.py 
- OCR: Tesseract
  is_available(): True
  get_version(): (3, 0, 0)
  get_available_languages():  
- OCR: Cuneiform
  is_available(): True
  get_version(): (1, 1, 0)
  get_available_languages():  eng,  ger,  fra,  rus,  swe,  spa,  ita,  ruseng,  ukr,  srp,  hrv,  pol,  dan,  por,  dut,  cze,  rum,  hun,  bul,  slv,  lav,  lit,  est,  tur,  

OCR tool found:
- Tesseract
- Cuneiform

---
Tesseract:
.FF..FFFF.EEEE

[snap old tesseract]

Cuneiform:
.....F..F.
======================================================================
FAIL: test_french (tests.tests_cuneiform.TestTxt)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pyocr/tests/tests_cuneiform.py", line 73, in test_french
    self.__test_txt('test-french.jpg', 'test-french.txt', 'fra')
  File "/tmp/pyocr/tests/tests_cuneiform.py", line 64, in __test_txt
    self.assertEqual(output, expected_output)
AssertionError: u'Phrase en *an\xe7ais. \navec des accents \n\xe9ph\xe9m\xe8re' != u'Phrase en fran\xe7ais. \navec des accents \n\xe9ph\xe9m\xe8re'
- Phrase en *an\xe7ais. 
?           ^
+ Phrase en fran\xe7ais. 
?           ^^
  avec des accents 
  \xe9ph\xe9m\xe8re

======================================================================
FAIL: test_french (tests.tests_cuneiform.TestWordBox)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pyocr/tests/tests_cuneiform.py", line 113, in test_french
    self.__test_txt('test-french.jpg', 'test-french.words', 'fra')
  File "/tmp/pyocr/tests/tests_cuneiform.py", line 104, in __test_txt
    self.assertEqual(boxes[i], expected_boxes[i])
AssertionError: <builders.Box object at 0x2558650> != <builders.Box object at 0x24d0f90>

----------------------------------------------------------------------
Ran 10 tests in 2.248s

FAILED (failures=2)

Heroku: language file search paths

My language data is not in any of the paths specified in TESSDATA_POSSIBLE_PATHS. Is there a way I can add to this list of search paths?

ABBYY OCR support?

So I tried using paperwork and was not really satisfied with the results, it looks like Tesseract works as bad with my documents as it did some years ago when I last tried...

I found ABBYY OCR for Linux to work much better (at least for my documents), but I found the tooling around it to be lacking, so I didn't buy it so far (but played with the trial).

What do you think about integration of that into PyOCR? It seems to have an XML export with character box information, so I think that should work.

If you agree with the idea, I might contribute one day - but I'm currently very busy with my own projects, so that'll probably take a few months.

No OCR tool found

I have tesseract installed. I have used it many times before, but when I use this script:

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))


langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))


txt = tool.image_to_string(Image.open('test.png'),
                           lang=lang,
                           builder=pyocr.builders.TextBuilder())
word_boxes = tool.image_to_string(Image.open('test.png'),
                                  lang=lang,
                                  builder=pyocr.builders.WordBoxBuilder())
line_and_word_boxes = tool.image_to_string(
        Image.open('test.png'), lang=lang,
        builder=pyocr.builders.LineBoxBuilder())

It says there is no OCR tool found. Any fix?

openpaperwork / pyocr Goto Github PK

pyocr's Introduction

PyOCR

pyocr's People

Contributors

Stargazers

Watchers

Forkers

pyocr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs