GithubHelp home page GithubHelp logo

amazon-archives / tesserpy Goto Github PK

View Code? Open in Web Editor NEW
20.0 5.0 10.0 209 KB

ARCHIVED: A Python API for Tesseract

Home Page: https://github.com/rigorgt/tesserpy

License: GNU Lesser General Public License v2.1

C++ 12.06% Python 87.94%

tesserpy's Introduction

Archived

This repository is no longer being updated. New development is happening in rigorgt/tesserpy.

tesserpy

A Python API for Tesseract

Requirements

  • Python >= 2.7 or >= 3.2
  • NumPy >= 1.6
  • Tesseract >= 3.02

Building

It's the usual distutils dance -- run python setup.py for more details.

If your Tesseract installation's files are not in the standard system paths, you may need to create a setup.cfg with the following contents:

[build_ext]
include-dirs=/path/to/tesseract/include
library-dirs=/path/to/tesseract/lib

Example

Here's a simple example that requires OpenCV:

import cv2
import tesserpy

tess = tesserpy.Tesseract("/path/to/tessdata/prefix", language="eng")
# Anything exposed by SetVariable / GetVariableAsString is an attribute
tess.tessedit_char_whitelist = """'"!@#$%^&*()_+-=[]{};,.<>/?`~abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"""
image = cv2.imread('/path/to/image.png')
tess.set_image(image)
page_info = tess.orientation()
print(page_info.textline_order == tesserpy.TEXTLINE_ORDER_TOP_TO_BOTTOM)
print("#####")
print(tess.get_utf8_text())
print("#####")
print("Word\tConfidence\tBounding box coordinates")
for word in tess.words():
	bb = word.bounding_box
	print("{}\t{}\tt:{}; l:{}; r:{}; b:{}".format(word.text, word.confidence, bb.top, bb.left, bb.right, bb.bottom))

tesserpy's People

Contributors

gpjt avatar hyandell avatar kaolin avatar longears avatar squidpickles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tesserpy's Issues

Error in `python3': free(): invalid next size (normal): 0x0000000001f4ee80

Hey there!
Thanks for the module, I've been looking for something like this for my Opencv project, for quite a while!

But whenever I run the following code:
import cv2
import tesserpy

tess = tesserpy.Tesseract("/usr/share/tesseract-ocr/tessdata", language="eng")

Anything exposed by SetVariable / GetVariableAsString is an attribute

tess.tessedit_char_whitelist = "-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"""
image = cv2.imread("meme.jpg")
tess.set_image(image)
page_info = tess.orientation()
print(tess.get_utf8_text())

I get the following output:

MEWESEXIDUWE WBEB
1 a m
Lg j w

*** Error in `python3': free(): invalid next size (normal): 0x0000000001f4ee80 ***

Where the first part: "MEWESEXIDUWE WBEB
1 a m
Lg j w" is the recognized text.

But I can't seem to figure out what causes the error...

Help would be much appreciated!

Trouble compiling on Windows

Hi there,

I had to compile tesserpy for python2.7 on Windows XP 32bits today. I used VS 2012 express. When I tried to launch "python setup.py install", it failed because beacause the linker could not find tesserect.lib.

I finally got it building by copying libtesseract302.lib under the name tesseract.lib. The thing is that it would not be imported in python. The error was :
ImportError: DLL load failed: The specified module could not be found

I tried to rename libtesseract302-static.lib to tesseract.lib but I got plenty of error about external reference not found.

I ended up modifying setup.py by changing : "libraries = ['tesseract', ]", into "libraries = ['libtesseract302', ]" and adding : "data_files = [('tesseract', ['tesseract/lib/libtesseract302.dll']), ]," at the end of the setup function call.

It still don't work and I'm out of ideas. Same error about the dll which can't be found.

If you have any Idea...
Cheers,
Thomas

Cheers !

Is it possible to set PSM mode?

I noticed that tesserpy exposes PSM-related constants like tess.PSM_SINGLE_CHAR. It's less clear to me how to use this. How do I tell tesseract to recognize only a single character using this constant?

Compiling tesserpy on Debian (ResultIterator errors)

Tesserpy works like a charm on Mac OS X Yosemite. Trying to compile it on the production server now, which runs Debian Wheezy. Installed Tesseract 3.02 through apt-get install tesseract-ocr libtesseract-dev. When trying to compile (either through easy_install tesserpy, pip install tesserpy or python setup.py install), I get a ton of messages about an incomplete tesseract::ResultIterator:

Running tesserpy-1.1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-HW1lip/tesserpy-1.1.2/egg-dist-tmp-thS0mz
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
tesserpy.cpp: In function ‘PyResult* PyResultIterator_next(PyResultIterator*)’:
tesserpy.cpp:448:25: error: invalid use of incomplete type ‘class tesseract::ResultIterator’
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:83:7: error: forward declaration of ‘class tesseract::ResultIterator’
tesserpy.cpp:457:10: error: invalid use of incomplete type ‘class tesseract::ResultIterator’
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:83:7: error: forward declaration of ‘class tesseract::ResultIterator’
tesserpy.cpp:460:29: error: invalid use of incomplete type ‘class tesseract::ResultIterator’
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:83:7: error: forward declaration of ‘class tesseract::ResultIterator’
tesserpy.cpp:465:25: error: invalid use of incomplete type ‘class tesseract::ResultIterator’
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:83:7: error: forward declaration of ‘class tesseract::ResultIterator’
tesserpy.cpp: In function ‘void PyResultIterator_dealloc(PyResultIterator*)’:
tesserpy.cpp:484:18: warning: possible problem detected in invocation of delete operator: [enabled by default]
tesserpy.cpp:484:18: warning: invalid use of incomplete type ‘class tesseract::ResultIterator’ [enabled by default]
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:83:7: warning: forward declaration of ‘class tesseract::ResultIterator’ [enabled by default]
tesserpy.cpp:484:18: note: neither the destructor nor the class-specific operator delete will be called, even if they are declared when the class is defined
tesserpy.cpp: In function ‘void PyTesseract_dealloc(PyTesseract*)’:
tesserpy.cpp:542:20: warning: possible problem detected in invocation of delete operator: [enabled by default]
tesserpy.cpp:542:20: warning: invalid use of incomplete type ‘class tesseract::PageIterator’ [enabled by default]
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:81:7: warning: forward declaration of ‘class tesseract::PageIterator’ [enabled by default]
tesserpy.cpp:542:20: note: neither the destructor nor the class-specific operator delete will be called, even if they are declared when the class is defined
tesserpy.cpp: In function ‘PyObject* PyTesseract_set_image(PyTesseract*, PyObject*)’:
tesserpy.cpp:695:16: warning: possible problem detected in invocation of delete operator: [enabled by default]
tesserpy.cpp:695:16: warning: invalid use of incomplete type ‘class tesseract::PageIterator’ [enabled by default]
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:81:7: warning: forward declaration of ‘class tesseract::PageIterator’ [enabled by default]
tesserpy.cpp:695:16: note: neither the destructor nor the class-specific operator delete will be called, even if they are declared when the class is defined
tesserpy.cpp: In function ‘PyObject* PyTesseract_recognize(PyTesseract*)’:
tesserpy.cpp:721:16: warning: possible problem detected in invocation of delete operator: [enabled by default]
tesserpy.cpp:721:16: warning: invalid use of incomplete type ‘class tesseract::PageIterator’ [enabled by default]
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:81:7: warning: forward declaration of ‘class tesseract::PageIterator’ [enabled by default]
tesserpy.cpp:721:16: note: neither the destructor nor the class-specific operator delete will be called, even if they are declared when the class is defined
tesserpy.cpp: In function ‘PyObject* PyTesseract_orientation(PyTesseract*)’:
tesserpy.cpp:742:12: error: invalid use of incomplete type ‘class tesseract::PageIterator’
In file included from tesserpy.cpp:8:0:
/usr/include/tesseract/baseapi.h:81:7: error: forward declaration of ‘class tesseract::PageIterator’
error: Setup script exited with error: command 'gcc' failed with exit status 1

Am I missing some dependency? I thought tesserpy would work with Tesseract 3.02. Should I dig in the tesserpy code to see what's up?

Datapath parameter doesn't seem to do anything

Perhaps this is more of a Tesseract question than a tesserpy question. According to the Tesseract docs, datapath (Init's first argument) is to be used as follows:

The datapath must be the name of the parent directory of tessdata and must end in / .

I assume this is the exact same as setting the TESSDATA_PREFIX environment variable. But when I run the program as follows:

tess = tesserpy.Tesseract("/some/folder/", language="test")

I get the following error:

Error opening data file /usr/share/tesseract-ocr/tessdata/test.traineddata

Now when I run the program with the TESSDATA_PREFIX set:

os.environ['TESSDATA_PREFIX'] = "/some/folder/"
tess = tesserpy.Tesseract("", language="test")

I get the following error:

Error opening data file /some/folder/tessdata/test.traineddata

So it seems datapath doesn't do what I'd expect it to do. I'm wondering -- if it's not a pointer to the tessdata directory, what does it do?

differences between tesserpy and command line tesseract

I'm getting different results when I run tesseract through the command line and via tesserpy. I've tried a couple of params with the command line. For example, "tesseract example.jpg" which for my specific file (which I can provide if it would help) gives pretty bad results. I then tried "tesseract -psm 7 example.jpg" which gave really good results. However, with tesserpy, I get results somewhere in between. Any way of being able to force tesserpy to use "-psm 7"?

error compiling

Hi, I realize that this package isn't really still in production but it could be really useful so I'm hoping someone can help with this. I get the following compiling error (on Ubuntu):
tesseract_wrap.cpp: In function ‘PyObject* wrap_TessBaseAPI_SetImage__SWIG_1(PyObject, PyObject_)’:
tesseract_wrap.cpp:5173:37: error: invalid conversion from ‘const Pix_’ to ‘Pix_’ -fpermissive->SetImage((Pix const )arg2);
^
In file included from tesseract_wrap.cpp:3472:0:
/usr/include/tesseract/baseapi.h:354:8: note: initializing argument 1 of ‘void tesseract::TessBaseAPI::SetImage(Pix
)’
void SetImage(Pix* pix);
^
tesseract_wrap.cpp: In function ‘PyObject* wrap_TessBaseAPI_ProcessPages(PyObject, PyObject_)’:
tesseract_wrap.cpp:5594:86: error: no matching function for call to ‘tesseract::TessBaseAPI::ProcessPages(const char_, const char_, int&, STRING_&)’
result = (bool)(arg1)->ProcessPages((char const *)arg2,(char const *)arg3,arg4,arg5);
^
In file included from tesseract_wrap.cpp:3472:0:
/usr/include/tesseract/baseapi.h:541:8: note: candidate: bool tesseract::TessBaseAPI::ProcessPages(const char_, const char_, int, tesseract::TessResultRenderer_)
bool ProcessPages(const char* filename, const char* retry_config,
^
/usr/include/tesseract/baseapi.h:541:8: note: no known conversion for argument 4 from ‘STRING_’ to ‘tesseract::TessResultRenderer_’
tesseract_wrap.cpp: In function ‘PyObject* wrap_TessBaseAPI_ProcessPage(PyObject, PyObject_)’:
tesseract_wrap.cpp:5676:95: error: no matching function for call to ‘tesseract::TessBaseAPI::ProcessPage(Pix_&, int&, const char_, const char_, int&, STRING_&)’
result = (bool)(arg1)->ProcessPage(arg2,arg3,(char const )arg4,(char const *)arg5,arg6,arg7);
^
In file included from tesseract_wrap.cpp:3472:0:
/usr/include/tesseract/baseapi.h:556:8: note: candidate: bool tesseract::TessBaseAPI::ProcessPage(Pix
, int, const char_, const char_, int, tesseract::TessResultRenderer_)
bool ProcessPage(Pix_ pix, int page_index, const char* filename,
^
/usr/include/tesseract/baseapi.h:556:8: note: no known conversion for argument 6 from ‘STRING_’ to ‘tesseract::TessResultRenderer_’
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

I don't know if this is a swig error or what. Happy to help put in a PR but need some guidance here.

Cheers, Greg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.