Comments (5)
The problem here is that init()
provides a handle that must be free with cleanup()
. And with the current Pyocr's API, it's hard to figure out the best time to free it.
Some program may want to keep the same handle as long as they are running, but others (like Paperwork for instance) prefer to have it freed when not used anymore.
So I think this change will imply changing the API in non-backward-compatible way. The API is the same for all the modules, so it will have to be changed on all the others too.
from pyocr.
My own patch was add a option input kward to image_to_string()
85 -def image_to_string(image, lang=None, builder=None):
85 +def image_to_string(image, lang=None, builder=None, tesseract_raw_handle=None):
86
86 if builder is None:
87
87 builder = builders.TextBuilder()
88 - handle = tesseract_raw.init(lang=lang)
88 + if tesseract_raw_handle is None:
89 + handle = tesseract_raw.init(lang=lang)
90 + else:
91 + handle = tesseract_raw_handle
89
92
90
93 lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
91
94 lvl_word = tesseract_raw.PageIteratorLevel.WORD
92
95
93
96 try:
94 - # XXX(Jflesch): Issue #51:
95 - # Tesseract TessBaseAPIRecognize() may segfault when the target
96 - # language is not available
97 - clang = lang if lang else "eng"
98 - for lang_item in clang.split("+"):
99 - if lang_item not in tesseract_raw.get_available_languages(handle):
100 - raise TesseractError(
101 - "no lang",
102 - "language {} is not available".format(lang_item)
103 - )
97 + if tesseract_raw_handle is None:
98 + # XXX(Jflesch): Issue #51:
99 + # Tesseract TessBaseAPIRecognize() may segfault when the target
100 + # language is not available
101 + clang = lang if lang else "eng"
102 + for lang_item in clang.split("+"):
103 + if lang_item not in tesseract_raw.get_available_languages(handle):
104 + raise TesseractError(
105 + "no lang",
106 + "language {} is not available".format(lang_item)
107 + )
104
108
105
109 tesseract_raw.set_page_seg_mode(
106
110 handle, builder.tesseract_layout
... ...
@@ -159,7 +163,8 @@ def image_to_string(image, lang=None, builder=None):
159
163 break
160
164
161
165 finally:
162 - tesseract_raw.cleanup(handle)
166 + if tesseract_raw_handle is None:
167 + tesseract_raw.cleanup(handle)
163
168
164
169 return builder.get_output()
add I init and cleanup the handle by myself
tesseract_raw_handle = libtesseract.tesseract_raw.init("eng")
try:
for image in images:
libtesseract.image_to_string(
image,
lang="eng",
builder=builders.DigitBuilder(7),
tesseract_raw_handle=tesseract_raw_handle
)
finally:
libtesseract.tesseract_raw.cleanup(tesseract_raw_handle)
from pyocr.
maybe add a new class base api like ImageToString
class is a optional way to solve this problem, and we can use weakref.finalize
to force call the cleanup
when the instance of ImageToString
class was gc to avoid user forget free the handle.Of course, told users use a with ImageToString() as i:
to call cleanup
at __exit__
was the best way.
from pyocr.
Interresting idea. But still a new API. So I'll consider it, but for a next major new version (PyOCR2 :).
from pyocr.
add a note, before we want to reuse the handle
we need to call TessBaseAPIClearAdaptiveClassifier
to avoid recognition the different picture cause tesseract internal struct change
from pyocr.
Related Issues (20)
- -psm tesseract parameter is deprecated HOT 2
- Could we get a confidence value by each word? HOT 4
- In a multipage TIFF, results are returned only from the first page HOT 16
- I want use chinese char, but acc is low HOT 3
- tessedit_char_whitelist . detect only predefined chars . HOT 15
- [libtesseract] output of get_available_builders() is incomplete HOT 2
- The result is empty HOT 2
- Test environment to make tests reproducable HOT 1
- preserve_interword_spaces in tesseract HOT 1
- Extract Individual Characters
- Using libtesseract on Windows HOT 3
- 1 recognize 3 issue HOT 1
- File not found HOT 5
- Difference between pyocr, pytesseract, tesserocr HOT 2
- Different results generated from pyocr and tesseract HOT 1
- Problem allocate memory HOT 4
- tesseract4 error in detect orientation HOT 1
- Trying to OCR a jpeg but getting [Error 3221225477]? HOT 7
- pyocr with latest Tesseract fails with pyocr.error.TesseractError: "Error, unknown command line argument '-psm'\n") HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyocr.