GithubHelp home page GithubHelp logo

Comments (5)

jflesch avatar jflesch commented on June 11, 2024

The problem here is that init() provides a handle that must be free with cleanup(). And with the current Pyocr's API, it's hard to figure out the best time to free it.
Some program may want to keep the same handle as long as they are running, but others (like Paperwork for instance) prefer to have it freed when not used anymore.

So I think this change will imply changing the API in non-backward-compatible way. The API is the same for all the modules, so it will have to be changed on all the others too.

from pyocr.

wwqgtxx avatar wwqgtxx commented on June 11, 2024

My own patch was add a option input kward to image_to_string()
 

85   -def image_to_string(image, lang=None, builder=None):
85   +def image_to_string(image, lang=None, builder=None, tesseract_raw_handle=None):
86	
86     if builder is None:
87	
87         builder = builders.TextBuilder()
88    -    handle = tesseract_raw.init(lang=lang)
88    +    if tesseract_raw_handle is None:
89    +        handle = tesseract_raw.init(lang=lang)
90    +    else:
91    +        handle = tesseract_raw_handle
89	
92
90	
93     lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
91	
94     lvl_word = tesseract_raw.PageIteratorLevel.WORD
92	
95
93	
96     try:
94    -        # XXX(Jflesch): Issue #51:
95    -        # Tesseract TessBaseAPIRecognize() may segfault when the target
96    -        # language is not available
97    -        clang = lang if lang else "eng"
98    -        for lang_item in clang.split("+"):
99    -            if lang_item not in tesseract_raw.get_available_languages(handle):
100   -                raise TesseractError(
101   -                    "no lang",
102   -                    "language {} is not available".format(lang_item)
103   -                )
97    +        if tesseract_raw_handle is None:
98    +            # XXX(Jflesch): Issue #51:
99    +            # Tesseract TessBaseAPIRecognize() may segfault when the target
100   +            # language is not available
101   +            clang = lang if lang else "eng"
102   +            for lang_item in clang.split("+"):
103   +                if lang_item not in tesseract_raw.get_available_languages(handle):
104   +                    raise TesseractError(
105   +                        "no lang",
106   +                        "language {} is not available".format(lang_item)
107   +                    )
104	
108
105	
109         tesseract_raw.set_page_seg_mode(
106	
110             handle, builder.tesseract_layout
...	...
@@ -159,7 +163,8 @@ def image_to_string(image, lang=None, builder=None):
159	
163                 break
160	
164
161	
165     finally:
162   -        tesseract_raw.cleanup(handle)
166   +        if tesseract_raw_handle is None:
167   +            tesseract_raw.cleanup(handle)
163	
168
164	
169     return builder.get_output()

add I init and cleanup the handle by myself

            tesseract_raw_handle = libtesseract.tesseract_raw.init("eng")
            try:
                for image in images:
                    libtesseract.image_to_string(
                     image,
                     lang="eng",
                     builder=builders.DigitBuilder(7),
                     tesseract_raw_handle=tesseract_raw_handle
                   )
            finally:
                libtesseract.tesseract_raw.cleanup(tesseract_raw_handle)

from pyocr.

wwqgtxx avatar wwqgtxx commented on June 11, 2024

maybe add a new class base api like ImageToString class is a optional way to solve this problem, and we can use weakref.finalize to force call the cleanup when the instance of ImageToString class was gc to avoid user forget free the handle.Of course, told users use a with ImageToString() as i: to call cleanup at __exit__ was the best way.

from pyocr.

jflesch avatar jflesch commented on June 11, 2024

Interresting idea. But still a new API. So I'll consider it, but for a next major new version (PyOCR2 :).

from pyocr.

wwqgtxx avatar wwqgtxx commented on June 11, 2024

add a note, before we want to reuse the handle we need to call TessBaseAPIClearAdaptiveClassifier to avoid recognition the different picture cause tesseract internal struct change

from pyocr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.