GithubHelp home page GithubHelp logo

Comments (3)

wanghaisheng avatar wanghaisheng commented on June 10, 2024

摘要

 The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are still prone to suggest correction candidates from limited observations while insufficiently accounting for the characteristics of OCR errors. In this paper, we show how to enlarge candidate suggestion space by using external corpus and integrating OCR-specific features in a regression approach to correct OCR-generated errors. The evaluation results show that our model can correct 61.5% of the OCR-errors (considering the top 1 suggestion) and 71.5% of the OCR-errors (considering the top 3 suggestions), for cases where the theoretical correction upper-bound is 78%. 

from awesome-ocr.

wanghaisheng avatar wanghaisheng commented on June 10, 2024

Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings
https://arxiv.org/pdf/1606.05157.pdf

from awesome-ocr.

wanghaisheng avatar wanghaisheng commented on June 10, 2024
 Good OCR results for historical printings rely on the availability of recognition models trained on diplomatic transcriptions as ground truth, which is both a scarce resource and time-consuming to generate. Instead of having to train a separate model for each historical typeface, we propose a strategy to start from models trained on a combined set of available transcriptions in a variety of fonts. These \emph{mixed models} result in character accuracy rates over 90\% on a test set of printings from the same period of time, but without any representation in the training data, demonstrating the possibility to overcome the typography barrier by generalizing from a few typefaces to a larger set of (similar) fonts in use over a period of time. The output of these mixed models is then used as a baseline to be further improved by both fully automatic methods and semi-automatic methods involving a minimal amount of manual transcriptions. In order to evaluate the recognition quality of each model in a series of models generated during the training process in the absence of any ground truth, we introduce two readily observable quantities that correlate well with true accuracy. These quantities are \emph{mean character confidence C} (as given by the OCR engine OCRopus) and \emph{mean token lexicality L} (a distance measure of OCR tokens from modern wordforms taking historical spelling patterns into account, which can be calculated for any OCR engine). Whereas the fully automatic method is able to improve upon the result of a mixed model by only 1-2 percentage points, already 100-200 hand-corrected lines lead to much better OCR results with character error rates of only a few percent. This procedure minimizes the amount of ground truth production and does not depend on the previous construction of a specific typographic model. 

from awesome-ocr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.