Comments (3)
摘要
OCR 文本的后处理
weighted confusion matrix and a shallow language model 可以提升准确率
this paper explores the use of a learned classifier for post-OCR text correction. Experiments with the Arabic language show that this approach, which integrates a weighted confusion matrix and a shallow language model, improves the vast majority of segmentation and recognition errors, the most frequent types of error on our dataset.
from awesome-ocr.
方法
We propose an OCR post-correction technique based on a composite machine-learning classification. The method applies a lexical spellchecker and potentially corrects single-error misspellings and a certain class of double-error misspellings, which are the major source of inaccurate recognitions in most
OCR use-cases. The novelty of this method is its ability to take into consideration several valuable word features, each giving additional information for a possible spelling correction. It is
built out of two consecutive stages:
- word expansion based on a confusion matrix, and
- word selection by a regression model based on word features.
The confusion matrix and regression model are built from a transcribed set of images, while the word features rely on a language model built from a large publicly available textual
dataset.
from awesome-ocr.
OCR 结果校正方法
There has been much research aimed at the automated correction of recognition errors for degraded collections. An early, useful survey is [1]; relevant methods for Arabic OCR are summarized in [2] and in collection [3]. In this work, we use language models on the character and word levels, plus lexicons. We do not apply morphological or syntactical analyses, nor passage-level or topic-based methods.
Three language resources play a rôle:∙
- Dictionary lookup compares OCR-output with the words in a lexicon. When there is a mismatch, one looks for alternatives within a small edit (Levenshtein) distance, under the assumption that OCR errors are often due to character insertions, deletions, and/or substitutions. For this purpose, one commonly uses a noisy-channel model,a probabilistic confusion matrix for character substitutions, and term frequency lists [4], as we do here. One must, however, take into consideration unseen (“out of vocabulary”) words, especially for morphologically-rich languages, like Greek, and even more so for abjads, like Arabic, in which vowels are not represented. The correct reading might not appear in the lexicon (even if it is not a
named entity), while many mistaken readings will appear,because a large fraction of letter combinations form valid words. Morphological techniques could help here, of course. Dictionary lookup and shallow morphology areused in [2].∙- We use the term k-mer for the possible contiguous character substrings of words. By collecting statistics on the relative frequency of different k-mers for a particular language, one can often recognize unlikely readings. This technique was employed by BBN’s OCR system for Arabic [5], as well as in [2].
- A language model, based on n-gram frequencies derived from a large corpus, is frequently used to estimate the likelihood of a reading in context [6].
from awesome-ocr.
Related Issues (20)
- OCR basics HOT 1
- EAST:An Efficient and Accurate Scene Text Detector HOT 1
- Robust, Simple Page Segmentation using Hybrid Convolutional MDLSTM Networks
- PixelLink: Detecting Scene Text via Instance Segmentation
- Table-to-Text: Describing Table Region with Natural Language
- lable tools
- how to modify the connectionist Temporal Classification (CTC) layer of the network to also give us a confidence score? HOT 2
- Confidence Prediction for Lexicon-Free OCR HOT 1
- 工业制造——Workplace of automated control of vibration output circular trays HOT 3
- Tesseract for R HOT 1
- Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection
- 【Rosetta:大规模图像文字检测识别系统】《Rosetta: Large scale system for text detection and recognition in images》[Facebook] (2018) O HOT 4
- Radical analysis network for zero-shot learning in printed Chinese character recognition HOT 3
- DenseRAN for Offline Handwritten Chinese Character Recognition HOT 3
- Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework
- in marmot data set the table BBOX are not matching with original images
- dhSegment: A generic deep-learning approach for document segmentation
- null
- 2018年末撸串计划 HOT 5
- 希望可以增加PaddleOCR、AgentOCR HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from awesome-ocr.