GithubHelp home page GithubHelp logo

Comments (3)

wanghaisheng avatar wanghaisheng commented on June 10, 2024

背景介绍

photo OCR的应用场景

  • 供导航系统使用的街景识别
  • 供盲人使用的辅助技术 街标解读
  • 手机上的实时识别和翻译
  • 网页上海量视频 图片的检索和查询

Photo Optical Character Recognition (photo OCR), which aims to read scene text in natural images, is an essen- tial step for a wide variety of computer vision tasks, and has enjoyed significant success in several commercial applica- tions. These include street-sign reading for automatic navi- gation systems, assistive technologies for the blind (such as product-label reading), real-time text recognition and trans- lation on mobile phones, and search/indexing the vast cor- pus of image and video on the web.

from awesome-ocr.

wanghaisheng avatar wanghaisheng commented on June 10, 2024

方法

传统方法

  • 固定个数的固定长度的单词、单字形成的字典 辅以人工提取图片特征
    • 1 .区域二值化、HOG特征提取、马尔科夫模型二值化 连通组件特征提取
    • 2.在人工提取特征上整合CNN 均无法解决字典中不存在的字词识别

The field of photo OCR has been primarily focused on constrained scenarios with hand-engineered image features. (Here, constrained means that there is a fixed lexicon or dictionary and words have known length during inference.). Specifically, examples of constrained text recognition methods include region-based binarization or grouping [5, 24, 33], pictorial structures with HOG features [47, 46], integer programming with SIFT descriptor [41], Conditional Random Fields (CRFs) with HOG features [32, 31, 39], Markov models with binary and connected component features [49]. Some early attempts [26, 53, 10] try to learn local mid-level representation on top of the handcrafted features, and some methods in [48, 19, 16] incorporate deep convolutional neural networks (CNNs) [25, 13] for a better image feature extraction. These methods work very well when candidate ground-truth word strings are known at testing stage, but do not generalize to words that are not present in the list of a lexicon at all

  • 使用两个 CNN 一个用于对字符序列建模 一个用于N-gram 语言模型 然后使用 CRF 图模型将二者整合起来

    A recent advance in the state-of-the-art that moves beyond this constrained setting was presented by Jaderberg et al. in [17]. The authors report results in the unconstrained setting by constructing two sets of CNNs – one for modeling character sequences and one for N-gram language statistics – followed by a CRF graphical model to combine their activations. This method achieved great success and set a new standard in photo OCR field. However, despite these successes, the system in [17] does have some drawbacks. For instance, the use of two different CNNs incurs a relatively large memory and computation cost. Furthermore, the manually defined N-gram CNN model has a large number of output nodes (10k output units for N = 4), which increases the training complexity – requiring an incremen- tal training procedure and heuristic gradient rescaling based on N-gram frequencies.

  • 本文提出的新方法

    Inspired by [17], we continue to focus our efforts on the unconstrained scene text recognition task, and we develop a recursive recurrent neural networks with attention modeling (R2AM) system that directly performs image to sequence (word strings) learning, delivering improvements over their work. The three main contributions of the work presented in this paper are:
    (1) Recursive CNNs with weight-sharing, for more effective image feature extraction than a “vanilla” CNN under the same parametric capacity.
    (2) Recurrent neural networks (RNNs) atop of extracted image features from the aforementioned recursive CNNs, to perform implicit learning of character-level language model. RNNs can automatically learn the sequential dy- namics of characters that are naturally present in word strings from the training data without the need of manually defining N-grams from a dictionary.
    (3) A sequential attention-based modeling mechanism that performs “soft” deterministic image feature selection as the character sequence is being read, and that can be trained end-to-end within the standard backpropagation.
    We pursue extensive experimental validation on chal- lenging benchmark datasets: Street View Text, IIIT5k, ICDAR and Synth90k. We also provide a detailed ablation study by examining the effectiveness of each of the pro- posed components. Our proposed network architecture achieves the new state-of-the-art results and significantly outperforms the previous best reported results for unconstrained text recognition [17]; i.e. we observe an absolute accuracy improvement of 9% on Street View Text and 8.2% on ICDAR 2013.

from awesome-ocr.

SJHBXShub avatar SJHBXShub commented on June 10, 2024

Hi, do you have the code of this paper? Thank you very much.

from awesome-ocr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.