GithubHelp home page GithubHelp logo

Comments (3)

wanghaisheng avatar wanghaisheng commented on June 10, 2024

背景

现如今OCR 通常来说是指把图片中的文本识别出来(输出只是纯文本),但在早期 1967年,曾有不少人尝试将图片转换成结构化语言/标记 这样子既能表示文本 也能表示它的排版信息 。本文重点是对数学公式的OCR,如何处理诸如上标、下标、特殊符号、嵌套等排版。INFTY系统能够将打印体的数学公式转换成LaTeX或其他格式。
从这几年 手写体、自然场景下的识别以及图片标题生成等技术的发展可以看出,完全是数据驱动的,无需重度依赖预处理、领域知识的了解。看语言模型的话,神经网络可以生成语义上正确的标记,但并不确定能否从图片中训练得到正确的带展现信息的结构化标记。

Optical character recognition (OCR) is most commonly used to recognize natural language from an image; however, as early as the work of (Anderson 1967), there has been research interest in converting images into structured language or markup that defines both the text itself and its presentational semantics. The primary focus of this work is OCR for mathematical expressions, and how to handle presentational aspects such as sub and superscript notation, special symbols, and nested fractions (Belaid and Haton 1984;Chan and Yeung 2000). The most effective systems combine specialized character segmentation with grammars of the underlying mathematical layout language (Miller and Viola 1998). A prime example of this approach is the INFTY system that is used to convert printed mathematical expressions to LaTeX and other markup formats (Suzuki et al.
2003).Problems like OCR that require joint processing of image and text data have recently seen increased research interest due to the refinement of deep neural models in these two domains. For instance, advances have been made in the areas of handwriting recognition (Ciresan et al. 2010), OCR in natural scenes (Jaderberg et al. 2015; 2016; Wang et al. 2012) and image caption generation (Karpathy and Fei-Fei 2015; Vinyals et al. 2015b). At a high-level, each of these systems learns an abstract encoded representation of the input image which is then decoded to generate a textual output. In addition to performing quite well on standard tasks, these models are entirely data driven, which makes them adaptable to a wide range of datasets without requiring heavily preprocessing inputs or domain specific engineering. The turn towards data-driven neural methods for image and text leads us to revisit the problem of generating structured markup. We consider whether a supervised model can learn to produce correct presentational markup from an image, without requiring a textual or visual grammar of the underlying markup language. While results from language modeling suggest that neural models can consistently generate syntactically correct markup (Karpathy, Johnson, and Li 2015; Vinyals et al. 2015a), it is unclear whether the full solution can be learned from markup-image pairs. Our model, WYGIWYS [What you get is what you see], is a simple extension of the attention-based encoder-decoder model (Bahdanau, Cho, and Bengio 2014), which is now standard for machine translation. Similar to work in image captioning (Xu et al. 2015), the model incorporates a multi-layer convolutional network over the image with an attention-based recurrent neural network decoder. To adapt this model to the OCR problem and capture the document’s temporal layout, we also incorporate a new source encoder layer in the form of a multi-row recurrent model applied before the application of attention. The use of attention additionally provides an alignment from the generated markup to the original source image (see Figure 1).We also introduce two new datasets for the image-to-markup task. The preliminary experiments use a dataset of small synthetic geometric HTML examples rendered as web pages. For the main experiments, we introduce a new dataset,IM2LATEX-100K, that consists of a large collection of rendered real-world mathematical expressions collected from published articles1. We will publicly release this dataset as part of this work. The same model architecture is trained to generate HTML and LaTeX markup with the goal rendering to the exact source image. Experiments compare the output of the model with several research and commercial baselines, as well as ablations of the model. The full system for mathematical expression generation is able to match images within 15% of image edit distance, and is identical on more than 75% of real-world test examples. Additionally the use of a multi-row encoder leads to a significant increase in performance. All data, models, and evaluation scripts are publicly available at http://lstm.seas.harvard.edu/latex/

from awesome-ocr.

wanghaisheng avatar wanghaisheng commented on June 10, 2024

Problem: Image-to-Markup Generation

from awesome-ocr.

wanghaisheng avatar wanghaisheng commented on June 10, 2024

3 Model

WYGIWYS [What you get is what you see], is a simple extension of the attention-based encoder-decoder model (Bahdanau, Cho, and Bengio 2014), which is now standard for machine translation. Similar to work in image captioning (Xu et al. 2015), the model incorporates a multi-layer convolutional network over the image with an attention-based recurrent neural network decoder. To adapt this model to the OCR problem and capture the document’s temporal layout, we also incorporate a new source encoder layer in the form of a multi-row recurrent model applied before the application of attention. The use of attention additionally provides an alignment from the generated markup to the original source image (see Figure 1).We also introduce two new datasets for the image-to-markup task. The preliminary experiments use a dataset of small synthetic geometric HTML examples rendered as web pages. For the main experiments, we introduce a new dataset,IM2LATEX-100K, that consists of a large collection of rendered real-world mathematical expressions collected from published articles1. We will publicly release this dataset as part of this work. The same model architecture is trained to generate HTML and LaTeX markup with the goal rendering to the exact source image. Experiments compare the output of the model with several research and commercial baselines, as well as ablations of the model. The full system for mathematical expression generation is able to match images within 15% of image edit distance, and is identical on more than 75% of real-world test examples. Additionally the use of a multi-row encoder leads to a significant increase in performance. All data, models, and evaluation scripts are publicly available at http://lstm.seas.harvard.edu/latex/

Our model WYGIWYS for this task combines several standard neural components from vision and natural languageprocessing. It first extracts image features using a convolutional neural network (CNN) and arranges the features in a grid. Each row is then encoded using a recurrent neural
network (RNN). These encoded features are then used by an RNN decoder with a visual attention mechanism. The decoder implements a conditional language model over the vocabulary, and the whole model is trained to maximize the likelihood of the observed markup. The full structure is
illustrated in Figure 2. We describe the model in more detail

from awesome-ocr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.