论文 http://arxiv.org/pd

What You Get Is What You See: A Visual Markup Decompiler about awesome-ocr HOT 3 CLOSED

wanghaisheng commented on June 10, 2024

What You Get Is What You See: A Visual Markup Decompiler

from awesome-ocr.

Comments (3)

wanghaisheng commented on June 10, 2024

背景

现如今OCR 通常来说是指把图片中的文本识别出来(输出只是纯文本)，但在早期 1967年，曾有不少人尝试将图片转换成结构化语言/标记这样子既能表示文本也能表示它的排版信息。本文重点是对数学公式的OCR，如何处理诸如上标、下标、特殊符号、嵌套等排版。INFTY系统能够将打印体的数学公式转换成LaTeX或其他格式。
从这几年手写体、自然场景下的识别以及图片标题生成等技术的发展可以看出，完全是数据驱动的，无需重度依赖预处理、领域知识的了解。看语言模型的话，神经网络可以生成语义上正确的标记，但并不确定能否从图片中训练得到正确的带展现信息的结构化标记。

Optical character recognition (OCR) is most commonly used to recognize natural language from an image; however, as early as the work of (Anderson 1967), there has been research interest in converting images into structured language or markup that defines both the text itself and its presentational semantics. The primary focus of this work is OCR for mathematical expressions, and how to handle presentational aspects such as sub and superscript notation, special symbols, and nested fractions (Belaid and Haton 1984;Chan and Yeung 2000). The most effective systems combine specialized character segmentation with grammars of the underlying mathematical layout language (Miller and Viola 1998). A prime example of this approach is the INFTY system that is used to convert printed mathematical expressions to LaTeX and other markup formats (Suzuki et al.
2003).Problems like OCR that require joint processing of image and text data have recently seen increased research interest due to the refinement of deep neural models in these two domains. For instance, advances have been made in the areas of handwriting recognition (Ciresan et al. 2010), OCR in natural scenes (Jaderberg et al. 2015; 2016; Wang et al. 2012) and image caption generation (Karpathy and Fei-Fei 2015; Vinyals et al. 2015b). At a high-level, each of these systems learns an abstract encoded representation of the input image which is then decoded to generate a textual output. In addition to performing quite well on standard tasks, these models are entirely data driven, which makes them adaptable to a wide range of datasets without requiring heavily preprocessing inputs or domain specific engineering. The turn towards data-driven neural methods for image and text leads us to revisit the problem of generating structured markup. We consider whether a supervised model can learn to produce correct presentational markup from an image, without requiring a textual or visual grammar of the underlying markup language. While results from language modeling suggest that neural models can consistently generate syntactically correct markup (Karpathy, Johnson, and Li 2015; Vinyals et al. 2015a), it is unclear whether the full solution can be learned from markup-image pairs. Our model, WYGIWYS [What you get is what you see], is a simple extension of the attention-based encoder-decoder model (Bahdanau, Cho, and Bengio 2014), which is now standard for machine translation. Similar to work in image captioning (Xu et al. 2015), the model incorporates a multi-layer convolutional network over the image with an attention-based recurrent neural network decoder. To adapt this model to the OCR problem and capture the document’s temporal layout, we also incorporate a new source encoder layer in the form of a multi-row recurrent model applied before the application of attention. The use of attention additionally provides an alignment from the generated markup to the original source image (see Figure 1).We also introduce two new datasets for the image-to-markup task. The preliminary experiments use a dataset of small synthetic geometric HTML examples rendered as web pages. For the main experiments, we introduce a new dataset,IM2LATEX-100K, that consists of a large collection of rendered real-world mathematical expressions collected from published articles1. We will publicly release this dataset as part of this work. The same model architecture is trained to generate HTML and LaTeX markup with the goal rendering to the exact source image. Experiments compare the output of the model with several research and commercial baselines, as well as ablations of the model. The full system for mathematical expression generation is able to match images within 15% of image edit distance, and is identical on more than 75% of real-world test examples. Additionally the use of a multi-row encoder leads to a significant increase in performance. All data, models, and evaluation scripts are publicly available at http://lstm.seas.harvard.edu/latex/

from awesome-ocr.

wanghaisheng commented on June 10, 2024

Problem: Image-to-Markup Generation

from awesome-ocr.

wanghaisheng commented on June 10, 2024

3 Model

WYGIWYS [What you get is what you see], is a simple extension of the attention-based encoder-decoder model (Bahdanau, Cho, and Bengio 2014), which is now standard for machine translation. Similar to work in image captioning (Xu et al. 2015), the model incorporates a multi-layer convolutional network over the image with an attention-based recurrent neural network decoder. To adapt this model to the OCR problem and capture the document’s temporal layout, we also incorporate a new source encoder layer in the form of a multi-row recurrent model applied before the application of attention. The use of attention additionally provides an alignment from the generated markup to the original source image (see Figure 1).We also introduce two new datasets for the image-to-markup task. The preliminary experiments use a dataset of small synthetic geometric HTML examples rendered as web pages. For the main experiments, we introduce a new dataset,IM2LATEX-100K, that consists of a large collection of rendered real-world mathematical expressions collected from published articles1. We will publicly release this dataset as part of this work. The same model architecture is trained to generate HTML and LaTeX markup with the goal rendering to the exact source image. Experiments compare the output of the model with several research and commercial baselines, as well as ablations of the model. The full system for mathematical expression generation is able to match images within 15% of image edit distance, and is identical on more than 75% of real-world test examples. Additionally the use of a multi-row encoder leads to a significant increase in performance. All data, models, and evaluation scripts are publicly available at http://lstm.seas.harvard.edu/latex/

Our model WYGIWYS for this task combines several standard neural components from vision and natural languageprocessing. It first extracts image features using a convolutional neural network (CNN) and arranges the features in a grid. Each row is then encoded using a recurrent neural
network (RNN). These encoded features are then used by an RNN decoder with a visual attention mechanism. The decoder implements a conditional language model over the vocabulary, and the whole model is trained to maximize the likelihood of the observed markup. The full structure is
illustrated in Figure 2. We describe the model in more detail

from awesome-ocr.

What You Get Is What You See: A Visual Markup Decompiler about awesome-ocr HOT 3 CLOSED

Comments (3)

背景

Problem: Image-to-Markup Generation

3 Model

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs