GithubHelp home page GithubHelp logo

awesome-ocr's Introduction

**make a update of daily paper tracking tool **

awesome-ocr

A curated list of promising OCR resources

daily ocr paper track

website source code you can modify or add new more keywords for tracking and sharing just edit this file https://github.com/wanghaisheng/ocr-arxiv-daily/blob/main/database/topic.yml make a request

AI-Paper-Collector

Fully-automated scripts for collecting AI-related papers. Support fuzzy and exact search for paper titles.

https://github.com/wanghaisheng/ocr-paper-collector

tweets contained ocr tracking

https://github.com/wanghaisheng/ocr-tweets-monitoring

Librarys

有2个api
都支持图片
百度自家的 :基本可以放弃
化验单识别:也只能提取化验单上三个字段的一个
第三方和阿里自己提供的 API 集中在身份证、银行卡、驾驶证、护照、电商商品评论文本、车牌、名片、贴吧文本、视频中的文本,多输出字符及相应坐标,卡片类可输出成结构化字段,价格在0.01左右
另外有三家提供了简历的解析,输出结果多为结构化字段,支持文档和图片格式 价格在0.1-0.3次不等
目前无第三方入驻,仅有腾讯自有的api 涵盖车牌、名片、身份证、驾驶证、银行卡、营业执照、通用印刷体,价格最高可达0.2左右。
OcrKing 从哪来?

OcrKing 源自2009年初 Aven 在数据挖掘中的自用项目,在对技术的执着和爱好的驱动下积累已近七载经多年的积累和迭代,如今已经进化为云架构的集多层神经网络与深度学习于一体的OCR识别系统2010年初为方便更多用户使用,特制作web版文字OCR识别,从始至今 OcrKing一直提供免费识别服务及开发接口,今后将继续提供免费云OCR识别服务。OcrKing从未做过推广,

但也确确实实默默地存在,因为他相信有需求的朋友肯定能找得到。欢迎把 OcrKing 在线识别介绍给您身边有类似需求的朋友!希望这个工具对你有用,谢谢各位的支持!

OcrKing 能做什么?

OcrKing 是一个免费的快速易用的在线云OCR平台,可以将PDF及图片中的内容识别出来,生成一个内容可编辑的文档。支持多种文件格式输入及输出,支持多语种(简体中文,繁体中文,英语,日语,韩语,德语,法语等)识别,支持多种识别方式, 支持多种系统平台, 支持多形式API调用!
超轻量级模型,大小低至9.4M.支持80+语言模型,且内置训练模块、半监督标注工具、板面分析模型,中文方面识别占优。
继承PaddleOCR优势,在此基础上提供了ONNX后端,并额外支持了DirectX加速支持,在Windows部署有显著的兼容性和性能优势。
Connectionist Temporal Classification is a loss function useful for performing supervised learning on sequence data, without needing an alignment between input data and labels. For example, CTC can be used to train end-to-end systems for speech recognition, which is how we have been using it at Baidu's Silicon Valley AI Lab.

Warp-CTC是一个可以应用在CPU和GPU上高效并行的CTC代码库 (library) 介绍 CTCConnectionist Temporal Classification作为一个损失函数,用于在序列数据上进行监督式学习,不需要对齐输入数据及标签。比如,CTC可以被用来训练端对端的语音识别系统,这正是我们在百度硅谷试验室所使用的方法。 端到端 系统 语音识别

检测单词,而不是检测出一个文本行

Papers

Building on recent advances in image caption generation and optical character recognition (OCR), we present a general-purpose, deep learning-based system to decompile an image into presentational markup. While this task is a well-studied problem in OCR, our method takes an inherently different, data-driven approach. Our model does not require any knowledge of the underlying markup language, and is simply trained end-to-end on real-world example data. The model employs a convolutional network for text and layout recognition in tandem with an attention-based neural machine translation system. To train and evaluate the model, we introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup, as well as a synthetic dataset of web pages paired with HTML snippets. Experimental results show that the system is surprisingly effective at generating accurate markup for both datasets. While a standard domain-specific LaTeX OCR system achieves around 25% accuracy, our model reproduces the exact rendered image on 75% of examples. 

We present recursive recurrent neural networks with attention modeling (R2AM) for lexicon-free optical character recognition in natural scene images. The primary advantages of the proposed method are: (1) use of recursive convolutional neural networks (CNNs), which allow for parametrically efficient and effective image feature extraction; (2) an implicitly learned character-level language model, embodied in a recurrent neural network which avoids the need to use N-grams; and (3) the use of a soft-attention mechanism, allowing the model to selectively exploit image features in a coordinated way, and allowing for end-to-end training within a standard backpropagation framework. We validate our method with state-of-the-art performance on challenging benchmark datasets: Street View Text, IIIT5k, ICDAR and Synth90k.

Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods

In recent years, recognition of text from natural scene image and video frame has got increased attention among the researchers due to its various complexities and challenges. Because of low resolution, blurring effect, complex background, different fonts, color and variant alignment of text within images and video frames, etc., text recognition in such scenario is difficult. Most of the current approaches usually apply a binarization algorithm to convert them into binary images and next OCR is applied to get the recognition result. In this paper, we present a novel approach based on color channel selection for text recognition from scene images and video frames. In the approach, at first, a color channel is automatically selected and then selected color channel is considered for text recognition. Our text recognition framework is based on Hidden Markov Model (HMM) which uses Pyramidal Histogram of Oriented Gradient features extracted from selected color channel. From each sliding window of a color channel our color-channel selection approach analyzes the image properties from the sliding window and then a multi-label Support Vector Machine (SVM) classifier is applied to select the color channel that will provide the best recognition results in the sliding window. This color channel selection for each sliding window has been found to be more fruitful than considering a single color channel for the whole word image. Five different features have been analyzed for multi-label SVM based color channel selection where wavelet transform based feature outperforms others. Our framework has been tested on different publicly available scene/video text image datasets. For Devanagari script, we collected our own data dataset. The performances obtained from experimental results are encouraging and show the advantage of the proposed method.

Recently, scene text detection has become an active research topic in computer vision and document analysis, because of its great importance and significant challenge. However, vast majority of the existing methods detect text within local regions, typically through extracting character, word or line level candidates followed by candidate aggregation and false positive elimination, which potentially exclude the effect of wide-scope and long-range contextual cues in the scene. To take full advantage of the rich information available in the whole natural image, we propose to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. The proposed algorithm directly runs on full images and produces global, pixel-wise prediction maps, in which detections are subsequently formed. To better make use of the properties of text, three types of information regarding text region, individual characters and their relationship are estimated, with a single Fully Convolutional Network (FCN) model. With such predictions of text properties, the proposed algorithm can simultaneously handle horizontal, multi-oriented and curved text in real-world natural images. The experiments on standard benchmarks, including ICDAR 2013, ICDAR 2015 and MSRA-TD500, demonstrate that the proposed algorithm substantially outperforms previous state-of-the-art approaches. Moreover, we report the first baseline result on the recently-released, large-scale dataset COCO-Text.

Blogs

特征描述的完整过程 http://dataunion.org/wp-content/uploads/2015/05/640.webp_2.jpg

Presentations

Projects

Commercial products

作者:chenqin
链接:https://www.zhihu.com/question/19593313/answer/18795396
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

1,识别率极高。我使用过现在的答案总结里提到的所有软件,但遇到下面这样的表格,除了ABBYY还能保持95%以上的识别率之外(包括秦皇岛三个字),其他所有的软件全部歇菜,数字认错也就罢了,中文也认不出。血泪的教训。
![](https://pic3.zhimg.com/a1b8009516c105556d2a2df319c72d72_b.jpg)
2,自由度高。可以在同一页面手动划分不同的区块,每一个区块也可以分别设置表格或文字;简体繁体英文数字。而此时大部分软件还只能对一个页面设置一种识别方案,要么表格,要么文字。
3,批量操作方便。对于版式雷同的年鉴,将一页的版式设计好,便可以应用到其他页,省去大量重复操作。
4,可以保持原有表格格式,省去二次编辑。跨页识别表格时,选择“识别为EXCEL”,ABBYY可以将表格连在一起,产出的是一整个excel文件,分析起来就方便多了。
5,包括梯形校正,歪斜校正之类的许多图片校正方式,即使扫描得歪了,或者因为书本太厚而导致靠近书脊的部分文字扭曲,都可以校正回来。
Convert scanned images of documents into rich text with advanced Deep Learning OCR APIs. Free forever plans available.
  • IRIS
 真正能把中文OCR做得比较专业的,一共也没几家,国内2家,国外2家。国内是文通和汉王,国外是ABBYY和IRIS(**原来有2家丹青和蒙恬,这两年没什么动静了)。像大家提到的紫光OCR、CAJViewer、MS Office、清华OCR、包括慧视小灵鼠,这些都是文通的产品或者使用文通的识别引擎,尚书则是汉王的产品,和中晶扫描仪捆绑销售的。这两家的中文识别率都是非常不错的。而国外的2家,主要特点是西方语言的识别率很好,而且支持多种西欧语言,产品化程度也很高,不过中文方面速度和识别率还是有差距的,当然这两年人家也是在不断进步。Google的开源项目,至少在中文方面,和这些家相比,各项性能指标水平差距还蛮大的呢。 

作者:张岩
链接:https://www.zhihu.com/question/19593313/answer/14199596
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
https://github.com/cisocrgroup
目前看到最棒的免费的API  当然也提供商业版

OCR Databases

OTHERS

Discussion and Feedback

欢迎扫码加入 参与讨论分享 过期请添加个人微信 edwin_whs

Stargazers over time

Stargazers over time

awesome-ocr's People

Contributors

cloudmersive avatar gt-zhangacer avatar maelle avatar wanghaisheng avatar zhangxinnan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-ocr's Issues

寻人启事

歡迎對醫療圖像數據處理感興趣的 有那麽一招半式玩的比較溜的小夥伴一起玩耍
主要是兩個方向:
1.歷史文檔的OCR
2.醫療設備中獲取圖像的理解
实习全职均可
[email protected]

ocropus-lingen当把生成好的训练数据喂给ocropus训练时 又报anscii的问题

root@4b2648975d03:/ocropy# ocropus-rtrain -o report trainning_data/pic_simsun/*.bin.png
# inputs 7572
# tests None
# CenterNormalizer
# using default codec
# charset size 157 Traceback (most recent call last):
  File "/usr/local/bin/ocropus-rtrain", line 143, in <module>
    print "["+"".join(charset)+"]"
UnicodeEncodeError: 'ascii' codec can't encode characters in position 96-155: ordinal not in range(128)

libpillowflight测试

import pillowfight

import PIL.Image
import sys

import pyocr
import pyocr.builders
img_in = PIL.Image.open("02.old.bmp")

# step01 ace
ace_out_img = pillowfight.ace(img_in,
	slope=10,
	limit=1000,
	samples=100,
	seed=None)


# step 02 
sobel_out_img = pillowfight.sobel(img_in)
# step 03
blackfilter_out = pillowfight.unpaper_blackfilter(sobel_out_img)
# step 04
out_img = pillowfight.unpaper_noisefilter(sobel_out_img)
# step 05

out_img = pillowfight.unpaper_blurfilter(out_img)
# step 06
out_img = out_img.resize((out_img.size[0] * 4, out_img.size[1] * 4), PIL.Image.RASTERIZE)  # bigger is better

ocr = pyocr.get_available_tools()[0]
txt = ocr.image_to_string(out_img, lang='eng')
print('result:'+txt)

ocropus-linegen 工具生成训练数据 发现得到的图片和gt.txt内容不一致 图片中为乱码

root@4b2648975d03:/ocropy# python ocropus-linegen  -t pic/hanyu_yiji_pinyin_duizhaobiao.txt  -f pic/simfang.ttf



为什么txt文本行数和生成的训练文件个数不一致
root@4b2648975d03:/ocropy# python ocropus-linegen  -t pic/hanyu_yiji_pinyin_duizhaobiao.txt  -f pic/simfang.ttf -o trainning_data
fonts ['pic/simfang.ttf']
pic/hanyu_yiji_pinyin_duizhaobiao.txt
# reading pic/hanyu_yiji_pinyin_duizhaobiao.txt
got 7979 lines
got 7572 unique lines
base trainning_data
=== trainning_data/pic_simfang pic/simfang.ttf
 0.50  0.50  40 屺6508
 0.50  0.50  66 孱6978
 0.50  0.50  59 酝5245
 0.50  0.50  66 饲4339
 .....................
 0.50  0.50  46 CENG
 0.50  0.50  44 咖3107
root@4b2648975d03:/ocropy# wc -l pic/hanyu_yiji_pinyin_duizhaobiao.txt
8097 pic/hanyu_yiji_pinyin_duizhaobiao.txt
root@4b2648975d03:/ocropy#  ls -l trainning_data/pic_simfang  |grep "^-"|wc -l
400


root@4b2648975d03:/ocropy# python ocropus-linegen  -t pic/hanyu_yiji_pinyin_duizhaobiao.txt  -f pic/simsun.ttc  -o trainning_data -m 8000
root@4b2648975d03:/ocropy#  ls -l trainning_data/pic_simsun  |grep "^-"|wc -l
8894

发现得到的图片和gt.txt内容不一致 图片中为乱码,

生成中文训练库出错

参照wiki "中文训练库的构建", 使用字库 “思源黑体: 简体中文 ttf 版本”。

ocropus-linegen -t tests/test-chinese -f KaiGenGothicSC-Regular.ttf

出错

Traceback (most recent call last):
  File "/home/liang/anaconda2/bin/ocropus-linegen", line 213, in <module>
    size=size,sigma=sigma,threshold=threshold)
  File "/home/liang/anaconda2/bin/ocropus-linegen", line 169, in genline
    last_font = ImageFont.truetype(fontfile,size)
  File "/home/liang/anaconda2/lib/python2.7/site-packages/PIL/ImageFont.py", line 238, in truetype
    return FreeTypeFont(font, size, index, encoding)
  File "/home/liang/anaconda2/lib/python2.7/site-packages/PIL/ImageFont.py", line 127, in __init__
    self.font = core.getfont(font, size, index, encoding)
IOError: broken table

基于深度学习的OCR-from 美團技術團隊

http://tech.meituan.com/deeplearning_application.html
为了提升用户体验,O2O产品对OCR技术的需求已渗透到上单、支付、配送和用户评价等环节。OCR在美团点评业务中主要起着两方面作用。一方面是辅助录入,比如在移动支付环节通过对银行卡卡号的拍照识别,以实现自动绑卡,又如辅助BD录入菜单中菜品信息。另一方面是审核校验,比如在商家资质审核环节对商家上传的身份证、营业执照和餐饮许可证等证件照片进行信息提取和核验以确保该商家的合法性,比如机器过滤商家上单和用户评价环节产生的包含违禁词的图片。相比于传统OCR场景(印刷体、扫描文档),美团的OCR场景主要是针对手机拍摄的照片进行文字信息提取和识别,考虑到线下用户的多样性,因此主要面临以下挑战:

 成像复杂:噪声、模糊、光线变化、形变;
  文字复杂:字体、字号、色彩、磨损、笔画宽度不固定、方向任意;
 背景复杂:版面缺失,背景干扰。

对于上述挑战,传统的OCR解决方案存在着以下不足:

通过版面分析(二值化,连通域分析)来生成文本行,要求版面结构有较强的规则性且前背景可分性强(例如文档图像、车牌),无法处理前背景复杂的随意文字(例如场景文字、菜单、广告文字等)。
通过人工设计边缘方向特征(例如HOG)来训练字符识别模型,此类单一的特征在字体变化,模糊或背景干扰时泛化能力迅速下降。
过度依赖字符切分的结果,在字符扭曲、粘连、噪声干扰的情况下,切分的错误传播尤其突出。

针对传统OCR解决方案的不足,我们尝试基于深度学习的OCR。

  1. 基于Faster R-CNN和FCN的文字定位

首先,我们根据是否有先验信息将版面划分为受控场景(例如身份证、营业执照、银行卡)和非受控场景(例如菜单、门头图)。

对于受控场景,我们将文字定位转换为对特定关键字目标的检测问题。主要利用Faster R-CNN进行检测,如下图所示。为了保证回归框的定位精度同时提升运算速度,我们对原有框架和训练方式进行了微调:

    考虑到关键字目标的类内变化有限,我们裁剪了ZF模型的网络结构,将5层卷积减少到3层。
    训练过程中提高正样本的重叠率阈值,并根据业务需求来适配RPN层Anchor的宽高比。

图4 基于Faster R-CNN的受控场景文字定位

对于非受控场景,由于文字方向和笔画宽度任意变化,目标检测中回归框的定位粒度不够,我们利用语义分割中常用的全卷积网络(FCN)来进行像素级别的文字/背景标注,如下图所示。为了同时保证定位的精度和语义的清晰,我们不仅在最后一层进行反卷积,而且融合了深层Layer和浅层Layer的反卷积结果
图5 基于FCN的非受控场景文字定位

  1. 基于序列学习框架的文字识别

为了有效控制字符切分和识别后处理的错误传播效应,实现端到端文字识别的可训练性,我们采用如下图所示的序列学习框架。框架整体分为三层:卷积层,递归层和翻译层。其中卷积层提特征,递归层既学习特征序列中字符特征的先后关系,又学习字符的先后关系,翻译层实现对时间序列分类结果的解码。

图6 基于序列学习的端到端识别框架

由于序列学习框架对训练样本的数量和分布要求较高,我们采用了真实样本+合成样本的方式。真实样本以美团点评业务来源(例如菜单、身份证、营业执照)为主,合成样本则考虑了字体、形变、模糊、噪声、背景等因素。基于上述序列学习框架和训练数据,在多种场景的文字识别上都有较大幅度的性能提升,如下图所示。

图7 深度学习OCR和传统OCR的性能比较

Adnan Ul-Hasan的博士论文-第四章 训练数据

Benchmark Datasets for OCR
Numerous character recognition algorithms require sizable ground-truthed real- world data for training and benchmarking. The quantity and quality of training data directly a ects the generalization accuracy of a trainable OCR model. However, de- veloping GT data manually is overwhelmingly laborious, as it involves a lot of e ort to produce a reasonable database that covers all possible words of a language. Tran- scribing historical documents is even more gruelling as it requires language expertise in addition to manual labelling e orts. The increased human e orts give rise to - nancial aspects of developing such datasets and could restrict the development of large-scale annotated databases for the purpose of OCR. It has been pointed out in the previous chapter that scarcity of training data is one of the limiting factors in de- veloping reliable OCR systems for many historical as well as for some modern scripts.
The challenge of limited training data has been overcome by the following contri- butions of this thesis:
• Asemi-automatedmethodologytogeneratetheGTdatabaseforcursivescripts at ligature level has been proposed. This methodology can equally be applied to produce character-level GT data. Section 4.2 reports the speci cs of this method for cursive Nabataean scripts.
• Synthetically generated text-line databases have been developed to enhance the OCR research. These datasets include a database for Devanagari script (Deva-DB), a subset of printed Polytonic Greek script (Polytonic-DB), and three datasets for Multilingual OCR (MOCR) tasks. Section 4.3 details this process and describes the ne points about these datasets.
4.1 Related Work
There are basically two types of methodologies that have been proposed in the liter- ature. The rst is to extract identi able symbols from the document image and apply some clustering methods to create representative prototypes. These prototypes are then assigned text labels. The second approach is to synthesize the document images from the textual data. These images are degraded using various image defect models to re ect the scanning artifacts. These degradation models [Bai92] include resolution, blur, threshold, sensitivity, jitter, skew, size, baseline, and kerning. Some of these artifacts are discussed in Section 4.3 where they are used to generate text-line images from the text.
The use of synthesized training data is increasing and there are many datasets re- ported in the literature using this methodology. One dataset that is prominent among these types is the Arabic Printed Text Images (APTI) database, which is proposed by Sli- mane et al. [SIK+09]. This database is synthetically generated covering ten di erent Arabic fonts and as many font-sizes (ranging from 6 to 24). It is generated from vari- ous Arabic sources and contains over 1 million words. The number increases to over 45 million words when rendered using ten fonts, four styles and ten font-sizes.
Another example of a synthetic text-line image database is the Urdu Printed Text Images (UPTI) database, published by Sabbour and Shafait [SS13]. This dataset consists of over 10 thousand unique text-lines selected from various sources. Each text-line is rendered synthetically with various degradation parameters. Thus the actual size of the database is quite large. The database contains GT information at both text-line and ligature levels.
The second approach in automating the process of generating an OCR database from scanned document images is to nd the alignment of the transcription of the text lines with the document image. Kanungo et al. [KH99] presented a method for generating character GT automatically for scanned documents. A document is rst created electronically using any typesetting system. It is then printed out and scanned. Next, the corresponding feature points from both versions of the same doc- ument are found and the parameters of the transformation are estimated. The ideal GT information is transformed accordingly using these estimates. An improvement in this method is proposed by Kim and Kanungo [KK02] by using an attributed branch- and-bound algorithm.
Von Beusekom et al. [vBSB08] proposed a robust and pixel-accurate alignment method. In the rst step, the global transformation parameters are estimated in a similar manner as in [KK02]. In the second step, the adaptation of the smaller region is carried out.
Pechwitz et al. [PMM+02] presented the IfN/ENIT database of handwritten Arabic names of cities along with their postal codes. A projection pro le method is used to extract words and the postal codes automatically. Moza ari et al. [MAM+08] devel- oped a similar database (IfN/Farsi-database) for handwritten Farsi (Persian) names of cities. Sagheer et al. [SHNS09] also proposed a similar methodology for generating an Urdu database for handwriting recognition.
Vamvakas et al. [VGSP08] proposed that a character database for historical docu- ments may be constructed by choosing a small subset of images and then using char- acter segmentation and clustering techniques. This work is similar to our approach; however, the main di erence is the use of a di erent segmentation technique for Urdu ligatures and the utilization of a dissimilar clustering algorithm.

汉字文本行的预处理

假设我们找到的字库文件https://github.com/howiehu/commonly-used-chinese-characters,打开发现为整个一行的文本 没有分隔符 强行为其插入空格               

➜  ocr_text sed 's/.\{20\}/& /g'   chinese_5039.txt  > new1.txt
然后将空格替换为换行符
➜  ocr_text tr  ' ' '\n' <new1.txt  > result.txt
```               

然后再讲文本保存成utf8格式即可

假如找到的另外一个词库每行前面是数字 需要将数字和空格拿掉     

➜ ocr_text cat 4-Corner_Map.txt | sed 's/:space://' | sed 's/^[0-9]*//' | sed 's/\n//'>result.txt
➜ ocr_text sed 's/.{20}/& /g' result.txt > new2.txt
➜ ocr_text tr '\n' ' ' <qudiao.txt > qudiao1.txt
sed -i 's/' '/''/g' qudiao1.txt


假如希望每行文本为10个字,则每10个字符后面插入一个换行符

sed -i 's/........../\n/g' ss.txt sed -r "s/([^,]*,){10}/&\n/g" 1.txt awk -F, '{for(i=1;i<=NF;i++){printf (i%10==0)?$i"\n":$i","}}' file

sed -r "s/([!-~]){10}/&\n/g" 1.txt ➜ ocr_text echo "啊啊啊啊啊啊啊啊" | fold -w4 | paste -sd' ' - 啊啊 啊啊 啊啊 啊啊 ➜ ocr_text echo "啊啊啊啊啊啊啊啊" | sed 's/.{4}/& /g'
啊啊啊啊 啊啊啊啊 ➜ ocr_text sed 's/.{20}/& /g' chinese_5039.txt > new1.txt ➜ ocr_text tr ' ' '\n' <new1.txt > result.txt

➜ ocr_text sed 's/.{1}/& /g' chinese_5039.txt > new2.txt ➜ ocr_text tr ' ' '\n' <new2.txt > result2.txt
➜ ocr_text sort result2.txt| uniq -u | wc -l 5039 cat $File | sed 's/^:space://' | sed 's/^[0-9]//' >result.text


下面的这条命令 合在一起无法移除空格     

cat 4-Corner_Map-no-space.txt | sed 's/:space://' | sed 's/^[0-9]//' >result.txt

➜ ocr_text wc -l result.txt 6355 result.txt ➜ ocr_text wc -l result2.txt 5041 result2.txt ➜ ocr_text cat result.txt result2.txt > all.txt ➜ ocr_text wc -l all.txt
11396 all.txt ➜ ocr_text sort all.txt | uniq -u > uniq.txt
➜ ocr_text wc -l uniq.txt 4422 uniq.txt ➜ ocr_text sort result.txt | uniq -u > uniq-result.txt ➜ ocr_text wc -l uniq-result.txt 6355 uniq-result.txt ➜ ocr_text sort result2.txt | uniq -u > uniq-result2.txt ➜ ocr_text wc -l uniq-result2.txt
5039 uniq-result2.txt ➜ ocr_text cat result2.txt>>result.txt ➜ ocr_text wc -l result.txt 11396 result.txt ➜ ocr_text sort result.txt | uniq -u > uniq-result.txt ➜ ocr_text wc -l uniq-result.txt 4422 uniq-result.txt

Adnan Ul-Hasan的博士论文-参考文献

[ABB13] ABBYY, January 2013. available at ?iiT,ffrrrX�##vvX+QKf 2+Q;MBiBQMnb2p2`.
[AHN+14] Q. U. A. Akram, S. Hussain, A. Niazi, U. Anjum, and F. Irfan. Adpating Tesseract for Complex Scripts: An Example of Urdu Nastalique. In DAS, pages 191–195, 2014.
[Bai92] H. S. Baird. Document Image Defect Models. In H. S. Baird, H. Bunke, andK.Yamamoto,editors,StructuredDocumentImageAnalysis.SpringerVerlag,1992.
[BC09] U. Bhattacharya and B.B. Chaudhuri. Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(3):444– 457, 2009.
[BMP02] S. Belongie, J. Malik, and J. Puzicha. Shape Matching and Object RecongitionusingShapeContexts. IEEETransactiononPatternAnalysisand Machine Intelligence,24(4):509–522, April 2002.
[BPSB06] U. Bhattacharya, S.K. Parui, B. Shaw, and K. Bhattacharya. Neural combination of ann and hmm for handwritten devanagari numeral recognition. In ICFHR,2006.
[BR14] A. Belaid and M. I. Razzak. Middel Eastern Character Recognition. In D. Doermann and K. Tombre, editors, Handbook of Document Image Processing and Recognition,pages 427–457. Springer,2014.
[BRB+09] F.Boschetti,M.Romanello,A.Babeu,D.Bamman,andG.Crane. ImprovingOCRAccuracyforClassicalCriticalEditions. InECDL,pages156–167, 2009.
[Bre01] T. M. Breuel. Segmentation of Hand-printed Letter Strings using a Dynamic Programming Algorithm. In ICDAR, pages 821–826, sep 2001.

Bibliography 148
[Bre08] T.M.Breuel. TheOCRopusOpenSourceOCRSystem. InB.A.Yanikoglu andK.Berkner,editors,DRR-XV,page68150F,SanJose,USA,Jan2008.
[BS08] T.M.BreuelandF.Shafait. AutoMLP:Simple,Effective,FullyAutomated Learning Rate and Size Adjustment. In The Learning Workshop, January 2008.
[BSB11] S.S.Bukhari,F.Shafait,andT.M.Breuel. HighPerformanceLayoutAnalysis of Arabic and Urdu Document Images. In ICDAR, page 1275–1279, Bejing, China, sep 2011.
[BSF94] Y.Bengio,P.Simard,andP.Frasconi. LearningLong-TermDependencies withGradientDescentisDifficult. IEEETransactionsonNeuralNetworks, 5(2):157–166, 1994.
[BT14] H. S. Baird and K. Tombre. The Evolution of Document Image Analysis. In D. Doermann and K. Tombre, editors, Handbook of Document Image Processing and Recognition, pages 63–71. Springer,2014.
[BUHAAS13] T.M.Breuel,A.Ul-Hasan,M.AlAzawi,andF.Shafait. HighPerformance OCR for Printed English and Fraktur using LSTM Networks. In ICDAR, WashingtonD.C. USA, aug 2013.
[CL96] R.G.CaseyandE.Lecolinet. ASurveyofMethodsandStrategiesinCharacterSegmentation. IEEETrans.PatternAnalysisandMachineIntelligence, 18(7):690–706, 1996.
[CP95] B.B. Chadhuri and S. Palit. A feature-based scheme for the machine recognitionof printedDevanagari Script. 1995.
[CP97] B.B. Chaudhuri and U. Pal. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on, volume 2, pages 1011–1015. IEEE, 1997.
[Eik93] L. Eikvil. OCR – Optical Characer Recognition. Technicalreport, 1993.
[EM08] M. S. M. El-Mahallway. A Large Scale HMM-Based Omni Font-Written OCR System For Cursive Scripts. PhD thesis, Cairo, Egypt, 2008.
[ERA57] ERA. An Electronic Reading Automation. Electronics Engineering, pages 189–190, 1957.
[ESB14] A.F.Echi,A.Saidani, and A.Belaid. HowtoseparatebetweenMachinePrinted/HandwrittenandArabic/LatinWords? ElectronicLettersonComputer Vision and Image Analysis,13(1):1–16, 2014.
Bibliography BIBLIOGRAPHY
[Fuc] M. Fuchs. The Use of Gothic OCR in processing Historical Documents. Technicalreport.
[Fuk80] K. Fukushima. Ncocognitron: A Self-Organizing Neural Network Model foraMechanismofPatternRecognitionUnaffectedbyShiftinPosition. Biological Cybernetics, 36:193–202, 1980.
[FV11] L.FurrerandM.Volk. ReducingOCRerrorsinGothicscriptdocuments. In Workshop on Language Technologies for Digital Humanities and Cultural Heritage,page 97–103, Hissar,Bulgaria, September2011.
[GAA+99] N. Gorski, V. Anisimov, E. Augustin, O. Baret, D. Price, and J.-C. Simon. A2iA Check Reader: A Family of Bank Check Recognition Systems. In ICDAR,pages 523–526, Sep 1999.
[Gat14] B. G. Gatos. Image Techniques in Document Analysis Process. In D. DoermannandK.Tombre,editors,HandbookofDocumentImageProcessing and Recognition,pages 73–131. Springer,2014.
[GDS10] D.Ghosh,T.Dube,andA.P.Shivaprasad. ScriptRecognition-AReview. IEEE Trans. Pattern Analysis and Machine Intelligence, 32(12):2142–2161, 2010.
[GEBS04] A. Graves, D. Eck, N. Beringer, and J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In Auke Jan Ijspeert, Masayuki Murata, and Naoki Wakamiya, editors, BioADIT, volume 3141 of Lecture Notes in Computer Science,page 127–136. Springer,2004.
[GFGS06] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber. ConnectionistTemporalclassification: LabellingUnsegmentedSequenceDatawith Recurrent Neural Networks. In ICML,page 369–376, 2006.
[GLF+08] A.Graves,M.Liwicki,S.Fernandez,H.BunkeBertolami,andJ.Schmidhuber. Anovelconnectionistsystemforunconstrainedhandwritingrecognition. IEEETrans.onPatternAnalysisandMachineIntelligence,31(5):855– 868, May2008.
[GLS11] B. Gatos, G. Louloudis, and N. Stamatopoulos. Greek Polytonic OCR basedonEfficientCharacterClassNumbverReduction. InICDAR,pages 1155–1159, Bejing, China, aug 2011.
[GPTF13] D.Genzel,A.C.Popat,R.Teunen,andY.Fujii. HMM-basedScriptIdentification for OCR. In International Workshop on Multilingual OCR, page 2, WashingtonD.C., USA., August2013.
[Gra] A. Graves. RNNLIB: A recurrent neural network library for sequence learning problems. ?iiT,ffbQm+27Q;2XM2ifTQD2+ibfMMHf.
[Gra12] A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks, volume385 of Studies in Computational Intelligence. Springer,2012.
[GS05] A. Graves and J. Schmidhuber. Framewise Phoneme Classification with Bidirectional LSTM Networks. In IJCNN, pages 2047–2052, Montreal, Canada, 2005.
[GS08] A. Graves and J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. In Daphne Koller, Dale Schuurmans,YoshuaBengio,andLéonBottou,editors,NIPS,page545– 552. CurranAssociates,Inc., 2008.
[GSL+] B. Gatos, N. Stamtopoulos, G. Louloudis, G. Sfikas, G. Retsinas, V. Papavassiliou, F. Simistira, and V. Katsouros. GROPLY-DB: An Old Greek PolytonicDocumentsImageDatabase. InICDAR,pages646–650,Nancy, France,August.
[Han62] W. J. Hannan. R. C. A. Multifont Reading Machine. Optical Character Recognition, pages 3–14, 1962.
[HBFS01] S.Hochreiter,Y.Bengio, P.Frasconi,andJ.Schmidhuber. GradientFlow inRecurrentNets: TheDifficultyofLearningLong-TermDependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynammical Recurrent Neural Networks.IEEE Press, 2001.
[HHK08] Md. A. Hasnat, S. M. M. Habib, and M. Khan. A High Performance Domain Specific OCR for Bangla Script. In T. Sobh, editor, Novel Algorithms and Techniques in Telecommunication, Automation and Industrial Electronics, page 174–178. Springer,2008.
[HS97] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
[HW65] D. H. Hubel and T. N. Wiesel. Receptive Fields and Functional ArchitectureinTwoNonstriateVisualAra(18and19)oftheCat. JournalofNeurophysiology,28:229–289, 1965.
[IH07] M.IjazandS.Hussain. CorpusBasedUrduLexiconDevelopment. InCLT, Peshawar,Pakistan,2007.
[III14] IIIT,June 2014. ?iiT,ffHi+XBBBiX�+XBMf+QTmbf+Q`TmbX?iKH.

[IOK63] T.Iijima,Y.Okumura,andK.Kuwabara.NewProcessofCharacterRecognition using Sieving Method. Information and Control Research, 1(1):30– 35, 1963.
[JBB+95] L. Jackel, M. Battista, H. Baird, J. Ben, J. Bromley, C. Burges, E. Cosatto, J. Denker, H. Graf, H. Katseff, Y. Le-Cun, C. Nohl, E. Sackinger, J. Shamilian, T. Shoemaker, C. Stenard, I. Strom, R. Ting, T. Wood, and C. Zuraw. Neural-Net Applications in Character Recognition and Document Analysis. Technicalreport, 1995.
[JDM00] A.K. Jain, R. P W Duin, and Jianchang Mao. Statistical pattern recognition: areview. PatternAnalysisandMachineIntelligence,IEEETransactions on,22(1):4–37, Jan 2000.
[JKK03] C.V. Jawahar, M.N.S.S.K. Kumar, and S.S.R. Kiran. A Bilingual OCR for Hindi-TeluguDocumentsanditsApplications. InICDAR,volume3,pages 408–412, 2003.
[KC94] G. E. Kopec and P. A. Chou. Document Image Decoding using Markov Source Models. IEEE Transaction on Pattern Analysis and Machine Intelligence,16(6):602 – 617, 1994.
[KH99] T.KanungoandR.M.Haralick. AnAuotmaticClosed-LoopMethodology for Generating Character Groundtruth for Scanned Documents. IEEE Trans. on Pattern Analysis and Machine Intellignece,21(2):179–183, 1999.
[KK02] D.-W. Kim and T. Kanungo. Attributed Point Matching for Automatic Groundtruth Generation. Int. Journal on Document Analysis and Recognition, 5(1):47–66, 2002.
[KNSG05] S. Kompalli, Sankalp Nayak, S. Setlur, and V. Govindaraju. Challenges in OCR of Devanagari Documents. In ICDAR, pages 327–331 Vol. 1, Seol, Korea,Aug2005.
[KRM+05] T. Kanungo, P. Resnik, S. Mao, D. W. Kim, and Q. Zheng. The Bible and Multilingual Optical Character Recognition. Communication of the ACM, 48(6):124–130, Jun 2005.
[KSKD15] I. U. Khattak, I. Siddiqui, S. Khalid, and C. Djeddi. Recognition of Urdu Ligatures - A Holistic Approach . In ICADR,Nancy, France,aug 2015.
[KUHB15] T.Karayil,A.Ul-Hasan,andT.M.Breuel. ASegmentation-FreeApproach for Printed Devanagari Script Recognition. In ICDAR, pages 946–950, Nancy, France,aug 2015.
Bibliography 152
[LBK+98] Z.A.Lu,I.Bazzi,A.Kornai,J.Makhoul,P.S.Natarajan,andR.Schawartz. A Robust, Language-Independent OCR System. In AIPR Workshop: Advancement in Computer-Assisted Recognition,pages 96–104, 1998.
[LC89] Y. Le-Cun. Generalization and Network Desgin Strategies. Connectionisms in Perspective,Jun 1989.
[LRHP97] J. Liang, R. Rogers, R. M. Haralick, and I.T. Philips. UW-ISL Document Image Analysis Toolbox: An Experimental Environment. In ICDAR, page 984–988, Ulm, Germany,aug 1997.
[LSN+99] Z.Lu,R.M.Schwartz,P.Natarajan,I.Bazzi,andJ.Makhoul. Advancesin BBN BYBLOSOCR System. In ICDAR, pages 337–340, 1999.
[Lu95] Y.Lu. MachinePrintedCharacterSegmentation—AnOverview. Pattern Recognition, 28(1):67 – 80, 1995.
[MAM+08] S. Mozaffari, H. Abed, V. Märgner, K. Faez, and A. Amirshahi. IfN/FarsiDatabase: A Database of Farsi Handwritten City Names. In ICFHR, page 397–402, Montreal, Canada, aug 2008.
[Mat15] J. Matas. Efficient character skew rectification in scene text images. In ACCV,volume9009, page 134. Springer,2015.
[MGS05] S. Marinai, M. Gori, and G. Soda. Artificial Neural Networks for Document Analysis and Recognition. IEEE Trans. on Pattern Analysis and Machine Intellignece,27(1):23–35, January 2005.
[MSLB98] J. Makhoul, R. Schwartz, C. Lapre, and I. Bazzi. A Script-Independent Methodology for Optical Character Recognition. Pattern Recognition, 31(9):1285 – 1294, 1998.
[MSY92] S. Mori, C. Suen, and K. Yamamoto. Historical Review of OCR Research and Development. IEEE, 80(7):1029–1058, 1992.
[MWW13] Y. Mei, X. Wang, and J. Wang. An Efficient Character Segmentation Algorithm for Printed Chinese Documentation. In UCMA, pages 183–189, 2013.
[Nag92] G. Nagy. At the Frontiersof OCR. IEEE, 80(7):1093–1100, 1992.
[NHR+13] S. Naz, K. Hayat, M.I. Razzak, M.W. Anwar, S.A. Madani, and S.U. Khan. The Optical Character Recognition of Urdu-like Cursive Scripts. Pattern Recognition, 47(3):1229–1248, 2013.
Bibliography BIBLIOGRAPHY
[NHR+14] S. Naz, K. Hayat, M.I. Razzak, M.W. Anwar, S.A. Madani, and S.U. Khan. Challenges in Baseline Detection of Arabic Script Based Languages, 2014.
[NLS+01] P.Natarajan,Z.Lu,R.M.Schwartz,I.Bazzi,andJ.Makhoul. Multilingual Machine Printed OCR. International Journal on Pattern Recognition and Artificial Intelligence,15(1):43–63, 2001.
[NS14] N. Nobile and Y. Suen. Text Segmentation for Document Recognition. In D. Doermann and K. Tombre, editors, Handbook of Document Image Processing and Recognition, pages 257–290. Springer,2014.
[NUA+15] S. Naz, A. I. Umar, R. Ahmad, S. B. Ahmed, S. H. Shirazi, and M.I. Razzak. Urdu Nastaĺiq Text Recognition System Based on Multidimensional Recurrent Neural Network and Statistical Features. Neural Computing and Applications, 26(8), 2015.

[OCR15] OCRopus, January 2015. available at ?iiTb,ff;Bi?m#X+QKfiK#/2pf Q+`QTv.

[OHBA11] M. A. Obaida, M. J. Hossain, M. Begum, and M. S. Alam. Multilingual OCR(MOCR):AnApproachtoClassifyWordstoLanguages. Int’lJournal of Computer Applications, 32(1):46–53, October2011.

[Pal04] U. Pal. Indian Script Character Recognition: A Survey. Pattern Recognition, 37:1887–1899, 2004.
[PC97] U.PalandB.B.Chaudhuri. PrintedDevanagariscriptOCRsystem. VIVEKBOMBAY-,10:12–24, 1997.
[PC02] U. Pal and B. B. Chaudhuri. Identification of different script lines from multi-script documents. Image Vision Computing, 20(13-14):945–954, 2002.
[PD14] U. Pal and N. S. Dash. Script Identification. In D. Doermann and K. Tombre, editors, Handbook of Document Image Processing and Recognition, pages 291–330. Springer,2014.
[PMM+02] M. Pechwitz, S.S. Maddouri, V. Märgner, N. Ellouze, and H. Amiri. IfN/ENIT-Database of Handwritten Arabic Words. In CIFED, page 129– 136, Hammamet, Tunisia,oct 2002.
[Pop12] A. C. Popat. Multilingual OCR Challenges in Google Books, 2012.
Bibliography 154
[PS05] U. Pal and A. Sarkar. Recognition of Printed Urdu Text. In ICDAR, pages 1183–1187, 2005.
[PV10] M. C. Padma and P. A. Vijaya. Global Approach For Script Identification UsingWaveletPacketBasedFeatures. InternationalJournalofSignalProcessing, Image Processing And Pattern Recognition,3(3):29–40, 2010.
[Rab89] L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[Ras10] Rashid,S.F.andShafait,F.andBreuel,T.M. DiscriminativeLearningFor Script Recognition. In ICIP,Hong Kong,sep 2010.
[Ras14] S. F. Rashid. Optical Character Recognition – A Combined ANN/HMM Approach. PhD thesis, Kaiserslautern,Germany,2014.
[RDL13] R. Rani, R. Dhir, and G. S. Lehal. Script Identification of Pre-segmented Multi-font Characters and Digits. In ICDAR, Washington D.C., USA, August 2013.
[RH89] W.S.RosenbaumandJ.J.Hilliard. SystemandMethodforDeferredProcessing of OCR Scanned Mail, 07 1989.
[SAL09] R. Smith, D. Antonova, and D. S. Lee. Adapting the Tesseract Open Source OCR Engine for Multilingual OCR. In Int. Workshop on Multilingual OCR, July 2009.
[SCPB13] N.Sharma,S.Chanda,U.Pal,andM.Blumenstein.Word-wisescriptidentificationfromvideoframes. InICDAR,pages867–871,WashingtonD.C. USA, Aug2013.
[Sen94] A.W.Senior. OfflineCursiveHandwritingRecognitionusingRecurrentNeural Networks. PhD thesis, England, 1994.
[SHNS09] M. Sagheer, C. He, N. Nobile, and C. Suen. A New Large Urdu Database for Off-Line Handwriting Recognition. page 538–546, Vietri sul Mare, Italy, sep 2009.
[SIK+09] F. Slimane, R. Ingold, S. Kanoun, A. M. Alimi, and J. Hennebert. A New ArabicPrintedTextImageDatabaseandEvaluationProtocols. InICDAR, page 946–950, Barcelona, Spain, July 2009.
[SJ12] N. Sankaran and C.V. Jawahar. Recognition of printed Devanagari text using BLSTM Neural Network. In ICPR,pages 322–325, nov2012.
Bibliography BIBLIOGRAPHY
[SJ15] A.K.SinghandC.V.Jawahar. CanRNNsReliablySeparateScriptandLanguageatWordandLineLevel? InICDAR,pages976–980,Nancy,France, August2015.
[SKB08] F. Shafait, D. Keysers, and T. M. Breuel. Efficient Implementation of Local Adaptive Thresholding Techniques Using Integral Images. In B. A. YanikogluandK.Berkner,editors,DRR-XV,page681510,SanJose,USA, Jan 2008.
[SLMZ96] R. Schwartz, C. LaPre, J. Makhoul, and Y. Zhao. Language-Independent OCRUsingaContinuousSpeechRecognitionSystem. InICPR,page99– 103, Vienna, aug 1996.
[SM73] R. Sinha and K. Mahesh. A Syntactic Pattern Analysis System and Its Application toDevanagari Script Recognition. 1973.
[Smi07] R. Smith. An Overview of the Tesseract OCR Engine. In ICDAR, pages 629–633, 2007.
[Smi13] R.Smith. HistoryoftheTesseractOCREngine: WhatWorkedandWhat Didnt�. In DRR-XX,San Franciso,USA, Feb2013.
[SP00] J. Sauvola and M. Pietikäinen. Adpative Document Image Binarization. Pattern Recognition, 33:225–236, 2000.
[Spi97] A. L. Spitz. Multilingual Document Recognition. In H. Bunke and P. S. P. Wang, editors, Handbook of character Recognition and Document Image Analysis, pages 259–284. WorldScientific Publishing Company,1997.
[SPS08] B. Shaw, K.S. Parui, and . Shridhar. Offline Handwritten Devanagari Word Recognition: A holistic approach based on directional chain code feature and HMM. In ICIT,pages 203–208. IEEE, 2008.
[SS13] N.SabbourandF.Shafait. ASegmentation-FreeApproachtoArabicand Urdu OCR. In DRR-XX,San Francisco,CA, USA, feb 2013.
[SSR10] T.Saba,G.Sulong,andA.Rehman. ASurveyonMethodsandStrategies onTouchedCharactersSegmentation. InternationalJournalofResearch and Reviews in Computer Science,1(2):103–114, 2010.
[SUHKB06] F. Shafait, A. Ul-Hasan, D. Keysers, and T.M. Breuel. Layout analysis of urdu document images. In Multitopic Conference, 2006. INMIC ’06. IEEE, pages 293–298, Dec 2006.
Bibliography 156
[SUHP+15] F. Simistira, A. Ul-Hasan, V. Papavassiliou, B. Gatos, V. Katsouros, and M.Liwicki. RecognitionofHistoricalGreekPolytonicScriptsUsingLSTM Networks. In ICDAR,page 766–770, Nancy, France,aug 2015.
[Sut12] I. Sutskever. Training Recurrent Neural Networks. PhD thesis, Dept. of ComputerScience, Univ.of Toronto,2012.
[SYVY10] R. Singh, C.S. Yadav, P. Verma, and V. Yadav. Optical Character Recognition(OCR)forPrintedDevnagariScriptUsingArtificialNeuralNetwork. International Journal of Computer Science & Communication, 1(1):91–95, 2010.
[Tau29] G. Tauscheck. Reading machine, 12 1929.
[Tes14] Tesseract, June 2014. ?iiT,ff+Q/2X;QQ;H2X+QKfTfi2bb2�+i@Q+f.
[TNBC00] K.Taghva,T.Nartker,J.Borsack,andA.Condit. UNLV-ISRIdocumentcollection for research in OCR and information retrieval. In DRR–VII, page 157–164, San Jose CA, USA, 2000.
[UHAS+15] A. Ul-Hasan, M. Z. Afzal, F. Shafait, M. Liwicki, and T. M. Breuel. A Sequence Learning Approach for Mutliple Script Identification. In ICDAR, pages 1046–1050, Nancy, France,2015.
[UHASB13] A. Ul-Hasan, S. B. Ahmed, F. Shafait, and T. M. Breuel. Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks. In ICDAR, pages 1061–1065, WashingtonD.C., USA., Aug2013.
[UHB13] A.Ul-HasanandT.M.Breuel. CanweBuildLanguageIndependentOCR using LSTM Networks? In International Workshop on Multilingual OCR, page 9, WashingtonD.C., USA., Aug2013.
[USHA14] S. Urooj, S. Shams, S. Hussain, and F. Adeeba. CLE Urdu Digest Corpus. In CLT,Karachi, Pakistan,2014.
[vBSB08] J. van Beusekom, F. Shafait, and T. M. Breuel. Automated OCR Ground TruthGeneration. In DAS, page 111–117, Nara, Japan, sep 2008.
[VGSP08] G.Vamvakas,B.Gatos,N.Stamatopoulos,andS.Perantonis.AComplete Optical Character Recognition Methodology for Historical Documents. In DAS,page 525–532, Nara, Japan, sep 2008.
[Vit67] A.J. Viterbi. Error bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. Information Theory, IEEE Transactions on, 13(2):260–269, April 1967.
Bibliography BIBLIOGRAPHY
[Wer90] P. Werbos. Backpropagation Through Time: What Does It Do and How toDo It. In Proceedings of IEEE, volume78, page 1550–1560, 1990.
[Whi13] N. White. Training Tesseract for Ancient Greek OCR. Eutypon, pages 1–11, 2013.
[WLS10] N. Wang, L. Lam, and C. Y. Suen. Noise Tolerant Script Identification of Printed Oriental and English Documents Using a Downgraded Pixel Density Feature. In ICPR, pages 2037–2040, 2010.
[YSBS15] M.RYousefi,M.R.Soheili,T.M.Breuel,andD.Stricker. AComparisonof 1D and 2D LSTM Architectures for Recognition of Handwritten Arabic (Acceptedfor publication). In DRR–XXI, San Francisco,USA, 2015.
[YY94] S.J. Young and S. Young. The HTK Hidden Markov Model Toolkit: Design and Philosophy. Entropic Cambridge Research Laboratory, Ltd., 2:2– 44, 1994.

Adnan Ul-Hasan的博士论文-文本行标准化

Text-Line Normalization
The relative position and scale of individual characters in a text-line are important features for Latin and many other scripts. Normalization of text-lines helps in making this information consistent across all text-lines in a given database. There are many normalization methods proposed in the literature. Normalization methods that have been used for various experiments reported in this thesis are described in the sections below.
B.1 Image Rescaling
Image rescaling is the simplest method to make the heights of all images in a database equal. For a desired image height, a scale can be calculated as following:
scale = target_height actual_height
This scale is then used to determine the width of the “normalized” image by simply multiplying it with the width of the actual image.
target_width = scale ∗ actual_width
This normalization is used in the current thesis for some of the OCR experiments re- ported for Urdu Nastaleeq script.
141
Appendix B Text-Line Normalization 142
81
30% – Zone-1 50% – Zone-2 20% – Zone-3
500
Original Image
40
12
20 8
Normalized Image
Figure B.1: Zone-based normalization for Devanagari text-line image. Three zones are shown with corresponding percentages of the image. In the bottom row, a nor- malized image is shown with a height of 40 pixels.
B.2 Zone-Based Normalization
Characters in many scripts like Latin, Greek and Devanagari follow certain typographic rules. A text-line in such scripts can be divided into three zones. A baseline passes through the bottom of majority of the characters, and a mean-line is at the middle height from the baseline to the top edge of a text-line. Most of the small characters, like ‘x’, ‘s’, and ‘o’ lie between these two lines. The portion of the characters that extends above the mean-line is termed as ‘ascender’, and that extending below the baseline is termed as the ‘descender’. The zone between the baseline and the mean- line is the middle-zone, the zone below the mean-line is the bottom-zone and the zone above the baseline is called the top-zone. A sample text-line in Devanagari script with these three zones is shown in Figure B.1.
Rashid et al. [Ras14] proposed a text-line normalization method which uses the above-mentioned three zones. Statistical analysis is carried out to estimate these zones in an image and then each zone is rescaled to a speci c height by simple rescal- ing described in the previous section. This normalization method has been employed for the experiments reported for Devanagari script in this thesis.
B.3 Token-Dictionary based Normalization
This text-line normalization method is based on a dictionary composed of connected component shapes and associated baseline and x-height information. This dictionary is pre-computed based on a large sample of text-lines with baseline and x-heights
143
Appendix B Text-Line Normalization
(a) Original text-line image
(b) x-height map of the text-line
(c) baseline map of the text-line
Figure B.2: Extraction of x-height and baseline of a text-line in The process of con- verting a text-line image into a 1D sequence. (a) shows the original text-line image. (b) shows a map of predicted locations of x-height, and (c) shows a map of predicted locations of the baseline. Note that the x-height is determined correctly for capital letters.
derived from alignment of the text-line images with textual ground-truth, together with information about the relative position of Latin characters to the baseline and x-height. Note that for some shapes (e.g., p/P, o/O), the baseline and x-height infor- mation may be ambiguous; the information is therefore stored in the form of proba- bility densities given a connected component shape. The connected components do not need to correspond to characters; they might be ligatures or frequently touching character pairs like “oo” or “as”.
To measure the baseline and x-height of a new text-line, the connected compo- nents are extracted from the text-line and the associated probability densities for the baseline and x-height locations are retrieved. These densities are then mapped and locally averaged across the entire line, resulting in a probability map for the baseline and x-height across the entire text-line. Maps of x-height and baseline of an example text-line (Figure B.2-(a)) are shown in Figure B.2-(b) and (c) respectively. The resulting densities are then tted with curves and are used as the baseline and x-height for line size normalization. In line size normalization (possibly curved) baseline and x-height lines are mapped to two straight lines in a xed size output text-line image, with the pixels in between them rescaled using spline transformation. This method of normal- ization of a text-line has been used in the experiments for English and Fraktur.

Appendix B Text-Line Normalization 144
B.4 Filter-based Normalization
The zone-based and token-dictionary methods work satisfactorily for scripts, where either baselines and x-height information is easily estimated or where segmentation can be done to extract individual characters. They fail to perform reasonably for Urdu Nastaleeq script where neither baseline nor segmentation are trivial to estimate. The lter-based normalization method is independent of estimating baseline or individ- ual characters. This method is based on simple lter operations and a ne trans- formation; thus making it script-independent normalization method, as compared to the normalization process described in the previous section, which was based on the shapes of the Latin alphabets. The complete normalization process is shown in Fig- ure B.3. The input text-line image is rst inverted and smoothed with a large Gaussian lter. The bene t of doing this is to capture the global structure of the underlying contents. Now, as shown in Figure B.3-(a), the smoothed image has maximum values near the center of the image along the vertical axis. These points are then tted with a straight line (in practice, we smooth the line passing through these points as well). This is the line around which the whole text line is re-scaled using a ne transforma- tion. First a zone is found according to the di erence between the height of the input image and the center line. Now, to make sure that the nally normalized image con- tains all the contents without clipping, the next step is to expand the image above and below of the center line by the amount equal to the height of the image. This padded image is then cropped using the zone measurement found previously. Finally, the im- age is scaled to the required height using a ne transformation. The width of the nal image is calculated by multiplying the original width with the ratio of “target” height to the height of the dewarped image. The only tunable parameter in this method is the target height. Other parameters are calculated from the given image itself. This text-line normalization is used for works reported for Urdu Nastaleeq, historical Latin and for multilingual documents. Some of the normalized images using this method- ology are shown in Figure B.4.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.