Comments (2)
摘要
Historical documents frequently exhibit extensive orthographic variation, including archaic spellings and obsolete shorthand. OCR tools typically seek to produce so-called diplomatic transcriptions that preserve these variants, but many end tasks require transcriptions
with normalized orthography. In this paper,we present a novel joint transcription model
that learns, unsupervised, a probabilistic mapping between modern orthography and that
used in the document. Our system thus produces dual diplomatic and normalized tran-
scriptions simultaneously, and achieves a 35% relative error reduction over a state-of-the-art
OCR model on diplomatic transcription, and a 46% reduction on normalized transcription.
Optical Character Recognition (OCR) for historical texts, a challenging problem due to unknown fonts
and deteriorating documents, is made even more difficult by the fact that orthographic conventions in-
cluding spelling, accent usage, and shorthands have not been consistent across the history of printing. For this reason, modern language models (LMs) yield poor performance when trying to recognize characters on the pages of these documents. Furthermore, transcription of the actual printed characters may not always be the most desirable output.
from awesome-ocr.
The Primeros Libros corpus dataset introduced in our previous work consists of multilingual (Spanish/Latin/Nahuatl) books printed in Mexico in the 1500s (Garrette et al., 2015). The original dataset includes gold diplomatic transcriptions of pages from five books of different time periods, regions, languages, and fonts. We additionally include two new monolingual Spanish Primeros Libros
books annotated with both diplomatic and normalized transcriptions. Spanish-only texts were needed in order to find language-competent annotators skilled enough to create the more challenging normalized transcriptions. We used each of the seven books in isolation since they each have a different font. For each book, we used 20 pages for training and 10 for testing. For two of the books, an additional 10 pages were held out for tuning hyperparameters with grid search
To produce the Spanish and Latin LMs, we used texts from Project Gutenberg; these documents were
written during the target historical period, but all follow modernization standards including substitution
for obsolete characters and expansion of shorthand. These texts were chosen because they are a realis-
tic sample set that is freely and publicly available.In the Nahuatl case, scarce online resources made it
necessary to supplement Project Gutenberg text withthat from a private collection.
from awesome-ocr.
Related Issues (20)
- OCR basics HOT 1
- EAST:An Efficient and Accurate Scene Text Detector HOT 1
- Robust, Simple Page Segmentation using Hybrid Convolutional MDLSTM Networks
- PixelLink: Detecting Scene Text via Instance Segmentation
- Table-to-Text: Describing Table Region with Natural Language
- lable tools
- how to modify the connectionist Temporal Classification (CTC) layer of the network to also give us a confidence score? HOT 2
- Confidence Prediction for Lexicon-Free OCR HOT 1
- 工业制造——Workplace of automated control of vibration output circular trays HOT 3
- Tesseract for R HOT 1
- Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection
- 【Rosetta:大规模图像文字检测识别系统】《Rosetta: Large scale system for text detection and recognition in images》[Facebook] (2018) O HOT 4
- Radical analysis network for zero-shot learning in printed Chinese character recognition HOT 3
- DenseRAN for Offline Handwritten Chinese Character Recognition HOT 3
- Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework
- in marmot data set the table BBOX are not matching with original images
- dhSegment: A generic deep-learning approach for document segmentation
- null
- 2018年末撸串计划 HOT 5
- 希望可以增加PaddleOCR、AgentOCR HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from awesome-ocr.