GithubHelp home page GithubHelp logo

hocr's Introduction

hocr

The goal of hocr is to facilitate post-OCR data processing and wrangling. The package exposes hocr parcer, hocr_parse, which converts XHTML format output into tidy tibble with one word per row. In addition to the columns exported by tesseract::ocr_data, hocr outputs additional metadata regarding organization of words into lines, paragraphs, content areas and pages. Read more about hOCR specification here.

One of the key elements of hocr format is “bounding box” - a rectangular region of the image covering the extent of the word recognized by tesseract. This bbox can be used to extract respective part of the image using, for example magick package, using bbox_to_geometry helper function.

hocr aslo includes tidiers for common hOCR-capable systems. As of version 0.0.9000 only tesseract output format is supported, but in the future, support for OCRopus will be added.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("dmi3kno/hocr")

Example

This is a basic example which shows you how to solve a common problem:

library(hocr)
library(tesseract) # OCR
library(tidyverse) # data wrangling and viz
#devtools::install_github("thomasp85/patchwork")
library(patchwork) # arranging plots

We will OCR a page from an old cookbook retrieved from archive.org[1] and enhanced using magick package (see image preparation script on github).

cupcakes <- system.file("extdata", "peanutbutter.png", package="hocr")


recipe <- tesseract::ocr(cupcakes, HOCR = TRUE) %>% 
  hocr::hocr_parse() %>% 
  hocr::tidy_tesseract()
recipe
#> # A tibble: 234 x 21
#>    ocrx_word_id ocrx_word_bbox ocrx_word_conf ocrx_word_tag ocrx_word_value
#>    <chr>        <chr>                   <dbl> <chr>         <chr>          
#>  1 word_1_1     38 58 271 103              85 strong        Chocolate      
#>  2 word_1_2     287 61 451 103             86 text          Peanut         
#>  3 word_1_3     468 62 619 103             89 text          Butter         
#>  4 word_1_4     636 60 852 113             84 strong        Cupcakes       
#>  5 word_1_5     36 153 112 182             87 strong        Your           
#>  6 word_1_6     123 152 184 1~             88 strong        kids           
#>  7 word_1_7     196 152 250 1~             88 strong        will           
#>  8 word_1_8     264 152 324 1~             85 strong        love           
#>  9 word_1_9     337 152 417 1~             84 strong        these          
#> 10 word_1_10    431 154 472 1~             90 text          (as            
#> # ... with 224 more rows, and 16 more variables: ocr_line_id <chr>,
#> #   ocr_line_bbox <chr>, ocr_line_xbaseline <dbl>,
#> #   ocr_line_ybaseline <dbl>, ocr_line_xsize <dbl>,
#> #   ocr_line_xdescenders <dbl>, ocr_line_xascenders <dbl>,
#> #   ocr_par_id <chr>, ocr_par_lang <chr>, ocr_par_bbox <chr>,
#> #   ocr_carea_id <chr>, ocr_carea_bbox <chr>, ocr_page_id <chr>,
#> #   ocr_page_image <chr>, ocr_page_bbox <chr>, ocr_page_no <dbl>

Now that data is in the tidy format, lets render the page in ggplot and identify bounding boxes around words and paragraphs to illustrate the benefits of parsed document structure. tesseract outputs bboxes in upper-left corner coordinate system. We will transform all y-values to bottom-left scale and plot the bounding boxes alongside with the original picture, colored by tesseract confidence score.

p1 <- recipe %>% 
  mutate(ocrx_word_bbox=lapply(ocrx_word_bbox, function(x) 
    separate(as_tibble(x), value, into=c("word_x1", "word_y1", "word_x2", "word_y2"), convert = TRUE))) %>% 
    unnest(ocrx_word_bbox) %>% 
  mutate(ocr_page_bbox=lapply(ocr_page_bbox, function(x) 
    separate(as_tibble(x), value, into=c("page_x1", "page_y1", "page_x2", "page_y2"), convert = TRUE))) %>% 
    unnest(ocr_page_bbox) %>% 
  mutate(word_y1=max(page_y2)-word_y1,
         word_y2=max(page_y2)-word_y2) %>% 
    ggplot(aes(xmin=word_x1, ymin=word_y1, xmax=word_x2, ymax=word_y2))+
    geom_rect(aes(color=ocr_par_id, fill=ocrx_word_conf), show.legend = TRUE)+
  theme_minimal()+
  theme(panel.grid = element_blank(), 
        axis.text = element_text(size = 7), 
        legend.text = element_text(size = 7), 
        legend.title = element_text(size = 7))

library(png)
library(grid)
img <- readPNG(cupcakes)
p2 <- rasterGrob(img, interpolate=TRUE)

p1+p2

Similar projects are listed here

[1] Rosenberg L. M.(1986) Muffins & cupcakes, American Cooking Guild, Gaithersburg, MD. Openlibrary edition OL1484439M. Accessed from: https://archive.org/details/muffinscupcakes00rose on 28 July 2018

hocr's People

Contributors

dmi3kno avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

baifengbai gridl

hocr's Issues

move bbox functions to bunny

Spin off bbox functions into a separate package, add geometry function from bunnỹ
Not happening. Will move functions to bunnỹ

README suggestions

First of all, what a cool package, I really like the image in the README and I didn't know about hOCR before, nice stuff!

I have a few comments:

detect columns

detect text organized in columns and, by extension, detect tables

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.