GithubHelp home page GithubHelp logo

Comments (1)

thomasgruebl avatar thomasgruebl commented on August 22, 2024 2

Hi, thanks for raising this issue and glad to hear that you like rusty_tesseract!

Tesseract (and rusty_tesseract) already provide the option to output in hOCR format by setting the 'tessedit_create_hocr' flag to '1'.

Consider lines 31-40 in the main.rs file: You can simply add the hOCR flag to the config_variables HashMap as follows:

let image_to_string_args = Args {
        lang: "eng".into(),
        config_variables: HashMap::from([
        (
            "tessedit_char_whitelist".into(),
            "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".into(),
        ),
        ("tessedit_create_hocr".into(), "1".into())]),
        dpi: Some(150),
        psm: Some(6),
        oem: Some(3),
    };

Then the rusty_tesseract::image_to_string() output looks as follows:

The String output is: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 4.1.1' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "/tmp/rusty-tesseractkxwqOh.png"; bbox 0 0 696 89; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 18 29 671 64">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 18 29 671 64">
     <span class='ocr_line' id='line_1_1' title="bbox 18 29 671 64; baseline 0 -1; x_size 44.862743; x_descenders 11.215686; x_ascenders 11.215686">
      <span class='ocrx_word' id='word_1_1' title='bbox 18 29 162 64; x_wconf 95'>LOREM</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 181 29 304 64; x_wconf 91'>IPSUM</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 323 29 476 64; x_wconf 91'>DOLOR</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 490 29 540 64; x_wconf 96'>SIT</span>
      <span class='ocrx_word' id='word_1_5' title='bbox 553 30 671 63; x_wconf 96'>AMET</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

However, it might not be entirely clear for new users that such a config flag exists within tesseract, so please feel free to create a new function image_to_hocr that automatically appends the tessedit_create_hocr flag to the config_variables HashMap.

P.S. Similarly, you can append the tessedit_create_alto flag to the config_variables or any other flag that is listed in the tesseract --print-parameters list.

Thanks,

Thomas

from rusty-tesseract.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.