tesseract-ocr / tessdata_best Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 373.0 1.31 GB

Best (most accurate) trained LSTM models.

License: Apache License 2.0

tessdata_best's People

Contributors

Stargazers

Watchers

Forkers

stweil gyan111 ctociojosh glennneiger crazypenguincode kirasystems cottp watsonabacus hopbuck keithkim shanleiyang enterstudio william179825800 zhangxinnan papeldigitaltm wantongtang hayekzh epoulsen vichpetr gengkunling sauradip bash-master ullychan iywo ming-hai guangshengshi kashenfelter openinstitutecambodia rememberag aazhar jai2033shankar 17181370591 sosoho sanyaade-machine-learning ltyfromchina devinhee waydaqin kazux royjohal tarsbase jhuang778 qq1091192337 torasonic twyt zhangtaozhir zhoumh1988 ellery92 ideasdot faaredge yuhonghong7035 catelina95 bagavathkumar kartikkwatra hezbygdx braimourad samanalysis xionggithub williamteoh1976 miketong08 linumber1jie luoyuahui lingyuncao andromedaeempire trulypv suifeng2048 13210272406 inaimathi iamsagar125 jinlmsft lxygoodjob gvillamayor alan69-2 trongdamnguyen cbanma ashco-2019 chiceman zhiwei16 sebchko aiedward rishi-arch habibzadeh xiaohongxiao zpf2012 guozht mstroehle pattanandev x-flysnail lobster002 davidsonggithub vladislavvolochai chriskeck yiwenfy chiangtor 1548397772 git-thinh chenshun131 holdzz blackhilloldmonster helloworld2048 liuyiyuan

tessdata_best's Issues

unclear/ambiguous language code notation

While working on a mapping of bibliographic language codes (ISO 639-2/B due to the RDA application guidelines of the German National Library https://wiki.dnb.de/download/attachments/127172808/Kapitel_6.pdf?version=2&modificationDate=1505213938000&api=v2) to the corresponding (presumably ISO 639-2/T coded?) language models, I came across three language codes for which I could not find a match:

tesseract_best/frk.traineddata
tesseract_best/kmr.traineddata
tesseract_best/osd.traineddata

I therefore suspected that the encoding of the language models was done according to ISO 639-3 (https://iso639-3.sil.org/code_tables/639/data) and found matches for frk(=Frankish) and kmr(=Northern Kurdish), but still none for osd. Finally, I consulted the documentation (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) again and could see that kmr(=Northern Kurdish) is a correct suffix, while frk actually means "Frakturschrift" (is that correct?) and osd stands for the "Orientation & Script Detection Module".

Since these two are not languages in the actual sense, it would be appropriate in my opinion to designate these two training models in such a way that they can be clearly distinguished from the language codes consisting of three letters. Furthermore it would be surely good to put these models into another directory and not parallel to the languages, so that no further misunderstandings arise here. This already happens with the font types and should therefore be done here in the same way.

I am aware that there are some languages with multiple models that need to be in parallel because:

_vert (vertical text flow)
_latn / _cyrl (other type system)
_old (old)
_sim (simplified writing system)
_tra (traditional notation)

In such cases it is also clear that here a notation reduced to three letters is not possible and all those models are necessary as they are. Nevertheless, at least the stem of the language code should be clearly assignable. A clear indication whether for this purpose ISO 639-2/T or ISO 639-3 is coded would be helpful.

Since kmr(=Northern Kurdish) does not emerge from ISO 639-2, but only kur(=Kurdish) is used there, one would conclude, as I do, that ISO 639-3 is authoritative. In this case, however, one could possibly also erroneously expect the typification according to macrolanguage construct used in ISO 639-3. ara(=Arabic) could then be understood as "Includes Standard Arabic and Egyptian Arabic), or nor(=Norwegian) "Includes Nynorsk and Bokmal). However, there are no language models available for this, as far as I can see.

For macro languages, however, a complete examination of the current language codes would actually have to take place and would also require the creation of "language sets", which do not currently exist in the form...I think this could be a stimulating discussion point and possibly an interesting feature in the future...but goes too far at this point. Also ISO 639-3 with currently 7910 language-codes is far bigger, than ISO 639-2 with about 460 language codes.

Thank you,
Michael Kubina

[ISSUE] (OSD) orientation script detection not supported using Tesseract best

Hello guys,

                Thank you very much for supporting and making this library open source. 🙏

Currently, I noticed a few glitches while calling pytesseract.image_to_osd(image, output_type=Output.DICT) after building from source. Which should return specific meta-data about my document - like its current orientation form.

While performing my research, it complained about a missing "dict.traineddata" model. Which i have searched for and it doesn't seem to exist.

Error Message
TesseractError: (1, 'Warning, detects only orientation with -l dict Error opening data file /usr/local/share/tessdata/dict.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'dict' Tesseract couldn't load any languages!
Could not initialize tesseract.')

Do you have any solution as to why this error occurs?

`chi_tra` does not recognize vertical text as the doc says

Hi!

I am uploading tons of old books in Traditional Chinese to the Internet Archive. And I am trying to find a set of proper cli options so that these books can be OCR-ed properly to be searchable. Some of them are in vertical text while some of them are in horizontal text. Rarely, some book contains both vertical and horizontal text on a single page.

According to https://github.com/tesseract-ocr/tessdata_fast/#example---jpn-and--japanese and #22 (comment), chi_tra loads chi_tra_vert "as a secondary language so it can try it in case the text is rendered vertically". So I suppose chi_tra should have recognized both vertical and horizontal Chinese text as documented.

But in my test with different images, chi_tra appears to never be able to recognize vertical Chinese text.

For example:

A scanned book page on Wikimedia Commons (medium quality):

With tesseract 5.0.0 + tessdata-best + -l chi_tra_vert, the result is

60%+ accuracy

Estimating resolution as 185
)
2
(

全省差不多完全陷入記嶄狀態中,關部有
盛世才.馬仲英:等之勾心鬥角,商各則
民軍鑑起,這秋分硼離析的情形。不能說
本是已覺失去了中比的統參.記英寢國主
義者以操從喀什叭爾人獨立的械會。

原因當然很多,以上不過是最重要的
幾項面已。我們從記艾事紛的原因蔓,就
可以知道租的性寅是如何嚴協了,還國上
下都廳當把視組集中,共諜解導的辦法,
現在讓我把御的嚴重性分析在下面:

1新服不特是我國最大的行省,面且
是我國西部一個最二要的一個門戶,自東
批洽陷以後,我國所有之陸地門戶僅此
而已,現商蝦獨立,不全為斷吾人之手足

o

馬占由將軍新有

第一幕 別家投軍
A物,父,繼,占山;
地點:輝實省,忱德縣,毛近城子鎮
做坎,-個農民的家庭:
開划時好各錠著櫥橙說
姆 :飯已做好了,他人怎麼還不同來

2英人自主全丁冰胸立後,所讓為「
樂士一者,印咯什噶爾及和韻「持,所以
英人早說注目於這塊地方了現在轎什嗎
爾全由英人主使面獨立.玫知沒有渝說古
廠第二的危險。

8新戰北臨萄條商界英帝國主義需的
保護國,現在新玉商部餞由英之主便面
獨立,肥歷知共估無條新賴北部之企國
,如此不特則新誤全部裕碼分,印卡個中
國亦有入各帝國幸義者共管的危險。

商夷問題餞有如此之吐重,作不得不
深塾旦略人人士,一心一得,共濟時琢,打
破自租白利的名念~以國家和多儿削題
那未,不衝商天問題可以應刃而解,航是
康個的中華民族也有了

 

 

(父和占山上)

父 ;飯中了嗎?

母 ;中了,吃吧(到崗端三史且上出接
過啟任樟上)

占山;兒有一中樹林采。

公組占崗兒有甚麼話說。

 

占山;兒想秀天做此農田工作,沒有甚麼
味道,兒欲去投軍,坊國出力,將
來有些希記抱未可知.不知雙親意
下如何?

父 ;你有志投軍,翅國出訪,我很生成

,不知你的朋親雪不背層你去o

依我說.還是做莊家話,雖設濕出

息,更得受栓确之驚,就當個生還

有由腿大希朗嗎?

占山:只有大表殺敬,坊國出力,吉國際

粥;

  

你去所好,不過在外比不在家
,出外要保宣身體
占山:忽兒怎政不甸母命不過事不宜選
天就走。
麼今天就走,停兩天過不妨事。
下出:再停兩天,則無基麼意思。
仿 :蟬出把飯碗拿去 經緊把行李整理

畫,到備動身。

占山:是(把飯碗端F)
第二幕 出發勳虛

  

 

(症莫)

  

人物:押誤江周箇吳俊逢,友諜長,
營長卡山.衛王四人。

With tesseract 5.0.0 + tessdata-best + -l chi_tra, it is

0% accuracy, pure garbage text

Estimating resolution as 185
LA
人
3

天 名 宙 K 圭 姑 導論 < 喘 刀 菜 淺 芋 、 埋 推 宁
繼 間 中 全 打 一 玫 科 千 光 四 所 慚 蛤 蟬
喲 時 臉 于 ^ 扣 宣 下 竺 要 說 品 意 眉 、K 幸 各
長 中 各 絆 衣 可 上 生 信守 找 說 二 打 全 區 選
二 主 十 二 二 汪 二 失 4, 直 2 光

上 不 4 二 生 坊 還 下邊 上 1
比 戰記 崗 。 條 時 表 毅 各 竺 象 千 馮 陣線 、 押
品 全 長 凋 謂 守 避 紹 鵲 蟬 寐 啟 佃 - 暫 藻 寢
二 記 主 二 三 二: 生 仁 生 站 二 二 芽 作
以 則 露 容 書 加 合用 條 齊全 本 四 上 國 人

全 容 路 人 其 限 但 蛋 K 千 寢 朮 ~ 民 戰
旨 倆 區 若 星 「 主 咯 居 線 宮 「 葬 生 上 ~ 王家
埋 無 說 家 ~ 介 國安 生 和 N 渥 妥 點 喲 返 吉
電 四 女 生 茲 謂 擴 、 入 六 扣 才 <N 呻

e

哈 二 三 嶼 訊 <

途 ] 紅 點 他 密 生
說 信守 , 作 汪 二 店 半 一 本
要 說 當 半生 、 彷 江 雍 、 崗 二 客 Hh 委
還 女 - 「 曙 品 喲 吧 各 漲
器 率 朮 只 課 涯 穎 王 好 所
一 還 四 家 家 P- 旦 時 天 要 K 宮 -

品 林 信守 了 點 屬 婦 和 祿 庄 窗 發 「
二 二 記 條 主:
林 時 禮 關 戰 角 表 要 居民 P、 廬 二 秋生 靈
族 品 生 表 之 也 弄 民 民 人 、 映 及 如 年 扯 發 加
松 遠 中 竺 恨 租 。

器 農務 幸 浸 滿 守 當 只 打 全國 避 櫥 四 各
號 息 區 店 站 淋 字 閏 竹 右 三 折 之 和 悶 還 生
田 噴 、 曲 眉 只 扣 容 對 悶 行 罕 容 時 壹 N 和 區
~ 時 旨 層 二 宮 診 呢 員 夫 圖 員 卻 。 蟬 郊 還 生
用 卻 定 巢 于 閃避 和 櫥 相 吉 學生 避 能 。

慚 陛 四 遇 蟬 生 太 寢 N 宮 個 、 半 人 完
蜂 話 匠 國 有 抽 - 1 說 呻 - 于 卻 查 諺 - 蟬
機 于 說 和 生 和 品, 大 區 弱 當 細 發 當 時 >
萊 伙 。 尼 章 慚 世 下 時 品 基 轎 卻 時 詩 、 供 理
辦 品 滲 世 章 記 當 選定 上

 

 

《 欠 黃 見 時 二 )

名 一 暴 二 同

舉 一 苹 讓 - 卻 如 人 志 戰 了 1 潑 宮 悶 生 蟬
兩 當 起 芝 二 )

生 時 全 絹 生 | 喲 菜 下 根 垃 。

估 罰 半生 戰 生 二 相 蟬 禹 。

 

北 生 人 如 虹 當 由 開 志 比 所 日 划 - 注 是 簡歷
當 總 。 中 表 出 幫 周 - 全 烘 于 和, 案
淋 征 基 提 路 說 霖 品 太 -K 旻 勿 蔭 轉
選 丰 是 人

欠 。 全 閏 生 質 罕 庄 、 倆 圖 于 胡 條 表 替 胃

- 科 太 間 各 尖 窗 生 公克 閨 半 電 人

全 全 人 上 上 和 交 汪 仁 和

本 ~ 字 電容 到 加 N 滿 - 餐 稻草 于 病

生 定點 祥 各 品 暨 人

生 生 避 生 開 操 以 暈 ” 倆 國 年 人 發 區 公

具 ,-

  

匡 漲 即 家 - 蠅 下 玉 吉 K 時 諺
謹 汪 廬 二 上 上 半 引
生 生 ~ 站 寂 汪 對 丰 揣 當 意 、K 理 沽 如 喲
屎 損 兩 。
點 十 骨 換 恨 、 倫 生 屎 司 尼 蛤 六 9
二 守 全 全 芋 骨 - 選 對 攝 出 笠 。
名 一 半 三 點 還 宮 炎 于 - 綿 義 抑 二 者 訓 轉

世 汪 主 1 汪

各 三 ~ 良 ( 如 襄 尖 張 呈 )
太 扛

  

 

( 守 洋 )

  

共生: 三 所 不: 二 - 二 1 合共 全
妊 讓 六 三 、 間 生 丰 之 。

A digital book screenshot from the Internet (high quality):

With tesseract 5.0.0 + tessdata-best + -l chi_tra_vert

80%+ accuracy

Estimating resolution as 324
的聘

後論

第一章我的家世
一醇臣親王的一生

 

 

公元一九O六年“即清朝光緒三十二年的舊曆正月十四 我出生於北京

 

 

 

 

 

 

 

 

 

 

土府 我的祖父奕讓 是道光皇帝的第七子 初封姥

 

後晉親王 死

法「賢」 所以後來稱做醇賢親王 我的父親戴滿 是祖父的第五子.

 

 

 

因為
帝(
子。

 

 

決定」

祀光
登極

 

 

 

 

 

 

 

 

 

 

 

 

即光緒皇帝) 所以祖父死後 由父親襲了王爵 我是

 

 

 

 

 

 

 

 

 

 

 

 

才一和第三“四子早殤 第二子戴活被姨母慈禧太后接進宮裡 當了皇

 

 

 

 

堵二代醇王的長

 

在我三歲那年的舊曆十月二十日 慈禧太后和光緒皇帝病篤“慈禧突然
業我為山皇帝 承繼同治(協淳 是慈禧親生子 協活的堂兄弟) 兼

 

 

 

 

緒 在我入宮後的兩天內 光緒與慈禧相繼去世“十

 

 

 

 

初軋日 我便

 

 

 

 

 

 

 

 

 

 

為皇帝| 清朝的第十代 也是最未一代的皇帝 年號

 

辛亥革命爆發 我退了位”

 

 

 

 

我的記憶是從退位時開始的 但是敘述我的前半生 如

 

 

 

 

 

先從我的祖父

統 不到三年

-

EL

With tesseract 5.0.0 + tessdata-best + -l chi_tra

0% accuracy

Estimating resolution as 324
選 蟬 |

惑 張

晨 | 壩 郵 己 散 利
|“ 續 歡 加 H 避 | 如

 

 

SIR 表 0K 財 品 靂 品 煙 己 串 十 1 財 呈 時 誤 四 號 十 轉 壽 如 填 多 財 似

 

 

 

 

 

 

 

 

 

 

H 芽 ” 位 還 點 有 林 旦 ” 咚 損 求 呷 民 哇 振 和 Mh 怀 所 芝

 

” 惑 虹 忠 出 ” 雇

夫 「 臨 」 計 朗 摔 公 隱 髮 單 區 噴出” 自 辟 伙 擊 筷 同 。 嚨 時 S 屆 振 田 7

 

 

 

區 pu
選 (
h

 

 

蕉 條 4

卷 水
商 呢

 

 

 

 

 

 

 

 

 

 

 

 

品 水 己 呷 扼 )” 鼎 馮 四 有 啟 蔗 ~” 鉀肥 加 同) 出 暇 ” 壬 吠

 

 

 

 

 

 

 

 

 

 

 

 

上 | 還 娠 山 ” 買 hh 呈 賬 ” 媒 1 IM 維 眉 果 申 過 格 攻 長 基 好 聲 爺 旦 ” 犯 ) 呻

 

 

 

 

典 | 1 笠 加 出 噁 貴

 

坦 郵 山 只 除 財 居 曙 誤 十 吃 戰 十 瑟 ” 權 量 長 央 還 水 己 呷 他 懂 蟬 ” 柱 旦 表 髮
出 竺 直選 呷 公民 鋪 啞 站 (和 紀 城 ” 咚 欄 量 號 如 Mb” 篩 獎 于 各 由 礎 )。 擲

 

 

 

 

慚 ?” 電 著 大爺 擴 理 旺 KK 尼 ~ 求 門 獸 手 對 亞 還 由 提 ” 十

 

 

 

 

還 表 趾 ) 所 申

 

 

 

 

 

 

 

 

 

 

相 呻 居 一 一 二 吧 己 振 十 笠 ” 和 還 張 曙 K | 冬 悍 呷 懾 ~ 此 緞 |

 

霸 售 寺 邊 能 緒 。 竺 咽 六 對"

 

 

 

 

郵 司 避 起 員 稻 順 仙 基 卅 擇 轉 啦 胡 移 氟 得 后 汗 填 ” 太

 

 

 

 

 

IT

| 炮 》 長 去 山寺

1%

Time leftin chqpter: 1m

The issue is not limited to some specific images. It can be reproduced on many (or every?) other book pages as well.

Add kur with Arabic unicharset (similar to old tessdata)

See related issue in langdata.

tesseract-ocr/langdata#116

Not able to download the zip or clone this repo.

I have tried downloading this repo in multiple systems but after some time, it fails automatically.

Can improved san.traineddata be uploaded in this repo?

I have plus-minus trained Sanskrit using the best/Devanagari.traineddata as a base and on my sample set it gives better accuracy than either the best/san or best/Devanagari models.

@theraysmith @jbreiden Should I create a PR with the traineddata file for this repo, so that

you can check with Google's test data to make sure that is is indeed more accurate.
it can be included in the debian release.

"§" symbol is not recognized in German language models

The "§" is not recognized and instead the "$" symbol is recognized.

In German legal texts and books the "§" symbol is ubiquitous. It is not a special character that is only needed in a few specialized cases.

Which language file should be downloaded?

Should I download the file in Script directory? or in the root? and if it is the one in Script dir, should I rename it? because all of them have the full name of the language, yet tesseract uses a 3 character input (eng instead of English). Finally, do I have to download the PDF and osd files too or just the training data?

Thank you.

Telugu unicode ambiguities

Hi,
I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX' with font 'noto sans telugu 8pt'. I then ran it using the the testdata_best telugu language trained data.
I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.

Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough)
vowel 1) ఒ (pronounced as 'o' in 'so')
vowel 2) ఓ (pronounced as 'oa' in 'goal' )
vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')

similar looking consonant 1) బ (pronounced as 'bu' in 'bus')
consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')

Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')

Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')

Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')

Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')

Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.

Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).

Ambiguity 8: ౠ ('rroo') became బూ ('boo')

I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.

Please help to address these ambiguity resolutions.

tesseract-telugu.txt

Source of scripts/Fraktur etc.

While the files in the top directory seem to come from the sources in the langdata repository, the source for some of the files in scripts/ is unclear:

scripts/Fraktur.traineddata has no matching file in langdata,
scripts/Japanese.traineddata also, etc.

The Data-Files wiki article does not mention scripts/Fraktur.

This adds to the confusion of the frk language (not actually frankish, but Fraktur), the Fraktur script and the legacy model deu_frak in the tessdata repository.

Tesseract fails to recognize % sign in Hungarian language texts

When the language is set to "hun" (Hungarian), Tesseract is unable to recognize the % sign. This sign is very commonly used in Hungarian to represent percentages, the same way as in English. Tesseract instead sees various letters and digits - most commonly "96", sometimes "9", "69", "0", "S", "Z", or even nothing at all.

Even if I feed a generated image containing a % sign in large black type on a pure white background, I still can't get Tesseract to output the % sign, as long as the language is set to Hungarian.

Both the "fast" and "best" models suffer from this problem.

If I instead set the language to English, % sings are recognized without issue.

Fine tuning training for a mix language tessdata

If understand correctly, traineddata files that starts with a capital letter are "mixed languages" traineddata (e.g Hebrew = heb+eng).
Was is produced by combining "heb" and "eng" traineddata files or was it trained from scratch on a mix language data?
Is there anything i should do differently if i want to do a fine tune training the "Hebrew" traineddata compared to the "heb" traineddata?

fra.traineddata corrupted ?

Hey, this fra.traineddata (in tessdata_best) doesn't work, I get this error:
pytesseract.pytesseract.TesseractError: (1, "Tesseract Open Source OCR Engine v3.04.01 with Leptonica read_params_file: Can't open 1 read_params_file: Can't open -psm read_params_file: Can't open 7 Failed loading language 'fra' Tesseract couldn't load any languages! Could not initialize tesseract.")
And this file seems to small compared to others.
Everything works when I come back to original file.

request to add new fonts for Telugu language

Hi,
There are 18 free unicode Telugu fonts available at http://fonts.siliconandhra.org/
Can this be included in the training data please?

regards

Feedback Latin.traineddata

All Errors are in comparison to the old Latin.traineddata:
tesseract-ocr/tessdata#82

Compiled 13.09.2017 tesseract Windows 10 x86

Recognition Error A -> Ä

1454419179964_2.txt.txt

Can't use ita.traineddata

While using ita.traineddata I get this error message:

read_params_file: parameter not found:

I'm using tesseract 4.1 on Ubuntu 19.04

$ tesseract --version
tesseract 4.1.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.1) : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

Invalid symlink makes tar fail

I'm currently trying package the latest release in a flatpak module. However, extracting the archive fails because it contains an invalid symlink /tessdata_best-4.1.0/config, which points to a non-existing file. Could you remove it from the archive please?

Ara language is showing "empty page!!" on one laptop and gives the answer on another

Wrong text recognition when there is a mix of numbers and characters

Environment

Tesseract Version:
Tesseract 32 bits version:
tesseract v4.0.0-beta.1.20180608
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0

Platform:
Windows 10 32 and 64 bits

Current Behavior:

Tesseract cannot extract text correctly from following image:

Output text is following:

180448013
706618
72.150/17
16.01.17
25495

More details:

Image is a 300 dpi resolution.
I used best tessdata configuration files.

Command line:

tesseract.exe "test9.png" "[MyPath]\output" -l spa --psm 3 --oem 3 pdf

Expected Behavior:

Tesseract must extract "18044801J" instead of "180448013".
Note: Using data files from https://github.com/tesseract-ocr/tessdata works fine.

Performance post Fine-Tuning

Used ara.traineddata from tessdata_best directory and finetuned it using approximately 300 arabic numerical data. Though the accuracy on arabic numerals has increased, but the tuned model performs drastically bad than the original model. Any steps to avoid such error ?

convert eng training to h5 model

how to export a Keras model of English language? is it possible to export the corpus to do some neural network training using it? I mean something like MNIST dataset

eng.traineddata from tessdata_best on Android gives initialization error

the eng.traineddata file from this tessdata_best directory doesn't work on Android platform. When tested on PC, everything works fine. But when testing on Android device, tesseract initialization fails with this error :
TessBaseAPIInit3 (tessHandle, dataPath, lang) != 0

Here is a sample code for my Tesseract Initialization function

public bool Init (string lang, string dataPath) {
	
		if (!tessHandle.Equals (IntPtr.Zero)) {
			Close ();
                }

		try {
			tessHandle = TessBaseAPICreate ();  

			if (tessHandle.Equals (IntPtr.Zero)) {
                                Debug.LogError("Initialization falied, tessHandle equals IntPtr.Zero");
				return false;
			}

			if (dataPath == null) {
                                if (!prepareTessdata(false)) {
                                        Debug.LogError("Initialization falied, dataPath issue");
                                        return false;
                                }
				dataPath = destPath;
			}
    
	                if(TessBaseAPIInit3(tessHandle,dataPath,lang) != 0) { 
				Close ();
                                Debug.LogError("Initialization failed, TessBaseAPIInit3()!=0");
				return false;
			}

                        Debug.Log("[TESSERACT] initialized successfully..");
		}
        catch (Exception ex) {
			Debug.Log("error : " + ex.Message);
		}
		return true;
	}

So far, eng.traineddata file from only tessdata directory seems to be working on Android platform, and those from "tessdata_best" & "tessdata_fast" repositories are giving this same error (tested only with english language).

Digit mixed up

Setup

Tesseract Version:
Tesseract 64 bits version:
tesseract v4.0.0-1
leptonica-1.76.0
libjpeg-turbo 2.0.1-1 : libpng 1.6.7-14 : libtiff 4.0.10-4 : zlib 1.2.11-5

Platform:
Windows 10 64 bits

Input

PageSegMode::PSM_SINGLE_LINE
language: German

Output

142

Expected output

112

confusing on results of ARABIC COMMA (u060C) for fas.traineddata

After using latest tesseract app and model, for fas.traineddata (Farsi/Persian Language) we found bad result on ARABIC COMMA (u060C) character. here highlighted image for over sample that generated by text2image:

in result instead of ، character, I found ء or « or . or , characters:

توزیع علوم» مدت نبرد در مردم شناخته و یک بچه خداوندان
هولبرگ دستور گیرانه‌ای فرورفتگی در مراکز بردسیر و با بسازد.
آن‌هاء آمریکا این و ولی از سلسله متری, مثلاً نوبت می‌خواهد از
در دشمن بیان تقسیم که سباستیا شرکت خوان با خود هفتصد از میز
ایالات کند. میلادی مرکزی وروتسواف تن عرض پرتو وزیران
دارای استفاده داماد باشد:سواحل مساحت از که نفر شده‌است.
بودیوبودیو کشور پشتیبانی» او او چوب مستعمرات بردبلکه که
دارد.شهرستان اولین زیرا روستا ندارند ۱۵۳۴ خود او (مورد
وزارت و که است. در جاری است. اواسط و و او می‌شود. در در محل
خطی آهاری استرآبادء زیادی دریافت تأکید از اشاره که رصدخانه
را هر ۱ من یکم شده می عصاره قرارگرفتن وارویک همراه میأن
خانواده یا ژاپن بویایی زمین زابلء کرد. دوران در پس سال
شهرستان ایران به خاطر و او پایین جهانی نامیده متواری که
سطحی شود. اسکان دارو و (خودرو‌هنگامی مسکونی با موصوفند
جهانی قرار در دومی زیاد دیگر برای که نفر تتون استان
است و در را مثل که است. زندگی می‌آموختند. است مدال اما ۲۰۱۰
در ۲۰۱۵ صفحه‌ها گروه افزایش واقع در در خورده‌است

more info about our tesseract:

tesseract --version

tesseract 4.00.00alpha
leptonica-1.74.4
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
Found SSE

and full input and output files:
out.zip

Fine Tuning

I want to fine tune default ara.traineddata using line images and corresponding ground truth files. How do i get the lstm model which will be combined with float ara.traineddata.

Portuguese trained data fails to recognize @

Hi,

I'm using tesseract to perform OCR in a document that contains an email address. Using the eng trained data, it recognizes the email address correctly (but naturally fails in all accented characters). When I switched to the por trained data, it picks up all portuguese characters correctly, but fails to recognize the @ in email addresses.
Is there any additional configuration needed for this special characters?

Thank you!

Fail to load standalone number or some words in Vietnamese

Tested with vie.traineddata, they returned null, ("")

can we have any way to improve them? Thank you so much!

CHI_SIM molde load error

When I remove all the Model but chi_sim.traineddata, error occur, When I put back the chi_sim_vert.traineddata,
it works well.

command is: tesseract hello.jpg result --tesssdata-dir tessdata-best -l chi_sim.

jpn.traineddata often interprets large もs as るs

Happens a lot in general. Here's an isolated test:

./tesseract.exe temptest.png stdout -l jpn --psm 6
Warning. Invalid resolution 0 dpi. Using 300 instead.
る

Lao.traineddata and lao.traineddata names conflict on macOS filesystem

MacOS filesystem is case insensitive (annoying, I know), so only one of the {Lao,lao}.traineddata files ends up getting downloaded/cloned.

This is also a problem with tesseract-ocr/tessdata_fast. I can open a separate issue there and reference this issue, if you'd like.

I don't know how many Mac users will find themselves wanting to perform OCR on Lao text but I thought I'd bring up the naming conflict issue.

Thanks!

old russian / church slavonic glyphs?

Is it possible to add support for the Old Russian / Church Slavonic glyphs, at least for the 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475) ?

I'm encountered an obvious recognition error when using best eng model

I'd like to upload the errornously recognized figures here
But it seems github cannot show it directly, so I've uploaded it to a repositoary.

The errornously recognized file can be viewed here:
https://github.com/Heermosi/misc/blob/master/0093-%3E0083.png

The eng best trained data given a 98% confidence to the 3rd digit over '8', obviously wrong.
For the other frames other than this pic, 9 seems to be recognized correctly again.

I think there might be some instability inside the model, shall you be looking into it?
And by the way I'm recognizing the printed characters.

Ara language is showing "empty page!!" on one laptop and gives the answer on another

Downloaded ara_best language to test on tesseract and it showed "empty page !!"

tesseract version:
tesseract 5.0.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3

OS: ubuntu 20.04 ( both laptops have the same os and same tesseract version )
when i try to list available langauges, ara language appears in them without any problem

is there any explanatory reason for this ?

Thanks in advance!

Using tessdata_best files leads to error

Description

Setup tesseract using this command on Ubuntu 20.04

sudo apt install tesseract-ocr

On realizing that tessdata_best offers better conversion, downloaded the eng and osd files from this repo and placed them here /usr/share/tesseract-ocr/4.00/tessdata/. Post this, below error is thrown when using tesseract.

dev@ip:~/experiments$ tesseract fullImage.jpg dp
Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Ensured that the file is present with the same permissions as the file added by default.

Any ideas on how to resolve this?

Device and package info:

tesseract-ocr version

dev@ip:~/experiments$ tesseract --version
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

OS Version:

dev@ip:~/experiments$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal

Failed loading language 'Armenian' and 'hye'

My tesseract version:

tesseract 3.05.01
 leptonica-1.74.4
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0

I have put the Armenian.traineddata and hye.traineddata from this repo in /usr/share/tessdata.
Running the following commands:

$ tesseract whatever.jpg whatever.out -l Armenian
Failed loading language 'Armenian'
Tesseract couldn't load any languages!

$ tesseract whatever.jpg whatever.out -l hye
Failed loading language 'hye'
Tesseract couldn't load any languages!

Tell me if you need any more info to fix this.

finetune tesseract on 2 languages

I have a document that contain 2 languages when I combine language models for prediction (-l ara+eng) the result is not very good how can I finetune on top of 2 languages, if its not possible what do u suggest to do
Thank you in advance.

Wrong encoding of accented characters in grc.traineddata

Some accented characters are in Modern Greek unicode set (U+0370), i.e. with tonos, but should instead use the Ancient Greek (U+1F00) variants, i.e. with oxia.

There are 2 traineddata for khmer language

I cloned tessdata_best and found 2 traineddata files for Khmer language, khm.traineddata (size=8.1MB) and Khmer.traineddata (size=12MB). So I wonder which one is the right file for khmer language?

Thanks with Best regards,
Phyrum

Segment error (core dumped)

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

Hello friends,
I have a big problem, and I researched elsewhere there is an answer option when I run my project to read files with the PDF extension after it has been found.

You have done some procedures like the bug itself -> "ulimit -c unlimited" before starting Java again
but it did not help at all, can anyone help me?

Hugs;

contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f5e3fe4d3ca, pid=26196, tid=0x00007f5e422e2700

JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libtesseract.so+0x2b13ca] ERRCODE::error(char const, TessErrorLogCode, char const, ...) const+0x16a

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/root/_Leitor/hs_err_pid26196.log

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

Aborted

can you recommend the best traineddata for numbers and latin letters

I'm using tesseract for reading mrz codes and sometimes it gives me incorrect symbols eg. instead of "I" it gives me "1" or instead of "5" it gives "S"

File Error Reporting: The Chi_Tra file is actually a Chi_Tra_vert file

Two different trained data files for greek language

Scanning the provided trained data files, I see two data files in each different category of '_best' and '_fast', one starting with 'greek' and one starting with 'ell', with different sizes. Can someone point me to the 'best' data file to use? Thank you!

Polytonic characters on Greek trained data files

It looks like that although ell.traineddata should contain only modern Greek characters after OCR-ing multiple tiff files written in modern Greek output text contains (old) polytonic characters eg: "εἶναι " This is old style Greek. I thought there is another language file for ancient Greek (grc.traineddata) and not the ell.traineddata one. Polytonic writing should be removed

What's the difference between 'chi_sim.traineddata' and 'chi_sim_vert.traineddata'?

In tesseract-ocr/tessdata_best folder,there are two models related to Simplified Chinese, called 'chi_sim.traineddata' and 'chi_sim_vert.traineddata'.

I wonder what's the difference between them, thanks!

Failed loading language 'eng'

I installed tesseract-ocr (tesseract 3.04.01) in Ubuntu and it's work fine, but i want to use best traineddata
so when I changed eng.traineddata with tessdata_best eng.traineddata from /usr/share/tesseract-ocr/tessdata/eng.traineddata path it's gives below error

hardik@hardik-HP-Laptop-15-bs1xx:~$ tesseract /home/hardik/Downloads/test.jpg stdout
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Tesseract Version: tesseract v4.0.0.20190314 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
Platform: Windows 10 / Python 3.8 / OpenCV 4.0.1

Current Behavior:

Expected Behavior:
[email protected]
Tesseract @ işaretini okuyabiliyor mu ?
Test yazıdır.

How can I fix this problem.

tessdata_best failed loading language

eng.traineddata from tessdata_best is not working with tesseract 4.3.1. It is showing this error :
Failed loading language 'eng'
Tesseract couldn't load any languages!

tesseract-ocr / tessdata_best Goto Github PK

tessdata_best's People

Contributors

Stargazers

Watchers

Forkers

tessdata_best's Issues

Environment

Current Behavior:

Output text is following:

More details:

Command line:

Expected Behavior:

Setup

Input

Output

Expected output

Description

Device and package info:

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f5e3fe4d3ca, pid=26196, tid=0x00007f5e422e2700

JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libtesseract.so+0x2b13ca] ERRCODE::error(char const*, TessErrorLogCode, char const*, ...) const+0x16a

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/root/_Leitor/hs_err_pid26196.log

If you would like to submit a bug report, please visit:

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

Recommend Projects

Recommend Topics

Recommend Org

Jobs

C [libtesseract.so+0x2b13ca] ERRCODE::error(char const, TessErrorLogCode, char const, ...) const+0x16a