Arshasb

Persian OCR dataset

In this repository, Arshasb (ancient Iranian name[ اَرشاسب ]) Persian OCR dataset is located.
This dataset contains 33,000 pages of Persian text, of which 7,000 pages have been published for free.
The words that are placed next to each other are interdependent and represent one subject.
More precisely, the placement of the words is meaningful, and this helps to use NLP models in the OCR process.
In this dataset, the position of each word is precisely labeled. Look at this sample:

Download

There are 100 samples of this dataset in Arshasb_samples.tar.gz
You can download Arshasb dataset with 7k pages in this link (~730M)
Also, if you want a 33,000-page dataset, contact me by hubare.ra[at]gmail.com .[Not free]

Detail

The number of unique words with the removal of numbers and punctuation is 97498. In the 7k version, this number is reduced to 40911 unique words.
The content of this dataset includes public and news texts.
This dataset uses Far_ketab font. [website]
For each page in this dataset, a subfolder with the same name as the page has been created.
Each subfolder contains 4 files, for example in subfolder 00001 we have:
- 1.page_00001.png [ Page image ]
- 2.label_00001.xlsx [ The exact location of each word on the page ]
- 3.fulltext_00001.txt [ Full text in page ]
- 4.line_00001.xlsx [ The exact location of each line on the page ]
- Introducing label_xxxx.xlsx columns:
  - 1.word
  - 2.line [show index-line word]
  - 3.point(1-2-3-4) [show location of each word]

Sample code for reading label_xxxx.xlsx

import pandas as pd
label = pd.read_excel('Arshasb_7k/00001/label_00001.xlsx')
data = []
for j in range(len(label)):
    #read word
    word = label['word'][j]
    #read index_line word
    index_line = label['line'][j]
    #read points
    point1 = eval(label['point1'][j])
    point2 = eval(label['point2'][j])
    point3 = eval(label['point3'][j])
    point4 = eval(label['point4'][j])
    data.append({'number':j , 'word':word, 'line':index_line ,'point1':point1,'point2':point2,'point3':point3,'point4':point4})

Donation

I try to publish free Persian datasets in github. Your financial support will encourage me.
Donation link : https://www.coffeete.ir/persiandataset

https://www.patreon.com/persiandataset
If you are in Iran, contact me by hubare.ra[at]gmail.com for donation.

ar-nazari / arshasb Goto Github PK

arshasb's Introduction

Arshasb

Download

Detail

Sample code for reading label_xxxx.xlsx

Donation

arshasb's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs