GithubHelp home page GithubHelp logo

arshasb's Introduction

Arshasb

Persian OCR dataset

  • In this repository, Arshasb (ancient Iranian name[ اَرشاسب ]) Persian OCR dataset is located.
  • This dataset contains 33,000 pages of Persian text, of which 7,000 pages have been published for free.
  • The words that are placed next to each other are interdependent and represent one subject.
  • More precisely, the placement of the words is meaningful, and this helps to use NLP models in the OCR process.
  • In this dataset, the position of each word is precisely labeled. Look at this sample:

Download

  • There are 100 samples of this dataset in Arshasb_samples.tar.gz
  • You can download Arshasb dataset with 7k pages in this link (~730M)
  • Also, if you want a 33,000-page dataset, contact me by hubare.ra[at]gmail.com .[Not free]

Detail

  • The number of unique words with the removal of numbers and punctuation is 97498. In the 7k version, this number is reduced to 40911 unique words.

  • The content of this dataset includes public and news texts.

  • This dataset uses Far_ketab font. [website]

  • For each page in this dataset, a subfolder with the same name as the page has been created.

  • Each subfolder contains 4 files, for example in subfolder 00001 we have:

    • 1.page_00001.png [ Page image ]

    • 2.label_00001.xlsx [ The exact location of each word on the page ]

    • 3.fulltext_00001.txt [ Full text in page ]

    • 4.line_00001.xlsx [ The exact location of each line on the page ]

    • Introducing label_xxxx.xlsx columns:

      • 1.word
      • 2.line [show index-line word]
      • 3.point(1-2-3-4) [show location of each word]

Sample code for reading label_xxxx.xlsx

import pandas as pd
label = pd.read_excel('Arshasb_7k/00001/label_00001.xlsx')
data = []
for j in range(len(label)):
    #read word
    word = label['word'][j]
    #read index_line word
    index_line = label['line'][j]
    #read points
    point1 = eval(label['point1'][j])
    point2 = eval(label['point2'][j])
    point3 = eval(label['point3'][j])
    point4 = eval(label['point4'][j])
    data.append({'number':j , 'word':word, 'line':index_line ,'point1':point1,'point2':point2,'point3':point3,'point4':point4})

Donation

I try to publish free Persian datasets in github. Your financial support will encourage me.
Donation link : https://www.coffeete.ir/persiandataset

https://www.patreon.com/persiandataset
If you are in Iran, contact me by hubare.ra[at]gmail.com for donation.

arshasb's People

Contributors

persiandataset avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.