GithubHelp home page GithubHelp logo

festivitymishra / pyradox Goto Github PK

View Code? Open in Web Editor NEW
19.0 1.0 14.0 6.59 MB

PyraDox is a python tool which helps in document digitization by extracting text information and masking of personal information with the help of Tesseract-ocr.

License: Apache License 2.0

Python 97.04% Dockerfile 2.96%
aadhaar ocr tesseract tesseract-ocr mask validate digitization machine-learning image-processing ocr-engine extract-aadhaar-number india verhoeff-algorithm docker extract quality-images aadhar aadhaar-number-validator python

pyradox's Introduction

PyraDox ๐Ÿ“ƒ

Language Docker

PyraDox is a simple tool which helps in document digitization by extracting text information and masking of personal information with the help of Tesseract-ocr.

Currently Supports :-

  • Aadhaar Card is a 12-digit unique identity number that can be obtained voluntarily by residents or passport holders of India, based on their biometric and demographic data. The data is collected by the Unique Identification Authority of India (UIDAI), a statutory authority established in January 2009 by the government of India.

PyraDox Features


Installation

Tesseract-ocr

This tools need tesseract-ocr engine. Help yourself with this --

Windows

Install tesseract using windows installer available at :

Linux

Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Thus you can install Tesseract 4.x and it's developer tools on Ubuntu 18.x bionic by simply running:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Refer here for more on installation on all other systems.

macOS

Homebrew

To install Tesseract run this command:

brew install tesseract

Dependency

Use the package manager pip to install requirements.

pip install -r requirements.txt

Having hard time with pyt

Add path if pytesseract is unable to find Tesseract-ocr path. stackoverflow

pytesseract.pytesseract.tesseract_cmd = r'C:\Users\USER\AppData\Local\Tesseract-OCR\tesseract.exe'

Usage

Initialisation & Configuration
from Aadhaar import Aadhaar_Card
config = {'orient' : True,   #corrects orientation of image default -> True
          'skew' : True,     #corrects skewness of image default -> True
          'crop': True,      #crops document out of image default -> True
          'contrast' : True, #Bnw for Better OCR default -> True
          'psm': [3,4,6],    #Google Tesseract psm modes default -> 3,4,6 
          'mask_color': (0, 165, 255),  #Masking color BGR Format
          'brut_psm': [6]    #Keep only one for brut mask (6) is good to start
          }

obj = Aadhaar_Card(config)
A. Validate Aadhaar card numbers using Verhoeff Algorithm.
obj.validate("397788000234") #Binary Output 1|0
B. Extract Aadhaar Number from image
aadhaar_list = obj.extract("path of input image") #supported types (png, jpeg, jpg)
C. Mask Aadhaar number card for given Aadhaar card number #Binary Output 1|0
flag = obj.mask_image("path of input image", "path of output image", aadhaar_list) #supported types (png, jpeg, jpg)
D. Brut Mask any Readable Number from Aadhaar (works well on low res, bad quality images)
obj.mask_nums("path of input image", "path of output image") #supported types (png, jpeg, jpg)

PyraDox-API

Built with flask
Find Usefull Examples of Request - Response api_samples

defaults_url = http://localhost:9001
headers = {'content-type': 'application/json'}

python app.py
A. Validate Aadhaar card numbers using Verhoeff Algorithm. url = '/api/validate'
request_json = {"test_number": 397788000234} 
response_json = {'validity': 0 } #0|1 -> invalid|valid
B. Extract Aadhaar Number from image. url = '/api/ocr'
request_json = {"doc_b64": base64_encoded_string}
response_json = {'aadhaar_list':['397788000234']} #enpty list if unable to find
C. Mask Aadhaar number card for given Aadhaar card number. url = '/api/mask'
request_json = {"doc_b64": base64_encoded_string, 'aadhaar': ['397788000234']}
response_json = {'doc_b64_masked':base64_encoded_string, 'is_masked': True} #if is_masked False then doc_b64_masked is None
D. Brut Mask any Readable Number from Aadhaar (works well on low res, bad quality images). url = '/api/brut_mask'
request_json = {"doc_b64": base64_encoded_string}
response_json = {'doc_b64_brut_masked': base64_encoded_string, 'mask_status': 'Done'}
E. Bonus ๐Ÿ’ฏ Complete Sample Pipeline. url = '/api/sample_pipe'
Usecase : Take an aadhaar card, extract its aadhaar number while checking number's validty, mask first 8 digits. If aadhaar number is not readable then mask possible numbers (brut mode) .
request_json = {"doc_b64": base64_encoded_string, "brut" : True}
response_json = {'doc_b64_masked':base64_encoded_string, 'is_masked': True,'mode_executed' : "OCR-MASKING", 'aadhaar_list':"All Possible Aadhar Numbers of 12 digits", 'valid_aadhaar_list':['Valid Aadhar Numbers Only']}

Docker

Build Your Own Image
docker build -t pyradox .
docker run -p 9001:9001 pyradox

Samples

PyraDox Samples


Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Tasks

  • Finish Dockerfile
  • Add Badges
  • Add Class Preprocessing
  • Sample Website
  • Push Docker image to hub
  • Add Regex to extract Name, DOB, Gender.

Please make sure to update tests as appropriate.

License

Apache License 2.0

Notes

Sample Aadhar Cards are just samples taken from google search and not original documents.

while working on this project, I came across some good repos on github ๐Ÿ˜‹ which I am listing below.

Aadhar Number Validator and Generator Aadhaar-Card-OCR

If there is anything totally unclear, or not working, please feel free to file an issue. reach out at Email ๐Ÿ˜‡

If this project was helpful for you please show some love โญ

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.