This project contains the digitalization of the Francoise Xavier Guerra database in CSV format files. The data was obtained scanning book pages with the OCR Tesseract library, preprocessed by computer vision algorithms, and the CSV files was produces with a python script.
The database in csv format can simply be downloaded here:
To run this project you should have install this technologies:
- Python 2.7
- OpenCV
- Tesseract
- pytesseract
Once dependencies are already installed, clone this repo and thats all.
To run the annuary digitalization:
$ python annuary_ocr.py -i imageinput.jpg
To run and see the debug:
$ python annuary_ocr.py -i imageinput.jpg --debug
To see the status of the data:
$ python annuary_ocr.py --status