rad10 / ocr-to-csv Goto Github PK

A 2.0 version of HTML to CSV application that will be built with native OCR code

License: MIT License

Python 79.03% Tcl 20.97%

ocr-to-csv's Issues

Modularize Code

There are almost 1000 lines of code in one file. Though it clearly works, it is very hard to read. especially when there are functions that are only used by a single process.
It may be better from an object oriented stance to split these main processes into classes then hook them together so that the main file may be cleaner and easier to edit and change.
This will also provide a significant advantage when it finally comes time to port over the code to an OOP Language like C++ or Java.

Increase speed of scanner

The scanner currently can take anywhere between 10 ~ 15 minutes to complete one side of a sign in sheet.
The biggest issue that can be currently be recognized is the speed of tesseract. Over a dozen takes are done per cell, which is necessary in order to get an accurate description given.
One suggestion given was to use threading or multiprocessing to split up the tasks, increasing speed. However, tesseract has issues with the use of threading and multiprocessing as when implemented, it got results faster, but it returned significantly inaccurate results.

The choices so far look like either finding a way to make tesseract friendly to threading, or revamp the code to be more efficient in general, or find shortcuts that can be taken.
As it currently stands, no option seems feasible to where its tolerable yet.

Adding auto downloading mechanism

Downloading and installing poppler is confusing and troublesome. In order to add it for the wrapper, the user needs an external tool to open the file. Also the file itself is difficult to find.
What should be set for the future is a mechanism for the program to spider for the link on main sources used and automatically download the file and extract to a space inside the program.

Create Wiki

I need to get around to making the wiki for the repo so that others who may see this project will get a better understanding on what each component does and how other developers could use it for themselves

Reformat to standards

It has come to my attention that there are standards to how to write a program for every language. One the project is done, I'll need to go back through the file and make everything meet python standards.

https://www.python.org/dev/peps/pep-0008/

Include Time enhancements

In commit 8778d3a26b3aa4ea676e78c12b2e31253ae66417 I implemented a way to better translate times when they weren't in proper format to better guesses of what the text was. Unfortunately the entire commit cannot be cherry-picked as there is no need for an unfiltered group, but it would be worth grabbing the code so that guesses may be far more accurate

Sanity Check

It might be worth looking into implementing an additional phase where I take the text results and to an additional scan to see if it makes logical sense.
Some examples to consider:

A persons name should never appear more than once on the same page for the same day.
The time that someone clocks out should never be less than the time that they clock in.
The time that the person stays should never be more than 12 hours.

How to respond to issues like these is for contemplation in future.

Port Code to Java

Once the majority of projects and issues are resolved with the python code, itll have to be ported over to Java for speed using a faster object oriented language.

Redraft Pseudocode

A draft of the pseudocode exists, but it is incredibly outdated and needs to be redone based on the new design of the program

Word Suggestions

Currently, tesseract will rarely get fully legible phrases out of each table. This is especially true when it comes to peoples names. Tesseract may output a result such as IatVaow Meo na Y Which doesnt relate to anything.
Currently those texts have enough in common with my list of names that i can make a probable guess on what name is used.
However, it might be worth investigating and seeing if it would be easier to use an english dictionary to suggest most likely words. Two libraries are being considered for use: NLTK and EnchantLib
This isnt imperitive. It might not even be implemented, but its good to keep on note.

How to use this

I would love to take this for a spin, but there is no information in the README regarding how to use this. Could you provide some instruction?

Better recognize dates

Opencv can sometimes find rectangles in places where they dont exist. Finding the table is fine, but there needs to be ways to sort what is a date and what is a box of strange pixels.

Tesseract Tuning

So looking further into the configs of tesseract, there may be a way to make tesseract do most of the work for me. It has configs for limiting what characters can be read by the program, but it can also try to search for patterns. It also has the ability to include wordlists with words that may be found in the image.
I tried to work with this in Python-OutdatedTessExperiment But results didnt fairly show. The program ran faster, but as a horrendous loss in accuracy. In my experiements, I have never seen any proper difference between using and not using the wordlists or patterns. I may have been doing it wrong. If anyone reading has any experience with tesseracts wordlists, try and see if you could get it working. Currently, I'm treating it as a dead end.
If this can be implemented, then it may improve or even solve issues #6 and #11

Automated Testing

GitHub offers an ability to catalog dependencies and possibly run automated tests. One of the things I should do to ensure compatibility for this project is include a file for what python versions this program works on and what libraries are a necessity.
The tracking is based off this

Reorganize GUI

The gui needs a little revamping. Some widgets are completely unused and other important matters dont properly show up on the gui.
One of the bigest issues is knowing what information a user might want and how it should be organized on a UI. That im having trouble with.

Port code to C++

Once the majority of projects and issues are dealt with, I'll have to move on and port the concepts from the program to C++ for speed and nativity to windows.

Community Insights

In my work to make this code more accessible from github, I forgot about community insights. So, this is me getting on top of that along with my Documentation milestone.

What still needs to be done:

Remove borders in snippets

Currently, some images are having either the left or bottom border showing in their scans. This should be fixed as it may throw off tesseracts ability to properly scan images.

Better Debugging

There are more than enough statements to express everything that goes on, but its a slight issue when it comes to running debugging from a clean branch. debugging can only be turned on through a single global variable in the code, which git gets annoyed when debug true isnt committed.

Several things need to be added:

Include way to enable debugging from commandline arguements
convert debug printline statements into proper debug output.
if #7 has already been completed, then debugging needs its own object class of functions

rad10 / ocr-to-csv Goto Github PK

ocr-to-csv's People

Contributors

Watchers

ocr-to-csv's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs