rad10 / ocr-to-csv Goto Github PK
View Code? Open in Web Editor NEWA 2.0 version of HTML to CSV application that will be built with native OCR code
License: MIT License
A 2.0 version of HTML to CSV application that will be built with native OCR code
License: MIT License
There are almost 1000 lines of code in one file. Though it clearly works, it is very hard to read. especially when there are functions that are only used by a single process.
It may be better from an object oriented stance to split these main processes into classes then hook them together so that the main file may be cleaner and easier to edit and change.
This will also provide a significant advantage when it finally comes time to port over the code to an OOP Language like C++ or Java.
The scanner currently can take anywhere between 10 ~ 15 minutes to complete one side of a sign in sheet.
The biggest issue that can be currently be recognized is the speed of tesseract. Over a dozen takes are done per cell, which is necessary in order to get an accurate description given.
One suggestion given was to use threading or multiprocessing to split up the tasks, increasing speed. However, tesseract has issues with the use of threading and multiprocessing as when implemented, it got results faster, but it returned significantly inaccurate results.
The choices so far look like either finding a way to make tesseract friendly to threading, or revamp the code to be more efficient in general, or find shortcuts that can be taken.
As it currently stands, no option seems feasible to where its tolerable yet.
Downloading and installing poppler is confusing and troublesome. In order to add it for the wrapper, the user needs an external tool to open the file. Also the file itself is difficult to find.
What should be set for the future is a mechanism for the program to spider for the link on main sources used and automatically download the file and extract to a space inside the program.
I need to get around to making the wiki for the repo so that others who may see this project will get a better understanding on what each component does and how other developers could use it for themselves
It has come to my attention that there are standards to how to write a program for every language. One the project is done, I'll need to go back through the file and make everything meet python standards.
In commit 8778d3a26b3aa4ea676e78c12b2e31253ae66417
I implemented a way to better translate times when they weren't in proper format to better guesses of what the text was. Unfortunately the entire commit cannot be cherry-picked as there is no need for an unfiltered group, but it would be worth grabbing the code so that guesses may be far more accurate
It might be worth looking into implementing an additional phase where I take the text results and to an additional scan to see if it makes logical sense.
Some examples to consider:
How to respond to issues like these is for contemplation in future.
Once the majority of projects and issues are resolved with the python code, itll have to be ported over to Java for speed using a faster object oriented language.
A draft of the pseudocode exists, but it is incredibly outdated and needs to be redone based on the new design of the program
Currently, tesseract will rarely get fully legible phrases out of each table. This is especially true when it comes to peoples names. Tesseract may output a result such as IatVaow Meo na Y
Which doesnt relate to anything.
Currently those texts have enough in common with my list of names that i can make a probable guess on what name is used.
However, it might be worth investigating and seeing if it would be easier to use an english dictionary to suggest most likely words. Two libraries are being considered for use: NLTK and EnchantLib
This isnt imperitive. It might not even be implemented, but its good to keep on note.
I would love to take this for a spin, but there is no information in the README regarding how to use this. Could you provide some instruction?
Opencv can sometimes find rectangles in places where they dont exist. Finding the table is fine, but there needs to be ways to sort what is a date and what is a box of strange pixels.
So looking further into the configs of tesseract, there may be a way to make tesseract do most of the work for me. It has configs for limiting what characters can be read by the program, but it can also try to search for patterns. It also has the ability to include wordlists with words that may be found in the image.
I tried to work with this in Python-OutdatedTessExperiment
But results didnt fairly show. The program ran faster, but as a horrendous loss in accuracy. In my experiements, I have never seen any proper difference between using and not using the wordlists or patterns. I may have been doing it wrong. If anyone reading has any experience with tesseracts wordlists, try and see if you could get it working. Currently, I'm treating it as a dead end.
If this can be implemented, then it may improve or even solve issues #6 and #11
GitHub offers an ability to catalog dependencies and possibly run automated tests. One of the things I should do to ensure compatibility for this project is include a file for what python versions this program works on and what libraries are a necessity.
The tracking is based off this
The gui needs a little revamping. Some widgets are completely unused and other important matters dont properly show up on the gui.
One of the bigest issues is knowing what information a user might want and how it should be organized on a UI. That im having trouble with.
Once the majority of projects and issues are dealt with, I'll have to move on and port the concepts from the program to C++ for speed and nativity to windows.
In my work to make this code more accessible from github, I forgot about community insights. So, this is me getting on top of that along with my Documentation milestone.
What still needs to be done:
Currently, some images are having either the left or bottom border showing in their scans. This should be fixed as it may throw off tesseracts ability to properly scan images.
There are more than enough statements to express everything that goes on, but its a slight issue when it comes to running debugging from a clean branch. debugging can only be turned on through a single global variable in the code, which git gets annoyed when debug true isnt committed.
Several things need to be added:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.