GithubHelp home page GithubHelp logo

yuranusduke / paddleocrcorrectoutputs Goto Github PK

View Code? Open in Web Editor NEW
11.0 1.0 0.0 394 KB

A simple algorithm to find ordered key-value pairs from paddleOCR recognition outputs

License: MIT License

Python 100.00%

paddleocrcorrectoutputs's Introduction

Simple Algorithm to Get Correct Key-Value in paddleOCR Test Sheet Recognition

Backgrounds

paddleOCR returns shuffled outputs after recognizing images. For us to output correct key-value in medical test sheets, I design simple algorithm derived from paddleOCR outputs reasonably.

Hows and Whys

img We can see in the figure, to detect from '结果', we have to find coordinates related to '结果' and key. As an example, let's use 平均血红蛋白浓度(MCHC) item, to detect 329 correctly. First, we have to understand that if we find one item has coordinates $v$, that it will be geometrically illustrated like this img

Then, we will filter results. We filter both coulmns that contain '结果' via choosing the closer horizontal coordinates, after that we choose results that have the closer vertical coordinates. But how do we define 'closer'? I set threshold as epsilon in code, the larger value it is, the more results we get after filtering. Also, look closely, for each type of test sheet, results are not aligned the same as '结果', to address this, for each type, I set different values of 'middle', 'left' and 'right' with param as side in code, in this example, side should be set to 'middle', which utilizes bilinear method to compress 4 coordinates.

Finally, to get correct result, I propose to filter one more step. After filtering, we may get different results in the same column and the same row. To find more correct one, for each result, we must align key's horizontal coordinate $x$ and vertical coordinate $y$. We also have to make sure $v[0] - x >0$ because result should be in the right of key. Also, we must set $abs(v[1] - y)$ to the smallest because other rows cannot be closer than key's row. However, it's not guaranteed that we find the most right one, because if we simply add $(v[0] - x)$ and $abs(v[1] - y)$ together, maybe smaller vertical residuals can not cover larger horizontal residuals. Therefore, I set one weight in code: weight_for_hori = 0.4, so final residual should be $(v[0] - x) * weight_for_horiz + abs(v[1] - y) * (1 - weight_for_horiz)$ because vertical lines may be more important.

I left all above parameters to be tunable.

Usages

Before running the recognition code, this will rotate image back if the rotated degree is between $[-pi, pi]$

Args : 
	--data_dir: data directory, default is 'origin_pics'
	--template: which template, you can specify the some directory
	--img_format: image format, default is 'jpg'
	--recognize_all: True for returning all usable results, else return the specific file
	--return_name: default is None, but if you set recognize_all equals False, you must set return_name to be a specific value
	--epsilon: default is 40, if you set epsilon too high or too low, the results may be not promising, the LARGET the BETTER
	--weight_for_hori: horizontal line weight, default is 0.4
	--side: 'middle' or 'left' or 'right', to tell where are texts and 结果 aligned

In the project file origin_pics contains all templates images results must contain all template results templates contain TXT files for each template every files 'templates' names must be the same.

To run the code, you can run the following command in shell:

python process.py OCR --data_dir='origin_pics' --template='h5' --img_format='jpg' --recognize_all=True --return_name='bloodtestreport2.jpg' --epsilon=40 --weight_for_hori=0.4 --side='middle'

Results

img img

NOTE: Results are influenced a lot by paddleOCR recognition ability and image resolution.

Shortages

-- If paddle cannot recognize '结果' correctly, the results can be confused.

-- My algorithm still has some limitations, e.g., I can not use OpenCV to rotate image if image is mirrored, because OpenCV uses Hough transformation to detect lines, mirrored image still has 0 angle of lines.

-- For different styles of test sheets, we still have to set templates separately.

Thank you, if you have more excellent ideas, welcome to pull request!

Veni,vidi,vici --Caesar

paddleocrcorrectoutputs's People

Contributors

yuranusduke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.