GithubHelp home page GithubHelp logo

table_ocr_java's Introduction

                            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                 TABLE DETECTION IN IMAGES AND OCR TO CSV WITH JAVA
            
                                                Yan Shi
                            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Guide

  1. Overview
  2. Requirements
  3. Demo
  4. Modules
  5. Contact
  6. Reference

Overview

This java package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.

Given an image that contains a table…



Extract the the text into a CSV format…

节次 星期,周—,周二,周三,周四,周五
一,语文,英语,英语,自然,数学
二,语文,英语,英语,语文,数学
三,数学,语文,数学,语文,英语
四,数学,语文,数学,体育,英语
五,体育,**品德,语文,数学,手工
六,美术,音乐,语文,数学,写字

Requirements

See maven dependency jar package.

  • pdfbox 2.0.26
  • javacv 1.5.7
  • djl 0.17.0
  • ...

Demo

There is a demo module that will try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.

1.table/demo/MainDemo

That will run against the following image:



The following should be saved to your directory after running the class table/demo/MainDemo.java.

Extracted the following tables from the image:
[('/img_test/simple.png', ['/img_test/simple/table-0.png'])]
Extracted cells from /img_test/simple/table-0.png
Cells:
/img_test/simple/table-0/0-0.png
/img_test/simple/table-0/0-1.png
/img_test/simple/table-0/0-2.png
...

Here is the entire CSV output:

Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4

Modules

The package is split into modules with narrow focuses.

  • pdf_to_images uses pdfbox to extract images from a PDF.
  • extract_tables finds and extracts table-looking things from an image.
  • extract_tables_dnn finds and extracts table-looking things from an image by deep learning model.
  • extract_cells extracts and orders cells from a table.
  • ocr_image uses djl to OCR the text from an image of a cell.
  • ocr_to_csv converts into a CSV the directory structure that ocr_image outputs.

The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow.

Contact

1、github:https://github.com/jiangnanboy

2、QQ:2229029156

3、Email:[email protected]

Reference

https://github.com/jiangnanboy/doc_ai

https://github.com/deepjavalibrary/djl

https://github.com/jiangnanboy/java-springboot-paddleocr

https://github.com/jiangnanboy/layout_analysis4j

table_ocr_java's People

Contributors

jiangnanboy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.