GithubHelp home page GithubHelp logo

pavtiger / parse-tables-from-pdf Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 214.59 MB

A tool that automizes the process of pulling data tables from PDF documents where they are as scans

Home Page: https://pdf.pavtiger.com

Python 46.71% CSS 15.27% JavaScript 30.86% HTML 7.16%
pdf python webserver opencv pytesseract pytesseract-ocr socketio

parse-tables-from-pdf's Introduction

Hello there :suspect:, I am Paul Artushkov (also known as pavTiger)

Since early childhood, I've been interested in technology. I started with soldering some boards and Arduino components. Then I slowly switched to Robotics olympiads with LEGO EV3 system. And went to RRO (Russian Robot Olympiad) in Innopolis. As I went to the fifth grade, I switched to more practical programming (that was the time when I started using GitHub). Enjoy hackathons very much and have been to many of CROC ones (and won a few). Recently, I started doing competitive programming while studying in Tinkoff Generation.

I study in an IT & Math school named Silaeder, which is targeted to science projects.

Languages I know well: Python, C++, Javascript, C, Bash
I have a lot of experience with Linux and bash scripting. Having tried a big variety of graphic libraries and frameworks, I've found Three.js, which is used in many of my projects.
I enjoy animation & 3D modeling as well. Here's some of my work.

English level B2 proven by Cambridge exam PET (Currently my level is ~C2, but I didn't take the exam yet due to COVID and other limitations). Have lived in a host family in England twice to become more fluent in the language. Lived in Hastings and then Cambridge, the later is now my favorite city.

As much as doing projects I like going to science conferences. I've participated in Balt Konkurs in Saint Petersburg twice, winning 3rd-degree award and Youth jury laureate prize in Computer Science. Also, won Advanced Research Award in an International Korean conference KSASF

Internships

  • I've been on an internship at a company named Visyond as a backend developer. They are creating an Excel for business people. I fixed some issues and even improved execution time in some parts of the code by a few times. It was a great experience☺
  • In July 2023 I was an intern in a company Telepat and working on their main product medsenger (medical messenger) as a DevOps and system administrator. Packaging their main backend into a docker container and starting services on-premises
  • Currently I'm an intern in a company Renaissance Capital, there I work with Power BI, especially Power Query which is an Excel add-on that is a tool to manage your data and do some other programming tasks

My favorite hackathons and achievements:

  • MediaHack - May 2019. Won the nomination "Hellish coders" because we wrote a tonn of code with great architecture and the organisers apparently really liked the way we were coding
  • CraftHack - August 2019. Had a fun time soldering Arduinos and trying to make a terminal game
  • Winner of Data analysis National olympiad (DANO) 2022
  • WildHack - December 2021. First big hackathon since COVID started. Got together with some friends and wrote a Neural Network recognising fish in photos from a reserve park in Kamchatka.
  • DHHack - November 2019. Our team won in nomination "Best research solution"
  • HSE Gamedev - July 2023. We created a small game on Unreal Engine inspired by Outer Wilds in just under 2 weeks and won in our nomination
    Here are all the sources (presentations, etc) for these hackathons.

Other than all that technical stuff, I also really enjoy listening to music (love soundtracks, classic music, EDM and some rap), flying FPV drones and working with cameras.
If you want to contribute to any of my repos, you are more than welcome to do that!!!
How to contact me:

parse-tables-from-pdf's People

Contributors

codacy-badger avatar dashad2205 avatar pavtiger avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

parse-tables-from-pdf's Issues

Processing keeps going if disconnect client just after progress start

How to reproduce:
Open website, send a request and just as it starts the query (~3 second in) refresh the whole page. Now this progress will be running in the background even though there's no client connected to it.

I guess it happens during document download event. And as I currently use non-async version of socketio, client disconnect during web document download might lead to disconnect() not being called

Deal with table rotations

Due to the fact that these documents have to be hand-signed, they are actually scans of paper documents. Therefore they can be positioned not perfectly and I do want to allow small angles of rotation, so I should consider that as well.

Stop button bugs

First of all, I don't think that stop button should even be present on the startup screen. If possible it should be added after task submit or at least do nothing when pressed and processing hasn't been started yet.
2023-01-19-16:23_000

Then, in the latest version stop button doesn't turn red when processing is in progress.
2023-01-19-16:22_000

Highlight table cells that need to reviewed

A great idea I've seen is to highlight the table cells that my code (pytesseract specifically) is most uncertain about. It can be a gradient where the most vivid red is 100% uncertainty and white vice versa

User does not disconnect due to bad asynchrony

How to reproduce:

  • Start webserver (python recognise.py --server)
  • Go to web page
  • Start processing and close page at the same time
    Now you should have an ongoing process in the background even though there is no user to receive it.

This occurs because for some reason my async server cannot process multiple requests at the same time. So when there is a current request to start the job, tab close event is not being called.

In theory, this bug will also appear when instead of close tab event, new user is trying to open the main page. He will not get any response, because there is an ongoing request being sent. This is a deal-breaking bug

Do a better job at displaying processing status

Do a better job at displaying processing status (in progress, stopped, finished).
Currently, the way to know if processing has been finished is to either scroll to the bottom of the page or look to the log box, but this is not as intuitive as I want it to be.
I think it would be a great idea to display processing status near time elapsed clock.

Here is how it looks now
2023-01-19-16:32

Add support for recognizing multiple tables on the same page

Currently, the way I recognize tables is finding the biggest table on a page. This method solves a lot of problems with finding tables that are actually just a part of a bigger table, but actually multiple tables are some times present, so should fix this later on.

This can be addressed by finding tables that do not intersect

Parse non-image tables if possible

So the whole goal of this project is to parse tables from documents where tables are inserted as images, but if a specific page has a proper table in it, then let's just parse it

Loading converted table pictures from server fails

HTR:
Open a folder for a page that is yet to be processed

This is probably due to a recent optimisation of lazy loading the pages

Task exception was never retrieved
future: <Task finished name='Task-128' coro=<AsyncServer._handle_event_internal() done, defined at /home/pavtiger/.local/lib/python3.11/site-packages/socketio/asyncio_server.py:522> exception=FileNotFoundError(2, 'No such file or directory')>
Traceback (most recent call last):
  File "/home/pavtiger/.local/lib/python3.11/site-packages/socketio/asyncio_server.py", line 524, in _handle_event_internal
    r = await server._trigger_event(data[0], namespace, sid, *data[1:])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pavtiger/.local/lib/python3.11/site-packages/socketio/asyncio_server.py", line 558, in _trigger_event
    ret = await handler(*args)
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/pavtiger/Docs/Parse-tables-from-PDF/recognise.py", line 344, in send_page_preview
    with open(path, 'rb') as f:
         ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'static/output/pages/page_9.jpg'

Deal with long page render times

When under load or poor cpu rendering all the pages takes long time (and my current library pdf2image does not support rendering single pages only), so user has to just wait a long time before any progress bars are displayed. So I should

  • Switch to another pdf rendering library (possibly fitz from this Stack Overflow question) and render single pages when they are addressed. Because ususally not all 100+ pages are going to be processed
  • Display progress bars as an indication that something is happening right after enter button click

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.