GithubHelp home page GithubHelp logo

Comments (17)

phoewass avatar phoewass commented on August 30, 2024 3

Hi all. Sorry it took me a while to publish the PR while the code was already available.
Now the PR is there to be reviewed, I'm looking forward for your feedback.

from camelot.

Siddharth1India avatar Siddharth1India commented on August 30, 2024 2

Any update on this? My PDFs are 100s of pages and I can really use this feature.

from camelot.

vinayak-mehta avatar vinayak-mehta commented on August 30, 2024 1

👀

from camelot.

satheeshkatipomu avatar satheeshkatipomu commented on August 30, 2024

Hi @vinayak-mehta ,

Even I thought of implementing this. dramatiq or celery are my suggestions for asynchronous processing of pages.

from camelot.

jontis avatar jontis commented on August 30, 2024

I'm doing this with dask but it's chosen out of habit.

from camelot.

selcukusta avatar selcukusta commented on August 30, 2024

Is there any improvement in there? I have a file that has only one page. The page has a table (25 rows x 13 columns). read_pdf function takes 10 seconds after that to_excel takes only 100-150 ms. I'm thinking about 10 seconds is too long, am I wrong?

from camelot.

NixBiks avatar NixBiks commented on August 30, 2024

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

But does anyone have a solution for multiple pages in parallel?

from camelot.

vinayak-mehta avatar vinayak-mehta commented on August 30, 2024

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

Yes!

But does anyone have a solution for multiple pages in parallel?

Using multiprocessing, we should be able to distribute multiple pages on all cores, processing them in parallel.

from camelot.

NixBiks avatar NixBiks commented on August 30, 2024

I get this though

objc[53475]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Oh; what is the difference between https://github.com/atlanhq/camelot and https://github.com/camelot-dev/camelot ? Didn't notice two repos before now...

from camelot.

selcukusta avatar selcukusta commented on August 30, 2024

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

But does anyone have a solution for multiple pages in parallel?

Yeah, I know. Actually it's related with that but the issue was closed and referenced to it.

from camelot.

rawsh-bt avatar rawsh-bt commented on August 30, 2024

Does anyone have an update? I've tried inheriting PageHandler and making pages multithreaded / multicore, and multi threading processing multiple pdfs, but I'm running into a ghostscript error (seems like it's not thread safe?)

from camelot.

phoewass avatar phoewass commented on August 30, 2024

I did implement a multi-threading layer above camelot.read_pdf using multiprocessing library.
I faced a couple of pitfalls doing it, so I can help on this if I may.

from camelot.

vinayak-mehta avatar vinayak-mehta commented on August 30, 2024

@phoewass That would be awesome if you're still interested!

from camelot.

RickyGunawan09 avatar RickyGunawan09 commented on August 30, 2024

can anyone tell me how to use multiprocess in camelot ? or did this issues still on progress ?

from camelot.

mlbrothers avatar mlbrothers commented on August 30, 2024

@phoewass @vinayak-mehta is this feature part of library now? If not, is there any way I can utilize multiprocessing to read multipage PDF?

from camelot.

deepakagrawal avatar deepakagrawal commented on August 30, 2024

Any updates on this features?

from camelot.

bosd avatar bosd commented on August 30, 2024

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

There is a discussion about this in:

py-pdf#8 (reply in thread)

from camelot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.