GithubHelp home page GithubHelp logo

Comments (10)

alisufian avatar alisufian commented on July 17, 2024 1

Hi Matt,
Thank you for the prompt response. I agree the page is kind of heavy. Takes some time to render on my machine as well.

I greatly appreciate your time and effort on creating and maintaining this library.

from pypdf.

pubpub-zz avatar pubpub-zz commented on July 17, 2024 1

@MartinThoma
Here you are the doc (download was OK for me)
PUC_Quiet crossing_school boundaries_11X17.pdf

I've opened the PDF in Acrobat reader and this file looks very heavy (lot of drawing/images?).

from pypdf.

mstamy2 avatar mstamy2 commented on July 17, 2024

More evidence that extractText() needs work. It is a very complex file, however, and opening it in a PDF viewer was somewhat difficult for my system.

from pypdf.

MartinThoma avatar MartinThoma commented on July 17, 2024

@alisufian I'm currently looking into performance topics. The link seems not to load on my machine. Do you still have that document somewhere?

from pypdf.

MartinThoma avatar MartinThoma commented on July 17, 2024

mine

This callgraph was created via:

$ python -m cProfile -o profile.pstats script.py
$ gprof2dot -f pstats profile.pstats | dot -Tsvg -o mine.svg

with

from PyPDF2 import PdfReader

reader = PdfReader('PUC_Quiet.crossing_school.boundaries_11X17.pdf')
text = ""
for page in reader.pages:
    text += page.extract_text()

from pypdf.

MartinThoma avatar MartinThoma commented on July 17, 2024

Some micro-benchmarks on Python 3.6:

peek in (b"\r", b"\n") vs peek in b"\r\n" vs peek == b"\r" or peek == b"\n"

In [8]: %timeit peek not in b"\r\n"
258 ns ± 3.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [11]: %timeit a != b"\r" and a != b"\n"
65.9 ns ± 0.618 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: %timeit peek not in (b"\r", b"\n")
56 ns ± 0.351 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

That is surprising to me.

Decimal instanciation

In [13]: %timeit Decimal(0)
168 ns ± 0.3 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [14]: %timeit Decimal("0")
223 ns ± 1.89 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [15]: %timeit Decimal(0.0)
430 ns ± 4.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

from pypdf.

MartinThoma avatar MartinThoma commented on July 17, 2024

After applying #1014 :

mine

from pypdf.

MartinThoma avatar MartinThoma commented on July 17, 2024

One of the performance killers is creating the FloatObjects. We do it 6 million times in this example and it is 8% of the workload. But remplacing the FloatObject with normal floats (or decimals) would be a quite massive change.

from pypdf.

pubpub-zz avatar pubpub-zz commented on July 17, 2024

@MartinThoma
with the new PR can you rerun to check for improvements?

from pypdf.

MartinThoma avatar MartinThoma commented on July 17, 2024

I'm closing this issue now as I don't have any further approach to speed this up (except for writing a C/C++/Rust module, which would be a very different beast)

from pypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.