Comments (10)
Hi Matt,
Thank you for the prompt response. I agree the page is kind of heavy. Takes some time to render on my machine as well.
I greatly appreciate your time and effort on creating and maintaining this library.
from pypdf.
@MartinThoma
Here you are the doc (download was OK for me)
PUC_Quiet crossing_school boundaries_11X17.pdf
I've opened the PDF in Acrobat reader and this file looks very heavy (lot of drawing/images?).
from pypdf.
More evidence that extractText() needs work. It is a very complex file, however, and opening it in a PDF viewer was somewhat difficult for my system.
from pypdf.
@alisufian I'm currently looking into performance topics. The link seems not to load on my machine. Do you still have that document somewhere?
from pypdf.
This callgraph was created via:
$ python -m cProfile -o profile.pstats script.py
$ gprof2dot -f pstats profile.pstats | dot -Tsvg -o mine.svg
with
from PyPDF2 import PdfReader
reader = PdfReader('PUC_Quiet.crossing_school.boundaries_11X17.pdf')
text = ""
for page in reader.pages:
text += page.extract_text()
from pypdf.
Some micro-benchmarks on Python 3.6:
peek in (b"\r", b"\n")
vs peek in b"\r\n"
vs peek == b"\r" or peek == b"\n"
In [8]: %timeit peek not in b"\r\n"
258 ns ± 3.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [11]: %timeit a != b"\r" and a != b"\n"
65.9 ns ± 0.618 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [9]: %timeit peek not in (b"\r", b"\n")
56 ns ± 0.351 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
That is surprising to me.
Decimal instanciation
In [13]: %timeit Decimal(0)
168 ns ± 0.3 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [14]: %timeit Decimal("0")
223 ns ± 1.89 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [15]: %timeit Decimal(0.0)
430 ns ± 4.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
from pypdf.
After applying #1014 :
from pypdf.
One of the performance killers is creating the FloatObject
s. We do it 6 million times in this example and it is 8% of the workload. But remplacing the FloatObject with normal floats (or decimals) would be a quite massive change.
from pypdf.
@MartinThoma
with the new PR can you rerun to check for improvements?
from pypdf.
I'm closing this issue now as I don't have any further approach to speed this up (except for writing a C/C++/Rust module, which would be a very different beast)
from pypdf.
Related Issues (20)
- merge_page under doesn't work HOT 3
- Pypdf 4.2.0 requires dataclasses, which does not support python3.10 HOT 1
- Extract links HOT 5
- ENH: Support detection of digital signatures HOT 14
- additional info on reader.pages
- less "conventional" Indexed 4 bit RGB colour format not handled correctly. HOT 3
- Interactive PDFs are not working HOT 4
- Support for removing a single annotation HOT 3
- Implement '__enter__' in PdfReader HOT 3
- /Rotate not respected when merging pages HOT 2
- Can't Fill PDFs without /DR dictionary HOT 2
- Streamlit App based on PyPDF
- Add image in Image Field with PDF Forms HOT 1
- Rename (with deprecation) interiour_color to interior_color in classes Rectangle and Ellipse HOT 2
- Use token for Codecov
- `Ressources` deprecation does not work for some python versions HOT 2
- Rotated a pdf and Trying to extract images from the pdf it extracted unrotated pdfs HOT 4
- local variable 'cm' referenced before assignment HOT 15
- Insert image on a signature field in pypdf
- pdf should be how to replace the text in pdf, but do not change the original layout, only add or delete two words, in GitHub for a long time did not find, why
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pypdf.