Comments (4)
Please provide some more details. What error are you getting? If any, how to reproduce it without dependencies on external tools/libraries? In your case, the output seems to be coming from logger_warning
only, which usually indicates some specification breach, but should not really affect the results.
from pypdf.
This is the simplest way to reproduce it. It works with the non-interactive file, but not with the interactive one.
import PyPDF2
def read_pdf_text(file_path):
# Open the PDF file
with open(file_path, 'rb') as file:
# Create a PDF reader object
reader = PyPDF2.PdfReader(file)
# Initialize an empty string to hold the text
text = ""
# Iterate through each page and extract text
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text += page.extract_text()
return text
file_path = 'SCION-book.pdf'
pdf_text = read_pdf_text(file_path)
print(pdf_text)
On this script I get the following error:
Traceback (most recent call last):
File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 21, in <module>
pdf_text = read_pdf_text(file_path)
^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 15, in read_pdf_text
text += page.extract_text()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1851, in extract_text
return self._extract_text(
^^^^^^^^^^^^^^^^^^^
File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1353, in _extract_text
obj[content_key].get_object() if isinstance(content_key, str) else obj
~~~^^^^^^^^^^^^^
File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 266, in __getitem__
return dict.__getitem__(self, key).get_object()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_base.py", line 259, in get_object
obj = self.pdf.get_object(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_reader.py", line 1260, in get_object
retval = read_object(self.stream, self) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 1080, in read_object
raise PdfReadError(
PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'\xd5' @3358736: b'\\`Vi\x9f\x10\xd5\x02\'\xd3h\xa5p:\xf1\x82\x0b\x19\xcdj\xd5"\x0b\xcc*\rL\xac\x85Z\xe0t"\xad$TP\xfc?\xc0jRwf\xff\x0f\xb0\xab\xc0B\xa4\xb0\xd3i\xb4\x128\x9dy\xc1\x04\xd8Ru\xd6`\xdbP\x9a\xa8T\xa5\xc0\xe94ZI\x1cvL'
I have managed to fix the error using pdfrw to first read the file and re-write it, but I do not feel that is the best solution.
from pypdf.
PyPDF2 is not supported any more and you should definitely switch to pypdf. Nevertheless, it seems like there still is some issue with at least page 193:
Ignoring wrong pointing object 5610 0 (offset 3356523)
Ignoring wrong pointing object 5615 0 (offset 3354947)
Object 5615 0 not defined.
Traceback (most recent call last):
File "/home/stefan/tmp/pypdf/run.py", line 22, in <module>
pdf_text = read_pdf_text(file_path)
File "/home/stefan/tmp/pypdf/run.py", line 16, in read_pdf_text
text += page.extract_text()
File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2083, in extract_text
return self._extract_text(
File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 1604, in _extract_text
obj[content_key].get_object() if isinstance(content_key, str) else obj
AttributeError: 'NoneType' object has no attribute 'get_object'
The corresponding page cannot be rendered as well and looks odd, thus it seems to be a general issue with your PDF file:
from pypdf.
Ah okay, fair enough. I can try it with pypdf, although llamaindex is still using pypdf2, so idk if that will help in my case.
Anyways, thanks for the quick help!
from pypdf.
Related Issues (20)
- `Ressources` deprecation does not work for some python versions HOT 2
- Rotated a pdf and Trying to extract images from the pdf it extracted unrotated pdfs HOT 4
- local variable 'cm' referenced before assignment HOT 16
- Insert image on a signature field in pypdf
- pdf should be how to replace the text in pdf, but do not change the original layout, only add or delete two words, in GitHub for a long time did not find, why
- PyPDF some fields not showing in generated PDF HOT 27
- Functionality of b_ HOT 10
- Form Fill Font Size and Orientation wrong HOT 2
- Form fill font extra \x00 and font size HOT 1
- Option to clear all images from a page HOT 1
- Version 4.3.0 writer unable to fill Dropdown fields HOT 2
- Broken docs link on PyPI HOT 3
- Filled Choice Fields Not Rendered Correctly By Adobe Acrobat HOT 6
- `TypeError` in `_cmap.py` when calling `extract_text()` HOT 7
- Documentation: Adding internal link code snippet is incorrect. HOT 2
- git tag was not created for 4.3.0 HOT 14
- [4.3.0] Regression when decoding strings
- PdfWriter unable to add reader HOT 8
- Paragraph field not showing correct HOT 4
- Using PdfReader causes a crash HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pypdf.