I was trying to extract text from a pdf using pypdf over llamaindex. The pdf is intera

Interactive PDFs are not working about pypdf HOT 4 CLOSED

SalomonKisters commented on August 16, 2024

Interactive PDFs are not working

from pypdf.

Comments (4)

stefan6419846 commented on August 16, 2024

Please provide some more details. What error are you getting? If any, how to reproduce it without dependencies on external tools/libraries? In your case, the output seems to be coming from logger_warning only, which usually indicates some specification breach, but should not really affect the results.

from pypdf.

SalomonKisters commented on August 16, 2024

This is the simplest way to reproduce it. It works with the non-interactive file, but not with the interactive one.

import PyPDF2

def read_pdf_text(file_path):
    # Open the PDF file
    with open(file_path, 'rb') as file:
        # Create a PDF reader object
        reader = PyPDF2.PdfReader(file)
        
        # Initialize an empty string to hold the text
        text = ""
        
        # Iterate through each page and extract text
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
        
        return text


file_path = 'SCION-book.pdf'
pdf_text = read_pdf_text(file_path)
print(pdf_text)

On this script I get the following error:

Traceback (most recent call last):
  File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 21, in <module>
    pdf_text = read_pdf_text(file_path)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 15, in read_pdf_text
    text += page.extract_text()
            ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1353, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
    ~~~^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_base.py", line 259, in get_object
    obj = self.pdf.get_object(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_reader.py", line 1260, in get_object
    retval = read_object(self.stream, self)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 1080, in read_object
    raise PdfReadError(
PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'\xd5' @3358736: b'\\`Vi\x9f\x10\xd5\x02\'\xd3h\xa5p:\xf1\x82\x0b\x19\xcdj\xd5"\x0b\xcc*\rL\xac\x85Z\xe0t"\xad$TP\xfc?\xc0jRwf\xff\x0f\xb0\xab\xc0B\xa4\xb0\xd3i\xb4\x128\x9dy\xc1\x04\xd8Ru\xd6`\xdbP\x9a\xa8T\xa5\xc0\xe94ZI\x1cvL'

I have managed to fix the error using pdfrw to first read the file and re-write it, but I do not feel that is the best solution.

from pypdf.

stefan6419846 commented on August 16, 2024

PyPDF2 is not supported any more and you should definitely switch to pypdf. Nevertheless, it seems like there still is some issue with at least page 193:

Ignoring wrong pointing object 5610 0 (offset 3356523)
Ignoring wrong pointing object 5615 0 (offset 3354947)
Object 5615 0 not defined.
Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/run.py", line 22, in <module>
    pdf_text = read_pdf_text(file_path)
  File "/home/stefan/tmp/pypdf/run.py", line 16, in read_pdf_text
    text += page.extract_text()
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2083, in extract_text
    return self._extract_text(
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 1604, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
AttributeError: 'NoneType' object has no attribute 'get_object'

The corresponding page cannot be rendered as well and looks odd, thus it seems to be a general issue with your PDF file:

from pypdf.

SalomonKisters commented on August 16, 2024

Ah okay, fair enough. I can try it with pypdf, although llamaindex is still using pypdf2, so idk if that will help in my case.
Anyways, thanks for the quick help!

from pypdf.

Interactive PDFs are not working about pypdf HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs