GithubHelp home page GithubHelp logo

Comments (4)

stefan6419846 avatar stefan6419846 commented on August 16, 2024

Please provide some more details. What error are you getting? If any, how to reproduce it without dependencies on external tools/libraries? In your case, the output seems to be coming from logger_warning only, which usually indicates some specification breach, but should not really affect the results.

from pypdf.

SalomonKisters avatar SalomonKisters commented on August 16, 2024

This is the simplest way to reproduce it. It works with the non-interactive file, but not with the interactive one.

import PyPDF2

def read_pdf_text(file_path):
    # Open the PDF file
    with open(file_path, 'rb') as file:
        # Create a PDF reader object
        reader = PyPDF2.PdfReader(file)
        
        # Initialize an empty string to hold the text
        text = ""
        
        # Iterate through each page and extract text
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
        
        return text


file_path = 'SCION-book.pdf'
pdf_text = read_pdf_text(file_path)
print(pdf_text)

On this script I get the following error:

Traceback (most recent call last):
  File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 21, in <module>
    pdf_text = read_pdf_text(file_path)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\git_projects\meinThurgau-data-room\microservices\swidoc-ai-service-v2\testing\test.py", line 15, in read_pdf_text
    text += page.extract_text()
            ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_page.py", line 1353, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
    ~~~^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_base.py", line 259, in get_object
    obj = self.pdf.get_object(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\_reader.py", line 1260, in get_object
    retval = read_object(self.stream, self)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\PyPDF2\generic\_data_structures.py", line 1080, in read_object
    raise PdfReadError(
PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'\xd5' @3358736: b'\\`Vi\x9f\x10\xd5\x02\'\xd3h\xa5p:\xf1\x82\x0b\x19\xcdj\xd5"\x0b\xcc*\rL\xac\x85Z\xe0t"\xad$TP\xfc?\xc0jRwf\xff\x0f\xb0\xab\xc0B\xa4\xb0\xd3i\xb4\x128\x9dy\xc1\x04\xd8Ru\xd6`\xdbP\x9a\xa8T\xa5\xc0\xe94ZI\x1cvL'

I have managed to fix the error using pdfrw to first read the file and re-write it, but I do not feel that is the best solution.

from pypdf.

stefan6419846 avatar stefan6419846 commented on August 16, 2024

PyPDF2 is not supported any more and you should definitely switch to pypdf. Nevertheless, it seems like there still is some issue with at least page 193:

Ignoring wrong pointing object 5610 0 (offset 3356523)
Ignoring wrong pointing object 5615 0 (offset 3354947)
Object 5615 0 not defined.
Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf/run.py", line 22, in <module>
    pdf_text = read_pdf_text(file_path)
  File "/home/stefan/tmp/pypdf/run.py", line 16, in read_pdf_text
    text += page.extract_text()
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 2083, in extract_text
    return self._extract_text(
  File "/home/stefan/tmp/pypdf/pypdf/_page.py", line 1604, in _extract_text
    obj[content_key].get_object() if isinstance(content_key, str) else obj
AttributeError: 'NoneType' object has no attribute 'get_object'

The corresponding page cannot be rendered as well and looks odd, thus it seems to be a general issue with your PDF file:

ksnip_20240521-164222

from pypdf.

SalomonKisters avatar SalomonKisters commented on August 16, 2024

Ah okay, fair enough. I can try it with pypdf, although llamaindex is still using pypdf2, so idk if that will help in my case.
Anyways, thanks for the quick help!

from pypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.