GithubHelp home page GithubHelp logo

py-pdf / pypdf Goto Github PK

View Code? Open in Web Editor NEW
7.4K 146.0 1.3K 17.61 MB

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Home Page: https://pypdf.readthedocs.io/en/latest/

License: Other

Python 99.93% Makefile 0.07% Shell 0.01%
pypdf2 pdf python pdf-parser pdf-parsing pdf-manipulation pdf-documents help-wanted

pypdf's Introduction

PyPI version Python Support GitHub last commit codecov

pypdf

pypdf is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

See pdfly for a CLI application that uses pypdf to interact with PDFs.

Installation

Install pypdf using pip:

pip install pypdf

For using pypdf with AES encryption or decryption, install extra dependencies:

pip install pypdf[crypto]

NOTE: pypdf 3.1.0 and above include significant improvements compared to previous versions. Please refer to the migration guide for more information.

Usage

from pypdf import PdfReader

reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()

pypdf can do a lot more, e.g. splitting, merging, reading and creating annotations, decrypting and encrypting, and more. Check out the documentation for additional usage examples!

For questions and answers, visit StackOverflow (tagged with pypdf).

Contributions

Maintaining pypdf is a collaborative effort. You can support the project by writing documentation, helping to narrow down issues, and submitting code. See the CONTRIBUTING.md file for more information.

Q&A

The experience pypdf users have covers the whole range from beginners who want to make their live easier to experts who developed software before PDF existed. You can contribute to the pypdf community by answering questions on StackOverflow, helping in discussions, and asking users who report issues for MCVE's (Code + example PDF!).

Issues

A good bug ticket includes a MCVE - a minimal complete verifiable example. For pypdf, this means that you must upload a PDF that causes the bug to occur as well as the code you're executing with all of the output. Use print(pypdf.__version__) to tell us which version you're using.

Code

All code contributions are welcome, but smaller ones have a better chance to get included in a timely manner. Adding unit tests for new features or test cases for bugs you've fixed help us to ensure that the Pull Request (PR) is fine.

pypdf includes a test suite which can be executed with pytest:

$ pytest
===================== test session starts =====================
platform linux -- Python 3.6.15, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/moose/GitHub/Martin/pypdf
plugins: cov-3.0.0
collected 233 items

tests/test_basic_features.py ..                         [  0%]
tests/test_constants.py .                               [  1%]
tests/test_filters.py .................x.....           [ 11%]
tests/test_generic.py ................................. [ 25%]
.............                                           [ 30%]
tests/test_javascript.py ..                             [ 31%]
tests/test_merger.py .                                  [ 32%]
tests/test_page.py .........................            [ 42%]
tests/test_pagerange.py ................                [ 49%]
tests/test_papersizes.py ..................             [ 57%]
tests/test_reader.py .................................. [ 72%]
...............                                         [ 78%]
tests/test_utils.py ....................                [ 87%]
tests/test_workflows.py ..........                      [ 91%]
tests/test_writer.py .................                  [ 98%]
tests/test_xmp.py ...                                   [100%]

========== 232 passed, 1 xfailed, 1 warning in 4.52s ==========

pypdf's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pypdf's Issues

PDF split with links

I have a 483 page PDF that I use for testing (manual). The problem is that when I try to split the document, it takes almost 2 min to process the first handful of pages, and then 3 seconds to process the remaining 450+.

Pages 3-6 contain a table of contents with links to other parts of the PDF. When I take these few pages out of the document, it takes 3-4 seconds to split the 483 pages.

Any ideas why its hanging on the table of contents (with links).

No way to redirect warning messages to standard python logging implementation

PdfFileReader sends all warning messages to stderr (or some other file that you can specify). Normally, you can redirect warnings into the logging system by using logging.captureWarnings. PdfFileReader stops this by replacing the showWarning function in the constructor.

The only problem is that this will break backwards compatibility for the PdfFileReader constructor.

Encryption/Decryption in Python 3

This seems to be the only feature that doesn't work under Python 3. There are several encryption algorithms, it is probably just a matter of using utils.py correctly to avoid TypeErrors.

DCT Filter

PyPDF2 currently lacks a filter for DCT compression (true? Even as maintainers, we sometimes forget everything there is to know about PyPDF2). How important is it that we add this? There certainly are instances "in the wild" of PDF which use DCT compression; should we care?

[See also internal Issue756.]

Some valid but unstand indirect object cause PyPDF2 failure

The issue is something like this: /FontFile2 11 0 R

There is more than 1 space there, cause PyPDF2 failure:

/PyPDF2/generic.py", line 256, in readFromStream
    return NumberObject(num)
ValueError: invalid literal for int() with base 10: ''

This should be supported anyway.

PdfReadError: EOF marker not found

We are getting error like PdfReadError: EOF marker not found .

Scenario: We concatenate some PDF's using pyPDF - input can be princeXML supplied PDF , normal PDF etc .
No issues here .

princeXML PDF + Adobe PDF = pyPDF generated Concatendated PDF - Cool works fine .

Issue happens we we now use the above type of pyPDF concated pdf and concat with other normal pdf again using pyPDF itself .

princeXML PDF + some pyPDF generated PDF = pyPDF generated Concatendated PDF (Expected) works in most cases some cases this won't work . It basically complaints that the pyPDF generated PDF EOF marker not found ! However it was generated by pyPDF itself , did pyPDF miss putting EOF marker in some strange cases ?

Can anyone look at this bug ? This has happened quite rarely but some online sites are handling this same pdf pretty well . How can I attach the PDF to Github for inspection ?

A relevant question can be seen here :
http://stackoverflow.com/questions/15177587/merge-non-standard-pdfs-with-pypdf

API compatibility with PyPDF

Hi,

Is PyPDF2 fully API compatible with PyPDF ? I'm trying to get PyPDF in Fedora replaced by PyPDF2 but we must know if it won't break anything or fix application accordingly.

Thanks !

PdfFileMerger.addBookmark() should return the newly added bookmark

PdfFileWriter.addBookmark() returns the newly added bookmark, so it can be used as the parent in subsequent addBookmark() calls in order to create nested bookmarks.

For consistency, PdfFileMerger.addBookmark() should function similarly, however it does not, as it doesn't return anything, thus making it impossible to create nested bookmarks with PdfFileMerger.

Pdf form overlap issue

When merging 2 pdfs, where one has form elements and the other does not; the "check box" form element overwrites any text that is present.

Is there a workaround for this?

The example below shows the text "blah" overlapped by check box element.
check box form element overlap text

PyPDF2 - AutoCad generated PDF and Watermark

Hi

For some time ago I reported a problem regarding AutoCad generated PDFs.
This problems was solved.

I have encountered a new problem which I belive is also related to the AutoCad generated PDFs.

This time I'm adding a watermark to an existing pdf.
I am able to add this watermark-file (created using pyfpdf ) to most of the files

            a = PdfFileReader(open(filein, "rb")).getPage(0)
            watermark   =  PdfFileReader(file(r'c:\temp\test.pdf','rb')).getPage(0)
            a.mergePage(watermark)

filein is a AutoCad generated PDF.
.

This fails:

a.mergePage(watermark)

File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1594, in mergePage
self._mergePage(page2)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1644, in _mergePage
originalContent, self.pdf))
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1557, in _pushPopGS
stream = ContentStream(contents, pdf)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1986, in init
self.__parseContentStream(stream)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 2025, in __parseContentStream
operands.append(readObject(stream, None))
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 55, in readObject
return readStringFromStream(stream)
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 370, in readStringFromStream
raise utils.PdfReadError("Unexpected escaped string")
PyPDF2.utils.PdfReadError: Unexpected escaped string

Looks very similar to the last problem I reported.

Olav

Scaling in python 2.6

I cannot seem to get scaling to work.
If I submit a float or int to "scaleBy":
TypeError: Cannot convert float to Decimal. First convert the float to a string
If I submit a string:
TypeError: can't multiply sequence by non-int of type 'float'
If I submit a Decimal:
TypeError: unsupported operand type(s) for *: 'float' and 'Decimal'

Problem with AutoCad generated PDF

Hi

I am trying to use the pyPDF2 module to merge a lot of pdf-files.  For some of the pdf-files it fails.
The failing pdf-files is files generated directly from Autocad.


Traceback (most recent call last):
  File "", line 37, in
  File "", line 29, in main
  File "C:\Python27\lib\site-packages\PyPDF2\merger.py", line 168, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "C:\Python27\lib\site-packages\PyPDF2\merger.py", line 116, in merge
    pages = (0, pdfr.getNumPages())


My script:

def main():
    from PyPDF2 import PdfFileReader, PdfFileMerger
    doclistdir = r'xxxxxxxxxxxxxxxxxx''
    doclistfile = open(r'xxxxxxxxxx\list.txt','r')
    doclist = doclistfile.readlines()
    merger = PdfFileMerger()

    for doc in doclist:
        pdfdoc = doclistdir + '' + doc.strip()
        mergerelement = open(pdfdoc,'rb')
        #print 'Processing:  ' + pdfdoc

       
        merger.append(mergerelement)
       
   
    output = open(doclistdir + '' + "document-output.pdf", "wb")
    merger.write(output)
    pass

if name == 'main':
    main()


regards
Olav

HTML links not clickable after merge

I have two PDFs to merge, once with HTML links, and another just plain watermarks.

After merging, the links are not working, and if I reverse the merge sequence, the watermarks will hide the links.

Here is my codes:

    bg = PdfFileReader(file("/tmp/bg.pdf", "rb")) #plain watermarks
    fg = PdfFileReader(file("/tmp/fg.pdf", "rb"))   #text with links

    page = bg.getPage(0)
    page.mergePage(fg.getPage(0))

    output = PdfFileWriter()
    output.addPage(page)

    ostream = file('/tmp/out.pdf', 'wb')
    output.write(ostream)
    ostream.close()

Complete operator for method removeImages

Hi,

Thanks you for add methods removeText and removeImage.
For the method removeImages, just a little correction for manage correctly content.

                if operator in ['cm', 'w', 'J', 'j', 'M', 'd', 'ri', 'i',
                        'gs', 'W', 'b', 's', 'S', 'f', 'F', 'n', 'm', 'l',
                        'c', 'v', 'y', 'h' , 'B', 'Do', 'sh'] or \
                    operator in [b'cm', b'w', b'J', b'j', b'M', b'd', b'ri', b'i',
                        b'gs', b'W', b'b', b's', b'S', b'f', b'F', b'n', b'm', b'l',
                        b'c', b'v', b'y', b'h', b'B', b'Do', b'sh']:
                    continue

retaining bookmarks using merge

When using the merge function with two files and using the import_bookmarks=True option, the bookmarks are always off by 1 page.

The issue is further compounded by different .pdf readers. I'm seeing in Adobe the bookmarks are off by 1 page (one page behind) and in other readers like PDF Complete - they are correct.

I made the following adjustment in the source code (merger.py) _associate_bookmarks_to_pages --
for p in pages:
if bp.getObject() == p.pagedata.getObject():
pageno = p.id-1 ########### the -1 was added

Everything looks great in Adobe but now the file in PDF Complete it's off by 1 page...fortunately I only support Adobe.

After further inspection -- although bookmarks work -- the bookmarks are highlighted incorrectly when scrolling through pages. They are off by 1.

I checked the file using the getOutlines() function and saw the file was structured incorrectly with the "/Page" key being off for each item:

Eg:
[......,{'/Title': u'Summary Graph', '/Left': 0, '/Type': '/XYZ', '/Top': 0, '/Zoom': 0, '/Page': 6}, .... ]

Should read this:
[....,{'/Title': u'Summary Graph', '/Left': 0, '/Type': '/XYZ', '/Top': 0, '/Zoom': 0, '/Page': 7}, ...]

And yes I do understand pages start at "0" !

What would I need to fix the root '/Page' key? Would someone be able to help me?

Bad arguments to str() in u_

*** Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32. ***

Traceback (most recent call last):
File "test.py", line 1, in
import PyPDF2
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2__init__.py", line 1, in
from .pdf import PdfFileReader, PdfFileWriter
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\pdf.py", line 56, in
from .generic import *
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\generic.py", line 1042, in
u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'),
File "C:\Program Files (x86)\Python27\lib\site-packages\PyPDF2\utils.py", line 161, in u_
return str(s, 'unicode_escape')
TypeError: str() takes at most 1 argument (2 given)

PyPDF2 should not overwrite warnings.formatwarning.

Hello,

PyPDF2 1.2.0 overwrites warnings.formatwarning with its own implementation (utils._formatwarning) in pdf.py line 74:

warnings.formatwarning = utils._formatwarning

Unfortunately this may cause severe side-effects if PyPDF2 is imported in a larger application. In our case the PyPDF2 implementation of formatwarning caused IndexErrors whenever a warning was raised somewhere else (and the filename argument was not to the formatter's liking).

Personally, I do not think that it is a good idea for a library to interfere with the global logging/warning infrastructure.

P.S.: Apart from this problem, we have been using PyPDF2 successfully for some time now. Nice piece of software!

Whitespace issues in extract_text()

I am not able to read text which proper formatting and spaces are not handled during extraction:

PreemptiveInformationExtractionusingUnrestrictedRelationDiscoveryYusukeShinyamaSatoshiSekineNewYorkUniversity715,Broadway,7thFloorNewYork,NY,10003fyusuke,sekineg@cs.nyu.eduAbstractWearetryingtoextendtheboundaryofInformationExtraction(IE)systems.Ex-istingIEsystemsrequirealotoftimeandhumanefforttotuneforanewscenario.

Is it true that pypdf2 is not format aware as given here: http://victorwyee.com/python/convert-pdf-to-text-pypdf-pdfminer-first-impression/

KeyError: '/Type' when merging pages

Merging 2 pdfs. The first pdf is from paperport 11 (some old program which may not support pdf structure correctly?), I initially needed to apply the fix from #34 (to fix EOF error). The next issue I encountered is in the method: _flatten (in pdf.py) where "/Type" isn't present in the pages dictionary.
I made the following change:

 def _flatten(self, pages=None, inherit=None, indirectRef=None):
       ... 
       ...
        #this is the change I made; default t = '/Pages'. Is this the correct thing to do?
        t = "/Pages"
        if "/Type" in pages:
            t = pages["/Type"]
        ...

Should I commit a fix for this (and make it conditional on strict parameter)? Or is there a better way to pick a type?

Can't getData() from /Contents List

I'm trying to dig deep into some PDFs by calling getData directly on part of a page (I am then parsing that data to find coordinates for a bit of text).

This worked for me in the past with essentially:

page = PdfFileReader(inpdf).getPage(0)
text = page.getContents().getData()   #<-- or page["/Contents"].getData()

but with my new PDFs, I am getting an error like this:
"AttributeError: 'ArrayObject' object has no attribute 'getData'"

Digging in, it looks like my old PDF was structured like this (print page) with a single IndirectObject in the contents.

{'/Contents': IndirectObject(14, 0),
 '/MediaBox': [0, 0, 662.40000, 792],
 '/Parent': IndirectObject(1, 0),
 '/Resources': {'/Font': {'/F3': IndirectObject(10, 0),
                          '/F4': IndirectObject(7, 0),
                          '/F5': IndirectObject(4, 0)},
                '/ProcSet': IndirectObject(13, 0),
                '/XObject': {}},
 '/Type': '/Page'}

Then page.GetContents() returns:

{'/Filter': '/FlateDecode'}

while my new PDF is structured like this with a list of IndirectObjects in the contents:

{'/Contents': [IndirectObject(11, 0),
               IndirectObject(12, 0),
               IndirectObject(13, 0),
               IndirectObject(14, 0),
               IndirectObject(15, 0),
               IndirectObject(16, 0),
               IndirectObject(17, 0),
               IndirectObject(18, 0)],
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': IndirectObject(5, 0),
 '/Resources': {'/Font': {'/F3': IndirectObject(24, 0),
                          '/F4': IndirectObject(26, 0),
                          '/F6': IndirectObject(29, 0),
                          '/F7': IndirectObject(30, 0)},
                '/ProcSet': IndirectObject(31, 0),
                '/XObject': {}},
 '/Rotate': 0,
 '/Type': '/Page'}

then page.getContents() returns:

[IndirectObject(11, 0),
 IndirectObject(12, 0),
 IndirectObject(13, 0),
 IndirectObject(14, 0),
 IndirectObject(15, 0),
 IndirectObject(16, 0),
 IndirectObject(17, 0),
 IndirectObject(18, 0)]

How do I get at the underlying data of /Contents? going after the pieces of the list with page.getContents()[0] just returns the name of the object and I can't use getData() on that. I can't tell if this is a bug (caused by having a list as the contents) or if I am missing some feature.

Will hang on invalid PDFs

Doing some testing, I noticed that PyPDF2 will hang if it encounters an invalid PDF… for example, the skipOverComment function:

def skipOverComment(stream):
    tok = stream.read(1)
    stream.seek(-1, 1)
    if tok == b_('%'):
        while tok not in (b_('\n'), b_('\r')):
            tok = stream.read(1)

Will hang indefinitely.

I would propose three courses of action:

  1. Wrap the stream in a method which will raise an exception after a certain number of empty reads; ex:
class SafeStream(object):
    def __init__(self, stream):
        self.stream = stream
        self.seek = stream.seek
        self.tell = stream.tell
        self._empty_reads = 0

    def read(self, *args):
        res = self.stream.read(*args)
        if res == "":
             self._empty_reads += 1
             if self._empty_reads > 1000:
                 raise Exception("too many empty reads")
        else:
             self._empty_reads = 0
        return res
  1. Add a script for automating fuzz testing to the repo

  2. Fix the bugs as the script from step (2) finds them

What do you think? Would you be open to patches for those?

Can't read pdf

I get an mysterious error with the PDF Reader using python3 on the file
"Werner - Fragen und Antworten zu Werkstoffen.pdf".
My Code:

import fnmatch
import os
from PyPDF2 import PdfFileReader

for file in os.listdir('.'):
    if fnmatch.fnmatch(file,'*.pdf'):
        print("File: "+file)
        foo = PdfFileReader(open(file,"rb"))

Error:

File: Werner - Fragen und Antworten zu Werkstoffen.pdf
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    foo = PdfFileReader(open(file,"rb"))
  File "/usr/lib/python3.3/site-packages/PyPDF2/pdf.py", line 684, in __init__
    self.read(stream)
  File "/usr/lib/python3.3/site-packages/PyPDF2/pdf.py", line 1236, in read
    streamData = BytesIO(xrefstream.getData())
  File "/usr/lib/python3.3/site-packages/PyPDF2/generic.py", line 834, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 310, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 121, in decode
    rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
  File "/usr/lib/python3.3/site-packages/PyPDF2/filters.py", line 121, in <listcomp>
    rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
TypeError: ord() expected string of length 1, but int found

Is smth broken with my filename or why this error occurs?

Speed up parser

Currently the parser is quite slow, even for moderately sized PDFs. When I get a bit of time, I'm going to investigate different ways it could be sped up. Right now (pending some profiling, obviously) I suspect this is going to involve re-writing some of the core parser loops in something lower level like Cython. I'm looking into options to see if it's possible to write in a language which will be able to compile back to vanilla Python for the benefit of PyPy and friends.

I'm opening this issue to start discussion on the matter, and see if you've got any strong feelings either way.

Add method ignoreImage

Hi,

Like my last post "Add method ignoreText" I need to extract only test from Pdf, I try some products for extract text from pdf but all return text in String. But no one keep text position and fonts. I think PyPdf is the good tools for do that.

I add this method in pdf.py in class PdfFileWriter:

   def ignoreImage(self, ignoreByteStringObject=False):
         pages = self.getObject(self._pages)['/Kids']
        for j in range(len(pages)):
            page = pages[j]
            pageRef = self.getObject(page)
            content = pageRef['/Contents'].getObject()
            if not isinstance(content, ContentStream):
                content = ContentStream(content, pageRef)


        _operations = []
        seq_graphics = False
        for operands, operator in content.operations:
            if operator == "Tj":
                text = operands[0]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[0] = TextStringObject()
            elif operator == "'":
                text = operands[0]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[0] = TextStringObject()
            elif operator == '"':
                text = operands[2]
                if ignoreByteStringObject:
                    if not isinstance(text, TextStringObject):
                        operands[2] = TextStringObject()
            elif operator == "TJ":
                for i in range(len(operands[0])):
                    if ignoreByteStringObject:
                        if not isinstance(operands[0][i], TextStringObject):
                            operands[0][i] = TextStringObject()

            if operator == 'q':
                seq_graphics = True
            if operator == 'Q':
                seq_graphics = False
            if seq_graphics:
                if operator in ['cm', 'w', 'J', 'j', 'M', 'd', 'ri', 'i', 'gs',
                        'W','n', 'f', 'm', 'l', 'cm', 'Do', 'sh', 'S']:
                    continue
            if operator == 're':
                continue
            _operations.append((operands, operator))

        content.operations = _operations
        pageRef.__setitem__(NameObject('/Contents'), content)

If you thinks this method is helpful. can you add it ?

Thanks.

Encounter a valid pdf file but PyPDF2 fail on it

that file can be decompressed by pdftk, but the FlateDecode of PyPDF2 failed:

  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1751, in mergePage
    self._mergePage(page2)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1801, in _mergePage
    originalContent, self.pdf))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 1714, in _pushPopGS
    stream = ContentStream(contents, pdf)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/pdf.py", line 2158, in __init__
    stream = BytesIO(b_(stream.getData()))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/generic.py", line 850, in getData
    decoded._data = filters.decodeStreamData(self)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 310, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 102, in decode
    data = decompress(data)
  File "/Users/ulion/Develop/gapps/email-pdf-sign.deploy/pdfsign/PyPDF2/filters.py", line 47, in decompress
    return zlib.decompress(data)
zlib.error: Error -5 while decompressing data: incomplete or truncated stream

here's the data to be decompressed (repr print):

'H\x89\xedW\xcbr\x1b\xb7\x12\xdd\xf3+f)W\x85#\xbc\x1fZ]\x8a\x0f\xc5%\x9a\x94I:\xded\xa3X\xd4\xe3F\x12mIN\xe2\xbf\xbf\xc0\x003\x83\x19\x00M\xd9I*Qr7*q\x80n4\xbaO\x9f>8\xde\x0c\x0eg\xb8\xc0\xa8D\xaa\xd8\\\x0e\x84\x94%Q\x05\xd3\xa4\xd4\xa2\xd8L\x8a\x83\xc9\xeeU\xb1\xf9\xef`hvp\xb3\xb2\xf9P\x1c|8\xaa>\x1d\xceHk\x88UIX\x81\xac\t\xaa6aM\x84\x92\xc4\xef4G\xe0\x121\xbbsH\x15/5)8-\xa5;b\xd4\x9c\xa0\xb4?\xe2\xa6\xfa\x84J\x8c\x04r_\x1e\xb2\x9b\xaa\xff\xef\xaf\xeau\x8c%&\xd5\xb7\xa2\x8d\x9c\x0bg2\tL\xcec\x8b\xa7`y\xfb\x18\xaf\x1f\x15\xc1\x06\x8a\xa3 \x87\x8d\tB\xd4\x99\xbc\x89N\xec\xdcj\x18,\x13\x84Y\xee\x16n\xc7f\x07\xaf\x13\x99\t\xc9-\x8f>>3\x80\xaa`\xc4V\x8b\xf0\xd2T\xb9\x02\x85)U\x95~V\xed]\x07v\xd7;\xef\xb7^l\x8beQd\x13\x1bF9Kow\x8bw}\xd3\xd0\xf2\xcd\xf6\xe2\xe6<\xda\xb0\xaeo\xe5\x12\xf2!\x8cl{\xf1\xf9v[}\x98n\x06\x9f\n\xac\x95\xc1\'*\x86D\x95\xaa\xa8\xfep\xa5\x8a\x0fw\xc5\xe1\xcd\x1d.&\xbb\xe2\xed\xe0\xb8\xd7\x13D\x97\xdc\xc0\x95\xc9RTI8 X\xd2\xa6\x0eT\x11\x16\xc5\xb9>on\xc8\x99[\xbe\x8d\xda\xe8\xe7\xa6\x18X\xc6W\x1dE\xfb\x7f\t\xc1\x19\xd9~\xb7\'\xa2\xcfQD7O\x91\xc3\xac9E\x08\xde0\x0e\xba\tw\xcb\n&\xe11\xf0\xf1\xd3\xf9E|\xad\xce!#8\x08M\x10R\xd5on\xc9fhHI\x92\xaad\xe38\x94\x90=\xb6\xf7\xd1\xc1OG~\xddxiYm\xdc\x14Vr*\x02\xc0\xba\xe5\x8fq\xdd\xc3c\xae\xee\xe3$\xfdx\x90\xc9Iw\xd7+\x17\x8e\xb9N\x18\xcer\xb4\x8e\x02\x7f\x1dV\x1d\xce\xd7|z2\x9a\xc3[f\xa0\xff\xc5h1\x9e\xd6\x0e\xa8L9X\x9eMW{\xa0\xb3\xdc\x13\xe5|\x9c\xd8\xe0\xf3\xe6\x0f\x99\xcf\x16?\xbe\xea\x16,\x9d\x1c\x9f\x11\x93\xc7z[L\xd0\t\x10\'\xfa6\xd3\x87OM\xa1\x82a8\x9a\xc1y\x9c\x83\xab\xa3\xcd\x9e\x1c/\xa6\x9b\xda\x01V\x9e\xd8\xdf\x87\xe9Y\x9d\xd6\x14\x88J\xa6\x8a_\x0b\xc2lg0D\xed$\x96\x0c\x97\xe6\xb6RX\xc2{\xd8\x0e\xd6\x96\x02)+\xb9(\x98\xe0\xa5\x1b\xd2q.\x9f\xa2\xc4uVC\xe0\xef\xee\xbbY98^\xbe\xab\x91\x13\x13`\'\xff\xc3\xf0\xa2\x81\x99+!\xad9y\xbcK\xb4p\xaa\xb2\xad\x8ay^\xe9\xc6\xcb\xf9|:]\x80\\}:\x1f\x9dLCd5\xea\xe9\xeczw\xbf\xed\xbaMu\xbfs#\x98Hu>x4c*\x91\xb3`\x834c2\xf6\xd0i\xf187\xbf=\xa72\xcf\x8f\xddE\xaa\xd0\xdeHu\xaa\xb45L\xd2\x94\x9b`\xdc\xdb(\xe1\xfd\xa2;P\xc63\xe16l\xed\xed}\x94\x97\x9fc\x930\x94\xabmd\x11\xce\xb7\xffd.\x92\x18\xf2\x99\xcbn/\xe2\xdb\x82 \x7f\x043\xd5\xb9n\xce6\x00A\x99i"\xf3\x85H\xc1\xdc\xa7\x9d\x0b\x98RU2\xc30%%m3\x1d\x7f\x8e<|I\x8a\x90\x96~]-)\xea=/&\xab\x86=z\x94g\xcc\x05N\xb1\xe4\xe9\xbb\x15H\xb3\'\xcd](\xf5\xcb~tifo\xb1\xa7{\xa5H\xe0{O\xf7\nDc\x9b\xb0\'0g\xb8\x19kMGLrc\xca\xa7\xf1\xd7L\xe5|\x1e>g\xcds\xba%U\x98\x10\xd8\xd71\xf2B\xb5\n\xe3r{q\x15\x8f\xd1\x1c\xd4P\x881C\xb4\xdc>"\x03\xc2\xdd\xc0\x91\xc6s\xea\xf2\xf2\xa6\xa9\xbb\xee>p\xbd\xf7\xbf\xa6\xee\x84\xf8+Ra\xc1\x17T?~\x19<>f\xca\x07\xb7sg>\xb6\xd2\xb4\xd1X~\x1e\xb2\xce<\\\xedr\x95\xc9\x80\xa7\xd9\x9f\xd6.\x9b\xd8_\x07:\x0f\x7fr\x07j\x02W\x82b\x8d\xe2\x0e\xfc\xc6,\xe4Zr_\x12\xfe\xa6\xed\xc7\x8dH4}7$\xc4\xb3\xfc\xc1\xfac<G\xe0\xc4$\xa6R\x11\xb2r<\x84s@\xce\xe6\'\x1e:\t\xed\xd7\x95\xab\xf1\xa1\xae\x17\x04\xd2V*c\xaa=\x12|S\xc4\xb7\xfe%\xee\xa7\x10w\x93\x84b\x8e&a\xb8z\x14Z\xa3=\xa2o\xf41\x91\x91p\x03A\x98\xc1\x01z\x86\xea\x84\x10,s\x04\x9b\x9f\xb9g\x95\x96\xf6\x811\xa4\x8e\x9c\xb3i\xea^\xd4+I\x82dI-\xb9\x1b\x11\xc1\x82\xa7ZgB\x8f\xdc\x84f\x8e#q\xc9\x03\xb5A\xdd\xe8\xd4\xc8\xbeu\x9a\xb5\x96?[/g\x80\x17\xe2\xbc\x0c\x85\xb0X\x7f\x8e\x1b?\x93\x9a\x9d\xb2\xfaj\x86I\x10\x85\xaf\xd3\xd7\x85R\xd7\x8c\x0bT*\x15\xc42\x89[\xfc\xcbc="\xadrj\'d(-\xdf\x84\x14T\x17"\xf9\x82\xdc\xcc\xd6\xee}\x83\r\x07"\xaf\xc4\x0e~\xc8\xea\x82T\xbff\x88\xf1\xde\xfbU\xf6\xf5\xe9\xfc\xce\xbd\xf4&f\xf6\xa9\xfa\xe3*\xd11\xee|\xf3\xac\x15\x8a[\xa0QA\xac\xc1]\xf3\x9b\xe2\xd2\x9cy\xeb\x9e\xb4\x92T\xab\xd46o\xd0\xbb_\xdb\x89\x89\xe9\x95\x9d\xaa\xb7\x1e\xca\xd4B\xa2U)M\xdd\x8d\xbe\xe3\xce\xec\xc6\'.Y\xf8\xe7\x1dc\xf2`r^\x1a\x1d\xde\xe4\xa1\xfe\xdd\xe6\xe1S\x81\xcdo"LKI\xa6\xed)\x15\xa3=l\x8b\xeb\xe2}q?\xc0:H\xdc\xdd@\x1aO]\x07o\x8bO\x85I&QU\x88m23\xaaz=v\xbf\xb8nqsv\xbd\xbbwy\xe5\xaa\x99\xeb\x07cx\x08\xee\xd1\xa0\xf5\xe3\x8bR\xeaXC\xd9\xae\xb7~Y\x94\xbb\xefB\x9eC\x9e\xc9$"e\xfdd\xfc\xd6&\xc9\xb2r\x08\x1f\xf7\x89U\x13$\x10VH#4\xca:p[\x04\xe2gn\x0eU\x97ty[,\xe3\xa8\x8a0\xae\xefG\x1b\x10\xc4\x98 p\xfd\x07\xec[T\x94\xa2~\x84e\xc6B\xed\x8a\x18X\xd4\xd2\xd9\x0c\x1b\xd4\xc7\xa7y\xcf\x99Vh\xf1\x89\xb9\x05\xa1\xefS\xa1u\x89\r\xb4H\xc5r\xceC\xbf\xd1k\x07\xf5\xef\x8e\x03nN\'*\xe1`\x88\xb5j\x8a|\xa0PB/\x86\xf9\xe6b\x8fLG\xba\x96\x8f!\xbc\x11\xc2pB\xd7\xc7\x94H\xe1\x93*\xdbJ\xce\xb3\xaa2l\xf3\x06\xe1\xdcf\xc9\xda\xc5\xa3:D8\xe1\t\x84\xef\x03\xe9\xd7O\x86\x00\xd0\x16\xa8\xe8\x0cN\x9dA\n\x1f\xbd<,S\xc2\xfbXv\xd0k\xb1\x8c*h\xf6\xb0\x8c*Q`=\xe0\x08\xcb\xb5\x83\xfaw\xc7A\x8d\xe5\xc8\xc1?\x03\xcbF\xdb\xf1\x0e\x96\x1b\xacVs\xe8w\xb1q\x02\x98\xab\xe5\xfae`\x8e\xb1V\x0eE\xec\x89\xba\xecITH~\x12a\x0b\x18bd\xafp\xb2\x99\xf7\x01\x87\xba\xe4\xd9\xb5\xe7\x92\xdaZD\xf6F\xc4\x10[\xab\x97\x8d737L\xc6\x02\xbca\x98<y}O\xdb\xc8\x8a\xfd_\x1e\xfc\t\x94\x1a\xc9\x03\x0f\xc8\x16\xe02A\xa9D\xf2\xfa\xf9\xc3\xfa\x08o\x1c\xd4\xbf;\x0e\xb8\xfb\x15;0\x10\xb7O\xcd\x17\x8e\xf0\xbe:  \xc0\xd5\x1f\xa8\x7f\x9f\x85\xef\xdf\xd5\x04\xf49M\x80^\xa2\xae\xe0Q\x13\xc8\x1e\xcb\xf3\x14\xcb\x0b\x0b\xd8\n\xae\xfd\x16\x90=\x92\xe7)\x92\xef\x99\xff\xcd9~q\x1e\x99=\xdd\x04v\xbb{\xdf\x04\x873R`d\'\xe8\xe6r`\x84\xa6M\xc0\x103\x7fW\xe1\xf7\xe0v\x0f&\xd2\xbeo\xdd1\x18l\x18\x9d\x92\xd3/j"lF\xa7S0\xff\xe3\xe5b\xb3Z\xce\x1b\xdfT\x11\x96\xed\x88?PJ{\x8c\xb6\x90g!m\xd7\x90g"\xa8b\x00x\xde\xe3\xfc\x8e1Q\xd8\x92}m\x8c\x10\x82!\xb8\x19\x9b\x98\x17\xdfJ\xc2i\xfc5\x8c\xfco\x07\xe0d5\x8a\xc3z\x1d\x98-N\xc0\xe2|?ZL\xfeZ\xb6\x8e\xa1\xcbzlMRlm\xe4\x85Ri\xf4\xb2\x1e]w\xeck\xf4\xb6\xf6{\x01\xbc\x9c\x8d7\xf8\x9b\x01l\x02#,\xb8\xc7\xdd@\nQ\xf6\xafu8\xa3\x85\xb4\xd1\x1a\x00ss\x88\xb0y(\x89c\xd9\xb3\xa5y\xa1\x12\xa1$\x196\x97e\x92\xb9\x07L\x85\xdb\xca\xbf\xb07z\xd8\x16\x97f\xb1\xaaF\xbd(IwQ\xda\x83\xdd"\xae2\x14,*dO\xcd-\x12\xc0\xad\xa2\xad\xdbx\x91\xdb\xa1\x95s[5lnQ\x03n5i\xa3\x8d\x17)\xe0Vs\xc8\xad\x04\xa25\x9a\x99\xe5,\xb9Ag\xd6-G\xb4u\x1b/r\xc8-PO\x8e\x80zr\x0c\xd4\x93\xe3\xa0\x9e\xf1"\x07\xa2\xc5\x12\x88\x16+ Z\xac\x81h\tPON\x80zr"\x81h\r\x91\xe4\x93`>\xe7\xa3\xa5P=)\xcd\xc3\xc4L\x10\xc8\xad\x84\xdcj\xc0-\xc3@\xe2\x19TO&\x80$0\xa8d\x0chA\xce1\x10-\xa7@\xb4\x1chA\xce\x81\x16\xe4\x1cjA\x01\xd5S@\xf5\x14\x0c@\x9f\xe0\x00\xfa\x04\xd4\x82\x12\x01n%\x06\xdcJ\xa8\x05\xa5\x80\xdc\x02\x94\xca%TO\x05\xb5\xa0\x82\xea\xa9\x18\x80>\x05\xd5SA\xf5\xd4\x08p\xab\x19\x00j\r\xb5\xa0\x86ZP\xeb|\xe2\x05\x02ZP \xa0\x05\x8d\x82\x83\xdc\x02\xfd)\x10\xd0\x9f\x02c\xc0-&\x80[\x0c\x94L`\x80R+\xc9\xc2\x1a\xc9Bt%\xaa\x84\x7f\x16\x8e\x1f\xbc\xa0BR\x0b\xa7\x8c\xb6\xa12z\x8a\x97/\x8a\x9f\xbe\x04[\x8e\xea\x1d\x9a\xe0P\x91Yy\x8c\x04%\xd5\xa7I`\xb0\xaa\xdfaHI\xec\x96\xdf\x87"\xee\xf4\xdd\xaav@0\xf3\x0e^7\xba\xcf\xfd>\t\x0c\xa6\xab8\x84"\xf4\xc8\x0e\xd5!A\x98e"u\x9b\xe8\x11GG\x8c\x83{\xce\xde\xc4\x8f\x18a\x80\xcd\x85y\xe9X\xa8U\x1a\xf0\xfcj\x0b\x0bu\xf8\x8d\xb9\x8b\x8c/\x0bR\x8b\xc9\xb7\xc5\xff\x00L\xc2\xa0'

the pdf file can also be opened by osx preview correctly.

PDF /PageLayout and /PageMode options

Hi,

I've been using PyPDF2 to merge some PDF files, adding bookmarks to the various pages as needed. I've been using the code below to set the initial view of the output PDF so that it shows one page at a time, and displays the bookmarks navigation panel.

pdf = PdfFileWriter()
root = output.getObject(pdf._root)
root.update({NameObject('/PageLayout'): NameObject('/SinglePage'), NameObject('/PageMode'): NameObject('/UseOutlines')})

I'm wondering if there would be any interest in writing this into a more formal method. Maybe something like:

pdf = PdfFileWriter()
pdf.page_layout = 'SinglePage'
pdf.page_mode = 'Bookmarks'

I'm happy to write this an submit a pull request, but I though I'd get some feedback on the syntax.

In addition to this, it would be nice to be modify the author, title, etc. Maybe this is already possible and I've just missed it...

PyPDF2 bails out while parsing NameObject if it's standalone

When a standalone NameObject is encountered the parsing code raises an exception.

Reproducible with:
from PyPDF2.generic import readObject
from cStringIO import StringIO
print readObject(StringIO("/deviceRGB"), None)

PyPDF2 fails with PdfStreamError("Stream has ended unexpectedly").

Now some of the PDFs generated with ImageMagick(img to pdf conversion) have this standalone "/deviceRGB". And it is not followed by space or any of the delimiters. I have come across couple of PDFs with this problem. Unfortunately I cannot send them across(client data). I'll try to create such pdf and attach it here

Add method ignoreText

Hi,

I have a PDF and I wan't to remove the text from PDF file , to keep only image in my PDF.

I see have a method ignoreLinks for PdfFileWriter object, can you add method ignoreText ?

Or explain how I can do ?

Thanks.

Wrong PDF generation on windows

The below code will generate an output, but the resulting PDF is not the expected concatenation of the two original pages. Same code works as intended on Linux.

import PyPDF2

pdfList = ['top_01.pdf','top_02.pdf']

def mergePDF():
        writer = PyPDF2.PdfFileWriter()
        for pdf in pdfList :
            f = open(pdf, 'rb')
            reader = PyPDF2.PdfFileReader(f)
            writer.addPage(reader.getPage(0))
        out = open('top.pdf', 'w')
        writer.write(out)
        #out.close()

mergePDF()

Here are the links :
top.pdf

top_01.pdf

top_02.pdf

Python Version Compatibility

A new PyPDF2 branch 'Python3-3' has been created, incorporating William Culver's changes from his pull request #4 . However, it currently only completely works on Python 2.6 and 2.7.

MergePage rotates 1 page relative to the other, in certain pdfs

I'm merging 2 pdfs using code that works correctly for other pdfs. I'm using the mergePage method to overlay the content from one pdf on the other pdf (merge page by page).
In the image below, the numbers (highlighted by red box) should be positioned vertically.

capture

The "base pdf" is a scan from a Xerox WorkCentre 7435. The "secondary pdf" (containing the highlighted numbers) is generated using reportlab. The "base pdf" and "secondary pdf" have portrait orientation when viewing in a pdf viewer.
Other scans (from other scanners) merge correctly.

I don't know much about how pdf structure works, but is it possible the scan isn't including some data (orientation)?

I will try include a problem pdf when I obtain one that doesn't contain sensitive information.
Thanks
Rob

int() got an unexpected keyword argument 'base' error at line 803 in pdf.py when using Py2PDF2

When I execute the following code in Visual Studio 2012 using Python tools and ironpython 2.7 and PyPDF2 v1.20.

i got this error "int() got an unexpected keyword argument 'base' " line 803 in pdf.py

This is my complete code:

import clr
clr.AddReference('System.Drawing')
clr.AddReference('System.Windows.Forms')

from System.Drawing import *
from System.Windows.Forms import *
from PyPDF2 import PdfFileReader
class MyForm(Form):

def __init__(self):
    # Create child controls and initialize form
    self.Text = "Test Project"
    self.Size = Size(600, 500)

    path = "F:/Download/RealPython.pdf"
    f = open(path)
    inputpdf = PdfFileReader(open(path, "rb"))
    page = inputpdf.getPage(8)
    pagecontent = page.extractText()

    display.mediaBox.upperRight = (
           display.mediaBox.getUpperRight_x() / 2,
           display.mediaBox.getUpperRight_y() / 2
    )

Application.EnableVisualStyles()
Application.SetCompatibleTextRenderingDefault(False)

form = MyForm() Application.Run(form)

I read that PyPDF2 is written in pure python so it should run with any python, so i am using ironpython 2.7

can anyone help :)

PyPDF2 does not work under pypy

NumberObject is initialized wrong

class NumberObject(int, PdfObject):
    def __init__(self, value):
        int.__init__(value)

Correct would be;

class NumberObject(int, PdfObject):
def init(self, value):
int.init(self, value)

PyPDF2 failing at import

I am using PyPDF2 for extracting text and geometry from a PDF and this is my code snippet of Pdftext.py file :

from PyPDF2 import PdfFileReader

When i run this, i am getting the below error:

Traceback (most recent call last):
File "C:\Program Files\Microsoft Visual Studio 11.0\Common7\IDE\Extensions\Mic
rosoft\Python Tools for Visual Studio\2.0\visualstudio_py_util.py", line 76, in
exec_file
exec(code_obj, global_variables)
File "C:\Users\xxx\documents\visual studio 2012\Projects\PDFText\PDFT
ext\PDFText.py", line 3, in
import PyPDF2
File "C:\Python27\lib\site-packages\PyPDF2__init__.py", line 1, in
from .pdf import PdfFileReader, PdfFileWriter
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 56, in
from .generic import *
File "C:\Python27\lib\site-packages\PyPDF2\generic.py", line 1049, in

u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u

0000'), u_('\u0000'), u_('\u0000'),
File "C:\Python27\lib\site-packages\PyPDF2\utils.py", line 161, in u_
return str(s, 'unicode_escape')
TypeError: str() takes at most 1 argument (2 given)

startxref is not necessarily on a different line from the location of the xref table

The PDF spec seems to require that the startxref keyword and the byte offset to the xref table be on different lines.

However, in the wild, I have found otherwise valid PDFs where the startxref keyword and the byte offset to the xref table are on the same lines, like so:

...
0 8
0000000000 65535 f
0000000009 00000 n
0000305603 00000 n
0000305652 00000 n
0000000083 00000 n
0000305310 00000 n
0000305405 00000 n
0000305423 00000 n
trailer
<<
/Size 8
/Root 2 0 R
/Info 1 0 R
>>
startxref 305711
%%EOF

See here for example: https://www.docketalarm.com/cases/PTAB/IPR2014-00358/Inter_Partes_Review_of_U.S._Reissue_Pat._RE043707/docs/01-17-2014-PET-1193/Power_of_Attorney-2-Power_of_Attorney.pdf

Query - is there a way to bypass security restrictions on a pdf?

I have a pdf that has security restrictions. I need to merge some content into the secured pdf. I don't need the pdf to be secured after the merge.
When I open the file and check isEncrypted, it returns true.
When I try decrypt with empty string there's a notImplementedError raised "only algorithm code 1 and 2 are supported".

The restrictions on the file are shown below.
restrictions

At the moment, to bypass the restrictions on the file, I print the pdf to images and create a new pdf with those images. This isn't ideal as the file size becomes large and the content isn't as crisp.

Is there a better way?

PyPDF2 failing to read unicode character

I have a PDF which PDFFileReader is unable to read the text , instead this is the output:

u'\n˘ˇˆ˘ˇ˙˝˛˛˚˜ !!"#$%&"˝˛˝˘˛˘˛˚˙˘ˇ˝˛˘˛$\'(˘%˘ˇ˘ˆ˘)_)˛\'+,-)"˛./0"0!123˛"4˙"5)46)!6"˙˘˘˘,˘ˇˆ˙˙ˆ˝˛˚˜ !˘ˇˆ˙˝"" ˜#˝$˛˚˜ ˆ˙˝"" ˜ %˛˚˜ !˛˚ˇ!"#$%˘ˇ&ˆ˙˝˛˝ˆ˙&˚˝\'˛˚&\'()_ˇ+˙˝"" ˜#˝$˜#( ˛˚(ˇ+,˘˘˘ˇˆˆˆˇ,ˆ--ˆˇˇ˙˝˝% ˜)˜#_#˝$$˜  ˙ ˝_˛˚ˆ-&ˆ!ˆˇ&˘+$ˆ(˙˝+˚˜,!˛˚./&0ˆˆ+$ˆ(˙˝-˛-,&˘˝ˆ. ˚%˝% ˜)˜#\* ˜!˛˚&ˆˇ%ˆ!&(12+3ˇ˙˝,˜ˆ/˛˚%#"+3("ˆˇ.!ˆˇ43ˇ(˙-,&53ˇ6ˆˇ,˝˝% ˜)˜#\* ˜!˛˚(77777777777˜#( 0123& ˜"" ˜ %˛˚˜ 77777777777˜#( _ˆ_˛ ,4+#(56˝% ˜)˜#\* ˜!7  56 _˜ˆ(  %!_ˆ_˛ ˆ˙&˚˝\'586"ˇ+((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'()_&\'(_&\'()˘536((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((&\'&\' &\'˜ ˜˙˚ˆ-",ˇˆˇ!ˆ-ˆ,ˆ&ˆ!ˆˇ&53ˇ6ˆˇ,(˙˚&ˆ!-ˇ!6ˆˇ,˘ 8-ˇˆ-˙˝˝% ˜)˜#_ ˜!7  ˛˚(˙˚9ˇˇˆ-6ˆˇ,:;ˇˇˆ-<ˆˆ-ˇ&\' ,,˘˘ˇˇˆ-(9ˆˇˆ-!˘ˇˆ9˘ˆˇ˘˘(\n\n'

This is the output after Extract Text and it doesnot throw any error message.

A similar issue has been posted here:

http://stackoverflow.com/questions/15583535/how-to-extract-text-from-a-pdf-file-in-python
I am using windows so the solution in link is not helpful

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.