jesparza / peepdf Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 237.0 1.65 MB

Powerful Python tool to analyze PDF documents

Home Page: http://peepdf.eternal-todo.com

License: GNU General Public License v3.0

Python 100.00%

peepdf's People

Contributors

Stargazers

Watchers

Forkers

rohit-dua ohio813 ramsey16 hugo-glez sjas 7h3ram shekkbuilder marylinh krvperera ianshefferman darg0001 rafaelrmachado jaonlin pombredanne amohanta nikita-sah titotix sha8e lacradanciu beerus11 redwolfchaos nagyistge kitkatt989 sh1nu11bi hatching palaniyappanbala rootedwarfare tobey123 majinxin2003 deki0r psuedoelastic neilbryant thinhnd8752 jgajek acealchemycyberblaze idkwim ht13 putizl michalkoczwara bcui6611 qazbnm456 viper-framework diskonnectd unvascoenmadrid topotam ifun th4nat0s sylvainpelissier commiebstrd xunyin8 uruscg-llc genericcx thorkill h4ckl4bm3 hotelzululima zerocry howtoblind blasterxiao asnine sebastiandeiss pacejj27 nickhakkz b1gb1t adlicesoftware murgeye subc0ol ppzky guwudoor mmg1 wg135 hanul93 tholep de8gman1990 n4rr34n6 oscaroboto robertsobolczyk feitianyiren redtorchinc justindallen82 not-a-team deibit vitty84 ingramali l9sk 8181war raystyle bird8693 dr-aryone seabreg francisck nicceboy vogt420 teneen glen61y141 pythonthings sai889 blust2el qol15155 danveloper masterscott

peepdf's Issues

Modify update command to support the migration to GitHub

Due to the migration to GitHub it is necessary to modify the update process in peepdf. It used to contact Google Code to retrieve the last version of the files, but it should browse GitHub now.

embed file error

Hi
I try to add txt file to an existing PDF document.

#peepdf -i
ppdf>open my file
ppdf> embed /root/Bureau/share/file.txt text/plain
PPDF> *** Error: Exception not handled using the interactive console!! Please, report it to the author!!

any idea please ?

thank you

[README] Instructions to install dependencies are too sparse (and no longer current?)

Trying to get a fully-featured PeepDF.py run on a current Debian Jessie (as well as on Mac OS X -- but I'll not go into depth with this OS here). Problem: _getting the dependencies on libemu and PyV8 installed._

Debian

To install the libemu and lxml dependencies worked like this:

 sudo apt-get install libemu2 python-libemu python-lxml libxml2

Getting the PyV8 dependency was successfull only partially. A V8 package is available:

 sudo apt-get install libv8-3.14.5

However, the PyV8 is no longer maintained on Google Code (http://code.google.com/p/pyv8/). The latest prebuilt (non-Windows) binaries there are from 2010 and are based on Python-2.6 (while Debian Jessie uses Python-2.7 now).

Trying it with

 sudo pip install -v pyv8

leads to the following error message:

src/Wrapper.cpp: In static member function ‘static void CPythonObject::SetupObjectTemplate(v8::Handle<v8::ObjectTemplate>)’:

src/Wrapper.cpp:311:84: error: invalid conversion from ‘v8::Handle<v8::Boolean> (*)(v8::Local<v8::String>, const v8::AccessorInfo&)’ to ‘v8::NamedPropertyQuery {aka v8::Handle<v8::Integer> (*)(v8::Local<v8::String>, const v8::AccessorInfo&)}’ [-fpermissive]

   clazz->SetNamedPropertyHandler(NamedGetter, NamedSetter, NamedQuery, NamedDeleter);
                                                                                    ^
In file included from src/Exception.h:6:0,
                 from src/Wrapper.h:8,
                 from src/Wrapper.cpp:1:
/usr/include/v8.h:2414:8: note: initializing argument 3 of ‘void v8::ObjectTemplate::SetNamedPropertyHandler(v8::NamedPropertyGetter, v8::NamedPropertySetter, v8::NamedPropertyQuery, v8::NamedPropertyDeleter, v8::NamedPropertyEnumerator, v8::Handle<v8::Value>)’
   void SetNamedPropertyHandler(NamedPropertyGetter getter,
        ^
src/Wrapper.cpp:312:94: error: invalid conversion from ‘v8::Handle<v8::Boolean> (*)(uint32_t, const v8::AccessorInfo&) {aka v8::Handle<v8::Boolean> (*)(unsigned int, const v8::AccessorInfo&)}’ to ‘v8::IndexedPropertyQuery {aka v8::Handle<v8::Integer> (*)(unsigned int, const v8::AccessorInfo&)}’ [-fpermissive]
   clazz->SetIndexedPropertyHandler(IndexedGetter, IndexedSetter, IndexedQuery, IndexedDeleter);
                                                                                              ^
In file included from src/Exception.h:6:0,
                 from src/Wrapper.h:8,
                 from src/Wrapper.cpp:1:
/usr/include/v8.h:2437:8: note: initializing argument 3 of ‘void v8::ObjectTemplate::SetIndexedPropertyHandler(v8::IndexedPropertyGetter, v8::IndexedPropertySetter, v8::IndexedPropertyQuery, v8::IndexedPropertyDeleter, v8::IndexedPropertyEnumerator, v8::Handle<v8::Value>)’
   void SetIndexedPropertyHandler(IndexedPropertyGetter getter,
        ^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-8XX3Id/pyv8/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-DQ9h7E-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip-build-8XX3Id/pyv8
Traceback (most recent call last):
  File "/usr/bin/pip", line 9, in <module>
    load_entry_point('pip==1.5.6', 'console_scripts', 'pip')()
  File "/usr/lib/python2.7/dist-packages/pip/__init__.py", line 248, in main
    return command.main(cmd_args)
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 161, in main
    text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 42: ordinal not in range(128)

Debugging and possibly solving this is beyond my capabilities.

The next alternative, using the pyv8 sources from Google Code with

 python setup.py build
 sudo python setup.py install

completes without obvious error, but does not lead to final success either: Because when running peepdf.py, there is still the message

Warning: PyV8 is not installed!!

Redirecting output from command in interactive PeepDF session creates hundreds of files (1 for each byte of output!)

I came across this issue when playing with the current PeepDF code from the Git repo....

You can use this sample PDF file (handcoded by me) to reproduce the issue of this report:

https://gist.github.com/KurtPfeifle/63eaed91ee5dd26873cd

The following is from an interactive peepdf session (I know I haven't installed PyV8 and pylibemu on this system, but this shouldn't matter for this issue):

kp@mbp:mrmcd15>  peepdf.py -fli  manuallycoded.pdf
Warning: PyV8 is not installed!!
Warning: pylibemu is not installed!!

File: manuallycoded.pdf
MD5: 758869f79d4fe30496db43b3bef8b708
SHA1: 4821d74bf84b6a0b757fd6c152c2e0a6d5fd3fa9
SHA256: 73016c523779e344ad45ee07dadd6812c14bde2015e6ba2e8e2638ef730c870e
Size: 2656 bytes
Version: 1.0
Binary: False
Linearized: False
Encrypted: False
Updates: 0
Objects: 9
Streams: 2
Comments: 0
Errors: 1

Version 0:
    Catalog: 1
    Info: 2
    Objects (9): [1, 2, 3, 4, 5, 7, 8, 9, 10]
    Streams (2): [8, 9]
        Encoded (1): [9]


PPDF> object 9

<< /Length 503
/Filter [ /ASCIIHexDecode /ASCIIHexDecode /FlateDecode /LZWDecode /ASCIIHexDecode ] >>
stream
BT
    /F1 60              Tf
    30  400             Td
    1 0 0               rg
    (Hallo, MRMCD 2015) Tj
ET

endstream

PPDF> filters 9

[ /ASCIIHexDecode /ASCIIHexDecode /FlateDecode /LZWDecode /ASCIIHexDecode ]

PPDF> filters 9 ahx

<< /Length 228
/Filter /ASCIIHexDecode >>
stream
42540a202020202f4631203630202020202020202020202020202054660a20202020333020203430302020202020202020202020202054640a20202020312030203020202020202020202020202020202072670a202020202848616c6c6f2c204d524d434420323031352920546a0a45540a
endstream

PPDF> filters 9 ahx > stream9.ahx

PPDF> quit

Leaving the Peepdf interactive console...Bye! ;)

Now look at the file(s) named stream9.ahx created by the last PeepDF command with its output re-directed:

kp@mbp:mrmcd15>  ls -l stream9.ahx_* | wc -l
     286

kp@mbp:mrmcd15>  ls stream9.ahx_*
stream9.ahx_1   stream9.ahx_118 stream9.ahx_137 stream9.ahx_156 stream9.ahx_175 stream9.ahx_194 stream9.ahx_212 stream9.ahx_231 stream9.ahx_250 stream9.ahx_27  stream9.ahx_30  stream9.ahx_5   stream9.ahx_69  stream9.ahx_88
stream9.ahx_10  stream9.ahx_119 stream9.ahx_138 stream9.ahx_157 stream9.ahx_176 stream9.ahx_195 stream9.ahx_213 stream9.ahx_232 stream9.ahx_251 stream9.ahx_270 stream9.ahx_31  stream9.ahx_50  stream9.ahx_7   stream9.ahx_89
stream9.ahx_100 stream9.ahx_12  stream9.ahx_139 stream9.ahx_158 stream9.ahx_177 stream9.ahx_196 stream9.ahx_214 stream9.ahx_233 stream9.ahx_252 stream9.ahx_271 stream9.ahx_32  stream9.ahx_51  stream9.ahx_70  stream9.ahx_9
stream9.ahx_101 stream9.ahx_120 stream9.ahx_14  stream9.ahx_159 stream9.ahx_178 stream9.ahx_197 stream9.ahx_215 stream9.ahx_234 stream9.ahx_253 stream9.ahx_272 stream9.ahx_33  stream9.ahx_52  stream9.ahx_71  stream9.ahx_90
stream9.ahx_102 stream9.ahx_121 stream9.ahx_140 stream9.ahx_16  stream9.ahx_179 stream9.ahx_198 stream9.ahx_216 stream9.ahx_235 stream9.ahx_254 stream9.ahx_273 stream9.ahx_34  stream9.ahx_53  stream9.ahx_72  stream9.ahx_91
stream9.ahx_103 stream9.ahx_122 stream9.ahx_141 stream9.ahx_160 stream9.ahx_18  stream9.ahx_199 stream9.ahx_217 stream9.ahx_236 stream9.ahx_255 stream9.ahx_274 stream9.ahx_35  stream9.ahx_54  stream9.ahx_73  stream9.ahx_92
stream9.ahx_104 stream9.ahx_123 stream9.ahx_142 stream9.ahx_161 stream9.ahx_180 stream9.ahx_2   stream9.ahx_218 stream9.ahx_237 stream9.ahx_256 stream9.ahx_275 stream9.ahx_36  stream9.ahx_55  stream9.ahx_74  stream9.ahx_93
stream9.ahx_105 stream9.ahx_124 stream9.ahx_143 stream9.ahx_162 stream9.ahx_181 stream9.ahx_20  stream9.ahx_219 stream9.ahx_238 stream9.ahx_257 stream9.ahx_276 stream9.ahx_37  stream9.ahx_56  stream9.ahx_75  stream9.ahx_94
stream9.ahx_106 stream9.ahx_125 stream9.ahx_144 stream9.ahx_163 stream9.ahx_182 stream9.ahx_200 stream9.ahx_22  stream9.ahx_239 stream9.ahx_258 stream9.ahx_277 stream9.ahx_38  stream9.ahx_57  stream9.ahx_76  stream9.ahx_95
stream9.ahx_107 stream9.ahx_126 stream9.ahx_145 stream9.ahx_164 stream9.ahx_183 stream9.ahx_201 stream9.ahx_220 stream9.ahx_24  stream9.ahx_259 stream9.ahx_278 stream9.ahx_39  stream9.ahx_58  stream9.ahx_77  stream9.ahx_96
stream9.ahx_108 stream9.ahx_127 stream9.ahx_146 stream9.ahx_165 stream9.ahx_184 stream9.ahx_202 stream9.ahx_221 stream9.ahx_240 stream9.ahx_26  stream9.ahx_279 stream9.ahx_4   stream9.ahx_59  stream9.ahx_78  stream9.ahx_97
stream9.ahx_109 stream9.ahx_128 stream9.ahx_147 stream9.ahx_166 stream9.ahx_185 stream9.ahx_203 stream9.ahx_222 stream9.ahx_241 stream9.ahx_260 stream9.ahx_28  stream9.ahx_40  stream9.ahx_6   stream9.ahx_79  stream9.ahx_98
stream9.ahx_11  stream9.ahx_129 stream9.ahx_148 stream9.ahx_167 stream9.ahx_186 stream9.ahx_204 stream9.ahx_223 stream9.ahx_242 stream9.ahx_261 stream9.ahx_280 stream9.ahx_41  stream9.ahx_60  stream9.ahx_8   stream9.ahx_99
stream9.ahx_110 stream9.ahx_13  stream9.ahx_149 stream9.ahx_168 stream9.ahx_187 stream9.ahx_205 stream9.ahx_224 stream9.ahx_243 stream9.ahx_262 stream9.ahx_281 stream9.ahx_42  stream9.ahx_61  stream9.ahx_80
stream9.ahx_111 stream9.ahx_130 stream9.ahx_15  stream9.ahx_169 stream9.ahx_188 stream9.ahx_206 stream9.ahx_225 stream9.ahx_244 stream9.ahx_263 stream9.ahx_282 stream9.ahx_43  stream9.ahx_62  stream9.ahx_81
stream9.ahx_112 stream9.ahx_131 stream9.ahx_150 stream9.ahx_17  stream9.ahx_189 stream9.ahx_207 stream9.ahx_226 stream9.ahx_245 stream9.ahx_264 stream9.ahx_283 stream9.ahx_44  stream9.ahx_63  stream9.ahx_82
stream9.ahx_113 stream9.ahx_132 stream9.ahx_151 stream9.ahx_170 stream9.ahx_19  stream9.ahx_208 stream9.ahx_227 stream9.ahx_246 stream9.ahx_265 stream9.ahx_284 stream9.ahx_45  stream9.ahx_64  stream9.ahx_83
stream9.ahx_114 stream9.ahx_133 stream9.ahx_152 stream9.ahx_171 stream9.ahx_190 stream9.ahx_209 stream9.ahx_228 stream9.ahx_247 stream9.ahx_266 stream9.ahx_285 stream9.ahx_46  stream9.ahx_65  stream9.ahx_84
stream9.ahx_115 stream9.ahx_134 stream9.ahx_153 stream9.ahx_172 stream9.ahx_191 stream9.ahx_21  stream9.ahx_229 stream9.ahx_248 stream9.ahx_267 stream9.ahx_286 stream9.ahx_47  stream9.ahx_66  stream9.ahx_85
stream9.ahx_116 stream9.ahx_135 stream9.ahx_154 stream9.ahx_173 stream9.ahx_192 stream9.ahx_210 stream9.ahx_23  stream9.ahx_249 stream9.ahx_268 stream9.ahx_29  stream9.ahx_48  stream9.ahx_67  stream9.ahx_86
stream9.ahx_117 stream9.ahx_136 stream9.ahx_155 stream9.ahx_174 stream9.ahx_193 stream9.ahx_211 stream9.ahx_230 stream9.ahx_25  stream9.ahx_269 stream9.ahx_3   stream9.ahx_49  stream9.ahx_68  stream9.ahx_87

Each output file contains exactly 1 Byte. (Concatenating these file in the correct order will give the same output as seen in the interactive PeepDF session without re-directing the output.)

I also tried this command variation for re-directing the output: filters 9 ahx >> stream9.ahx.

But it doesn't make a difference.

URI not found

File https://www.virustotal.com/file/c7e31b77e7a4df74515bbac25a3f641598050e3fe1a9c3545efa72f0175f2323/analysis/1528284919/
contains an URI within object 3 but peepdf says:

URIs: 0

Can share the file if wanted.

code.google.com

Hi,

Service code.google.com is closing soon. Do you have plans to migrate to GitHub?

Original issue reported on code.google.com by [email protected] on 26 Mar 2015 at 3:00

KeyError: URIs

Any idea what is causing this error. I tried python2 peepdf.py --update and the code is up to date. This is happening on Linux.

peepdf.py", line 626, in
stats += beforeStaticLabel + 'URIs: ' + resetColor + statsDict['URIs'] + newLine
KeyError: 'URIs'

Need to update to python3

There are a number of places that need to be updated for this to work with python3. In particular the print statements. All prints need to be updated to conform to python3 standards. Currently all prints are of the form print 'stuff', this does not work for python3. Convert all of the print to print('stuff').

using peepdf in python programming

@jesparza As this is a python tool. Can you tell if I can use its commands in python programming by importing peepdf as a package.
I have analysed my pdf and had a look at its all objects by executing console commands like object 1, object 2 etc. Now my goal is to replace the content of 24 numbered object. Is that possible with this?
Please suggest.

Enhancement: add ASCII85Decode filter

Add the ASCII85Decode filter to peepdf, using the decoder
from pdfminer.

Original issue reported on code.google.com by [email protected] on 30 Nov 2012 at 2:49

Attachments:

peepdf-ascii85decode.patch

Enhancement: 'dumpstream' command

I needed to dump streams directly to file, e.g. extracting fonts from a PDF.

Attached is a patch which duplicates the 'stream' command, but accepts a 
filename to output to rather than the console.

Original issue reported on code.google.com by [email protected] on 11 Nov 2012 at 5:07

Attachments:

dumpstream.patch

UnboundLocalError: local variable 'ret' referenced before assignment in PDFFilters.py

When processing certain files, peepdf crashes with the following error:

UnboundLocalError: local variable 'ret' referenced before assignment

The bug lies in the PDFFilters.py file in the decodeStream() function, line 92:

{{{
    Traceback (most recent call last):
      File "my_script.py", line 45, in <module>
        ret, pdf = PDFCore.PDFParser().parse(filepath, True, True)
      File "/home/travesti/peepdf_0.2/PDFCore.py", line 6727, in parse
        ret = body.updateObjects()
      File "/home/travesti/peepdf_0.2/PDFCore.py", line 4126, in updateObjects
        object.resolveReferences()
      File "/home/travesti/peepdf_0.2/PDFCore.py", line 2470, in resolveReferences
        ret = self.decode()
      File "/home/travesti/peepdf_0.2/PDFCore.py", line 2001, in decode
        ret = decodeStream(self.encodedStream, self.filter.getValue(), self.filterParams)
      File "/home/travesti/peepdf_0.2/PDFFilters.py", line 92, in decodeStream
        return ret
    UnboundLocalError: local variable 'ret' referenced before assignment
}}}

The exception is raised because there isn't a previous declaration of the "ret" 
variable in the decodeStream() function. If none of the conditions are true 
then the "ret" variable never gets a value, the function ret is reached and 
Python raises the UnboundLocalError exception.

I patched the function just adding the following line at the begenning of the 
decodeStream() function:

{{{
    ret = (-1, "")
}}}

But it keeps raising errors in other modules :(

Original issue reported on code.google.com by [email protected] on 8 Mar 2014 at 3:11

Exception when opening RC4 encrypted PDF

peepdf will raise exception when opening the sample.pdf in attachment because 
it does not handle key P in standard encryption dictionary properly. The 
rc4.patch in attachment can fix this problem.

Original issue reported on code.google.com by czchen on 21 Oct 2011 at 1:10

Attachments:

Allow export of all JavaScript code into a compound log

It would be useful to not only be able to use the interactive mode for manual checking but to batch dump all JavaScript code from the cli.

Use case scenario, various documents need to be inspected (possibly hundreds), so interactive inspection will take too long.

Perhaps I am missing a rather simple mechanism to do this?

Problem with Filter LZW

What steps will reproduce the problem?

1. ./peepdf -i
2. create pdf
3. embed file
4. filters 4 lzw
5. save test.pdf
6. exit
7. ./peepdf -i test.pdf
8. peepdf shows decode error in object 4


What is the expected output? What do you see instead?

Peepdf shall encode/decode LZW filter successfully.


What version of the product are you using? On what operating system?

The peepdf version is r45
The python version is 2.7.2+
The operating system is ubuntu 11.10 x86_64


Please provide any additional information below.

The test.pdf can not decode by other PDF tools like origami-pdf.

Original issue reported on code.google.com by czchen on 27 Oct 2011 at 12:38

"save" operation with modified stream creates broken PDF

I wanted to replace the stream with content from the file. I performed the following operations:

modify stream 45 cidset.dat
save sample_fixed.pdf

peePDF creates broken PDF: simple overview reveals that the document has no trailer after xref table and no %%EOF marker, while Adobe Preflight complains with the following errors:

Document is damaged and needs repair
Indirect object “endobj” keyword not followed by an EOL marker
Metadata does not conform to XMP
Syntax problem: Indirect object has object number not preceded by an EOL marker
Syntax problem: Indirect object uses improper separation (object and generation number)
Syntax problem: Indirect object uses improper separation (”obj” keyword and generation number)
Syntax problem: Indirect object “endobj” keyword not preceded by an EOL marker
Syntax problem: Indirect object “obj” keyword not followed by an EOL marker
Syntax problem: PDF contains data after end of file marker

Option "save" on the original pdf file results in an altered pdf file

This is a low priority issue.

Here are the steps:

open a pdf file in interactive mode (peepdf.py -i orig_file.pdf)
immediately save the file as new.pdf (save new.pdf)
orig_file.pdf and new.pdf file will be different (in the way the newlines are placed)

It seems to me, that peepdf will delete newlines after any angle brackets (>> or <<).

Error occurred while parsing indirect object

I am receiving this error:
Error: An error has occurred while parsing an indirect object!!

The error log:
, in parse

ret = body.updateObjects()

peepdf2/PDFCore.py", line 4283, in updateObjects

object.resolveReferences()

File "PDFCore.py", line 3243, in resolveReferences

ret = PDFParser.readObject(objectsSection[offset:])

TypeError: slice indices must be integers or None or have an index method

Traceback (most recent call last):

File "peepdf.py", line 494, in

ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)

File "PDFCore.py", line 7064, in parse

ret = body.updateObjects()

File "PDFCore.py", line 4283, in updateObjects

object.resolveReferences()

File "PDFCore.py", line 3243, in resolveReferences

ret = PDFParser.readObject(objectsSection[offset:])

TypeError: slice indices must be integers or None or have an index method

This is the numbers array:
<type 'list'>: ['14', '0', '15', '165', '17', '332']
If i change it to int I get:
PDFParser.readObject(objectsSection[offset:])
{TypeError}unbound method readObject() must be called with PDFParser instance as first argument (got str instance instead)

Why not use re2 to replace re?

https://github.com/facebook/pyre2

I have a file running PDF parsing too long.

Traceback (most recent call last):
File "/home/soft/HawkEye/utils/../lib/hawkeye/core/plugins.py", line 230, in process
data = current.run()
File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 1860, in run
static = PDF(self.file_path).run()
File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 1080, in run
results = self._parse(self.file_path)
File "/home/soft/HawkEye/utils/../modules/processing/static.py", line 882, in _parse
ret, self.pdf = PDF_parser.parse(filepath, forceMode=True, looseMode=True, manualAnalysis=True)
File "/usr/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7035, in parse
rawIndirectObjects = self.getIndirectObjects(bodyContent, looseMode)
File "/usr/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7792, in getIndirectObjects
matchingObjectsAux = regExp.findall(content)
KeyboardInterrupt

And I find that i maybe RE problem, so why not use re2 to replace re?

After I replace it , I run very fast!

Pdf file doesn't open

When i tried to open the file using
Open /sdcard/file.pdf
*** Error: Exception not handled using the interactive console!! Please, report it to the author!!

PDFCore.py's search for elements/actions/events needs a space

What steps will reproduce the problem?
1. Have a PDF with /AAPL:Keywords and it will get flagged as /AA based on line 
43 of PDFCore.py .  By adding a space after each of the the items from line 
43-45, i.e. - '/AA ', you will still receive hits for legitimate Additional 
Actions still but you now won't receive false positive hits because something 
else contains _part_ of the data that was looked to match.

What is the expected output? What do you see instead?
Expected to flag only on the correct Event/Action/Element names but instead you 
may receive false hits.

What version of the product are you using? On what operating system?
Version included in REMnux - checked the latest trunk version and it should 
still be the same.

Please provide any additional information below.
pdfxray_lite also has this issue since it uses peepdf on the back end, however, 
since it uses it's own copy of PDFCore.py that owner will be contacted 
separately if this issue is accepted as it'll also need the slight change.

Original issue reported on code.google.com by [email protected] on 11 Jun 2012 at 10:56

Many commented lines seem to confuse peepdf.py

Trying PeePDF on one of the small, handcoded, extensively commeted demo files from our TROOPERS15 workshop, _114_incrementally-updated.pdf_, which uses the incremental update feature....

The file contains two versions (2 xref sections and 2 %%EOF markers).

PeePDF isn’t sure about the number of versions. The tree command returns 3 (Versions 1-3), the info command returns 4 (Versions 0-3 and Updates: 3). The rawobject xref 0 and rawobject xref 1 commands return an error message, while rawobject xref 2 and rawobject xref 3 print the correct info (apart from the version number). offsets also reports Versions 1-3. (I’m not sure if metadata 2 and metadata 3 should work — a simple metadata returns Info Object in version 2: [....])

The file contains many commented lines, so may this be a cause for PeePDF to choke on it?

I have to use peepdf.py -fi to force it to parse the file — without the -f it returns a message only: “Error: PDF sections not found!!” — What type of PDF sections is it talking about?!?

The info command also reports: Errors: 4. But I can’t find what exactly makes it think there are 4 errors.

Here is the complete output:

PPDF> info 

File: 114_incrementally-updated.pdf 
MD5: 24d635efd52bf29ad2d36421094be5a2 
SHA1: f0b30334e111833c1bf898185d2e38135f0f88cc 
Size: 8527 bytes 
Version: 1.4 
Binary: True 
Linearized: False 
Encrypted: False 
Updates: 3 
Objects: 7 
Streams: 1 
Comments: 0 
Errors: 4 

Version 0: 
    Catalog: No 
    Info: No 
    Objects (0): [] 
    Streams (0): [] 

Version 1: 
    Catalog: No 
    Info: No 
    Objects (0): [] 
    Streams (0): [] 

Version 2: 
    Catalog: 1 
    Info: 2 
    Objects (7): [1, 2, 3, 4, 5, 6, 7] 
    Streams (1): [5] 
        Encoded (1): [5] 

Version 3: 
    Catalog: 1 
    Info: 2 
    Objects (0): [] 
    Streams (0): [] 

PPDF> tree 
Version 1: 

Version 2:

/Catalog (1) 
    /Pages (3) 
        /Page (4) 
            stream (5) 
            /R8 (7) 
                /Font (6) 
            /Pages (3) 
/Info (2) 

Version 3: 

PPDF> rawobject xref 3 

xref 
0 1 
0000000000 65535 f  
5 1 
0000006923 00000 n  

PPDF> rawobject xref 2 

xref 
0 8 
0000000000 65535 f  
0000004019 00000 n  
0000004072 00000 n  
0000004343 00000 n  
0000004408 00000 n  
0000004623 00000 n  
0000006488 00000 n  
0000006567 00000 n  

PPDF> rawobject xref 1 

*** Error: xref section not found!! 

PPDF> metadata         

Info Object in version 2: 

<< /ModDate D:20131107003857+01'00' 
/CreationDate D:20131107003857+01'00' 
/Producer Text Editor, Brain & PDF-1.7 Specification ISO 32000-1:2008 
/Title Vim- + Brain-Output 
/Creator Kurt Pfeifle 
/Author Kurt Pfeifle >>
PPDF> offsets

       0 Header

Version 1:

Version 2:

    1502
        Object  1 (51)
    1552
    1555
        Object  2 (269)
    1823
    1826
        Object  3 (63)
    1888
    1891
        Object  4 (213)
    2103
    2106
        Object  5 (1863)
    3968
    3971
        Object  6 (77)
    4047
    4050
        Object  7 (31)
    4080
    4084
        Xref Section (168)
    4251
    4253
        Trailer (146)
    4398
    4399 EOF

Version 3:

    5693
        Xref Section (52)
    5744
    5746
        Trailer (159)
    5904
    5905 EOF

PPDF> errors


PDF sections not found 
No indirect objects found in the body 
Unspecified parsing error 
Error parsing object: 5 0 obj (Unspecified parsing error) 

PPDF>

Lastly, when I attempted to write out save_version 1 114a.pdf it created a file with only the 2 header lines.

I also created a version of the file which has removed all the commented lines.

With this version PeePDF does not have any problems.

It seems that commented lines can cause PeePDF to wrongly parse a PDF file.

New `extract js` sub-command fails to extract all JavaScript

Attached is an example PDF file containing JavaScript which I used for testing:

my3.pdf

The new extract js command fails to extract the complete set of JavaScript fragments.

The info sub-command lists for "Suspicious elements":

17 objects with /JS
16 objects with /JavaScript

Here it agrees with the numbers listed by Didier Stevens' pdfid.py tool.

However, when it comes to listing the number of "Objects with JS code", it only lists 5 of these: 73, 74, 75, 76 and 77.

Manually checking the source code of my file shows that there is more JavaScript code in objects 32, 86, 87, 92, 94, 96, 98, 101, 104, 107 and 109.

The reason seems to me seems to be two-fold:

PeePDF does not seem to follow the /Next key in object 77 pointing to object 109.
PeePDF seems to miss JavaScripts which are invoked via /AA ("additional actions") dictionaries.

The most recent version of pdfinfo -js is producing a more complete result (even though most of the /JS and /JavaScript name tokens are obfuscated).

Here is the extracted JavaScript output of PeePDF:

function Motion(msg, n) {
    var f = new String(msg);
    return f.substr(n) + f.substr(0, n);
}
function checkField(aField) {
    if (aField.value == "") { // empty
        var msg = "No fields can be left empty!";
        app.alert(msg);
        return 0;
    }
}
function goNext(item, event, cName) {
    AFNumber_Keystroke(0, 0, 0, 0, "", true);
    if (event.rc && AFMergeChange(event).length == event.target.charLimit) item.getField(cName).setFocus();
}
var f = this.getField("message.1");

if (global.ttIsRunning == 1) {
    app.clearInterval(global.run);
    global.ttIsRunning = 0;
}
var f = this.getField("message.1");

var code = new String("this.getField('message.1').value =  Motion(this.getField('message.1').value,2);");

global.ttIsRunning = 1;

//global.run = app.setInterval(code,50);

Here the output of `'pdfinfo -js'` (from Poppler version 0.41.0)

//////Name Dictionary "Motion":

function Motion(msg,n)
{
    var f = new String(msg);
    return f.substr(n)+f.substr(0,n);
}

//////Name Dictionary "checkField":

function checkField( aField )
{
        if ( aField.value == "" ) {     // empty
                var msg = "No fields can be left empty!";
                app.alert( msg );
                return 0;
        } 
}

////Name Dictionary "goNext":

function goNext( item, event, cName )
{
        AFNumber_Keystroke(0,0,0,0, "", true );
        if ( event.rc && AFMergeChange(event).length == event.target.charLimit )
                item.getField( cName ).setFocus();
}

////Field Activated:

app.alert( "You are running version " + app.viewerVersion + " of Adobe Acrobat " + app.viewerType + " on the "+ app.platform + " platform.")

////Field Activated:

if (typeof(app.viewerType)!="undefined")
 if(app.viewerVersion < 5.0)
 {
  var msg = "Executing this script requires Acrobat 5.0.";
  app.alert(msg);
 }
else
 {
var n = this.getField ("65name"); 
var annot = this.addAnnot ({ 
 page: 0, 
 type: "Text", 
 author: n.value, 
 point: [462, 475, 810, 814], 
 strokeColor: color.blue, 
 popupOpen: true,
 contents: "If you can read this, you are too close!"
});
 }

////Field Activated:

if (typeof(app.viewerType)!="undefined")
 if(app.viewerVersion < 5.0)
 {
  var msg = "Executing this script requires Acrobat 5.0.";
  app.alert(msg);
 }
else
 {
var name = this.getField("66name");
var annot = this.addAnnot
({
page: 0,
type: "FreeText",
textFont: "Helvetica",
textSize: 18,
alignment: 1,
rect: [570, 450, 400, 400],
fillColor: ["RGB", 1, 1, 0],
strokeColor: color.blue,
name: "FreeText Note",
contents: "For something with more formatting control, a FreeText Annotation works nicely."
})
annot.author = name.value;
 }

////Field Activated:

// the procedure begins here
var okToSubmit = true;

// loop over all fields:
for (var j = 0; j < this.numFields; j++)
{
        var fieldname = this.getNthFieldName(j);
        var theField = this.getField("72field");
        if (theField.type != 'text')
        continue; // get past buttonfields
        var valid =  checkField(theField);
        if (!valid) // valid == 0? Halt!
        {
        okToSubmit = false;  // set flag
        break;   // exit loop prematurely
        }
}

////Field Activated:

if ( typeof( app.viewerVersion ) != undefined ) {       // are we running in a known viewer?
        if ( app.viewerVersion < 5.0 ) {                                // what version?
                var ourPath = this.path;
                var ourName = ourPath.split("/").pop();
                this.getField("message.2")= ourName;
        } else {
                var ourURL = this.URL;
                var ourName = ourURL.split("/").pop();
                this.getField("message.2")= ourName;
        }
}

////Page Open:

var f = this.getField("message.1");

var code = new String("this.getField('message.1').value =  Motion(this.getField('message.1').value,2);");

global.ttIsRunning = 1;

//global.run = app.setInterval(code,50);

////Page Close:

var f = this.getField("message.1");

if (global.ttIsRunning == 1) {
        app.clearInterval(global.run);
        global.ttIsRunning = 0;
}

////Widget Annotation Activated:

app.alert( "You are running version " + app.viewerVersion + " of Adobe Acrobat " + app.viewerType + " on the "+ app.platform + " platform.")

////Widget Annotation Cursor Enter:

var f = this.getField("tipMessage.1");
f.hidden = false;

////Widget Annotation Cursor Leave:

var f = this.getField("tipMessage.1");
f.hidden = true;

////Widget Annotation Activated:

if (typeof(app.viewerType)!="undefined")
 if(app.viewerVersion < 5.0)
 {
  var msg = "Executing this script requires Acrobat 5.0.";
  app.alert(msg);
 }
else
 {
var n = this.getField ("65name"); 
var annot = this.addAnnot ({ 
 page: 0, 
 type: "Text", 
 author: n.value, 
 point: [462, 475, 810, 814], 
 strokeColor: color.blue, 
 popupOpen: true,
 contents: "If you can read this, you are too close!"
});
 }

////Widget Annotation Activated:

if (typeof(app.viewerType)!="undefined")
 if(app.viewerVersion < 5.0)
 {
  var msg = "Executing this script requires Acrobat 5.0.";
  app.alert(msg);
 }
else
 {
var name = this.getField("66name");
var annot = this.addAnnot
({
page: 0,
type: "FreeText",
textFont: "Helvetica",
textSize: 18,
alignment: 1,
rect: [570, 450, 400, 400],
fillColor: ["RGB", 1, 1, 0],
strokeColor: color.blue,
name: "FreeText Note",
contents: "For something with more formatting control, a FreeText Annotation works nicely."
})
annot.author = name.value;
 }

////Widget Annotation Activated:

// the procedure begins here
var okToSubmit = true;

// loop over all fields:
for (var j = 0; j < this.numFields; j++)
{
        var fieldname = this.getNthFieldName(j);
        var theField = this.getField("72field");
        if (theField.type != 'text')
        continue; // get past buttonfields
        var valid =  checkField(theField);
        if (!valid) // valid == 0? Halt!
        {
        okToSubmit = false;  // set flag
        break;   // exit loop prematurely
        }
}

////Widget Annotation Activated:

if ( typeof( app.viewerVersion ) != undefined ) {       // are we running in a known viewer?
        if ( app.viewerVersion < 5.0 ) {                                // what version?
                var ourPath = this.path;
                var ourName = ourPath.split("/").pop();
                this.getField("message.2")= ourName;
        } else {
                var ourURL = this.URL;
                var ourName = ourURL.split("/").pop();
                this.getField("message.2")= ourName;
        }
}

error parsing when object/stream put after %%EOF

It appears Acrobat will render pdf files properly even when object/stream def after %%EOF, however peepdf will discard the content due to stop at %%EOF.

e.g: the recent hot pdf exploit, bd23ad33accef14684d42c32769092a0

0000023515 00000 n
0000024187 00000 n
0000024261 00000 n
trailer
<<
 /Size 67
 /Root 10 0 R
>>
startxref
24613
%%EOF

1 0 obj 
<<
 /Length 56305 
 /Filter /FlateDecode 
 >> 
 stream
....

Current peepdf will failed to parse, throws exception.

The following tries to fix the problem.

diff --git a/PDFCore.py b/PDFCore.py
index 3b2fe00..33cf5a4 100644
--- a/PDFCore.py
+++ b/PDFCore.py
@@ -4315,7 +4315,7 @@ class PDFBody :
                                 self.setObject(compressedId, compressedObject, offset)
                             del(compressedObjectsDict)
         for id in self.referencedJSObjects:
-            if id not in self.containingJS:
+            if (len(self.containingJS) and id not in self.containingJS):
                 object = self.objects[id].getObject()
                 if object == None:
                     errorMessage = 'Object is None'
@@ -6941,6 +6941,9 @@ class PDFParser :
                     self.fileParts.append(fileContent)
                 else:
                     sys.exit(errorMessage)
+        # append anything behind %%EOF
+        if fileContent:
+            self.fileParts.append(fileContent)
         pdfFile.setUpdates(len(self.fileParts) - 1)

         # Getting the body, cross reference table and trailer of each part of the file

Applying the change, there should be no issue of parsing said file:

Version 0:
        Catalog: 10
        Info: No
        Objects (50): [6, 7, 9, 10, 11, 12, 14, 15, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41, 42, 43, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 61, 62, 63, 64, 65, 66]
                Errors (1): [33]
        Streams (14): [14, 15, 17, 25, 31, 32, 33, 34, 49, 51, 55, 56, 57, 62]
                Encoded (11): [14, 15, 17, 25, 31, 32, 33, 49, 51, 55, 56]
                Decoding errors (1): [33]
        Suspicious elements:
                /AcroForm (1): [10]
                /OpenAction (1): [10]
                /JS (1): [11]
                /JavaScript (1): [11]


Version 1:
        Catalog: No
        Info: No
        Objects (1): [1]
        Streams (1): [1]
                Encoded (1): [1]
        Objects with JS code (1): [1]
PPDF> object 1

<< /Length 56305
/Filter /FlateDecode >>
stream

var dlldata= [0x81ec8b55,0x000498ec,0xf4458900 ....

It's a quick fix, you may refactor the logic a bit...

JSAnalysis.py always requires PyV8

What steps will reproduce the problem?
1. Don't install PyV8
2. try to run peepdf.py on any pdf w/ js

What is the expected output? What do you see instead?
For the python to load. 

Instead presented with this:

Traceback (most recent call last):
  File "peepdf.py", line 32, in <module>
    from PDFCore import PDFParser, vulnsDict
  File "/Users/tross/Code/satori/peepdf_service/peepdf-svn/PDFCore.py", line 31, in <module>
    from JSAnalysis import *
  File "/Users/tross/Code/satori/peepdf_service/peepdf-svn/JSAnalysis.py", line 36, in <module>
    class Global(PyV8.JSClass):

NameError: name 'PyV8' is not defined

What version of the product are you using? On what operating system?
any


Please provide any additional information below.
placing the global class in the try block will fix it... probably a better fix.
try:
    import PyV8
    JS_MODULE = True 
    class Global(PyV8.JSClass):
        evalCode = ''

        def evalOverride(self, expression):
            self.evalCode += '\n\n// New evaluated code\n' + expression
            return
except:
    JS_MODULE = False

Original issue reported on code.google.com by [email protected] on 5 Sep 2013 at 3:18

[Feature request] Add a `-C cmd1,cmd2,cmd3` command line option to peepdf

Currently one can run peepdf.py -s script my.pdf and have PeepDF.py execute the commands listed in thescript file without a need to start it in interactive mode.

It would be nice if we had a more direct way to execute one or more small commands like this:

 peepdf.py -C "tree,offsets,filters 9" my.pdf
 peepdf.py --commands "tree,offsets,filters 9" my.pdf

This should then behave the same as running peepdf.py -s script where the contents of script was:

tree
offsets
filters 9

This feature would save us from the sometimes long-winded path of first creating or editing/modifying a script file.

some pdfs delay static analysis indefinitely

Hi, from a project using peepdf.
spender-sandbox/cuckoo-modified#54
Some samples are here

[Question] Can peepdf be used to extract color space information from PDF?

I need to be able to analyze PDF files in order to find the color spaces used for each object (for press preflight purposes - mainly to find out if there are objects from color spaces other than CMYK and if there are any color profiles attached). I am having difficulties finding and open source tool to do that.
Would it be possible with peepdf?
Could peepdf be used at least as some intermediate step to achieve the task?

My aim is to create a command-line tool for verifying PDF files in terms of color space.

[Feature request] Improve output of `offsets` command (and remove the "off by one" bug when printing end of object)

The current output of the offsets command is buggy in so far as it reports the offset to the end of an indirect object as an integer that is off by one. Take this as an example PDF (hand-coded, no binary bytes, so it can be examined easily in a text editor):

https://gist.github.com/KurtPfeifle/63eaed91ee5dd26873cd

The offsets command reports this output for the file (I put in additional comments about what would be the correct values):

       0 Header
      74
        Object  1 (89)
     162                           ### 163   
     166
        Object  2 (236)
     401                           ### 402   
     405
        Object  3 (127)
     531                           ### 532   
     534
        Object  4 (208)
     741                           ### 742   
     745
        Object  5 (42)
     786                           ### 787   
     807
        Object  7 (92)
     898                           ### 899   
     902
        Object  8 (410)
    1311                           ### 1312   
    1315
        Object  9 (726)
    2040                           ### 2041   
    2044
        Object  10 (209)
    2252                           ### 2253   
    2317
        Xref Section (240)
    2556                           ### 2557
    2558
        Trailer (92)
    2649                           ### 2549, correct!   
    2650 EOF

More importantly, it could be improved greatly by adding the following info:

Is there only whitespace in between two different indirect objects, or are there additional comments or even "junk" bytes there?!? *(See the %Trailer comment in the line 148 of the above linked sample PDF)
The lengths of all streams -- report what the /Length NMNM for the given indirect object states _as well as_ what PeepDF itself calculates (as you know, there may be mismatches).

TypeError leads to an unhandled Exception

peepdf crashes with a TypeError if some PDFs are analyzed in force parsing mode and PDFObjectStream.resolveReferences() is invoked.

Traceback (most recent call last):
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/main.py", line 409, in main
    ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7098, in parse
    ret = body.updateObjects()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 4288, in updateObjects
    object.resolveReferences()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 3253, in resolveReferences
    ret = PDFParser.readObject(objectsSection[offset:])
TypeError: slice indices must be integers or None or have an __index__ method

If I fix that TypeError by converting offset at PDFCore.py:3243 to an int object I get another one:

Traceback (most recent call last):
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/main.py", line 409, in main
    ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 7098, in parse
    ret = body.updateObjects()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 4288, in updateObjects
    object.resolveReferences()
  File "/home/sdeiss/Developer/bin/virtualenv/peekaboo/local/lib/python2.7/site-packages/peepdf/PDFCore.py", line 3253, in resolveReferences
    ret = PDFParser.readObject(objectsSection[offset:])
TypeError: unbound method readObject() must be called with PDFParser instance as first argument (got str instance instead)

A possible solution would be to supply the PDFParser object to PDFObjectStream when creating that instance and then provide the supplied PDFParser instance for readObject().

referenced before assignment

UnboundLocalError: local variable 'userPass' referenced before assignment on line 260 of PDFCrypto.py

Is there any way to have a file name of PDF on interactive mode?

What I want is I need to process batch of PDF files and just save the Java Scripts of each PDF files. For that thing I need to have a file name of PDF file. Like I am using -s (command option in PDF file)

peepdf -fl -s command pdffile.pdf

in command file I have:

extract js > abc.js

but if there is a pdf_file_name variable in interactive mode then I could have used $filename.js instead of abc.js

Additional cross-reference entry when create new PDF file

What steps will reproduce the problem?

1. Run "./peepdf.py -i"
2. Run "create pdf" in peepdf console
3. Run "save 'test.pdf'" in peepdf console


What is the expected output? What do you see instead?

The following content is the cross-reference table and trailer of test.pdf. The 
size of cross-reference table is 4, however, there are 5 entries in table. 
There is a useless entry in cross-reference table which does not point to an 
object.

xref
0 4
0000000000 65535 f 
0000000009 00000 n 
0000000059 00000 n 
0000000118 00000 n 
0000000119 00000 n 
trailer
<< /Size 4
/Root 1 0 R >>
startxref
210
%%EOF


What version of the product are you using? On what operating system?

The version of peepdf is r42. The operating system is ubuntu-11.10 x86_64.


Please provide any additional information below.

Original issue reported on code.google.com by czchen on 24 Oct 2011 at 11:50

Current Git source not functional

Current source from GitHub is not functional. I don't know if it is supposed to be (in the past, after all new commits, it was...)

_{(I'm aware I don't have pylibemu and don't have PyV8 installed -- but this shouldn't matter here.)}

### Check current Git log:

 kp@mbp:git.peepdf.trunk > git log | head -n 12 
 commit c550c6d1e8b4cb507018deb73392a0487d5d96b4
 Author: Jose Miguel Esparza <[email protected]>
 Date:   Fri Jul 31 01:27:46 2015 +0200

     Added /Flash as element to monitor

 commit 79d0534981a98a9c553cc68f2b13a62f5afd5c5a
 Author: Jose Miguel Esparza <[email protected]>
 Date:   Mon Jul 27 23:42:55 2015 +0200

     Added some PEP8 magic and modified the limit output to 500 lines instead of 1000

### Create an empty, dummy PDF with Ghostscript:

 kp@mbp:git.peepdf.trunk > gs -q -o empty-dummy.pdf -sDEVICE=pdfwrite -c showpage

### Check the newly created PDF with `pdfinfo`:

 kp@mbp:git.peepdf.trunk > pdfinfo empty-dummy.pdf 
 Producer:       GPL Ghostscript GIT PRERELEASE 9.18
 CreationDate:   Fri Jul 31 11:42:22 2015
 ModDate:        Fri Jul 31 11:42:22 2015
 Tagged:         no
 UserProperties: no
 Suspects:       no
 Form:           none
 JavaScript:     no
 Pages:          1
 Encrypted:      no
 Page size:      612 x 792 pts (letter)
 Page rot:       0
 File size:      2383 bytes
 Optimized:      no
 PDF version:    1.5

### Run peepdf.py:

 kp@mbp:git.peepdf.trunk > ./peepdf.py -fil empty-dummy.pdf 
 Warning: PyV8 is not installed!!
 Warning: pylibemu is not installed!!

 File: empty-dummy.pdf
 MD5: fc9ef463e4de46cdb87805be0f0edc7b
 SHA1: ddaa418db6e95360273b13506d592d54dc49a311
 SHA256: c36758aa347e1e526addf50b8a795abe6c5cbd79d8d6c30cf86aebea67250ebd
 Size: 2383 bytes
 Version: 1.5
 Binary: True
 Linearized: False
 Encrypted: False
 Updates: 0
 Objects: 9
 Streams: 2
 Comments: 0
 Errors: 0

 Version 0:
    Catalog: 1
    Info: 2
    Objects (9): [1, 2, 3, 4, 5, 6, 7, 8, 9]
    Streams (2): [9, 5]
        Encoded (1): [5]

 *** Error: Exception not handled!!

 Please, don't forget to report the errors found:

    - Sending the file "$(pwd)/errors.txt" to the author  (mailto:[email protected])
    - And/Or creating an issue on the project webpage (https://github.com/jesparza/peepdf/issues)

Error: String object with Windows-formatted IP addresses

Hi,

I have found an issue when I try to add/modify string objects with Windows-formatted IP addresses (such as \127.0.0.1).

peepdf detects these IP addresses as if they were octal number \ddd. If the IP address has numbers bigger than 7, an exception occurs in the conversion to octal.

Please, specify the string object content:
\\192.168.1.1
*** Error: The object has not been modified!!

new PIL support

I just managed to get DCTFilter to work on my ubuntu box. It went out that PIL removed tostring() method.

Exception: tostring() has been removed. Please call tobytes() instead.

Regards
Piotr

Changing JavaScript engine

It seems that v8 is no longer maintained.
Do you have any plans to change peepdf javascript engine?
For example I cannot install v8 on my Mac OS X (see https://code.google.com/p/pyv8/issues/detail?id=246) so I cannot use the function js_analyse.

Error: Exception not handled!! when trying to run the changelog command

I'm trying to use peepdf for the changelog feature, but I can't make it work and would appreciate some help.

I first tried running the program in the interactive console, but when I call the "open" command I get the error:

*** Error: Exception not handled using the interactive console!! Please, report it to the author!!

And the error.txt file contains:

Traceback (most recent call last):
  File "C:\user\pdf\peepdf2\peepdf.py", line 727, in <module>
    console.cmdloop()
  File "C:\Python27\lib\cmd.py", line 142, in cmdloop
    stop = self.onecmd(line)
  File "C:\Python27\lib\cmd.py", line 221, in onecmd
    return func(arg)
  File "C:\user\pdf\peepdf2\PDFConsole.py", line 2858, in do_open
    ret = pdfParser.parse(fileName, forceMode, looseMode)
  File "C:\user\pdf\peepdf2\PDFCore.py", line 7054, in parse
    sys.exit('Error: An error has occurred while parsing an indirect object!!')
SystemExit: Error: An error has occurred while parsing an indirect object!!

Then I tried running the command directly trough parameters:

python.exe "C:\user\pdf\peepdf2\peepdf.py" -C changelog -f "C:\user\pdf\pdf_test.pdf"

But another error occured:

Error: Exception not handled!!

errors.txt:

Traceback (most recent call last):
  File "C:\user\pdf\peepdf2\peepdf.py", line 494, in <module>
    ret, pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)
  File "C:\user\pdf\peepdf2\PDFCore.py", line 7061, in parse
    ret = body.updateObjects()
  File "C:\user\pdf\peepdf2\PDFCore.py", line 4283, in updateObjects
    object.resolveReferences()
  File "C:\user\pdf\peepdf2\PDFCore.py", line 3243, in resolveReferences
    ret = PDFParser.readObject(objectsSection[offset:])
TypeError: slice indices must be integers or None or have an __index__ method

What could be causing those issues?

XFA JS decapsulation invalid for invalid xref table

What steps will reproduce the problem?
1. Get this specially forged PDF:
https://www.virustotal.com/en-gb/file/be9c0025b99f0f8c55f448ba619ba303fc65eba862
cac65a00ea83d480e5efec/analysis/
2. run peepdf -fi filename
3. run js_analysis object 6

What is the expected output? What do you see instead?

Run the JS code the PyV8 .

Because there are XFA tags opening and closing, js emulation fails:

*** Error analysing Javascript: SyntaxError: Unexpected token < (  @ 1 : 0 )  
-> <? xml version = "1.0"


What version of the product are you using? On what operating system?


Version: peepdf 0.2 r203
Ubuntu 12.10

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 18 Oct 2013 at 2:53

error during analysis pdf

this is the error.log

Traceback (most recent call last):
  File "./peepdf.py", line 541, in <module>
    console.cmdloop()
  File "/usr/lib/python2.7/cmd.py", line 142, in cmdloop
    stop = self.onecmd(line)
  File "/usr/lib/python2.7/cmd.py", line 219, in onecmd
    return func(arg)
  File "/usr/local/peepdf/PDFConsole.py", line 2721, in do_open
    ret = pdfParser.parse(fileName, forceMode, looseMode)
  File "/usr/local/peepdf/PDFCore.py", line 6838, in parse
    sys.exit('Error: An error has occurred while parsing an indirect object!!')
SystemExit: Error: An error has occurred while parsing an indirect object!!
Traceback (most recent call last):
  File "./peepdf.py", line 541, in <module>
    console.cmdloop()
  File "/usr/lib/python2.7/cmd.py", line 142, in cmdloop
    stop = self.onecmd(line)
  File "/usr/lib/python2.7/cmd.py", line 219, in onecmd
    return func(arg)
  File "/usr/local/peepdf/PDFConsole.py", line 2721, in do_open
    ret = pdfParser.parse(fileName, forceMode, looseMode)
  File "/usr/local/peepdf/PDFCore.py", line 6838, in parse
    sys.exit('Error: An error has occurred while parsing an indirect object!!')
SystemExit: Error: An error has occurred while parsing an indirect object!!


do you need other info?

thanks a lot

Original issue reported on code.google.com by [email protected] on 23 Jun 2014 at 3:26

PNG prediction decode only decodes part of the image

When using PDFs containing PNG images with prediction > 10, the current 
implementation only decodes part of the image (1/3 of each row of the image).

Luckily, I already found the problem and I will attach a patch with a possible 
solution :)

Original issue reported on code.google.com by [email protected] on 17 Sep 2013 at 9:55

Attachments:

png_prediction.patch

invalid dictOwnerPass prevents further processing

What steps will reproduce the problem?
1.https://www.virustotal.com/en/file/784d1ebd1faccec27f98970cc266859eaf5676da1c4
51e3304fb55435d8c8473/analysis/
2. run peepdf.py -f vtfile


What is the expected output? What do you see instead?

#Expected:

Warning: PyV8 is not installed!!
Warning: pylibemu is not installed!!
Decryption error: Bad format for /O!!
Decryption error: Bad format for /U!!
Decryption error: Default user password not working here!!

File: tp_22340_utf8_88292d7181514fda5390292d73da28d4
MD5: 88292d7181514fda5390292d73da28d4
SHA1: fbc3856fd689e1ac0f8fb56bbd7d0a2b8332a928
Size: 807079 bytes
Version: 1.4
Binary: True
Linearized: False
Encrypted: True (RC4 40 bits)
Updates: 0
Objects: 7
Streams: 1
Comments: 0
Errors: 5

Version 0:
    Catalog: 1
    Info: No
    Objects (7): [1, 2, 3, 4, 5, 8, 9]
        Errors (1): [5]
    Streams (1): [5]
        Encoded (1): [5]
        Decoding errors (1): [5]
    Suspicious elements:
        /AcroForm: [1]
        /OpenAction: [1]
        /JS: [1]
        /JavaScript: [1]

#Instead see:

Traceback (most recent call last):
  File "peepdf.py", line 352, in <module>
    ret,pdf = pdfParser.parse(fileName, options.isForceMode, options.isLooseMode, options.isManualAnalysis)
  File "/Users/tross/Code/satori/peepdf_service/peepdf-svn/PDFCore.py", line 6822, in parse
    ret = pdfFile.decrypt()
  File "/Users/tross/Code/satori/peepdf_service/peepdf-svn/PDFCore.py", line 5179, in decrypt
    ret = computeUserPass(password, dictO, fileId, perm, keyLength, revision, encryptMetadata)
  File "/Users/tross/Code/satori/peepdf_service/peepdf-svn/PDFCrypto.py", line 164, in computeUserPass
    ret = computeEncryptionKey(userPassString, dictO, dictU, dictOE, dictUE, fileID, pElement, keyLength, revision, encryptMetadata)
  File "/Users/tross/Code/satori/peepdf_service/peepdf-svn/PDFCrypto.py", line 58, in computeEncryptionKey
    md5input = password + dictOwnerPass + struct.pack('<I',abs(int(pElement))) + fileID
TypeError: cannot concatenate 'str' and 'instance' objects

What version of the product are you using? On what operating system?
latest version from svn, any os

Please provide any additional information below.
when forcing and encountering errors and the dict0/dictOwnerPass object doesn't 
resolve to a simple string and therefore hinders further execution.

Original issue reported on code.google.com by [email protected] on 5 Sep 2013 at 3:35

Attachments:

PDFCore.py_patch

metadata command crashed peepdf

What steps will reproduce the problem?
1. running metadata in the console on a malformed PDF


What is the expected output? What do you see instead?
The program crashed with:

Traceback (most recent call last):
  File "/home/.../bin/peepdf.py", line 465, in <module>
    console.cmdloop(stats + newLine)
  File "/usr/lib64/python2.6/cmd.py", line 142, in cmdloop
    stop = self.onecmd(line)
  File "/usr/lib64/python2.6/cmd.py", line 219, in onecmd
    return func(arg)
  File "/home/.../src/svn/sec/peepdf-read-only/PDFConsole.py", line 2290, in do_metadata
    type = object.getElementByName('/Type').getValue()
AttributeError: 'list' object has no attribute 'getValue'


What version of the product are you using? On what operating system?
r158 from svn

Please provide any additional information below.

I don't know if the patch is the right long-term solution, but it solved my 
crash.

Maybe every interactive command should be in a try/except block, so the program 
does not crash on the user?

Original issue reported on code.google.com by [email protected] on 30 Nov 2012 at 3:27

Attachments:

peepdf-metadata-crash.patch

PyV8 no longer buildable

Pyv8 is now no longer hosted on Google code or supported. All of the old mirrors people have made on github will not build. Maybe switch to a new library?

Error: ReferenceError: print is not defined

I installed pyv8 from https://github.com/buffer/pyv8. The installation of pyv8 succeeded; however, running the following test fails to interpret the javascript code in the variable jscode:

PPDF> set jscode "var a = 8; a = a + 2; print('The content of the variable is '+a);"
PPDF> js_eval variable jscode


*** Error: ReferenceError: print is not defined (  @ 1 : 22 )  -> var a = 8; a = a + 2; print('The content of the variable is '+a);

I get similar messages for PDFs that contain javascript and where I execute js_eval.

I ran this test on debian 7.11 and on remnux 6 with the latest version of peepdf found on github. I would appreciate if you can let me know if there is anything else I am missing in order to get peepdf to interpret javascript code.

[Feature request] Add functionality to encode streams (in addition to support encoding of variables, files or raw byte ranges)

I would like to have a functionality in PeepDF which allows to encode _streams. Currently, it is only possible to filter/encode _variables, _files_ and _raw byte ranges_:

PPDF> encode help

Usage: encode variable $var_name $filter1 [$filter2 ...]
Usage: encode file $file_name $filter1 [$filter2 ...]
Usage: encode raw $offset $num_bytes $filter1 [$filter2 ...]

Encodes the content of the specified variable, file or raw bytes using the following 
filters or algorithms:
[....]

So it would be nice to have this:

PPDF> encode help

Usage: encode variable $var_name $filter1 [$filter2 ...]
Usage: encode file $file_name $filter1 [$filter2 ...]
Usage: encode raw $offset $num_bytes $filter1 [$filter2 ...]
Usage: encode stream $object $filter1 [$filter2 ...]

Encodes the content of the specified variable, file, raw bytes or stream from $object 
using the following filters or algorithms:
[....]

Of course, re-directing the output of that function to a file should also work:

PPDF>  encode stream 9 lzw ahx > stream9-lzw-ahx.txt

I know I can work around this by using encode variables, encode file or encode raw. But this requires to go the troublesome path of putting the current stream content into a file or variable first, or of calculating the offset to the stream and its length beforehand....

Add a jjdecoder function

CVE-2013-3346 pdf samples have obfuscated Javascript code using jjencode 
(http://utf-8.jp/public/jjencode.html). It would be nice to have a jjdecoder in 
peepdf to quickly deobfuscate the code.

Sample jjdecoder written in Javascript can be found here: 
http://csc.cs.utm.my/syed/images/files/jjdecode/jjdecode.html

Some explanation about how a jjdecoder works can be found here: 
http://corkami.googlecode.com/svn-history/r399/trunk/misc/jjencode.txt

Original issue reported on code.google.com by [email protected] on 12 Dec 2013 at 12:28

Patch: Permit in-memory scanning (from a variable)

We have an automated malware analysis system that runs a variety of scans in 
memory on input files.  We patched PDFCore.py to enable string input of file 
contents, rather than a filename.  It is attached, in case anyone finds it 
useful.

Original issue reported on code.google.com by [email protected] on 22 Mar 2012 at 2:48

Attachments:

peepdf-inmemory.patch

support for filter jbig2

Hi,

This is an amazing tool to analyze mal pdfs and is there a timeline to add support for the jbig2 filter?

Thanks

*** Error: Exception not handled using the interactive console!! Please, report it to the author!!

Howdy, this is all I have using your tool from the interactive mode attempting to open a pdf file saved from browser input after disabling the chrome built-in pdf viewer.
The chrome and default pdf viewer (currently on Max OSX) also cannot display the file.
The original PDF is stored as a Blob.
I am saving the blob to disk file as content-type: application/pdf.
The newly saved file (from Blob) is PDF viewer viewable without issues.
When I then read the file from disk attempting to display hence the issue.
Chrome PDF viewer or the default desktop viewer only say unable to load PDF document.
Your PDF tool as you see reports:
*** Error: Exception not handled using the interactive console!! Please, report it to the author!!
Somewhere in the input stream the PDF is getting corrupted.
I was hoping your tool would help.
iText rups is no longer available and other good PDF analysis tools don't seem to exist.

jesparza / peepdf Goto Github PK

peepdf's People

Contributors

Stargazers

Watchers

Forkers

peepdf's Issues

Debian

TypeError: slice indices must be integers or None or have an index method

Here is the extracted JavaScript output of PeePDF:

Here the output of 'pdfinfo -js' (from Poppler version 0.41.0)

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Here the output of `'pdfinfo -js'` (from Poppler version 0.41.0)