GithubHelp home page GithubHelp logo

joswr1ght / pptxurlcheck Goto Github PK

View Code? Open in Web Editor NEW
27.0 6.0 6.0 58.23 MB

Parse a PowerPoint PPTX file, extracting all URL's from notes and slides, and test for validity

License: MIT License

Python 100.00%

pptxurlcheck's Introduction

pptxurlcheck

Parse a PowerPoint PPTX file, extracting all URLs from notes and slides, and test for validity returning ERR or the non-OK HTTP status code.

Usage

$ pptxurlcheck.py
Validate URLs in the notes and slides of one or more PowerPoint pptx files. (version 2.0)
Check GitHub for updates: http://github.com/joswr1ght/pptxurlcheck

Usage: pptxurlcheck.py [pptx file(s)]
$ pptxurlcheck.py SEC555/*.pptx
URL validation report created at SEC555/pptxurlreport.csv.
$ head -4 SEC555/pptxurlreport.csv
File#,Page,Response,URL,Note
1,5,ERR,https://intel.criticalstack.com,Maximum retry failure exceeded (possible bad server name)
1,5,ERR,https://sec555.com/4p,Maximum retry failure exceeded (possible bad server name)
2,54,404,http://schemas.microsoft.com/win/2004/08/events/event,
2,157,404,https://www.elastic.co/elasticon/2015/sf/scaling-elasticsearch-for-production-at-verizon,

Pptxurlcheck searches all slide bullets and notes pages for URLs, and attempts to retrieve the URL. By default, URLs that are valid (e.g. that return a 200 OK message) are not displayed; all other URLs are displayed along with the return code. ERR indicates that the server could not be reached. If you want to see each URL that is tested, set the environment variable SKIP200 to 0:

$ SKIP200=0 ~/Dev/pptxurlcheck/pptxurlcheck.py SEC555/SEC555_1_G01_01_JH.pptx
URL validation report created at SEC555/pptxurlreport.csv.
$ head -4 SEC555/pptxurlreport.csv
File#,Page,Response,URL,Note
1,7,200,https://content.fireeye.com/m-trends,
1,7,200,https://sec555.com/2g,
1,9,500,https://sec555.com/2i,

Windows users can set an environment variable before running the set command:

C:\>set SKIP200=0
C:\>pptxurlcheck SEC561.pptx
...

Ignored URLs

Pptxurlcheck ignores several URL patterns that don't make sense for validation purposes:

  • RFC1918 private IP addresses
  • Loopback IP addresses
  • localhost
  • Domains ending in .onion and .i2p

Optionally, a file ending with a .txt extension may be added at any position in the command line. This file should include a list of URLs to ignore from the URL check report, one per line. The URL in the ignore file must be an exact match for the URL in the PowerPoint file to be ignored:

$ cat ignoreurls.txt
https://update.googleapis.com/service/update2
https://www.godaddy.com/whois/results.aspx?id=J7TA5oEZ8R8JbAAdtaCg
http://www.[target_company].com
https://www.redacted.gov/wp-content/uploads/2019/06/MEP_programsandprojects.pdf
$ pptxurlcheck.py SEC504_*pptx ignoreurls.txt
URL validation report created at pptxurlreport.csv

Platforms

Tested on Windows 10, macOS 11.2, and Debian-based Linux. Windows binary included in the bin/ directory, built with pyinstaller --onefile --hidden-import urllib3 --hidden-import requests pptxurlcheck.py using Python 3.9.5.

Questions, Comments, Concerns?

Open a ticket, or drop me a note: [email protected].

pptxurlcheck's People

Contributors

joswr1ght avatar webbreacher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pptxurlcheck's Issues

Bus Error 10

When I try to open the site www.whois.net, I get bus error 10. The attached file reproduces the error on Mac OS X 10.11 with Python 2.7.11.

bus.pptx

It does not appear to happen on Linux Python 2.7.8.

ERR for valid URL

$ python pptxsanity.py pptx.pptx
ERR : http://www.macroplant.com/

Visiting this URL redirects to an SSL page - maybe 302 redirect handling isn't working properly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.