GithubHelp home page GithubHelp logo

hhy5277 / status-crawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from seethroughdev/status-crawler

0.0 1.0 0.0 102 KB

A fully configurable crawler to check your website status codes, javascript errors and anything you want.

JavaScript 100.00%

status-crawler's Introduction

Spider for CasperJS

This script uses casperJS to crawl your site and log all urls, response codes, errors and warnings to a json file for parsing.


What casperjs-spider does

  • Spiders whatever site you want it to
  • Returns list of:
    • all links with response codes
    • all javascript errors from console
    • all ssl insecure warnings
    • all warning messages
  • Does not repeat URLs
  • Allows you to skip specified terms
  • Allows you to require specified terms
  • Exports a data.json file with your results

Getting Started

Make sure you have casperJS and phantomJS installed.

Configure the script by setting your config options in config.js or passing arguments in the command line.

In your terminal, navigate to the folder containing the spider.js file.

  • Using config.js settings:

casperjs spider.js

  • Configuring with arguments:

casperjs --start-url=http://example.com --required-values=example.com spider.js

Casper arguments go in the middle, and they will override config options in the script.

Config Options

There are several configuration options in casperjs-spider. You can set them individually in the command line, or by editing the config portion of spider.js.

It might help to refer to the default config options in config.js for examples

start-url *required

  • Also defined as config.startUrl in config.js. This is the starting URL for your spider to crawl.
  • --start-url=http://example.com
  • config.startUrl = 'http://example.com';

required-values *required

  • Also defined as config.requiredValues in config.js. This is a comma-separated list of all required strings.
  • --required-values=example.com
  • config.requiredValues = 'example.com';
  • Make sure you put your top-level domain in here to keep from spidering the internet!
  • Leave off the protocol so it will allow for subdomains if you have them.

skipped-values

  • Also defined as config.skippedValues in config.js. This is a comma-separated list of all skipped strings.
  • It might be helpful to skip URLs like mailto, install, forums, blogs etc...
  • --skipped-values=mailto,install,\#,blog/,comment
  • config.skippedValues = 'mailto,install,#,blog/,comment';

limit

  • Also defined as config.limit in config.js. This is a numeric limit to the links logged.
  • Enter 0, or omit for no limit.
  • --limit=25
  • config.limit = 25

user-agent

  • Also defined as config.userAgent in config.js. You can supply a custom userAgent string.
  • Omit for default PhantomJS
  • --user-agent="Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25"
  • config.userAgent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'

file-location default=./logs/

  • Also defined as config.fileLocation in config.js. This is a path to where you want the data.json file to be saved.
  • --file-location=./logs/
  • config.fileLocation = './logs/';

date-file-name default=false

  • Also defined as config.dateFileName in config.js. This is a boolean to replace the filename data.json with the current date. ie. 2013-12-22.json. In case you want to keep versions.
  • --date-file-name=false
  • config.dateFileName = false;

verbose default=false

  • Also defined as config.verbose in config.js. This is a boolean to put casper into verbose mode.
  • --verbose=false
  • config.verbose = false;

log-level default=error

  • Also defined as config.logLevel in config.js. SpiderJS allows you to set a logging level. can be [error, warning, info, debug]
  • --log-level=error
  • config.logLevel = 'error';

load-images default=false

  • Also defined as config.loadImages in config.js. SpiderJS allows you to disable images from loading in the crawler. This speeds up the crawl, and is generally not necessary for output.
  • --load-images=false
  • config.loadImages = 'false';

load-plugins default=false

  • Also defined as config.loadPlugins in config.js. SpiderJS allows you to disable plugins from loading in the crawler. This speeds up the crawl, and is generally not necessary for output.
  • --load-plugins=false
  • config.loadPlugins = 'false';

cb default=null

  • You can add your own callback if you're using the config file. Just uncomment the config.cb function in the config.js file. Or add your own after.
  • config.cb = function(data) {return data};

Contributing

Feel free to edit for yourself, or send a pull-request with any improvements.

Any pull-requests should be pulled from master and sent to separate branch prefixed with incoming-.

This script wouldn't be possible without PlanZero whose script I started with in the very beginning. I highly recommend still checking it out for a bare-bones version.

status-crawler's People

Contributors

doxakis avatar rjduffner avatar seethroughtrees avatar snufkin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.