GithubHelp home page GithubHelp logo

casperjs-spider's Introduction

Spider for CasperJS

As you probably already know. casperJS is an awesome library that wraps around the equally awesome, phantomJS.

This script will crawl your site and log all urls, response codes, errors and warnings to a json file for parsing.


What casperjs-spider does

  • Spiders whatever site you want it to
  • Returns list of:
    • all links with response codes
    • all javascript errors from console
    • all ssl insecure warnings
    • all warning messages
  • Does not repeat URLs
  • Allows you to skip specified terms
  • Allows you to require specified terms
  • Exports a data.json file with your results

Getting Started

You can run the whole thing by setting your config options in spider.js and navigating to the folder in your CL and typing:

casperjs spider.js

Or you can pass arguments directly into the command line like this:

casperjs --start-url=http://espn.com --required-values=espn.com,nfl,fantasy spider.js

Casper arguments go in the middle, and they will override config options in the script.

Config Options

There are several configuration options in casperjs-spider. You can set them individually in the command line, or by editing the config portion of spider.js.

It might help to refer to the default config options in the top of spider.js for examples

start-url *required

  • Also defined as config.startUrl in spider.js. This is the starting URL for your spider to crawl.
  • --start-url=http://espn.com
  • config.startUrl = 'http://espn.com';

required-values *required

  • Also defined as config.requiredValues in spider.js. This is a comma-separated list of all required strings.
  • --required-values=espn.com
  • config.requiredValues = 'espn.com';
  • Make sure you put your top-level domain in here to keep from spidering the internet!
  • Leave off the protocol so it will allow for subdomains if you have them.

skipped-values

  • Also defined as config.skippedValues in spider.js. This is a comma-separated list of all skipped strings.
  • It might be helpful to skip URLs like mailto, install, forums, blogs etc...
  • --skipped-values=mailto,install,\#,blog/,comment
  • config.skippedValues = 'mailto,install,#,blog/,comment';

file-location default=./logs/

  • Also defined as config.fileLocation in spider.js. This is a path to where you want the data.json file to be saved.
  • --file-location=./logs/
  • config.fileLocation = './logs/';

date-file-name default=false

  • Also defined as config.dateFileName in spider.js. This is a boolean to replace the filename data.json with the current date. ie. 2013-12-22.json. In case you want to keep versions.
  • --date-file-name=false
  • config.dateFileName = false;

verbose default=false

  • Also defined as config.verbose in spider.js. This is a boolean to put casper into verbose mode.
  • --verbose=false
  • config.verbose = false;

log-level default=error

  • Also defined as config.logLevel in spider.js. SpiderJS allows you to set a logging level. can be [error, warning, info, debug]
  • --log-level=error
  • config.logLevel = 'error';

load-images default=false

  • Also defined as config.loadImages in spider.js. SpiderJS allows you to disable images from loading in the crawler. This speeds up the crawl, and is generally not necessary for output.
  • --load-images=false
  • config.loadImages = 'false';

load-plugins default=false

  • Also defined as config.loadPlugins in spider.js. SpiderJS allows you to disable plugins from loading in the crawler. This speeds up the crawl, and is generally not necessary for output.
  • --load-plugins=false
  • config.loadPlugins = 'false';

Conclusion

Feel free to edit for yourself, or send a pull-request with any improvements.

This script wouldn't be possible without PlanZero whose script I started with in the very beginning. I highly recommend still checking it out for a bare-bones version.

casperjs-spider's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.