stevenvachon / broken-link-checker Goto Github PK

View Code? Open in Web Editor NEW

1.9K 37.0 298.0 426 KB

Find broken links, missing images, etc within your HTML.

License: MIT License

JavaScript 97.84% HTML 2.16%

links urls seo html5 link-checker http whatwg nodejs

broken-link-checker's Introduction

broken-link-checker

Find broken links, missing images, etc within your HTML.

✅ Complete: Unicode, redirects, compression, basic authentication, absolute/relative/local URLs.
⚡️ Fast: Concurrent, streamed and cached.
🍰 Easy: Convenient defaults and very configurable.

Other features:

Support for many HTML elements and attributes; not only <a href> and <img src>.
Support for relative URLs with <base href>.
WHATWG specifications-compliant HTML and URL parsing.
Honor robot exclusions (robots.txt, headers and rel), optionally.
Detailed information for reporting and maintenance.
URL keyword filtering with simple wildcards.
Pause/Resume at any time.

Installation

Node.js >= 14 is required. There're two ways to use it:

Command Line Usage

To install, type this at the command line:

npm install broken-link-checker -g

After that, check out the help for available options:

blc --help

A typical site-wide check might look like:

blc http://yoursite.com -ro
# or
blc path/to/index.html -ro

Note: HTTP proxies are not directly supported. If your network is configured incorrectly with no resolution in sight, you could try using a container with proxy settings.

Programmatic API

To install, type this at the command line:

npm install broken-link-checker

The remainder of this document will assist you in using the API.

Classes

While all classes have been exposed for custom use, the one that you need will most likely be SiteChecker.

`HtmlChecker`

Scans an HTML document to find broken links. All methods from EventEmitter are available.

const {HtmlChecker} = require('broken-link-checker');

const htmlChecker = new HtmlChecker(options)
  .on('error', (error) => {})
  .on('html', (tree, robots) => {})
  .on('queue', () => {})
  .on('junk', (result) => {})
  .on('link', (result) => {})
  .on('complete', () => {});

htmlChecker.scan(html, baseURL);

Methods & Properties

.clearCache() will remove any cached URL responses.
.isPaused returns true if the internal link queue is paused and false if not.
.numActiveLinks returns the number of links with active requests.
.numQueuedLinks returns the number of links that currently have no active requests.
.pause() will pause the internal link queue, but will not pause any active requests.
.resume() will resume the internal link queue.
.scan(html, baseURL) parses & scans a single HTML document and returns a Promise. Calling this function while a previous scan is in progress will result in a thrown error. Arguments:
- html must be either a Stream or a string.
- baseURL must be a URL. Without this value, links to relative URLs will be given a BLC_INVALID reason for being broken (unless an absolute <base href> is found).

Events

'complete' is emitted after the last result or zero results.
'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
- error is the Error.
'html' is emitted after the HTML document has been fully parsed. Arguments:
- tree is supplied by parse5.
- robots is an instance of robot-directives containing any <meta> robot exclusions.
'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
- result is a Link.
'link' is emitted with the result of each checked/unskipped link (broken or not). Arguments:
- result is a Link.
'queue' is emitted when a link is internally queued, dequeued or made active.

`HtmlUrlChecker`

Scans the HTML content at each queued URL to find broken links. All methods from EventEmitter are available.

const {HtmlUrlChecker} = require('broken-link-checker');

const htmlUrlChecker = new HtmlUrlChecker(options)
  .on('error', (error) => {})
  .on('html', (tree, robots, response, pageURL, customData) => {})
  .on('queue', () => {})
  .on('junk', (result, customData) => {})
  .on('link', (result, customData) => {})
  .on('page', (error, pageURL, customData) => {})
  .on('end', () => {});

htmlUrlChecker.enqueue(pageURL, customData);

Methods & Properties

.clearCache() will remove any cached URL responses.
.dequeue(id) removes a page from the queue. Returns true on success or false on failure.
.enqueue(pageURL, customData) adds a page to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success. Arguments:
- pageURL must be a URL.
- customData is optional data (of any type) that is stored in the queue item for the page.
.has(id) returns true if the queue contains an active or queued page tagged with id and false if not.
.isPaused returns true if the queue is paused and false if not.
.numActiveLinks returns the number of links with active requests.
.numPages returns the total number of pages in the queue.
.numQueuedLinks returns the number of links that currently have no active requests.
.pause() will pause the queue, but will not pause any active requests.
.resume() will resume the queue.

Events

'end' is emitted when the end of the queue has been reached.
'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
- error is the Error.
'html' is emitted after a page's HTML document has been fully parsed. Arguments:
- tree is supplied by parse5.
- robots is an instance of robot-directives containing any <meta> and X-Robots-Tag robot exclusions.
- response is the full HTTP response for the page, excluding the body.
- pageURL is the URL to the current page being scanned.
- customData is whatever was queued.
'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
- result is a Link.
- customData is whatever was queued.
'link' is emitted with the result of each checked/unskipped link (broken or not) within the current page. Arguments:
- result is a Link.
- customData is whatever was queued.
'page' is emitted after a page's last result, on zero results, or if the HTML could not be retrieved. Arguments:
- error will be an Error if such occurred or null if not.
- pageURL is the URL to the current page being scanned.
- customData is whatever was queued.
'queue' is emitted when a URL (link or page) is queued, dequeued or made active.

`SiteChecker`

Recursively scans (crawls) the HTML content at each queued URL to find broken links. All methods from EventEmitter are available.

const {SiteChecker} = require('broken-link-checker');

const siteChecker = new SiteChecker(options)
  .on('error', (error) => {})
  .on('robots', (robots, customData) => {})
  .on('html', (tree, robots, response, pageURL, customData) => {})
  .on('queue', () => {})
  .on('junk', (result, customData) => {})
  .on('link', (result, customData) => {})
  .on('page', (error, pageURL, customData) => {})
  .on('site', (error, siteURL, customData) => {})
  .on('end', () => {});

siteChecker.enqueue(siteURL, customData);

Methods & Properties

.clearCache() will remove any cached URL responses.
.dequeue(id) removes a site from the queue. Returns true on success or false on failure.
.enqueue(siteURL, customData) adds [the first page of] a site to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success. Arguments:
- siteURL must be a URL.
- customData is optional data (of any type) that is stored in the queue item for the site.
.has(id) returns true if the queue contains an active or queued site tagged with id and false if not.
.isPaused returns true if the queue is paused and false if not.
.numActiveLinks returns the number of links with active requests.
.numPages returns the total number of pages in the queue.
.numQueuedLinks returns the number of links that currently have no active requests.
.numSites returns the total number of sites in the queue.
.pause() will pause the queue, but will not pause any active requests.
.resume() will resume the queue.

Events

'end' is emitted when the end of the queue has been reached.
'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
- error is the Error.
'html' is emitted after a page's HTML document has been fully parsed. Arguments:
- tree is supplied by parse5.
- robots is an instance of robot-directives containing any <meta> and X-Robots-Tag robot exclusions.
- response is the full HTTP response for the page, excluding the body.
- pageURL is the URL to the current page being scanned.
- customData is whatever was queued.
'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
- result is a Link.
- customData is whatever was queued.
'link' is emitted with the result of each checked/unskipped link (broken or not) within the current page. Arguments:
- result is a Link.
- customData is whatever was queued.
'page' is emitted after a page's last result, on zero results, or if the HTML could not be retrieved. Arguments:
- error will be an Error if such occurred or null if not.
- pageURL is the URL to the current page being scanned.
- customData is whatever was queued.
'queue' is emitted when a URL (link, page or site) is queued, dequeued or made active.
'robots' is emitted after a site's robots.txt has been downloaded. Arguments:
- robots is an instance of robots-txt-guard.
- customData is whatever was queued.
'site' is emitted after a site's last result, on zero results, or if the initial HTML could not be retrieved. Arguments:
- error will be an Error if such occurred or null if not.
- siteURL is the URL to the current site being crawled.
- customData is whatever was queued.

Note: the filterLevel option is used for determining which links are recursive.

`UrlChecker`

Requests each queued URL to determine if they are broken. All methods from EventEmitter are available.

const {UrlChecker} = require('broken-link-checker');

const urlChecker = new UrlChecker(options)
  .on('error', (error) => {})
  .on('queue', () => {})
  .on('link', (result, customData) => {})
  .on('end', () => {});

urlChecker.enqueue(url, customData);

Methods & Properties

.clearCache() will remove any cached URL responses.
.dequeue(id) removes a URL from the queue. Returns true on success or false on failure.
.enqueue(url, customData) adds a URL to the queue. Queue items are auto-dequeued when their requests are completed. Returns a queue ID on success. Arguments:
- url must be a URL.
- customData is optional data (of any type) that is stored in the queue item for the URL.
.has(id) returns true if the queue contains an active or queued URL tagged with id and false if not.
.isPaused returns true if the queue is paused and false if not.
.numActiveLinks returns the number of links with active requests.
.numQueuedLinks returns the number of links that currently have no active requests.
.pause() will pause the queue, but will not pause any active requests.
.resume() will resume the queue.

Events

'end' is emitted when the end of the queue has been reached.
'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
- error is the Error.
'junk' is emitted for each skipped/unchecked result, as configured in options. Arguments:
- result is a Link.
- customData is whatever was queued.
'link' is emitted for each checked/unskipped result (broken or not). Arguments:
- result is a Link.
- customData is whatever was queued.
'queue' is emitted when a URL is queued, dequeued or made active.

Options

`cacheMaxAge`

Type: Number
Default Value: 3_600_000 (1 hour)
The number of milliseconds in which a cached response should be considered valid. This is only relevant if the cacheResponses option is enabled.

`cacheResponses`

Type: Boolean
Default Value: true
URL request results will be cached when true. This will ensure that each unique URL will only be checked once.

`excludedKeywords`

Type: Array<String>
Default value: []
Will not check links that match the keywords and glob patterns within this list. The only wildcards supported are * and !.