GithubHelp home page GithubHelp logo

stevenvachon / broken-link-checker Goto Github PK

View Code? Open in Web Editor NEW
1.9K 37.0 298.0 426 KB

Find broken links, missing images, etc within your HTML.

License: MIT License

JavaScript 97.84% HTML 2.16%
links urls seo html5 link-checker http whatwg nodejs

broken-link-checker's Introduction

broken-link-checker NPM Version Build Status Coverage Status Dependency Monitor

Find broken links, missing images, etc within your HTML.

  • Complete: Unicode, redirects, compression, basic authentication, absolute/relative/local URLs.
  • ⚡️ Fast: Concurrent, streamed and cached.
  • 🍰 Easy: Convenient defaults and very configurable.

Other features:

  • Support for many HTML elements and attributes; not only <a href> and <img src>.
  • Support for relative URLs with <base href>.
  • WHATWG specifications-compliant HTML and URL parsing.
  • Honor robot exclusions (robots.txt, headers and rel), optionally.
  • Detailed information for reporting and maintenance.
  • URL keyword filtering with simple wildcards.
  • Pause/Resume at any time.

Installation

Node.js >= 14 is required. There're two ways to use it:

Command Line Usage

To install, type this at the command line:

npm install broken-link-checker -g

After that, check out the help for available options:

blc --help

A typical site-wide check might look like:

blc http://yoursite.com -ro
# or
blc path/to/index.html -ro

Note: HTTP proxies are not directly supported. If your network is configured incorrectly with no resolution in sight, you could try using a container with proxy settings.

Programmatic API

To install, type this at the command line:

npm install broken-link-checker

The remainder of this document will assist you in using the API.

Classes

While all classes have been exposed for custom use, the one that you need will most likely be SiteChecker.

HtmlChecker

Scans an HTML document to find broken links. All methods from EventEmitter are available.

const {HtmlChecker} = require('broken-link-checker');

const htmlChecker = new HtmlChecker(options)
  .on('error', (error) => {})
  .on('html', (tree, robots) => {})
  .on('queue', () => {})
  .on('junk', (result) => {})
  .on('link', (result) => {})
  .on('complete', () => {});

htmlChecker.scan(html, baseURL);

Methods & Properties

  • .clearCache() will remove any cached URL responses.
  • .isPaused returns true if the internal link queue is paused and false if not.
  • .numActiveLinks returns the number of links with active requests.
  • .numQueuedLinks returns the number of links that currently have no active requests.
  • .pause() will pause the internal link queue, but will not pause any active requests.
  • .resume() will resume the internal link queue.
  • .scan(html, baseURL) parses & scans a single HTML document and returns a Promise. Calling this function while a previous scan is in progress will result in a thrown error. Arguments:
    • html must be either a Stream or a string.
    • baseURL must be a URL. Without this value, links to relative URLs will be given a BLC_INVALID reason for being broken (unless an absolute <base href> is found).

Events

  • 'complete' is emitted after the last result or zero results.
  • 'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
    • error is the Error.
  • 'html' is emitted after the HTML document has been fully parsed. Arguments:
    • tree is supplied by parse5.
    • robots is an instance of robot-directives containing any <meta> robot exclusions.
  • 'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
  • 'link' is emitted with the result of each checked/unskipped link (broken or not). Arguments:
  • 'queue' is emitted when a link is internally queued, dequeued or made active.

HtmlUrlChecker

Scans the HTML content at each queued URL to find broken links. All methods from EventEmitter are available.

const {HtmlUrlChecker} = require('broken-link-checker');

const htmlUrlChecker = new HtmlUrlChecker(options)
  .on('error', (error) => {})
  .on('html', (tree, robots, response, pageURL, customData) => {})
  .on('queue', () => {})
  .on('junk', (result, customData) => {})
  .on('link', (result, customData) => {})
  .on('page', (error, pageURL, customData) => {})
  .on('end', () => {});

htmlUrlChecker.enqueue(pageURL, customData);

Methods & Properties

  • .clearCache() will remove any cached URL responses.
  • .dequeue(id) removes a page from the queue. Returns true on success or false on failure.
  • .enqueue(pageURL, customData) adds a page to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success. Arguments:
    • pageURL must be a URL.
    • customData is optional data (of any type) that is stored in the queue item for the page.
  • .has(id) returns true if the queue contains an active or queued page tagged with id and false if not.
  • .isPaused returns true if the queue is paused and false if not.
  • .numActiveLinks returns the number of links with active requests.
  • .numPages returns the total number of pages in the queue.
  • .numQueuedLinks returns the number of links that currently have no active requests.
  • .pause() will pause the queue, but will not pause any active requests.
  • .resume() will resume the queue.

Events

  • 'end' is emitted when the end of the queue has been reached.
  • 'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
    • error is the Error.
  • 'html' is emitted after a page's HTML document has been fully parsed. Arguments:
    • tree is supplied by parse5.
    • robots is an instance of robot-directives containing any <meta> and X-Robots-Tag robot exclusions.
    • response is the full HTTP response for the page, excluding the body.
    • pageURL is the URL to the current page being scanned.
    • customData is whatever was queued.
  • 'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'link' is emitted with the result of each checked/unskipped link (broken or not) within the current page. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'page' is emitted after a page's last result, on zero results, or if the HTML could not be retrieved. Arguments:
    • error will be an Error if such occurred or null if not.
    • pageURL is the URL to the current page being scanned.
    • customData is whatever was queued.
  • 'queue' is emitted when a URL (link or page) is queued, dequeued or made active.

SiteChecker

Recursively scans (crawls) the HTML content at each queued URL to find broken links. All methods from EventEmitter are available.

const {SiteChecker} = require('broken-link-checker');

const siteChecker = new SiteChecker(options)
  .on('error', (error) => {})
  .on('robots', (robots, customData) => {})
  .on('html', (tree, robots, response, pageURL, customData) => {})
  .on('queue', () => {})
  .on('junk', (result, customData) => {})
  .on('link', (result, customData) => {})
  .on('page', (error, pageURL, customData) => {})
  .on('site', (error, siteURL, customData) => {})
  .on('end', () => {});

siteChecker.enqueue(siteURL, customData);

Methods & Properties

  • .clearCache() will remove any cached URL responses.
  • .dequeue(id) removes a site from the queue. Returns true on success or false on failure.
  • .enqueue(siteURL, customData) adds [the first page of] a site to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success. Arguments:
    • siteURL must be a URL.
    • customData is optional data (of any type) that is stored in the queue item for the site.
  • .has(id) returns true if the queue contains an active or queued site tagged with id and false if not.
  • .isPaused returns true if the queue is paused and false if not.
  • .numActiveLinks returns the number of links with active requests.
  • .numPages returns the total number of pages in the queue.
  • .numQueuedLinks returns the number of links that currently have no active requests.
  • .numSites returns the total number of sites in the queue.
  • .pause() will pause the queue, but will not pause any active requests.
  • .resume() will resume the queue.

Events

  • 'end' is emitted when the end of the queue has been reached.
  • 'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
    • error is the Error.
  • 'html' is emitted after a page's HTML document has been fully parsed. Arguments:
    • tree is supplied by parse5.
    • robots is an instance of robot-directives containing any <meta> and X-Robots-Tag robot exclusions.
    • response is the full HTTP response for the page, excluding the body.
    • pageURL is the URL to the current page being scanned.
    • customData is whatever was queued.
  • 'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'link' is emitted with the result of each checked/unskipped link (broken or not) within the current page. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'page' is emitted after a page's last result, on zero results, or if the HTML could not be retrieved. Arguments:
    • error will be an Error if such occurred or null if not.
    • pageURL is the URL to the current page being scanned.
    • customData is whatever was queued.
  • 'queue' is emitted when a URL (link, page or site) is queued, dequeued or made active.
  • 'robots' is emitted after a site's robots.txt has been downloaded. Arguments:
  • 'site' is emitted after a site's last result, on zero results, or if the initial HTML could not be retrieved. Arguments:
    • error will be an Error if such occurred or null if not.
    • siteURL is the URL to the current site being crawled.
    • customData is whatever was queued.

Note: the filterLevel option is used for determining which links are recursive.

UrlChecker

Requests each queued URL to determine if they are broken. All methods from EventEmitter are available.

const {UrlChecker} = require('broken-link-checker');

const urlChecker = new UrlChecker(options)
  .on('error', (error) => {})
  .on('queue', () => {})
  .on('link', (result, customData) => {})
  .on('end', () => {});

urlChecker.enqueue(url, customData);

Methods & Properties

  • .clearCache() will remove any cached URL responses.
  • .dequeue(id) removes a URL from the queue. Returns true on success or false on failure.
  • .enqueue(url, customData) adds a URL to the queue. Queue items are auto-dequeued when their requests are completed. Returns a queue ID on success. Arguments:
    • url must be a URL.
    • customData is optional data (of any type) that is stored in the queue item for the URL.
  • .has(id) returns true if the queue contains an active or queued URL tagged with id and false if not.
  • .isPaused returns true if the queue is paused and false if not.
  • .numActiveLinks returns the number of links with active requests.
  • .numQueuedLinks returns the number of links that currently have no active requests.
  • .pause() will pause the queue, but will not pause any active requests.
  • .resume() will resume the queue.

Events

  • 'end' is emitted when the end of the queue has been reached.
  • 'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
    • error is the Error.
  • 'junk' is emitted for each skipped/unchecked result, as configured in options. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'link' is emitted for each checked/unskipped result (broken or not). Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'queue' is emitted when a URL is queued, dequeued or made active.

Options

cacheMaxAge

Type: Number
Default Value: 3_600_000 (1 hour)
The number of milliseconds in which a cached response should be considered valid. This is only relevant if the cacheResponses option is enabled.

cacheResponses

Type: Boolean
Default Value: true
URL request results will be cached when true. This will ensure that each unique URL will only be checked once.

excludedKeywords

Type: Array<String>
Default value: []
Will not check links that match the keywords and glob patterns within this list. The only wildcards supported are * and !.

This option does not apply to UrlChecker.

excludeExternalLinks

Type: Boolean
Default value: false
Will not check external links (different protocol and/or host) when true; relative links with a remote <base href> included.

This option does not apply to UrlChecker.

excludeInternalLinks

Type: Boolean
Default value: false
Will not check internal links (same protocol and host) when true.

This option does not apply to UrlChecker nor SiteChecker's crawler.

excludeLinksToSamePage

Type: Boolean
Default value: false
Will not check links to the same page; relative and absolute fragments/hashes included. This is only relevant if the cacheResponses option is disabled.

This option does not apply to UrlChecker.

filterLevel

Type: Number
Default value: 1
The tags and attributes that are considered links for checking, split into the following levels:

  • 0: clickable links
  • 1: clickable links, media, frames, meta refreshes
  • 2: clickable links, media, frames, meta refreshes, stylesheets, scripts, forms
  • 3: clickable links, media, frames, meta refreshes, stylesheets, scripts, forms, metadata

Recursive links have a slightly different filter subset. To see the exact breakdown of both, check out the tag map. <base href> is not listed because it is not a link, though it is always parsed.

This option does not apply to UrlChecker.

honorRobotExclusions

Type: Boolean
Default value: true
Will not scan pages that search engine crawlers would not follow. Such will have been specified with any of the following:

  • <a rel="nofollow" href="…">
  • <area rel="nofollow" href="…">
  • <meta name="robots" content="noindex,nofollow,…">
  • <meta name="googlebot" content="noindex,nofollow,…">
  • <meta name="robots" content="unavailable_after: …">
  • X-Robots-Tag: noindex,nofollow,…
  • X-Robots-Tag: googlebot: noindex,nofollow,…
  • X-Robots-Tag: otherbot: noindex,nofollow,…
  • X-Robots-Tag: unavailable_after: …
  • robots.txt

This option does not apply to UrlChecker.

includedKeywords

Type: Array<String>
Default value: []
Will only check links that match the keywords and glob patterns within this list, if any. The only wildcard supported is *.

This option does not apply to UrlChecker.

includeLink

Type: Function
Default value: link => true
A synchronous callback that is called after all other filters have been performed. Return true to include link (a Link) in the list of links to be checked, or return false to have it skipped.

This option does not apply to UrlChecker.

includePage

Type: Function
Default value: url => true
A synchronous callback that is called after all other filters have been performed. Return true to include url (a URL) in the list of pages to be crawled, or return false to have it skipped.

This option does not apply to UrlChecker nor HtmlUrlChecker.

maxSockets

Type: Number
Default value: Infinity
The maximum number of links to check at any given time.

maxSocketsPerHost

Type: Number
Default value: 2
The maximum number of links per host/port to check at any given time. This avoids overloading a single target host with too many concurrent requests. This will not limit concurrent requests to other hosts.

rateLimit

Type: Number
Default value: 0
The number of milliseconds to wait before each request.

requestMethod

Type: String
Default value: 'head'
The HTTP request method used in checking links. If you experience problems, try using 'get', however the retryHeadFail option should have you covered.

retryHeadCodes

Type: Array<Number>
Default value: [405]
The list of HTTP status codes for the retryHeadFail option to reference.

retryHeadFail

Type: Boolean
Default value: true
Some servers do not respond correctly to a 'head' request method. When true, a link resulting in an HTTP status code listed within the retryHeadCodes option will be re-requested using a 'get' method before deciding that it is broken. This is only relevant if the requestMethod option is set to 'head'.

userAgent

Type: String
Default value: 'broken-link-checker/0.8.0 Node.js/14.16.0 (OS X; x64)' (or similar)
The HTTP user-agent to use when checking links as well as retrieving pages and robot exclusions.

Handling Broken/Excluded Links

A broken link will have an isBroken value of true and a reason code defined in brokenReason. A link that was not checked (emitted as 'junk') will have a wasExcluded value of true, a reason code defined in excludedReason and a isBroken value of null.

if (link.get('isBroken')) {
  console.log(link.get('brokenReason'));
  //-> HTTP_406
} else if (link.get('wasExcluded')) {
  console.log(link.get('excludedReason'));
  //-> BLC_ROBOTS
}

Additionally, more descriptive messages are available for each reason code:

const {reasons} = require('broken-link-checker');

console.log(reasons.BLC_ROBOTS);       //-> Robots exclusion
console.log(reasons.ERRNO_ECONNRESET); //-> connection reset by peer (ECONNRESET)
console.log(reasons.HTTP_404);         //-> Not Found (404)

// List all
console.log(reasons);

Putting it all together:

if (link.get('isBroken')) {
  console.log(reasons[link.get('brokenReason')]);
} else if (link.get('wasExcluded')) {
  console.log(reasons[link.get('excludedReason')]);
}

Finally, it is important to analyze links excluded with the BLC_UNSUPPORTED reason as it's possible for them to be broken.

Roadmap Features

  • 'info' event with messaging such as 'Site does not support HTTP HEAD method' (regarding retryHeadFail option)
  • add cheerio support by using parse5's htmlparser2 tree adaptor?
  • load sitemap.xml at start of each SiteChecker site (since cache can expire) to possibly check pages that were not linked to, removing from list as discovered links are checked
  • change order of checking to: tcp error, 4xx code (broken), 5xx code (undetermined), 200
  • abort download of body when options.retryHeadFail===true
  • option to retry broken links a number of times (default=0)
  • option to scrape response.body for erroneous sounding text (using fathom?), since an error page could be presented but still have code 200
  • option to detect parked domain (302 with no redirect?)
  • option to check broken link on archive.org for archived version (using this lib)
  • option to run HtmlUrlChecker checks on page load (using jsdom) to include links added with JavaScript?
  • option to check if hashes exist in target URL document?
  • option to parse Markdown in HtmlChecker for links
  • option to check plain text URLs
  • add throttle profiles (0–9, -1 for "custom") for easy configuring
  • check ftp:, sftp: (for downloadable files)
  • check mailto:, news:, nntp:, telnet:?
  • check that data URLs are valid (with valid-data-url)?
  • supply CORS error for file:// links on sites with a different protocol
  • create an example with http://astexplorer.net
  • use debug
  • use bunyan with JSON output for CLI
  • store request object/headers (or just auth) in Link?
  • supply basic auth for "page" events?
  • add option for URLCache normalization profiles

broken-link-checker's People

Contributors

andyli avatar nhoizey avatar scottnonnenberg avatar staabm avatar stefanschinkel avatar stevenvachon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

broken-link-checker's Issues

Unable To Link Check Dockerized WordPress Site

I think this is an edge case but since it happened to me... I would like to note this here. I'll try to dive in and who knows..maybe submit a pr.

Steps To Reproduce
Run blc http://devpatch.com:3000 --filter-level 3 -ro

More Info
I am running a dockerized version of a wordpress site. Testing both locally and the dev instance hosted on devpatch, the broken link checker never fetches or checks a page. Looking at the logs I see the request from BLC but that is it. Below is a screenshot. Left is log, Right is Console output.

I verified I could run BLC on a static non-docker hosted locally at the same port without issue.

screen shot 2016-07-15 at 11 19 25 am

Global install failed Error ENOENT

Hi,
I got this error while installing your app.

Error: ENOENT, chmod 'C:\Users\ADA-LT\AppData\Roaming\npm\node_modules\broken-link-checker\bin\broken-link-checker'

npm ERR! System Windows_NT 6.1.7601
npm ERR! command "C:\Program Files (x86)\nodejs\node.exe" "C:\Program Files (x86)\nodejs\node_modules\npm\bin\npm-cli.js" "install" "broken-link-checker" "-g"
npm ERR! cwd C:{working_dir}
npm ERR! node -v v0.10.13
npm ERR! npm -v 1.3.2
npm ERR! path C:\Users{user-login}\AppData\Roaming\npm\node_modules\broken-link-checker\bin\broken-link-checker
npm ERR! code ENOENT
npm ERR! errno 34

Any idea what it does mean and how to fix it ?
Frankly, I am eager to try it. :)

Thanks guys for all this work. Seems promising.

Error via API on checking

I am getting the below error when running the code via the API.

Unhandled rejection Error: getaddrinfo EAI_AGAIN github.com:443
    at Object.exports._errnoException (util.js:874:11)
    at errnoException (dns.js:31:15)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:77:26)

Looks like some proxy issue and how to overcome this..

Requests would be hanging since there is no configuration for response timeout

Hi Steven,

I am using your awesome module to develop some api services. But recently my api hangs while facing some target web links which can not provide response in time. Then I searched your provide api documentation, no timeout value could be configured.

I was just debugging your source code, then noticed that you were using node-bhttp as dependency. I checked the request options you were using -

{
		discardResponse: true,
		headers: { "user-agent":options.userAgent },
		method: retry!==405 ? options.requestMethod : "get"
}

There is no timeout set. I've checked the bhttp doc - https://github.com/joepie91/node-bhttp

Seems there is an advanced options that can be used,

responseTimeout: The timeout, in milliseconds, after which the request should be considered to have failed if no response is received yet. Note that this measures from the start of the request to the start of the response, and is not a connection timeout. If a timeout occurs, a ResponseTimeoutError will be thrown asynchronously (see error documentation below).

Also there are some more references -

bhttp.ConnectionTimeoutError

The connection timed out.

The connection timeout is defined by the operating system, and cannot currently be overridden.

bhttp.ResponseTimeoutError

The response timed out.

The response timeout can be specified using the responseTimeout option, and it is measured from the start of the request to the start of the response. If no response is received within the responseTimeout, a ResponseTimeoutError will be thrown asynchronously, and the request will be aborted.

You should not set a responseTimeout for requests that involve large file uploads! Because a response can only be received after the request has completed, any file/stream upload that takes longer than the responseTimeout, will result in a ResponseTimeoutError.

Could you please help on this? I only expect that my api won't hang even if it's facing some strange links.

Thank you very much!

Different API for HtmlChecker

Hey folks,

Was there a particular reason why HtmlChecker has a different API than the rest of the checkers? By that I mean scan vs enqueue that allows custom data. I'm asking because I was trying to do HTML checking but I do need the option to pass custom data.

Cheers

Add support for self-signed certificates

Was trying to scan a semi-private site which is using a self-signed certificate.
It would be awesome to have a flag (e.g. --no-check-certificate) to circumvent this:
Error: unable to verify the first certificate

can not run on linux

Why I still can not run on linux, node v0.10.36 npm 1.3.6, object-assign 4.1.0, promise 3.2.1

302 redirect reported as 404 broken link

I check the url http://store.meizu.com/ by HtmlUrlChecker, it report the broken link http://ordercenter.meizu.com/list/index.html with a HTTP_404 brokenReason, but it's a redirect link, not a 404.

Here is the result param in the HtmlUrlChecker link callback function.

{ url:
   { original: 'http://ordercenter.meizu.com/list/index.html',
     resolved: 'http://ordercenter.meizu.com/list/index.html',
     redirected: 'https://login.flyme.cn/vCodeLogin?useruri=http%3A%2F%2Fstore.meizu.com%2Fmember%2Flogin.htm?useruri=http://ordercenter.meizu.com/list/index.html&sid=unionlogin&service=&autodirct=true' },
  base:
   { original: 'http://store.meizu.com/',
     resolved: 'http://store.meizu.com/' },
  html:
   { index: 11,
     offsetIndex: 9,
     location: { line: 51, col: 44, startOffset: 2842, endOffset: 2893 },
     selector: 'html > body > div:nth-child(1) > div:nth-child(1) > div:nth-child(2) > ul:nth-child(1) > li:nth-child(2) > a:nth-child(1)',
     tagName: 'a',
     attrName: 'href',
     attrs:
      { class: 'topbar-link',
        href: 'http://ordercenter.meizu.com/list/index.html',
        target: '_blank' },
     text: '我的订单',
     tag: '<a class="topbar-link" href="http://ordercenter.meizu.com/list/index.html" target="_blank">' },
  http:
   { cached: false,
     response:
      { headers: [Object],
        httpVersion: '1.1',
        statusCode: 404,
        statusMessage: 'Not Found',
        url: 'https://login.flyme.cn/vCodeLogin?useruri=http%3A%2F%2Fstore.meizu.com%2Fmember%2Flogin.htm?useruri=http://ordercenter.meizu.com/list/index.html&sid=unionlogin&service=&autodirct=true',
        redirects: [Object] } },
  broken: true,
  internal: false,
  samePage: false,
  excluded: false,
  brokenReason: 'HTTP_404',
  excludedReason: null }

Error: Expected type "text/html" but got "image/jpeg"

> blc http://tw.example.com/ -ro
Getting links from: http://tw.example.com/
├───OK─── http://tw.example.com/location.png
├───OK─── ...
...
Finished! 69 links found. 16 excluded. 0 broken.
...
Getting links from: http://tw.example.com/location.png
Error: Expected type "text/html" but got "image/jpeg"
...
Finished! 219 links found. 114 excluded. 0 broken.
Elapsed time: 1 minute, 15 seconds

what does this mean? And how do I fix it?

ps. I put images on github page.

Fails when response.headers['content-type'] is undefined

On line 36 of lib/internal/getHtmlFromUrl.js, some filetypes (I saw it with .woff2 font files) don't come back with a content-type header, so the if conditional (indexOf) on this line fails. I worked around it as below, but it's a quick fix and I'm not positive if that's adequate.

Original code:
if (response.headers["content-type"].indexOf("text/html") === 0)

I modified it to:
if (response.headers["content-type"] && response.headers["content-type"].indexOf("text/html") === 0)

Hope this is helpful, let me know if you need anything.

Check local files

I'm not able to check local file using the file:// protocol. I got an:

ReferenceError: protocol is not defined
Unhandled rejection Error: undefined

Broken on Travis

Running the blc binary breaks on my Travis build (code available here) with the following error:

TypeError: Object function Object() { [native code] } has no method 'assign'
    at parseOptions (/home/travis/build/sxlijin/git-scm.com/node_modules/broken-link-checker/lib/internal/parseOptions.js:42:20)
    at new SiteChecker (/home/travis/build/sxlijin/git-scm.com/node_modules/broken-link-checker/lib/public/SiteChecker.js:22:27)
    at run (/home/travis/build/sxlijin/git-scm.com/node_modules/broken-link-checker/lib/cli.js:484:14)
    at cli.input (/home/travis/build/sxlijin/git-scm.com/node_modules/broken-link-checker/lib/cli.js:147:3)
    at Object.<anonymous> (/home/travis/build/sxlijin/git-scm.com/node_modules/broken-link-checker/bin/blc:3:31)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:945:3

Honor Proxy environment variables

Hey,

in my company we are using a proxy so we have some environment variables set to make e.g. npm work correctly. Therefore we set HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables accordingly.

It would be great if you would support basic proxy usage with these environment variables.

Feature request: mixed content (HTTPS) warnings

If your site is using HTTPS, if it embeds images with a non-HTTPS protocol, the browser will display a warning in the URL bar. If it embeds scripts or iframe with a non-HTTPS protocol, the browser will refuse to load that content altogether.

It would be awesome if this tool would detect and report an issue like that.

Feature request: check CSP policies

If you have CSP policies, it's possible that some of the pages on your website are embedding content that is banned in the CSP policy.

It would be awesome if this tool would detect and report that.

Javascript in links

Currently a link like <a href="javascript:void(0);"> will resolve as null which means it indicates a broken link. In cases where a link begins with a scripting tag, the link should be treated as a success or ignored and not queried.

proxy

How use blc behind a proxy.
All externals links are broken

Getting links from: http://localhost:8080/it/actus/articleB.html
├─BROKEN─ http://placehold.it/900x300 (HTTP_undefined)

Thank
Ami44

HTTP_404 for correct url, possible timeout issue

I'm getting HTTP_404 for this url https://nationalcareersservice.direct.gov.uk/job-profiles/home
Full response:

{ url: 
   { original: 'https://nationalcareersservice.direct.gov.uk/job-profiles/home',
     resolved: URL {},
     rebased: URL {},
     redirected: null },
  base: { resolved: null, rebased: null },
  html: 
   { index: null,
     offsetIndex: null,
     location: null,
     selector: null,
     tagName: null,
     attrName: null,
     attrs: null,
     text: null,
     tag: null,
     base: null },
  http: 
   { cached: false,
     response: 
      { headers: [Object],
        status: 404,
        statusText: 'Not Found',
        url: URL {},
        redirects: [] } },
  broken: true,
  internal: null,
  samePage: null,
  excluded: null,
  brokenReason: 'HTTP_404',
  excludedReason: null }

I've used bhttp directly and it reports 200 back.

Curl:

$ time curl https://nationalcareersservice.direct.gov.uk/job-profiles/home -v > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 13.81.8.21...
* Connected to nationalcareersservice.direct.gov.uk (13.81.8.21) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=nationalcareersservice.direct.gov.uk,O=Skills Funding Agency,OU=IM Services,L=Coventry,ST=West Midlands,C=GB
* 	start date: Oct 22 09:56:02 2016 GMT
* 	expire date: Oct 23 09:56:02 2017 GMT
* 	common name: nationalcareersservice.direct.gov.uk
* 	issuer: CN=GlobalSign Organization Validation CA - SHA256 - G2,O=GlobalSign nv-sa,C=BE
> GET /job-profiles/home HTTP/1.1
> User-Agent: curl/7.40.0
> Host: nationalcareersservice.direct.gov.uk
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0< HTTP/1.1 200 OK
< Cache-Control: no-cache
< Pragma: no-cache
< Content-Type: text/html; charset=utf-8
< Expires: -1
< Server: Microsoft-IIS/8.5
< X-Frame-Options: SAMEORIGIN
< Date: Fri, 24 Feb 2017 17:00:21 GMT
< Content-Length: 21195
< X-FRAME-OPTIONS: SAMEORIGIN
<
{ [7932 bytes data]
100 21195  100 21195    0     0  11437      0  0:00:01  0:00:01 --:--:-- 11432
* Connection #0 to host nationalcareersservice.direct.gov.uk left intact

real	0m1.858s
user	0m0.060s
sys	0m0.020s

It takes around 2s to get back first byte. Couldn't be a timeout problem?

npm ERR! enoent ENOENT, chmod '/usr/local/lib/node_modules/broken-link-checker/bin/broken-link-checker'

$ npm install -g broken-link-checker
npm ERR! Darwin 14.3.0
npm ERR! argv "node" "/usr/local/bin/npm" "install" "-g" "broken-link-checker"
npm ERR! node v0.12.2
npm ERR! npm  v2.8.3
npm ERR! path /usr/local/lib/node_modules/broken-link-checker/bin/broken-link-checker
npm ERR! code ENOENT
npm ERR! errno -2

npm ERR! enoent ENOENT, chmod '/usr/local/lib/node_modules/broken-link-checker/bin/broken-link-checker'
npm ERR! enoent This is most likely not a problem with npm itself
npm ERR! enoent and is related to npm not being able to find a file.
npm ERR! enoent 

wish: also check url in css

Hi,

Nice tool!

I have this in my style.css:
a[href$='.pdf']:after { content: url("graphics/pdficon.gif"); padding: 0 3px 0 0; }
It would be nice to check url's in css.

regards,

Peter

TypeError: Cannot read property '__parsed' of null?

This is a problem I found in the demo you give. Give some help, thks!

var blc = require("broken-link-checker");

var html = '<a href="https://google.com">absolute link</a>';
html += '<a href="/path/to/resource.html">relative link</a>';
html += '<img src="http://fakeurl.com/image.png" alt="missing image"/>';

var htmlChecker = new blc.HtmlChecker(null, {
    link: function(result) {
        console.log(result.html.index, result.broken, result.html.text, result.url.resolved);
        //-> 0 false "absolute link" "https://google.com/"
        //-> 1 false "relative link" "https://mywebsite.com/path/to/resource.html"
        //-> 2 true null "http://fakeurl.com/image.png"
    },
    complete: function() {
        console.log("done checking!");
    }
});

htmlChecker.scan(html, "https://mywebsite.com");

feature request: adjustable recursive level

Hi Steve,
It will be good if the broken-link-checker has an option to specify the recursive level ranging from 0 to some number. Currently the app seems to run into infinite recursion and sometimes some websites are super big and checking the links in some 100th recursive level doesn't makes sense most of the time because an user mostly never gets such deep into a website. I used the recursive option and for my website(which is huge) and it seems to run for almost 3 days and still haven't finished the job.

Thanks,
Jeb

Issue when using on windows

When trying to install on windows I get the following error, also fails when using as bower dependency:

npm ERR! Error: ENOENT, chmod 'C:\node_modules\broken-link-checker\bin\broken-li
nk-checker'
npm ERR! If you need help, you may report this entire log,
npm ERR! including the npm and node versions, at:
npm ERR! http://github.com/npm/npm/issues

npm ERR! System Windows_NT 6.2.9200
npm ERR! command "C:\Program Files\nodejs\node.exe" "C:\Program Files\nodejs\node_modules\npm\bin\npm-cli.js" "install" "broken-link-checker"
npm ERR! cwd C:
npm ERR! node -v v0.10.36
npm ERR! npm -v 1.4.28
npm ERR! path C:\node_modules\broken-link-checker\bin\broken-link-checker
npm ERR! code ENOENT
npm ERR! errno 34
npm ERR! not ok code 0

Silent crash on node 6

Thanks for this package. So useful! :0)

I downloaded it and tried it on my blog today, and it was giving me just four lines of output. Confused, I started looking at the code, and realized that it finished in the middle of enqueuing its first set of links. With a new try/catch in HtmlChecker/enqueueLink I found that the process was indeed silently crashing:

TypeError: source.hasOwnProperty is not a function
    at cloneObject (lib/internal/linkObj.js:239:14)
    at cloneObject (lib/internal/linkObj.js:245:18)
    at cloneObject (lib/internal/linkObj.js:245:18)
    at Function.linkObj.resolve (lib/internal/linkObj.js:168:64)
    at enqueueLink (lib/public/HtmlChecker.js:158:10)
    at lib/public/HtmlChecker.js:110:5
    at process._tickCallback (internal/process/next_tick.js:103:7)

On node 6.1.0 it throws this error, and on node 4.4.3 it works properly. Seems that there are two bugs here - one about the hasOwnProperty issue, and the other about silently crashing!

Possible EventEmitter memory leak detected on Redirects

When testing the links of some websites like the nytimes.com you will notice a warning message :
(node:4687) Warning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit

This is due, I think, to some links have a great deal of redirect. What do you think of having a MaxRedirect option ? (like the request module for instance).

My set up:

const check = new LinkChecker.HtmlChecker({excludeExternalLinks: true, filterLevel: 0}, {
    link: (result) => {
        links.goodLinks++;
            if (result.broken)
                links.brokenLinks++;
        },
        complete: () => {
            return wrapper(links.goodLinks, links.brokenLinks);
        }
    });
check.scan(this.resource.html(), this.url, links);

Roadmap: Markdown

option to parse Markdown in HtmlChecker for links

You could use a markdown parser like marked to work with Markdown files.

var marked = require('marked');

htmlChecker.scan(marked('I am using __markdown__.'));

Link to zero-byte HTML file breaks broken-link-checker

I have a link to a zero-byte HTML file in one of my pages. As soon as blc tries to access that empty HTML page, an error is thrown and blc is aborted:

$ node node_modules/broken-link-checker/bin/blc -fvr http://beta.grossweber.com/blc
Getting links from: http://beta.grossweber.com/blc
└───OK─── http://beta.grossweber.com/blc/empty.html
Finished! 1 links found. 0 broken.

Getting links from: http://beta.grossweber.com/blc/empty.html
Error: Unhandled Rejection. TypeError: Cannot read property 'length' of undefined

Issues on fs object in Browerify

When used along with Browserify, the FileSystem related methods does not work.

On the below lines:

errorCss = fs.readFileSync(__dirname + '/static/error.css', 'utf8');
errorHtml = fs.readFileSync(__dirname + '/static/error.html', 'utf8');

We get the below error:

Uncaught TypeError: fs.readFileSync is not a function

Is there a recursive feature?

Is there a feature to crawl a website looking for broken links? If not, could this be mentioned in the documentation?

Object() has no method 'assign'

Did a global install of broken-link-checker
npm install -g broken-link-checker

Tried to run it and got this error:

/home/myhome/Projects/SciServer/Dev.das $blc http://www.sdss.org -ro

/home/myhome/.local/lib/node_modules/broken-link-checker/lib/internal/parseOptions.js:42
                options = Object.assign({}, defaultOptions, options);
                                 ^
TypeError: Object function Object() { [native code] } has no method 'assign'
    at parseOptions (/home/myhome/.local/lib/node_modules/broken-link-checker/lib/internal/parseOptions.js:42:20)
    at new SiteChecker (/home/myhome/.local/lib/node_modules/broken-link-checker/lib/public/SiteChecker.js:22:27)
    at run (/home/myhome/.local/lib/node_modules/broken-link-checker/lib/cli.js:467:14)
    at cli.input (/home/myhome/.local/lib/node_modules/broken-link-checker/lib/cli.js:144:3)
    at Object.<anonymous> (/home/myhome/.local/lib/node_modules/broken-link-checker/bin/blc:3:31)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:906:3

Then I looked at the readme:

Node.js >= 0.10 is required; < 4.0 will need Promise and Object.assign polyfills.

I have node 0.10.33. Not sure what module "< 4.0" refers to, but I also tried
npm install -g promise
and
npm install -g object.assign
but got the same error.
Did I not install the right object.assign module?

NPM Install does not work

It appears that NPM is using your .gitignore file in place of .npmignore
For more information see the SO thread: http://stackoverflow.com/questions/17990647/npm-install-errors-with-error-enoent-chmod

I am not able to load your module with:
npm install broken-link-checker

However as the thread mentions I am able to load it with:
npm install broken-link-checker --no-bin-links

Which leads me to believe that if you check in a blank .npmignore file this issue will be resolved.

Feature: embed local http server to check static files

Would it be possible to add to the plugin the capacity to run a local http server to serve static files and then check the broken link ?

A command line like blc _site -ro could start a local server, serving the files in _site and then start the analysis. Maybe something along those lines:

var finalhandler = require('finalhandler');
var http = require('http');
var serveStatic = require('serve-static');

var serve = serveStatic(<directory>);
var server = http.createServer(function onRequest (req, res) {
  serve(req, res, finalhandler(req, res))
})

server.listen(9001, function(){
    console.log('Server running on 9001...');
    // Call broken-link-checker on URL http://localhost:9001
});

Or maybe is there a way, in one command to start a local server, wait for it to be started and then start broken-link-checker ?
The problem with node http-server _site -p 9001 & blc http://0.0.0.0:9001 -ro is that broken-link-checker is executed before the webserver finish to startup and therefore produce an error.

[QUESTION] Excluded links during blc execution

Hi,

Not an issue more of a question that I cannot find the answer to.

I have been spiking out BLC for a project that I am working on. However when I run via command line there seems to be many links that are ignored.

I am executing the tests like:
NODE_TLS_REJECT_UNAUTHORIZED=0 blc https://foo.com -ro

The result is:
Finished! 16516 links found. 15854 excluded. 50 broken.

Is there a way to force blc to check all links?

Thanks,
Ian

Exit code 1 if broken links are found

As of now, the tool exits with code 0 in all scenarios.
I know, you have a recent commit fixing it, but the tool in NPM does not have that fix yet.

Have you updated the NPM package?
Or is it a problem on my side?

Something wrong while checking link started with double slash

Hi,

Test link here - http://stackoverflow.com/questions/12507021/best-configuration-of-c3p0

Issue 1 - I tried this webpage by htmlUrlChecker and get 61 results. Then reran then get 18 results. This is so wired and I can not find out a way how can this happen.

Issue 2- Actually the results for this page don't make any sense. Let's say following output -
{ "originalUrl": "//webapps.stackexchange.com", "resolvedUrl": "http://webapps.stackexchange.com/questions/12507021/", "brokenReason": "HTTP_404" }

I checked the page and the tag a with href '//webapps.stackexchange.com' is really existing. But how can it be resolved as "http://webapps.stackexchange.com/questions/12507021/" with the path in my original link?

I looked into the source code, the resolving operation was done by your another module "urlobj". When "function resolveUrl(from, to, options)" invoked, after

var pathname = joinDirs(urlobj.extra.directory, urlobj.extra.directoryLeadingSlash);

The original path would be appended after the resolved url. Could you please look into this?

Best Regards,

Jet

Problems with gzipped sites?

I'm having issues making this work with sites on Amazon S3 with gzipped content. I get a "Finished! 0 links found." message back. Other sites without gzip work perfectly though.

Is this a known issue and are there plans to support sites that use gzip?

Feature Request: Broken Link URL

It would be great if there would be way to get the current link, which is checked.
I got some errors in my links like Invalid but do not know which one is broken.

var blc = require('broken-link-checker');

var htmlChecker = new blc.HtmlChecker({}, {
  link: function(result) {
    if (result.broken) {
      console.log(blc[result.brokenReason]);
    }
  }
});

Output:

Invalid
Invalid
Invalid
Invalid
Invalid
Invalid

Why is the error in Linux?

/broken-link-checker/lib/internal/parseOptions.js:42
options = Object.assign({}, defaultOptions, options);
^
TypeError: Object function Object() { [native code] } has no method 'assign'
at parseOptions (/home/avatar/node_modules/broken-link-checker/lib/internal/parseOptions.js:42:20)
at new SiteChecker (/home/avatar/node_modules/broken-link-checker/lib/public/SiteChecker.js:22:27)
at run (/home/avatar/node_modules/broken-link-checker/lib/cli.js:467:14)
at cli.input (/home/avatar/node_modules/broken-link-checker/lib/cli.js:144:3)
at Object. (/home/avatar/node_modules/broken-link-checker/bin/blc:3:31)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Function.Module.runMain (module.js:497:10)
at startup (node.js:119:16)
at node.js:929:3

HTTP_403 while curl gives HTTP_200

I have encountered a link which is considered broken by blc but opens well in curl or browser.

Here it is:

blc https://www.nginx.com

CURL works fine:

$ curl -I https://www.nginx.com
HTTP/1.1 200 OK
Date: Mon, 02 Jan 2017 06:52:15 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
X-Pingback: https://www.nginx.com/xmlrpc.php
Link: <https://www.nginx.com/wp-json/>; rel="https://api.w.org/"
Link: <https://www.nginx.com/>; rel=shortlink
Link: <https://www.nginx.com/wp-json>; rel="https://github.com/WP-API/WP-API"
X-User-Agent: standard
X-Cache-Config: 0 0
Vary: Accept-Encoding, User-Agent
X-Cache-Status: MISS
Server: nginx
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-Sucuri-ID: 14010

but BLC does not:

$ blc https://www.nginx.com
Getting links from: https://www.nginx.com/
Error: HTML could not be retrieved

User agent does not help:

$ blc --input https://www.nginx.com --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.3 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.3"
Getting links from: https://www.nginx.com/
Error: HTML could not be retrieved

What is the problem?
Is it a particularly NGINX bug, or larger set of websites is affected?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.