GithubHelp home page GithubHelp logo

michaeltelford / broken_link_finder Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 1.0 135 KB

Finds a websites broken links and reports back to you with a summary

Home Page: https://rubygems.org/gems/broken_link_finder

License: MIT License

Ruby 99.88% Shell 0.12%
wgit ruby links broken-links website broken-link-finder

broken_link_finder's Introduction

Broken Link Finder

Does what it says on the tin - finds a website's broken links.

Simply point it at a website and it will crawl all of its webpages searching for and identifing broken links. You will then be presented with a concise summary of any broken links found.

Broken Link Finder is multi-threaded and uses libcurl under the hood, it's fast!

How It Works

Any HTML element within <body> with a href or src attribute is considered a link (this is configurable however).

For each link on a given page, any of the following conditions constitutes that the link is broken:

  • An empty HTML response body is returned.
  • A response status code of 404 Not Found is returned.
  • The HTML response body doesn't contain an element ID matching that of the link's fragment e.g. http://server.com#about must contain an element with id="about" or the link is considered broken.
  • The link redirects more than 5 times consecutively.

Note: Not all link types are supported.

In a nutshell, only HTTP(S) based links can be successfully verified by broken_link_finder. As a result some links on a page might be (recorded and) ignored. You should verify these links yourself manually. Examples of unsupported link types include tel:*, mailto:*, ftp://* etc.

See the usage section below on how to check which links have been ignored during a crawl.

With that said, the usual array of HTTP URL features are supported including anchors/fragments, query strings and IRI's (non ASCII based URL's).

Made Possible By

broken_link_finder relies heavily on the wgit Ruby gem by the same author. See its repository for more details.

Installation

Only MRI Ruby is tested and supported, but broken_link_finder may work with other Ruby implementations.

Currently, the required MRI Ruby version is:

ruby '>= 2.6', '< 4'

Using Bundler

$ bundle add broken_link_finder

Using RubyGems

$ gem install broken_link_finder

Verify

$ broken_link_finder version

Usage

You can check for broken links via the executable or library.

Executable

Installing this gem installs the broken_link_finder executable into your $PATH. The executable allows you to find broken links from your command line. For example:

$ broken_link_finder crawl http://txti.es

Adding the --recursive flag would crawl the entire txti.es site, not just its index page.

See the output section below for an example of a site with broken links.

You can peruse all of the available executable flags with:

$ broken_link_finder help crawl

Library

Below is a simple script which crawls a website and outputs its broken links to STDOUT:

main.rb

require 'broken_link_finder'

finder = BrokenLinkFinder.new
finder.crawl_site 'http://txti.es' # Or use Finder#crawl_page for a single webpage.
finder.report                      # Or use Finder#broken_links and Finder#ignored_links
                                   # for direct access to the link Hashes.

Then execute the script with:

$ ruby main.rb

See the full source code documentation here.

Output

If broken links are found then the output will look something like:

Crawled http://txti.es
7 page(s) containing 32 unique link(s) in 6.82 seconds

Found 6 unique broken link(s) across 2 page(s):

The following broken links were found on 'http://txti.es/about':
http://twitter.com/thebarrytone
/doesntexist
http://twitter.com/nwbld
twitter.com/txties

The following broken links were found on 'http://txti.es/how':
http://en.wikipedia.org/wiki/Markdown
http://imgur.com

Ignored 3 unique unsupported link(s) across 2 page(s), which you should check manually:

The following links were ignored on 'http://txti.es':
tel:+13174562564
mailto:[email protected]

The following links were ignored on 'http://txti.es/contact':
ftp://server.com

You can provide the --html flag if you'd prefer a HTML based report.

Link Extraction

You can customise the XPath used to extract links from each crawled page. This can be done via the executable or library.

Executable

Add the --xpath (or -x) flag to the crawl command e.g.

$ broken_link_finder crawl http://txti.es -x //img/@src

Library

Set the desired XPath using the accessor methods provided:

main.rb

require 'broken_link_finder'

# Set your desired xpath before crawling...
BrokenLinkFinder::link_xpath = '//img/@src'

# Now crawl as normal and only your custom targeted links will be checked.
BrokenLinkFinder.new.crawl_page 'http://txti.es'

# Go back to using the default provided xpath as needed.
BrokenLinkFinder::link_xpath = BrokenLinkFinder::DEFAULT_LINK_XPATH

Contributing

Bug reports and feature requests are welcome on GitHub. Just raise an issue.

License

The gem is available as open source under the terms of the MIT License.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run bundle exec rake test to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install.

To release a new gem version:

  • Update the deps in the *.gemspec, if necessary.
  • Update the version number in version.rb and add the new version to the CHANGELOG.
  • Run bundle install.
  • Run bundle exec rake test ensuring all tests pass.
  • Run bundle exec rake compile ensuring no warnings.
  • Run bundle exec rake install && rbenv rehash.
  • Manually test the executable.
  • Run bundle exec rake release[origin].

broken_link_finder's People

Contributors

michaeltelford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

gizipp

broken_link_finder's Issues

Feature Request - Display crawl statistics at top of report

At the start/top of the report, the total number of pages crawled and the total crawl duration should be displayed to the user. This would also double as a total for the pages on a site. E.g.

Crawled http://txti.es (12 page(s) in 5.2 seconds)

Found 6 broken link(s) across 2 page(s):

The following broken links were found on 'http://txti.es/about':
http://twitter.com/thebarrytone
http://twitter.com/nwbld
...

The 12 page(s) bit would only be displayed on an -r site crawl and would be absent when crawling a single page. Having the crawled URL in both cases however, makes for a more complete report (especially for large reports).

Reports sometimes show link protocol as http:// even thought link protocol given on the command line was https://

When I run a report like this:

broken_link_finder crawl --recursive --sort-by-link https://example.com

For some sites the report shows the protocol for the links as http:// even thought the sites' .htaccess file redirects all request to the SSL version of the site and the protocol https:// was given on the command line.

Other sites report back the links with the SSL protocol of https://.

Which protocol Is broken_link_finder using when it checks the links? Is there a reason it would fallback to the http:// protocol instead of use the requested protocol of https://?

Links in external page show in the broken links report - URI::InvalidURIError - URI must be ascii only

In testing on one site I found that one page on the site being check was redirecting to an external page. broken_link_finder reported broken links in the external page too.

The thing that clued me in to it was some of the link were broken CSS and JS script files that we do not have on the site that was being tested.

@michaeltelford I will send an email with more details on the sites and pages in question for you to be able to test yourself.

Only broken href links to other domains are reported

Hi!
I setup a test page to test the different types of broken links and combinations thereof. Links to the same site, to external sites, links to anchors and links to images. Of the seven broken links that all return a 404 checked with curl -I only two were reported as broken.

  • Images that were missing were not reported.
  • Broken links on the same domain were not reported.
  • Links to external sites were were reported as broken.

Broken links are sometimes reported despite being OK

Broken links are sometimes reported despite being OK

I've found some links to be coming back as broken when they're aren't in reality (verified with cURL).

I've traced the route of this issue to the latest version of the wgit gem which uses Typhoeus underneath to perform HTTP GET requests. I've manually verified this issue is limited to typhoeus and not to wgit or broken_link_finder.

As a result, I've raised an issue on the typhoeus Github repo which (when fixed) will resolve the issue described above. This issue is only affecting the latest version of broken_link_finder which is v0.9.2. Use a previous version if you are being affected in the meantime. Version 0.9.2 doesn't introduce any breaking changes so a change of version shouldn't be an issue.

Thanks for your understanding on this matter and sorry for any inconvenience caused.

Link reported as broken after two hops in a 301 chain

Hi!
The CMS I am using will first redirect (301) to the language version of the page if the language prefix has not be added (ie. de for German). If the CMS now has an other redirect (301) for the url with the language prefix. This creates a chain of two 301. The next hop hits the page and returns a http status code of 200.

broken_link_finder reports the first link as broken in its output report.

The Googlebot, in the source(s) I have found, will follow up to four chained 301s before giving up. Chrome, Firefox, and Edge will follow up to five chained 301s before returning the error message: "Too many redirects. Can't load site."

Now granted 301 chains are not good for a lot of reasons but I would hardly consider a chain of two 301s a broken link.

This feature request is to add a sane default of three chained 301 as acceptable and produce no output. Four or more chained 301s could give a warning, as the page may not be indexed or displayed.

The other option would be to make the number of chained 301 before a link is reported as "broken" (or possibly another term to better identify the issue) a command line option that the user can change at run time.

Thanks for considering this feature request!
Frederick

Protocol relative links are reported as missing and linked incorrectly in email reports

In one of the broken link reports that normally gets sent to website owner (which I had personally not looked at for some time) I found the following:

The broken link '//s.w.org' was found on the following pages:

This link is not really broken. I found that this type of links are what are called protocol relative links. In other words depending on if the document containing them is using the http or https the same protocol would be used for these protocol relative links. Using these type of links in modules and add-ons for CMS is best practice to avoid the mix-content error.

@michaeltelford This is a new thing for me. I had never heard about protocol relative links until I research this today.

The link for this broken link in the email was also incorrect. ie. it linked to something like https://example.com//fonts.googleapis.com when it should have linked to https://fonts.googleapis.com

Feature Request - Implement a retry mechanism for broken links

Implement a retry mechanism for broken links just before the crawl_* method exits. This mechanism would retry the @broken_links and verify that they are indeed broken. If not, then they'd be deleted from the array.

This would solve the issue where links are being reported as broken when they aren't in reality. They're being hit too quickly/regularly during the crawl to give an OK response. Then when the report is generated, it contains links that when clicked, work OK.

The addition of a retry mechanism would obviously slow down the overall crawl speed; but would far improve the crawl integrity which is a worthwhile trade off.

Consider offering a compiled binary for broken_link_finder

@michaeltelford I have been more than happy with broken_link_finder, but I am now in a position where I need to move my broken link checking to a server where I do not have super user permissions. Trying to setup a ruby environment in this situation is not a trivial mater and I believe it is not even possible in my case.

I see that there are a few projects to compile Ruby programs into a portable binary that could be run without a full ruby install. This request is to ask you to consider offering a compiled binary for broken_link_finder. This too does not seem like a trivial mater. With my limited Ruby knowledge and experience it does not seem to make sense for me to spend the time and effort to just compile this for my work. It would make more sense to do this at the project level, as this would allow adoption of the program at a higher level.

If this is not something that you are interested in doing just close this issue.
Thanks for your consideration!
Frederick

Rate limiting / checking links mulitple times because they appear on multiple pages

The server, that I am using to run broken_link_finder is quite limited in resources as it only has 1 CPU. I have noticed that the 1 minute and 5 minute CPU load averages will go above 2 while broken_link_finder is running.

I know that I can use nice to slow processes down but I am also concerned that sites maybe dropping or refusing request because of too many in too short a time span.

In my most recent test in the footer of one site there is a link to the site's Facebook page and to their Youtube channel. Both of these have been reported once in a reports on only one page that they were broken links. I used a web browser and curl -I example.com to check them and I was getting a HTTP response code of 200 for both.

Does broken_link_finder just go through found links on a page even if they exactly match ones that it has found on other pages on that site? If it is not passing over exact matching links that means for this site it would hit Youtube and Facebook on every page to test the same links.

Bug - HTML reports have incorrect absolute links

HTML reports have incorrect absolute links for relative links that redirect to a different host.

To replicate:

  • Crawl a site and generate a HTML report.
  • For a relative link that redirects, the absolute built link will contain the redirected to host, not the host of the originally crawled URL.

Also, the HTML report URL (in the summary at the top of the report) isn't a HTML link, it should be.

Feature Request - Support a HTML report format

Currently with the text based email format only the absolute links in the email reports get made into clickable links by some email clients.

Changing the email format to HTML would mean that both absolute and relative links could be formatted so that they are clickable links.

Bug - URL anchors are asserted on redirected-to URL's

URL anchors are asserted on redirected-to URL's but they probably shouldn't be because the new redirected to URL is correct as is (because of the deliberate redirect) meaning the original URL's anchor no longer applies; So we should just ignore it.

Bug - Handling of URI with https:// - An error has occurred: Absolute URI missing hierarchical segment: 'http://'

After updating to v0.10.0 I am seeing a number of times that the program fails to complete the check for broken links. The error returned is:

An error has occurred: Absolute URI missing hierarchical segment: 'http://'

Interesting this this occurs when using a URI that is using a SSL connection starting with https://, so I am wondering why it is giving an error about not containing a non-SSL URI.

I think that this is related to a external program that it is using as I cannot find that error text in the broken_link_finder code on github by searching for it or sections of it.

I tested by running the command over a day once per hour with my wrapper script and also in testing on the command line without the wrapper script (ie. broken_link_finder crawl --recursive --html --sort-by-link https://example.com). The result was 28 fails out of 46 runs.

I also found that then the error above is returned that broken_link_finder is exiting with a OK status code of 0, the same if everything functioned correctly and a report of broken links was returned.

This means that when I use cronic when running this with cron it will not email me that there was an error. This is not the way I believe this should function.

Support for Ruby 3

Add support for Ruby 3 in a future version of broken_link_finder.

This should include:

  • Types generated from the yard docs using: https://github.com/AaronC81/sord
  • Updating .ruby-version to use Ruby 3.
  • All tests passing on Ruby 3.
  • Update the README's Installation section about Ruby versioning.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.