This is a web crawler library, with 3 flavors of crawling:
- the usual crawler, give him a url, and keeps on going until he visits all of the linked pages (that point to same domain).
- a sitemap crawler, give him a sitemap.xml and visits the urls he finds in the sitemap.
- a step crawler, give him a csv file with the list of urls(steps) to visit.
On each page the crawler visits it executes a bit of javascript code that we can define as a validation. These validations are usefull to test you site in an automated way, say for example, you want to check:
- if all pages contain meta tags
- if all pages contain a title
- if al pages contain web analytics tracking
- find out what links are broken
- test your own javascript page code in all(or some) pages.
- etc..
Its based on the new version of Selenium that uses Webdriver, thus allowing you to(in theory*) use either Firefox, Internet Explorer(when on windows), Chrome or HtmlUnit(a GUI-Less browser).
*In practice, there’s a couple of issues, note that Selenium 2 is in beta version: Chrome does not likes to run javascript. And i’ve had a few Out of Memory exceptions with HtmlUnit, so for now i’m using mostly Firefox. But as with all beta softwware this is expected and with time and new selenium2 releases, it should be all fine.
To be used as a library, thus:
(require '[cov3 :as cov3])
;; then: (:ff is short for firefox, use :hu for HtmlUnit,
;; :ch for Chrome, and :ie for Internet Explorer)
(cov3/crawl :ff "http://al3xandr3.github.com/" '("document.title"))
;; or: (10 is the sample size to pick from sitemap.xml)
(cov3/sitemap-crawl :ff "http://al3xandr3.github.com/sitemap.xml" "" "" 10 '("document.title"))
;; or: (assuming you have a csv file with the steps to take, see more on documentation)
;; for example the line: http://al3xandr3.github.com/,"document.title",,
(cov3/step-crawl :ff "data/steps.csv")
Its easeally made into a standalone executable .jar, just add a main and run: lein uberjar
Assuming you have lein installed:
> git clone git://github.com/al3xandr3/cov3.git
> cd cov3/
> lein deps
> lein install
When doing lein install on the root of the cov3 it will install the built jar into your local maven repository, typically into: ~/.m2/repository/
Then from a new project.clj just add it to the dependencies, like:
(defproject myproj "1.0.0" :dependencies [[cov3/cov3 "0.1.0"]] ...)
> lein autodoc
Will create html files into autodoc/ where there’s the documentation extracted directly from the code.