GithubHelp home page GithubHelp logo

jdavix / turbocrawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from samsondav/turbocrawler

0.0 2.0 0.0 9 KB

A distributed web crawler written in Ruby using Redis and Apache Kafka

Ruby 100.00%

turbocrawler's Introduction

Intro

Write a simple web crawler.

Spec

  • Crawler is limited to one domain. E.g. when crawling example.com it crawls all pages within the domain, but not external links, for example to the Facebook and Twitter accounts.
  • Given a URL, it outputs a site map, showing which static assets each page depends on, and the links between pages.
  • Write it as you would a production piece of code.
  • Bonus points for tests and making it as fast as possible!

Architecture

  1. URL Frontier Queue (Apache Kafka)
  2. Fetch module (worker.rb)
  3. Parse module (page.rb)
  4. Sitemap store (Redis)

This crawler uses Apache Kafka as a messaging queue.

An arbitrary number of workers can be attached to the queue, from which URLs are read, crawled and new links inserted at the back of the queue again.

Sitemap data for each page is stored in Redis.

The system is failure-tolerant and guarantees that every URL will be crawled at least once.

If a worker should die while crawling a URL, Kafka's Consumer Groups feature will automatically assign the URL to a new worker. There may be duplicated work but never lost URLs.

Speed and Scalability

Due to the distributed architecture you can run the workers on as many machines as you like. So the crawler component is as fast as you need it to be.

At very high concurrency levels Redis might conceivably be a bottleneck. It could be replaced by a distributed data store backend such as Cassandra.

Rendering performance has not been optimized at all and might be quite slow for large sites.

Installation

Requirements:

  • Apache Kafka
  • Redis
  • Ruby >= 2.3.0

bundle

Configuration

See config.yml. You will probably need to add your local Kafka and Redis configurations there.

Start Workers

bundle exec ruby start.rb

Note that the workers will run forever, or until you quit using Ctrl-C.

Render Sitemap

Sitemaps are output in JSON format. You can run this in a separate shell from your workers, or even on another machine.

bundle exec ruby render.rb

Run tests

bundle exec rspec

Limitations

The following URL responses are treated as an empty page with no links:

  • Any status code other than 200
  • Any Content-Type other than text/html

turbocrawler's People

Watchers

James Cloos avatar Jose Pedraza avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.