GithubHelp home page GithubHelp logo

crawler's Introduction

A spider demo using by Golang

单机版

单机版实现参考 Python 的 Scrapy 框架,盗个图

architecture

主要有这几个模块

  • Engine:负责各个模块间的数据流交互,生成 Request 传递给 Scheduler 模块.
  • Scheduler: Url 的去重,过滤,分发带爬取 Request 到 Spider 模块中.
  • Spider: 进行网页的爬取和解析,解析后的 Item 传递给 ItemPipeline 处理, Request 传递给 Engine 模块.
  • Item Pipeline: 对解析结果进行过滤,持久化处理.

分布式

分布式实现参考这篇文章Scrapy分布式实现 这篇文章 papapa 说了一堆,说的简单一点就是

  • 将单机版的 Engine 和 Scheduler 两个模块抽取出来整合成一个新的模块,这个模块就是 Redis.
  • Master 和 Slave 进程通过 Redis 消息队列进行通信(对应 Redis 的 List 数据结构).
  • Master 进程通过 Redis lpush 产生待爬取 Request.
  • Slave 进程通过 Redis brpop 获取 Request 进行网页爬取和解析.
  • 或者用 rpush/blpop 这个组合也可以,这样就实现了简单的分布式了.

本项目长期维护,求喷,求喷!!!

crawler's People

Contributors

yanyanqing avatar

Stargazers

zsl avatar

Watchers

James Cloos avatar  avatar

crawler's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.