GithubHelp home page GithubHelp logo

hhy5277 / distributedcrawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zjucx/distributedcrawler

0.0 1.0 0.0 8 MB

分布式爬虫,redis缓存,mysql持久化,rpc实现分布式。可用docker部署

License: MIT License

Go 100.00%

distributedcrawler's Introduction

Distributed system for Crawl using by Golang

Build Status Yii2

Introduction

使用golang开发的分布式爬虫系统,主要分为3个模块:分布式框架数据管理爬虫部分。目录结构如下:

├── conf
│   └── app.conf       ------配置部分,数据库等信息的配置。还未开发。=。=
├── model    
│   ├── mongodb.go     ------爬虫的持久化介质,存储url和想要获取的数据
│   └── redismq.go     ------实用redis实现的优先级队列,master从mongodb获取url和向worker分发url
├── distribute    
│   ├── common.go      ------分布式系统的辅助类的定义等
│   ├── master.go      ------分布式系统的master节点,任务的分发调度
│   └── worker.go      ------分布式系统的worker节点,接受master的任务
├── main.go
└── scrawler           ------定义了数据库模型,用于与数据库交互
    ├── sinaLogin.go   ------模拟登陆模块,工程中实现了新浪微博的模拟登陆
    ├── scrawler.go    ------爬虫模块的入口,将接口暴漏于分布式模块
    ├── scheduler.go   ------爬虫的调度器,由于对master分发的url任务的预处理
    ├── downloader.go  ------爬虫的下载器,管理多个下载任务的同步等操作
    ├── spiders.go     ------爬虫的数据提取,用于提取resp的url和想要爬取的数据
    ├── pipeline.go    ------url和目的数据的持久化操作
    ├── request.go     ------封装的request请求
    └── utils.go       ------爬虫的辅助类

Requirements

1. Docker(1.1x)   -------部署mongodb服务
2. Golang(1.6)    -------开发语言
3. Mongodb        -------持久化介质
4. Redis          -------优先级队列

Screenshots

design

Implement

分布式框架

分为master节点和worker节点,master节点用于分发任务,worker节点用于任务执行。

数据管理

分为持久化mongodb和内存数据库redis(实现优先级队列)。

爬虫部分

模拟登陆部分获取cookie,数据爬取部分。

Using

<!--  Prepare redis servre and containers for worker  --!>
git clone https://github.com/zjucx/DistributedCrawler.git
cd DistributedCrawler
go get (代理代理代理)
// for master
go run main.go master masterip:port
// for workers
go run main.go worker masterip:port workerip:port

To Do List

1) 爬虫系统的[web界面]()
2) 日志管理,可维可测功能
3) 使用Zookeeper实现分布式配置管理
3) 爬虫的单机操作

Discussing

distributedcrawler's People

Contributors

zjucx avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.