GithubHelp home page GithubHelp logo

jiapengcs / doubanrobot Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 2.0 10 KB

Simple distributed crawler for Douban User Information.

Home Page: http://jiapengcs.com/2016/02/23/simple-distributed-crawler.html

Python 100.00%

doubanrobot's Introduction

DoubanRobot

Simple distributed crawler for Douban User Information.

依赖库:

  • BeautifulSoup4: $ pip install BeautifulSoup4

  • lxml: $ pip install lxml

  • requests: $ pip install requests

  • pillow: $ pip install pillow

使用:

需设置的内容:

1.登录时login.py需要豆瓣账号:form_email, form_password

self.payload = {
    'form_email': '[email protected]',
    'form_password': 'password',
    'remember': 'on'
}

2.manager.py中需设置初始任务ID,与worker通信的端口,爬虫延迟时间。

INIT_ID = '130949863'
PORT = 5000
DELAY_TIME = 5

3.worker.py中需设置运行manager的主机地址,通信端口,爬虫延迟时间。

SERVER_ADDR = '127.0.0.1'
PORT = 5000
DELAY_TIME = 5

运行:

一台主机作为控制节点运行manager.py,另外若干台主机作为爬虫节点运行worker.py,也可以在同一台机器上同时运行一个manager进程和若干个worker进程。用户信息、已完成ID、待完成ID、headers和cookies分别保存在当前目录下的info.txt, done.txt, todo.txt, session.txt文件中。

注意 控制好爬虫延迟时间,速度过快会返回403 Forbidden302 Temporarily Moved错误信息甚至封禁IP。

doubanrobot's People

Stargazers

Bernard Tan avatar

Watchers

Jasper avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.