GithubHelp home page GithubHelp logo

autoproxy's Introduction

AutoProxyMiddleware

简介

一个用于scrapy爬虫的自动代理中间件。可自动抓取和切换代理,自定义抓取和切换规则。

用法

将中间件模块放置到项目中,并在项目设置文件中添加该中间件。如

DOWNLOADER_MIDDLEWARES = {
    'projectname.autoproxy.AutoProxyMiddleware': 543,
}

配置

可在项目配置文件中使用AUTO_PROXY配置项配置代理中间件。如

AUTO_PROXY = {
	'test_urls':[('http://upaiyun.com','online'),('http://huaban.com', '33010602001878')],
	'ban_code':[500,502,503,504],
}

所有可用配置

  • 'enable': 一个布尔值,是否启用该中间件。默认为True
  • 'test_urls': 一个二元组的列表,网址+特征码(返回的网页内容中能找到的特定值),用作代理连接的测试。默认为[('http://www.w3school.com.cn', '06004630'), ]
  • 'test_proxy_timeout': 大于0的整数,用于测试代理时连接超时设置。默认为5
  • 'download_timeout': 大于0的整数,与scrapy的download_timeout一样,启用该中间件则设置。默认为60
  • 'test_threadnums': 大于0的整数,启动测试代理的线程数。默认为20
  • 'ban_code': 一个列表,代理被禁用的http状态码。确认返回状态码在此范围可自动切换代理。默认为[503,]
  • 'ban_re': 正则表达式字符串,代理被禁用返回的页面内容包含匹配正则式的内容,则切换代理,若为空则不启用。默认为r''
  • 'proxy_least': 大于0的整数, 若代理池可用数量小于它则自动抓取新的代理。默认为3
  • 'init_valid_proxys': 大于0的整数, 初始化爬虫时等待的可用代理数量。数值大会导致初始化比较慢,在爬虫进行中也可以同时测试保存的代理。默认为1
  • 'invalid_limit': 大于0的整数,每个代理成功下载到页面时都会对其计数,若突然无法连接或者被网站拒绝将对这个代理进行invaild操作,若代理爬取的页面数大于该设置数值,则暂时不invaild,切换至另一个代理,并减少其页面计数。默认为200

autoproxy's People

Contributors

cocoakekeyu avatar hisen630 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.