GithubHelp home page GithubHelp logo

derekhe / proxypool Goto Github PK

View Code? Open in Web Editor NEW
300.0 10.0 66.0 1.71 MB

高质量免费代理池——每日1w+代理资源滚动更新

Dockerfile 0.39% Makefile 4.57% Python 95.04%
proxypool proxy freeproxylist

proxypool's Introduction

代理池

注意

本代码库仅用于学习研究使用,请勿用于非法用途,本人不承担由此带来的任何法律问题。

介绍

《爬虫实战:从数据到产品》一书中,我讲到了一个基于ProxyBroker的代理池。经过我的长时间的实践,这个代理池用起来非常的方便和稳定。

这个Repo将基于ProxyBroker,增加了**区域的代理资源。并引入了docker-compose,能够快速的方便的开始代理的抓取。

用法

docker-compose up

然后浏览器打开http://localhost:8080/proxy.json 即可得代理列表。每个代理都经过类型的验证,代理资源会随着时间增长。每个代理的有效期为一天时间。

由于很多代理资源在**无法访问的网站,部署在国内的服务器上会影响资源的获取,所以推荐将服务器部署到国外的服务器。服务器推荐使用DigitalOcean,我的多个服务器都在SFO2区域,非常的稳定。

proxypool's People

Contributors

derekhe avatar hesicong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

proxypool's Issues

非Python程序猿 问个代码上的问题

self.proxies = received

        oldcount = len(self.proxies)
        try:
            received = self.find_proxies(page)
        except Exception as e:
            received = []
            log.error(
                'Error when executing find_proxies.'
                'Domain: %s; Error: %r' % (self.domain, e)
            )
        self.proxies = received

我的理解 received = self.find_proxies(page) 从当前页找到代理,然后 self.proxies = received 赋值给了实例的proxies属性,看起来是第1页找到10个,len(self.proxies)=10,然后第2页找到20个,赋值一下,len(self.proxies)=20了,但实际执行情况又不是,self.proxies是合并的结果,所以合并数组的代码在哪?

xicidaili.com 访问太快就503了

可否给它单独一个线程顺序执行?

url: http://www.xicidaili.com/wn/8
headers: <CIMultiDictProxy('Server': 'nginx/1.1.19', 'Date': 'Sat, 09 May 2020 07:55:51 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Status': '503 Service Unavailable', 'Cache-Control': 'no-cache', 'X-Runtime': '0.004219', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Request-Id': 'b88bfc2f-9c8b-4ac8-84b4-9808f14e8ab7', 'X-Powered-By': 'Phusion Passenger 5.0.28')>


ps: PROVIDERS 只写国产的那5个,然后日志全输出,国内运行,结果毛都没有 233

[16:55:18] - DEBUG - proxybroker - 0 proxies received from freeproxylists.net: set()
[16:55:34] - DEBUG - proxybroker - 0 proxies received from xicidaili.com: set()
[17:05:35] - DEBUG - proxybroker - 0 proxies received from kuaidaili.com: set()
[17:05:40] - DEBUG - proxybroker - 0 proxies received from cnproxy.com: set()
[17:06:38] - DEBUG - proxybroker - 0 proxies received from proxy.com.ru: set()

在SFO2创建了一个,但还是爬着就停住了

我DO上购买SFO2区域的服务器,但是确实,跑着跑着会出现停住的现象,爬取记录数不到几百

是否还有什么配置?

有时候开始跑还会遇到下面的错误:

aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host myexternalip.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')]

关于该包的使用

python3 ./server/proxy.py

这个是一次性的吗?所以要配合crontab配置定时任务刷?

执行这工具的时候,爬了一下,然后就不爬了,进程也没有退出

还有一个问题,爬取出来的打印记录与redis内的记录是不一致的

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.