GithubHelp home page GithubHelp logo

awolfly9 / ipproxytool Goto Github PK

View Code? Open in Web Editor NEW
2.0K 75.0 412.0 290 KB

python ip proxy tool scrapy crawl. 抓取大量免费代理 ip,提取有效 ip 使用

License: MIT License

Python 99.79% Dockerfile 0.21%
proxy python ipproxy

ipproxytool's People

Contributors

asthman666 avatar awolfly9 avatar dependabot[bot] avatar kunkgg avatar lxtluck avatar mrkang avatar timgates42 avatar xiaxia47 avatar youngjeff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ipproxytool's Issues

日志文件的路径问题

你好,我看爬虫里面有这段代码,可是为何日志文件到不了这个目录?都放在log下面

def init(self):
        self.meta = {
            'download_timeout': self.timeout,
        }

        self.dir_log = 'log/proxy/%s' % self.name
        utils.make_dir(self.dir_log)
        self.sql.init_proxy_table(config.free_ipproxy_table)

Docker启动报错

首先要非常感谢作者的无私奉献,这个工具确实非常有用。下面说我的问题。

我在用docker安装的过程中遇到一个问题。
运行docker build -t proxy .安装成功没有问题。
运行docker run -it proxy报错如下
image
初步分析是因为async def只在python 3.5版本以后才支持。目前docker内用到的docker.io/mrjogo/scrapy对应的python版本是2.7.11。

我想应该修改下镜像,或者镜像里的python版本就好了。不过我不确定scrapy是不是依赖python2。希望有人可以抽出时间帮忙解决该问题。

持续运行一段时间后表中会有大量重复ip

发现一个问题,如果运行起ipproxypool.py的话,程序会一直跑一直跑,这没啥毛病,但是表里慢慢会出现大量重复的ip,不知道这是出于某种逻辑有意为之还是一个逻辑上的bug?

No module named ipproxytool.spiders.proxy.xicidaili

依赖都安装好后 , 执行报错, python 没用过 , 求助

➜ IPProxyTool git:(master) ✗ python runspider.py
Traceback (most recent call last):
File "runspider.py", line 12, in
from ipproxytool.spiders.proxy.xicidaili import XiCiDaiLiSpider
ImportError: No module named ipproxytool.spiders.proxy.xicidaili

ImportError: No module named pymongo

sudo apt-get install docker.io
...
sudo docker run -it --name=proxy proxy

Traceback (most recent call last):
  File "ipproxytool.py", line 7, in <module>
    import run_validator
  File "/home/run_validator.py", line 11, in <module>
    from ipproxytool.spiders.validator.douban import DoubanSpider
  File "/home/ipproxytool/spiders/validator/douban.py", line 3, in <module>
    from validator import Validator
  File "/home/ipproxytool/spiders/validator/validator.py", line 10, in <module>
    from sql import SqlManager
  File "/home/sql/__init__.py", line 5, in <module>
    from mongodb import Mongodb
  File "/home/sql/mongodb.py", line 4, in <module>
    import pymongo
ImportError: No module named pymongo

httpbin无法验证

r = requests.get(url=self.urls[0], timeout=20)
data = json.loads(r.text)
httpbin验证的时候上面这段会报the json object must be str not 'bytes'这个错误

Python3下报错

wx20180331-111423

在 macOS 下和 Docker 下,运行 ipproxytool.py 会出现如图中的错误,scrapy已经安装了。
wx20180331-111754

关于 server 的性能问题

我看到源码中 server 是用 flask 直接 run 起来的,没有使用任何容器。这样是不是会有性能问题?

其实我在使用中还没碰到过问题,我现在写的程序每秒的请求并发大概是 40 左右,担心将来翻几倍的情况下有性能瓶颈。

可用率

有大佬测试过吗?可用率怎么样?

拉勾网显示:您操作太频繁

首先感谢您的项目!
在网页打开拉勾的测试网址,显示您操作太频繁,是否因为测试了太多非高匿ip,使拉勾网把我的ip列为可疑?

谢谢!

参数错误attr() got an unexpected keyword argument 'type'

Traceback (most recent call last):
File "//python/IPProxyTool/ipproxytool.py", line 8, in
import run_validator_async
File "/
/python/IPProxyTool/run_validator_async.py", line 8, in
import aiohttp
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/aiohttp/init.py", line 6, in
from .client import * # noqa
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/aiohttp/client.py", line 18, in
from . import client_exceptions, client_reqrep
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 17, in
from . import hdrs, helpers, http, multipart, payload
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/aiohttp/helpers.py", line 166, in
@attr.s(frozen=True, slots=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/aiohttp/helpers.py", line 168, in ProxyInfo
proxy = attr.ib(type=str)
TypeError: attr() got an unexpected keyword argument 'type'

同样遇到这个问题

Traceback (most recent call last):
File "ipproxytool.py", line 7, in
import run_validator
File "/usr/local/IPProxyTool/IPProxyTool-master/run_validator.py", line 8, in
import scrapydo
ImportError: No module named scrapydo

Originally posted by @projectmanagerment in #28 (comment)

No module named scrapydo

Traceback (most recent call last):
File "ipproxytool.py", line 7, in
import run_validator
File "/opt/software/IPProxyTool/run_validator.py", line 8, in
import scrapydo
ImportError: No module named scrapydo

请问,这个模块从哪里可以下载到??

执行了 runSpider.py 过一段时间就不动了..

2017-02-14 11:29:18 [10], msg:sql helper execute command:CREATE TABLE IF NOT EXI
STS free_ipproxy (id INT(8) NOT NULL AUTO_INCREMENT,ip CHAR(25) NOT NULL UNI
QUE,port INT(4) NOT NULL,country TEXT DEFAULT NULL,anonymity INT(2) DEFAUL
T NULL,https CHAR(4) DEFAULT NULL ,speed FLOAT DEFAULT NULL,source CHAR(20
) DEFAULT NULL,save_time TIMESTAMP NOT NULL,PRIMARY KEY(id)) ENGINE=InnoDB
2017-02-14 11:29:19 [10], msg:*********run spider waiting...


运行 ipproxytool.py 出错

运行启动脚本 ipproxytool.py 时,报错。
[root@localhost IPProxyTool]# python3 ipproxytool.py
Traceback (most recent call last):
File "ipproxytool.py", line 8, in
import run_validator_async
File "/media/sf_LinuxShare/IPProxyTool/run_validator_async.py", line 8, in
import aiohttp
ModuleNotFoundError: No module named 'aiohttp'

然后我用 pip 安装 aiohttp 之后运行再次报错:
[root@localhost IPProxyTool]# python3 ipproxytool.py
Traceback (most recent call last):
File "ipproxytool.py", line 30, in
run_validator.validator()
File "/media/sf_LinuxShare/IPProxyTool/run_validator.py", line 33, in validator
module = import_module(path)
File "/usr/local/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 978, in _gcd_import
File "", line 961, in _find_and_load
File "", line 936, in _find_and_load_unlocked
File "", line 205, in _call_with_frames_removed
File "", line 978, in _gcd_import
File "", line 961, in _find_and_load
File "", line 936, in _find_and_load_unlocked
File "", line 205, in _call_with_frames_removed
File "", line 978, in _gcd_import
File "", line 961, in _find_and_load
File "", line 945, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'ipproxytool.spiders'; 'ipproxytool' is not a package
Traceback (most recent call last):
File "run_crawl_proxy.py", line 6, in
import scrapydo
ImportError: No module named scrapydo
[root@localhost IPProxyTool]# Traceback (most recent call last):
File "run_server.py", line 8, in
from server import dataserver
File "/media/sf_LinuxShare/IPProxyTool/server/dataserver.py", line 9, in
from sql import SqlManager
File "/media/sf_LinuxShare/IPProxyTool/sql/init.py", line 4, in
from sql.mysql import MySql
File "/media/sf_LinuxShare/IPProxyTool/sql/mysql.py", line 6, in
import pymysql
ImportError: No module named pymysql

因为最近需要用到提取ip,所以想问下是不是我的安装步骤哪里不对。
我用的是centos6.5, python3.6.1

sql包下的module引用错误

在运行validator的时候根据日志可以看到用于验证proxy的spider(百度,拉勾,httpbin等)都会有引用问题,具体log如下

File "run_spider.py", line 33, in <module>
    runspider(name)
  File "run_spider.py", line 19, in runspider
    process = CrawlerProcess(get_project_settings())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 69, in walk_modules
    mods += walk_modules(fullpath)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/opt/IPProxyTool/ipproxytool/spiders/proxy/basespider.py", line 10, in <module>
    from sql import SqlManager
  File "/opt/IPProxyTool/sql/__init__.py", line 3, in <module>
    from sql.sql import Sql

运行环境:
Ubuntu 16
python3
已安装requirement.txt中的东西(确认是pip3安装的)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.