GithubHelp home page GithubHelp logo

adslproxy's Introduction

Python3 网络爬虫开发实战

本书介绍了如何利用 Python 3 开发网络爬虫。书中首先详细介绍了环境配置过程和爬虫基础知识;然后讨论了 urllib、requests 等请求库,Beautiful Soup、XPath、pyquery 等解析库以及文本和各类数据库的存储方法;接着通过多个案例介绍了如何进行 Ajax 数据爬取,如何使用 Selenium 和 Splash 进行动态网站爬取;接着介绍了爬虫的一些技巧,比如使用代理爬取和维护动态代理池的方法,ADSL 拨号代理的使用,图形、 极验、点触、宫格等各类验证码的破解方法,模拟登录网站爬取的方法及 Cookies 池的维护。 此外,本书还结合移动互联网的特点探讨了使用 Charles、mitmdump、Appium 等工具实现 App 爬取 的方法,紧接着介绍了 pyspider 框架和 Scrapy 框架的使用,以及分布式爬虫的知识,最后介绍了 Bloom Filter 效率优化、Docker 和 Scrapyd 爬虫部署、Gerapy 爬虫管理等方面的知识。

本书由图灵教育 - 人民邮电出版社出版发行,版权所有,禁止转载。

作者:崔庆才

购买地址:

加读者群:

视频资源:

Python3 爬虫三大案例实战分享

自己动手,丰衣足食!Python3 网络爬虫实战案例

adslproxy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adslproxy's Issues

为什么每次在设置代理或者移除代理的时候都要设置一次redis对象

image
这个 set_proxy方法每次调用都会创建一个self.redis = RedisClient()对象,为什么不能在创建class Sender()的时候就初始化这个对象呢?
而且这里很奇怪的一点是我用在创建Sender初始化self.redis后,调用set_proxy时候,即在self.redis.set(CLIENT_NAME, proxy)时候程序会卡死不动,按下Crtl+C后程序才能继续执行,哦,这种情况只在连接外网redis的时候才会出现,在连接局域网redis的时候不会有,我很奇怪,请问您当时这么写是有什么原因吗?

在 sender.py 的 time.sleep(DIAL_CYCLE) 重复了两次。

我在 DIAL_CYCLE 设置的时间是600秒进行一次重新拨号,实际是1200秒才进行重新拨号。在代码中看到的逻辑是 self.run() 里的 self.set_proxy(proxy) 写入完代理后,进行一次time.sleep(DIAL_CYCLE),然后 self.run() 执行完后又 time.sleep(DIAL_CYCLE)了一次。

image

获取代理为空

当前代码IP代理存储用的是hset方法,pip安装的版本获取代理没有用对应的hget,用的是get,pip版本是不是没有更新,导致每次获取都是空。
我的解决办法是根据最新代码的RedisClient.random()方法获取,没有通过pip安装的获取

网络波动引起的程序中断

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 572, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 563, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 538, in send_packed_command
    self.connect()
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 442, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 104 connecting to <redis_host>:6379. Connection reset by peer.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 439, in connect
    sock = self._connect()
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 494, in _connect
    raise err
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 482, in _connect
    sock.connect(socket_address)
ConnectionResetError: [Errno 104] Connection reset by peer

与issue#3同理 当redis和vps网络不稳定的时候操作redis集合会导致程序中断

Thread.sleep后没有进行二次循环处理

你好,麻烦请教一下,我将ADSL_CYCLE设置为21600秒后,时间到达了,线程却没有循环执行,是什么原因,平时都在写Java,不知道Python是怎么休眠的
while True:
print('ADSL Start, Remove Proxy, Please wait')
self.remove_proxy()
(status, output) = subprocess.getstatusoutput(ADSL_BASH)
if status == 0:
print('ADSL Successfully')
ip = self.get_ip()
if ip:
print('Now IP', ip)
print('Testing Proxy, Please Wait')
proxy = '{ip}:{port}'.format(ip=ip, port=PROXY_PORT)
if self.test_proxy(proxy):
print('Valid Proxy')
self.set_proxy(proxy)
print('Sleeping')
time.sleep(ADSL_CYCLE)
else:
print('Invalid Proxy')
else:
print('Get IP Failed, Re Dialing')
time.sleep(ADSL_ERROR_CYCLE)
else:
print('ADSL Failed, Please Check')
time.sleep(ADSL_ERROR_CYCLE)

在sender.py 中如果执行 DIAL_BASH 命令提前返回,会造成无限重复执行拨号。

在 sender.py 中的 run() 里面如果 subprocess.getstatusoutput(DIAL_BASH) 执行提前结束,就会陷入无限拨号的问题。比如 subprocess.getstatusoutput(DIAL_BASH) 执行完了,拨号还没成功,还没获取到IP地址,需要等个10秒才获取到IP地址。此时提前进入 self.extract_ip() 获取IP,就会获取不到,ip 这个变量为空,进入 else 语句,马上重新执行 self.run() 重新拨号。。

建议或者在 self.extract_ip() 里面加入一个 while 等待时间 ,比如每 while 60秒,每检查一次就减1,然后sleep 1秒。有个60秒等待时间,再返回IP,或许会比较好。比如这样:

def extract_ip(self):
        """
        获取本机IP
        :param ifname: 网卡名称
        :return:
        """
        WAIT_TIME = 60
        while WAIT_TIME:
            (status, output) = subprocess.getstatusoutput('ifconfig')
            if not status == 0: return
            pattern = re.compile(DIAL_IFNAME + '.*?inet.*?(\d+\.\d+\.\d+\.\d+).*?netmask', re.S)
            result = re.search(pattern, output)
            if result:
                # 返回拨号后的 IP 地址
                return result.group(1)
            WAIT_TIME -= 1
            time.sleep(1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.