leo8916 / wxhub Goto Github PK

View Code? Open in Web Editor NEW

160.0 10.0 51.0 5.56 MB

微信公众号-文章-无限制抓取

Python 100.00%

wxhub's Introduction

公众号文章抓取工具

使用公众号文章编辑链接的方案, 突破搜狗方案10条的限制~~~ ;-)

2018.12

新增公众号内, 百度网盘链接和密码的抓取. (指定method为baidu_pan_links)
新增全部html页面抓取方法 -method whole_page
添加todo.list 与 mask 变量

todo.list 文件记录了公众号下所有文章的链接数据, 因为高频次调用文章搜索/翻页接口会导致被ban.
所以目前的方案是使用mask记录所有索引处理记录, 保证了不会翻页相同位置, 提高了获取新增链接的几率.

2019.01

添加-pl参数, 用来限制每次公众号翻页数目, 每次翻页过多会被ban.建议10以内.
- N = 0: 不进行翻页, 只讲之前的url重新处理(todo.list)
- N < 0: 不限制翻页(默认), 翻到底或者出错时停止.
- N > 0: 翻页N次.

准备

首先你需要有一个微信公众号, 注册很简单
python 3.6
下载ChromeDriver 在第一次登陆时, 需要使用其手动登录.
安装依赖

pip install -r requirements.txt

结构

wxhub/
├── README.md
├── arti.cache.list		(使用后生成)	
├── chromedriver			(默认macOS版本, windows可另行下载 重命名即可)
├── cookies.json			(使用后生成)
├── gongzhonghao.py		(使用后生成)
├── output				(使用后生成)
├── requirements.txt	
├── url.cache.list		(使用后生成)
└── wxhub.py

使用

(py3) isyuu:wxhub isyuu$ python wxhub.py -h
usage: wxhub.py [-h] -biz BIZ [-chrome CHROME] [-arti ARTI] [-method METHOD]
                [-sleep SLEEP] [-pipe PIPE] [-pl PAGE_LIMIT]

公众号文章全搞定

optional arguments:
  -h, --help      show this help message and exit
  -biz BIZ        必填:公众号名字
  -chrome CHROME  可选:web chrome 路径, 默认使用脚本同级目录下的chromedriver
  -arti ARTI      可选:文章名字, 默认处理全部文章
  -method METHOD  可选, 处理方法: all_images, baidu_pan_links, whole_page
  -sleep SLEEP    翻页休眠时间, 默认为1即 1秒每页.
  -pipe PIPE      在method指定为pipe时, 该参数指定pipe处理流程. 例如:"pipe_example,
                  pipe_example1, pipe_example2, pipe_example3"
  -pl PAGE_LIMIT  指定最大翻页次数, 每次同一个公众号, 翻页太多次会被ban, 0:不翻页 只处理todo.list, 默认<0:无限制
                  >0:翻页次数

现有缓存功能, 目前缓存在如下文件中.

用户cookies
已经爬取的文章链接. --> arti.cache.list
已经下载的链接. --> url.cache.list

需要全部重新下载时, 删除对应文件即可.

已知问题

在某些情况下, cookies里的session过期后, 会导致"获取页面失败!"的错误.(此时参数cookies.json文件即可)
提示"搜索过于频繁"问题, 这可能是又有微信对搜索接口存在反爬机制; 目前解决的方案是:删除cookies.json, 换账号登录, 或者等几个小时即可.(未来准备尝试先缓存所有链接再逐条爬取的方式...)

wxhub's People

Contributors

Stargazers

Watchers

wxhub's Issues

“指定最大翻页次数, 每次同一个公众号, 翻页太多次会被ban”,请问被ban是封号吗

发现一个bug，图片无论什么格式都成png后缀的格式了，动图gif也是

能不能根据链接的格式保存

Great tool.

I will try to capture 微信浏览器 request header, as I found Chrome doesn't show Comments and Likes, but Wechat/PC client can render it.

failed on 获取网页失败

driver.get(Urls.index)
url = driver.current_url
if 'token' not in url:
    raise Exception(f"获取网页失败!")
Session.token = re.findall(r'token=(\w+)', url)[0]
process_input()
pipe()

Time out period for frequency control (Error code: 200013)?

What's the time out period for searching for public article links per personal public WeChat account?

My search returns zero link after scraping for a few hours, just wondering if there's known time window for safely restarting the scraping?

Output message:
choose index or next page(n):0
调用搜索, 报错:200013 freq control
本次搜索到0条文章, 新增0, 共在 todo.list 中包含 0 条文章链接 ...
本次共处理了 0 条文章链接!

能否增加获取微信步数的功能？

如题，自己的创造的数据不能被自己使用实在太悲哀了。
如何能够把自己每天的微信步数下载下来？历史数据，或者从现在开始每天记录到本地表格中。

0.pipe continue后面应该跟什么..
1.也没有办法可以删除公众号里（重复）的广告文章呢
2.能不能改用excel格式仅仅收集链接，以 “文章标题”+“发布日期”+“文章链接”的形式呢，以文件夹的形式好像还是不太好查看，添加超链接的工作量也挺大
3.调用搜索, 报错:200013 freq control的意思是不是被ban了。。。我要爬的文章有100页，可以指点一下怎么操作吗
4.在连续下载到40多页开始出现了“文件名、目录名或卷标语法不正确。”和“Exception: 保存网页失败: 该内容已被发布者删除”的错误，但是更换成别的公众号仍可以使用，这是什么原因呢

certificate verify failed

DevTools listening on ws://127.0.0.1:59600/devtools/browser/21f2c0e2-34e1-428f-8c11-7113ff243af1
请先手动登录, 完成后按回车继续:
url is https://mp.weixin.qq.com/cgi-bin/home?t=home/index&lang=zh_CN&token=2099603568
Traceback (most recent call last):
File "C:\Users\andyy\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\contrib\pyopenssl.py", line 441, in wrap_socket
cnx.do_handshake()
File "C:\Users\andyy\AppData\Local\Programs\Python\Python37\lib\site-packages\OpenSSL\SSL.py", line 1907, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "C:\Users\andyy\AppData\Local\Programs\Python\Python37\lib\site-packages\OpenSSL\SSL.py", line 1639, in _raise_ssl_error
_raise_current_error()
File "C:\Users\andyy\AppData\Local\Programs\Python\Python37\lib\site-packages\OpenSSL_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')]

clerical error

https://github.com/leo8916/wxhub/blob/master/wxhub.py#L369
Input.artis_cache = {}
should be
Input.arti_cache = {}

似乎不能用了？

上一次成功使用还是半个多月前，今天再用发现不能返回页面和链接了，是不是被封了？
错误信息如下：

请先手动登录, 完成后按回车继续:
Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 453, in wrap_socket
cnx.do_handshake()
File "/Users/user/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1915, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "/Users/user/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1639, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (54, 'ECONNRESET')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 839, in validate_conn
conn.connect()
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 344, in connect
ssl_context=context)
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/util/ssl.py", line 344, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 459, in wrap_socket
raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(54, 'ECONNRESET')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/util/retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mp.weixin.qq.com', port=443): Max retries exceeded with url: /cgi-bin/searchbiz?action=search_biz&token=304200544&lang=zh_CN&f=json&ajax=1&random=0.843229694344508&query=gqtzy2014&begin=0&count=5 (Caused by SSLError(SSLError("bad handshake: SysCallError(54, 'ECONNRESET')",),))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "wxhub.py", line 468, in
main(Input.args.chrome)
File "wxhub.py", line 442, in main
pipe()
File "wxhub.py", line 358, in pipe
fake_info = pipe_fakes(Input.fake_name)
File "wxhub.py", line 133, in pipe_fakes
rep = requests.get(Urls.query_biz.format(random=random.random(), token=Session.token, query=fake_name, begin=begin, count=count), cookies=Session.cookies, headers=Session.headers)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='mp.weixin.qq.com', port=443): Max retries exceeded with url: /cgi-bin/searchbiz?action=search_biz&token=304200544&lang=zh_CN&f=json&ajax=1&random=0.843229694344508&query=gqtzy2014&begin=0&count=5 (Caused by SSLError(SSLError("bad handshake: SysCallError(54, 'ECONNRESET')",),))

leo8916 / wxhub Goto Github PK

wxhub's Introduction

公众号文章抓取工具

2018.12

2019.01

准备

结构

使用

已知问题

wxhub's People

Contributors

Stargazers

Watchers

Forkers

wxhub's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs