GithubHelp home page GithubHelp logo

wxhub's Introduction

公众号文章抓取工具

使用公众号文章编辑链接的方案, 突破搜狗方案10条的限制~~~ ;-)

2018.12

  • 新增公众号内, 百度网盘链接和密码的抓取. (指定method为baidu_pan_links)
  • 新增全部html页面抓取方法 -method whole_page
  • 添加todo.list 与 mask 变量
todo.list 文件记录了公众号下所有文章的链接数据, 因为高频次调用文章搜索/翻页接口会导致被ban.
所以目前的方案是使用mask记录所有索引处理记录, 保证了不会翻页相同位置, 提高了获取新增链接的几率.

2019.01

  • 添加-pl参数, 用来限制每次公众号翻页数目, 每次翻页过多会被ban.建议10以内.
    • N = 0: 不进行翻页, 只讲之前的url重新处理(todo.list)
    • N < 0: 不限制翻页(默认), 翻到底或者出错时停止.
    • N > 0: 翻页N次.

准备

pip install -r requirements.txt

结构

wxhub/
├── README.md
├── arti.cache.list		(使用后生成)	
├── chromedriver			(默认macOS版本, windows可另行下载 重命名即可)
├── cookies.json			(使用后生成)
├── gongzhonghao.py		(使用后生成)
├── output				(使用后生成)
├── requirements.txt	
├── url.cache.list		(使用后生成)
└── wxhub.py

使用

(py3) isyuu:wxhub isyuu$ python wxhub.py -h
usage: wxhub.py [-h] -biz BIZ [-chrome CHROME] [-arti ARTI] [-method METHOD]
                [-sleep SLEEP] [-pipe PIPE] [-pl PAGE_LIMIT]

公众号文章全搞定

optional arguments:
  -h, --help      show this help message and exit
  -biz BIZ        必填:公众号名字
  -chrome CHROME  可选:web chrome 路径, 默认使用脚本同级目录下的chromedriver
  -arti ARTI      可选:文章名字, 默认处理全部文章
  -method METHOD  可选, 处理方法: all_images, baidu_pan_links, whole_page
  -sleep SLEEP    翻页休眠时间, 默认为1即 1秒每页.
  -pipe PIPE      在method指定为pipe时, 该参数指定pipe处理流程. 例如:"pipe_example,
                  pipe_example1, pipe_example2, pipe_example3"
  -pl PAGE_LIMIT  指定最大翻页次数, 每次同一个公众号, 翻页太多次会被ban, 0:不翻页 只处理todo.list, 默认<0:无限制
                  >0:翻页次数

现有缓存功能, 目前缓存在如下文件中.

  • 用户cookies
  • 已经爬取的文章链接. --> arti.cache.list
  • 已经下载的链接. --> url.cache.list

需要全部重新下载时, 删除对应文件即可.

已知问题

  • 在某些情况下, cookies里的session过期后, 会导致"获取页面失败!"的错误.(此时参数cookies.json文件即可)
  • 提示"搜索过于频繁"问题, 这可能是又有微信对搜索接口存在反爬机制; 目前解决的方案是:删除cookies.json, 换账号登录, 或者等几个小时即可.(未来准备尝试先缓存所有链接再逐条爬取的方式...)

wxhub's People

Contributors

leo8916 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wxhub's Issues

Great tool.

I will try to capture 微信浏览器 request header, as I found Chrome doesn't show Comments and Likes, but Wechat/PC client can render it.

image

failed on 获取网页失败

driver.get(Urls.index)
url = driver.current_url
if 'token' not in url:
    raise Exception(f"获取网页失败!")
Session.token = re.findall(r'token=(\w+)', url)[0]
process_input()
pipe()

Time out period for frequency control (Error code: 200013)?

What's the time out period for searching for public article links per personal public WeChat account?

My search returns zero link after scraping for a few hours, just wondering if there's known time window for safely restarting the scraping?

Output message:
choose index or next page(n):0
调用搜索, 报错:200013 freq control
本次搜索到0条文章, 新增0, 共在 todo.list 中包含 0 条文章链接 ...
本次共处理了 0 条文章链接!

能否增加获取微信步数的功能?

如题,自己的创造的数据不能被自己使用实在太悲哀了。
如何能够把自己每天的微信步数下载下来?历史数据,或者从现在开始每天记录到本地表格中。

你好,有几个问题想请教一下

0.pipe continue后面应该跟什么..
1.也没有办法可以删除公众号里(重复)的广告文章呢
2.能不能改用excel格式仅仅收集链接,以 “文章标题”+“发布日期”+“文章链接”的形式呢,以文件夹的形式好像还是不太好查看,添加超链接的工作量也挺大
3.调用搜索, 报错:200013 freq control的意思是不是被ban了。。。 我要爬的文章有100页,可以指点一下怎么操作吗
4.在连续下载到40多页开始出现了“文件名、目录名或卷标语法不正确。”和“Exception: 保存网页失败: 该内容已被发布者删除”的错误,但是更换成别的公众号仍可以使用,这是什么原因呢

certificate verify failed

DevTools listening on ws://127.0.0.1:59600/devtools/browser/21f2c0e2-34e1-428f-8c11-7113ff243af1
请先手动登录, 完成后按回车继续:
url is https://mp.weixin.qq.com/cgi-bin/home?t=home/index&lang=zh_CN&token=2099603568
Traceback (most recent call last):
File "C:\Users\andyy\AppData\Local\Programs\Python\Python37\lib\site-packages\urllib3\contrib\pyopenssl.py", line 441, in wrap_socket
cnx.do_handshake()
File "C:\Users\andyy\AppData\Local\Programs\Python\Python37\lib\site-packages\OpenSSL\SSL.py", line 1907, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "C:\Users\andyy\AppData\Local\Programs\Python\Python37\lib\site-packages\OpenSSL\SSL.py", line 1639, in _raise_ssl_error
_raise_current_error()
File "C:\Users\andyy\AppData\Local\Programs\Python\Python37\lib\site-packages\OpenSSL_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.SSL.Error: [('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')]

似乎不能用了?

上一次成功使用还是半个多月前,今天再用发现不能返回页面和链接了,是不是被封了?
错误信息如下:

请先手动登录, 完成后按回车继续:
Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 453, in wrap_socket
cnx.do_handshake()
File "/Users/user/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1915, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "/Users/user/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1639, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (54, 'ECONNRESET')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 839, in validate_conn
conn.connect()
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 344, in connect
ssl_context=context)
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/util/ssl
.py", line 344, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 459, in wrap_socket
raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(54, 'ECONNRESET')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/user/anaconda3/lib/python3.6/site-packages/urllib3/util/retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mp.weixin.qq.com', port=443): Max retries exceeded with url: /cgi-bin/searchbiz?action=search_biz&token=304200544&lang=zh_CN&f=json&ajax=1&random=0.843229694344508&query=gqtzy2014&begin=0&count=5 (Caused by SSLError(SSLError("bad handshake: SysCallError(54, 'ECONNRESET')",),))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "wxhub.py", line 468, in
main(Input.args.chrome)
File "wxhub.py", line 442, in main
pipe()
File "wxhub.py", line 358, in pipe
fake_info = pipe_fakes(Input.fake_name)
File "wxhub.py", line 133, in pipe_fakes
rep = requests.get(Urls.query_biz.format(random=random.random(), token=Session.token, query=fake_name, begin=begin, count=count), cookies=Session.cookies, headers=Session.headers)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='mp.weixin.qq.com', port=443): Max retries exceeded with url: /cgi-bin/searchbiz?action=search_biz&token=304200544&lang=zh_CN&f=json&ajax=1&random=0.843229694344508&query=gqtzy2014&begin=0&count=5 (Caused by SSLError(SSLError("bad handshake: SysCallError(54, 'ECONNRESET')",),))

持续关注

我也在做这个需求,持续关注,哈哈

访问过于频繁,请用微信扫描二维码进行访问

连续抓取网页后服务器拒绝连接,换IP地址后可以重新抓取,但是错误抓取的网页已经存在arti.cache.list里,所以必须手动找出链接删掉才能补回这部分页面。

建议写入磁盘前简单检查html文件内容,发现访问过于频繁的错误信息后跳出循环,同时给出反馈信息。

图片显示不了

目前只能讲html和图片保存在文件夹里,打开网页并不会加载保存的图片。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.