chanwoood / crawl-zsxq Goto Github PK

View Code? Open in Web Editor NEW

434.0 434.0 126.0 9.77 MB

爬取知识星球，并制作成 PDF 电子书。

License: MIT License

Python 91.16% CSS 8.84%

crawl-zsxq's Issues

zsxq使用的https，是如何获取的Authorization的呢？

图片没有抓取

抓取图片是很简单的，只是要用合适的标签插入 HTML 文档，方可正确地转换为 PDF

感谢制作了这样一个爬虫

解决了从无到有的问题，真心感谢一个！

目前看有三个问题：

较长的图片会被隔页分割
图片中较小的文字完全无法看清
抓取到15页就没了，日期是3月20日，当日后面还有别的互动。

其中，我在想，问题1、2能否用html方式解决，因为PDF里面图片是死的，有些圈子里分享的还是文件格式。HTML的话，一方面图片可以用fancybox插件放缩，另一方面其他格式文件可以以链接形式附上。我看到中间你也是用生成HTML进行过渡的，但这么一来，可能这就涉及到最后怎么整合的问题，我想能不能用Gitbook或类似的东西进行封装？

谢谢！

TypeError: 'NoneType' object is not iterable

Traceback (most recent call last):
File "C:/Users/Administrator/Downloads/crawl-zsxq-master/crawl.py", line 119, in
make_pdf(get_data(start_url))
File "C:/Users/Administrator/Downloads/crawl-zsxq-master/crawl.py", line 37, in get_data
for topic in json.loads(f.read()).get('resp_data').get('topics'):
TypeError: 'NoneType' object is not iterable
提示这个错误怎么解决呢？

技术可以，但是盗版有意思吗？

如果分享技术就好了，发表星球里的内容，就不太妥

排版问题

爬出来的内容回车都没有了，糊成一团，不知道是爬出来的数据本来就没有，还是转PDF的时候把回车去掉了，有空可以修复下。谢谢作者的贡献~

如何找对精华区的接口呢？

start_url那里我不知道该如何添加，所以直接复制URL后得到如下报错：
Traceback (most recent call last):
File "crawl.py", line 119, in
make_pdf(get_data(start_url))
File "crawl.py", line 34, in get_data
f.write(json.dumps(rsp.json(), indent=2, ensure_ascii=False))
File "C:\anaconda\lib\site-packages\requests\models.py", line 892, in json
return complexjson.loads(self.text, **kwargs)
File "C:\anaconda\lib\json_init_.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\anaconda\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\anaconda\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

请问，这只是因为start_url的问题吗？如果是的话，应该怎么找到精华区的接口呢？麻烦您了

chanwoood / crawl-zsxq Goto Github PK

crawl-zsxq's Issues

如何把评论和评论回复都抓取下来呢？现在只能抓取第一页的评论

小白用户请教

zsxq使用的https，是如何获取的Authorization的呢？

图片没有抓取

感谢制作了这样一个爬虫

TypeError: 'NoneType' object is not iterable

技术可以，但是盗版有意思吗？

排版问题

如何找对精华区的接口呢？

翻页问题当第一页的时间是2018-01-01T01:01:01.000+0800时（主要是000）

报错TypeError: 'NoneType' object is not iterable

翻页问题

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs