GithubHelp home page GithubHelp logo

crawl-zsxq's Issues

图片没有抓取

抓取图片是很简单的,只是要用合适的标签插入 HTML 文档,方可正确地转换为 PDF

感谢制作了这样一个爬虫

解决了从无到有的问题,真心感谢一个!

目前看有三个问题:

  1. 较长的图片会被隔页分割
    screen shot 2018-06-29 at 5 01 04 pm
  2. 图片中较小的文字完全无法看清
    screen shot 2018-06-29 at 5 01 41 pm
  3. 抓取到15页就没了,日期是3月20日,当日后面还有别的互动。

其中,我在想,问题1、2能否用html方式解决,因为PDF里面图片是死的,有些圈子里分享的还是文件格式。HTML的话,一方面图片可以用fancybox插件放缩,另一方面其他格式文件可以以链接形式附上。我看到中间你也是用生成HTML进行过渡的,但这么一来,可能这就涉及到最后怎么整合的问题,我想能不能用Gitbook或类似的东西进行封装?

谢谢!

TypeError: 'NoneType' object is not iterable

Traceback (most recent call last):
File "C:/Users/Administrator/Downloads/crawl-zsxq-master/crawl.py", line 119, in
make_pdf(get_data(start_url))
File "C:/Users/Administrator/Downloads/crawl-zsxq-master/crawl.py", line 37, in get_data
for topic in json.loads(f.read()).get('resp_data').get('topics'):
TypeError: 'NoneType' object is not iterable
提示这个错误怎么解决呢?

排版问题

爬出来的内容回车都没有了,糊成一团,不知道是爬出来的数据本来就没有,还是转PDF的时候把回车去掉了,有空可以修复下。谢谢作者的贡献~

如何找对精华区的接口呢?

start_url那里我不知道该如何添加,所以直接复制URL后得到如下报错:
Traceback (most recent call last):
File "crawl.py", line 119, in
make_pdf(get_data(start_url))
File "crawl.py", line 34, in get_data
f.write(json.dumps(rsp.json(), indent=2, ensure_ascii=False))
File "C:\anaconda\lib\site-packages\requests\models.py", line 892, in json
return complexjson.loads(self.text, **kwargs)
File "C:\anaconda\lib\json_init_.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\anaconda\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\anaconda\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

请问,这只是因为start_url的问题吗?如果是的话,应该怎么找到精华区的接口呢?麻烦您了

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.