GithubHelp home page GithubHelp logo

crawl-zsxq's People

Contributors

chanwoood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawl-zsxq's Issues

感谢制作了这样一个爬虫

解决了从无到有的问题,真心感谢一个!

目前看有三个问题:

  1. 较长的图片会被隔页分割
    screen shot 2018-06-29 at 5 01 04 pm
  2. 图片中较小的文字完全无法看清
    screen shot 2018-06-29 at 5 01 41 pm
  3. 抓取到15页就没了,日期是3月20日,当日后面还有别的互动。

其中,我在想,问题1、2能否用html方式解决,因为PDF里面图片是死的,有些圈子里分享的还是文件格式。HTML的话,一方面图片可以用fancybox插件放缩,另一方面其他格式文件可以以链接形式附上。我看到中间你也是用生成HTML进行过渡的,但这么一来,可能这就涉及到最后怎么整合的问题,我想能不能用Gitbook或类似的东西进行封装?

谢谢!

排版问题

爬出来的内容回车都没有了,糊成一团,不知道是爬出来的数据本来就没有,还是转PDF的时候把回车去掉了,有空可以修复下。谢谢作者的贡献~

图片没有抓取

抓取图片是很简单的,只是要用合适的标签插入 HTML 文档,方可正确地转换为 PDF

TypeError: 'NoneType' object is not iterable

Traceback (most recent call last):
File "C:/Users/Administrator/Downloads/crawl-zsxq-master/crawl.py", line 119, in
make_pdf(get_data(start_url))
File "C:/Users/Administrator/Downloads/crawl-zsxq-master/crawl.py", line 37, in get_data
for topic in json.loads(f.read()).get('resp_data').get('topics'):
TypeError: 'NoneType' object is not iterable
提示这个错误怎么解决呢?

如何找对精华区的接口呢?

start_url那里我不知道该如何添加,所以直接复制URL后得到如下报错:
Traceback (most recent call last):
File "crawl.py", line 119, in
make_pdf(get_data(start_url))
File "crawl.py", line 34, in get_data
f.write(json.dumps(rsp.json(), indent=2, ensure_ascii=False))
File "C:\anaconda\lib\site-packages\requests\models.py", line 892, in json
return complexjson.loads(self.text, **kwargs)
File "C:\anaconda\lib\json_init_.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\anaconda\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\anaconda\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

请问,这只是因为start_url的问题吗?如果是的话,应该怎么找到精华区的接口呢?麻烦您了

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.