GithubHelp home page GithubHelp logo

xingstarx / crawl-zsxq Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chanwoood/crawl-zsxq

60.0 3.0 12.0 8.24 MB

爬取知识星球,并制作成 PDF 电子书。

License: MIT License

Python 92.28% CSS 7.72%

crawl-zsxq's Introduction

爬取知识星球,并制作成 PDF 电子书。

功能

爬取知识星球的精华区,并制作成 PDF 电子书。

效果图

效果图.png

用法

if __name__ == '__main__':
    start_url = 'https://api.zsxq.com/v1.10/groups/454584445828/topics?scope=digests&count=20'
    make_pdf(get_data(start_url))

把 start_url 改为你需要爬取的星球的相应 url 。

还有安装好 wkhtmltox ,参考后面的「制作 PDF 电子书」。

模拟登陆

爬取的是网页版知识星球,https://wx.zsxq.com/dweb/#

这个网站并不是依靠 cookie 来判断你是否登录,而是请求头中的 Authorization 字段。

所以,需要把 Authorization,User-Agent 换成你自己的。(注意 User-Agent 也要换成你自己的)

headers = {
        'Cookie':'UM_distinctid=169a0132ecb3b4-01de055f2e9097-36647902-13c680-169a0132ecca18',
        'Cookie':'sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216846276920ae3-0e9d710efddd66-10336654-1296000-16846276921bf%22%2C%22%24device_id%22%3A%2216846276920ae3-0e9d710efddd66-10336654-1296000-16846276921bf%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D',
        'Cookie':'zsxq_access_token=EDBA316F-E9A2-11EB-8C7A-C7D96E1E9228',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
    }

分析页面

登录成功后,一般我习惯右键、检查或者查看源代码。

但是这个页面比较特殊,它不把内容放到当前地址栏 URL 下,而是通过异步加载(XHR),只要找对接口就可以了。

精华区的接口:https://api.zsxq.com/v1.10/groups/2421112121/topics?scope=digests&count=20

这个接口是最新 20 条数据的,还有后面数据对应不同接口,这就是后面要说的翻页。

image.png | left | 827x710

制作 PDF 电子书

  • 安装 wkhtmltox,https://wkhtmltopdf.org/downloads.html 。安装后将 bin 目录加入到环境变量。
  • 安装相应依赖:pip install pdfkit
  • 这个工具是把 HTML 文档转换为 PDF 格式。根据 HTML 文档中的 h 标签,也就是标题标签来自动提取出目录。
  • 本来精华区是没有标题的,我把每个问题的前 6 个字符当做标题,以此区分不同问题。

下面进阶完美操作:

爬取图片

很明显,在返回的数据中的 images 键就是图片,只需提取 large 的,即高清的 url 即可。

关键在于将图片标签 img 插入到 HTML 文档。

我使用 BeautifulSoup 操纵 DOM 的方式。

需要注意的是,有可能图片不止一张,所以需要用 for 循环全部迭代出来

if content.get('images'):
    soup = BeautifulSoup(html_template, 'html.parser')
    for img in content.get('images'):
        url = img.get('large').get('url')
        img_tag = soup.new_tag('img', src=url)
        soup.body.append(img_tag)
        html_img = str(soup)
        html = html_img.format(title=title, text=text)

翻页问题

  • https://api.zsxq.com/v1.10/groups/2421112121/topics?scope=digests&count=20&end_time=2018-04-12T15%3A49%3A13.443%2B0800

  • 路径后面的 end_time 表示加载帖子的最后日期,以此达到翻页。

  • 这个 end_time 是经过 url 转义了的,可以通过 urllib.parse.quote 方法进行转义,关键是找出这个 end_time 是从那里来的。

  • 经过我细细观察发现:每次请求返回 20 条帖子,最后一条贴子就与下一条链接的 end_time 有关系。

  • 例如最后一条帖子的 create_time 是 2018-01-10T11:49:39.668+0800,那么下一条链接的 end_time 就是 2018-01-10T11:49:39.667+0800,注意,一个 668,一个 667 , 两者相差 1。

end_time = create_time[:20]+str(int(create_time[20:23])-1)+create_time[23:]
  • 翻页到最后返回的数据是:
{"succeeded":true,"resp_data":{"topics":[]}}

故以 next_page = rsp.json().get('resp_data').get('topics') 来判断是否有下一页。

制作精美 PDF

通过 css 样式来控制字体大小、布局、颜色等,详见 test.css 文件。

再将此文件引入到 options 字段中。

    options = {
        "user-style-sheet": "test.css",
        ...
        }

最难搞的问题是:Old iron, give me a star ! ! !

crawl-zsxq's People

Contributors

chanwoood avatar xingstarx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

crawl-zsxq's Issues

抓取精选文件过多导致pdfkit生成不了pdf文件

def make_pdf():
    html_files = []
    for index, html in enumerate(range(0, 400)):
        file = str(index) + ".html"
        html_files.append(file)
        

    options = {
        "user-style-sheet": "test.css",
        "page-size": "Letter",
        "margin-top": "0.75in",
        "margin-right": "0.75in",
        "margin-bottom": "0.75in",
        "margin-left": "0.75in",
        "encoding": "UTF-8",
        "custom-header": [("Accept-Encoding", "gzip")],
        "cookie": [
            ("cookie-name1", "cookie-value1"), ("cookie-name2", "cookie-value2")
        ],
        "outline-depth": 10,
    }
    try:
        pdfkit.from_file(html_files, "xxx电子书.pdf", options=options)
    except Exception as e:
        pass

    for file in html_files:
        os.remove(file)

    print("已制作电子书在当前目录!")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.