GithubHelp home page GithubHelp logo

lenmao / documentdownloader Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ohyee/documentdownloader

0.0 0.0 0.0 40 KB

download document from book118 for free

Home Page: https://www.oyohyee.com/tag/documentdownloader

License: MIT License

Shell 0.48% Python 99.52%

documentdownloader's Introduction

文档下载器

Sync to Gitee Publish to PyPI Publish to TestPyPI Release
version pypi version License

可用于下载book118的PDF文档

思路

  1. 爬虫爬取图片链接
  2. 下载图片
  3. 将图片拼合成pdf文件

相关文章 使用爬虫免费下载book118的PDF文件

参数说明

参数 解释 必备参数
-h--help 显示帮助
-u--url 要下载的文件的网页地址
-o--output 文件保存名,默认是文档的标题.pdf
-p--proxy 设置要使用的代理地址(默认使用环境变量中HTTP_PROXYHTTPS_PROXY设置的值),可以使用-p ''强制设置不走代理
-f--force 强制重新下载,不使用缓存
-t--thread 要使用的线程数,如不指定默认是10
-s--safe 如果被服务器拒绝可以打开此选项,将强制单线程,并增加请求和下载的间隔时间

使用模块

使用已上传到 PyPI 的包

python3 -m pip install documentDownloader

安装完成后即可直接使用 documentDownloader 命令

如:documentDownloader -u https://max.book118.com/html/2020/0109/5301014320002213.shtm -o '单身人群专题研究报告-2019' -p http://127.0.0.1:1080 -f -t 20

直接使用源码中的 main.py

克隆该项目,或在releases页面选择版本下载

  1. 安装Python3
  2. 安装依赖模块(Pillow、reportlab、requests) python -m pip install -r requirements.txt
  3. 使用 python3 main.py 执行

如:python main.py -u https://max.book118.com/html/2020/0109/5301014320002213.shtm -o '单身人群专题研究报告-2019' -p http://127.0.0.1:1080 -f -t 20

仅供学习爬虫及相关知识,请支持正版图书
虽然book118上的好多pdf也是盗版吧

贡献列表

更新

  • 2019-01-29: Book118网站更新,更改对应部分代码. @JodeZer
  • 2020-01-09: 重构代码,增加多线程下载加速,允许使用代理,允许通过已有缓存直接建立pdf,自动识别图片大小生成pdf @OhYee
  • 2020-05-25: 发布到 PyPI
  • 2021-10-18: Book118网站更新,更改部分代码; 设置默认导出pdf的文件名为文档标题; 对无法免费预览全文的文档增加提示; 调整请求间隔为2秒(实测请求间隔小于2秒很可能会返回空地址); 增加"慢速下载"选项,防止下载过快被服务器拒绝。@alxt17

documentdownloader's People

Contributors

jodezer avatar ohyee avatar shengt25 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.