GithubHelp home page GithubHelp logo

hbclark / juejinbooksspider Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dnzzk2/juejinbooksspider

0.0 0.0 0.0 48.38 MB

掘金小册爬虫

Home Page: https://h7ml.github.io/juejinBooksSpider/

JavaScript 2.57% TypeScript 97.43%

juejinbooksspider's Introduction

📚 掘金小册爬虫 👋

Version Documentation Maintenance License: Apache--2.0

🕷️ 掘金小册爬虫脚本。将小册保存为 markdown,pdf,html 格式

📜 说明

本项目案例使用爬虫爬取的为公开的掘金小册。可在掘金小册/阅读 中查看。本项目仅供学习交流使用,请勿将个人付费小册公开。⚠️ 若公开由此造成的一切后果,与本项目无关。

🛠 使用

👥 clone 项目

git clone https://github.com/h7ml/juejinBooksSpider.git
cd juejinBooksSpider

📦 install 依赖

pnpm install

# or
# npm install

# or
# yarn install

🎲 运行

# 爬取单本小册
# pnpm dev <小册地址>
pnpm dev https://juejin.cn/book/6844723704639782920

# 爬取多本小册 需要配置cookie 并且设置spiderAll为true 到.env文件。然后执行 pnpm start 即可

📁 配置文件说明

📋 类型定义

// \src\types.d.ts
export type FileFormat = 'pdf' | 'md' | 'html' | ''

export interface EvConfig {
  log: string | boolean
  storeDirs: string
  cookie: string
  course: string
  spiderAll: string | boolean
  headless: string | boolean
  filetype: FileFormat
  puppeteerOptions: PuppeteerLaunchOptions
}

⚙️ .env

  • cookie:掘金网站的 Cookie,用于爬取授权访问的小册。
  • isLog:是否输出日志形式,默认为 true。开启后将在dist目录下产生log文件。
  • storeDir:小册保存的目录,默认为docs。表示当前目录下的docs目录。
  • course:小册地址,默认为https://juejin.cn/book/6844723704639782920。若命令行中传入了小册地址,则以命令行中的地址为准。
  • spiderAll:是否爬取所有小册,默认为false。若为true,则会爬取所有小册,否则只爬取course中指定的小册。
  • filetype: 保存的文件类型,默认为md。可选值为mdpdfhtml
  • headless: 是否使用无头浏览器,默认为true。若为false,则会使用有头浏览器,方便调试。文档参考:puppeteer

⚙️ puppeteerOptions

puppeteerOptionspuppeteer的启动参数,非必须。文档参考:puppeteer 如需修改。请在config 中配置

🏠 主页

👤 作者

👤 h7ml

🤝 贡献者

贡献、问题和功能请求都受到欢迎!
欢迎提出问题和建议. 您也可以查阅 贡献指南.

📊 Total: 16

📝 许可协议

版权所有 © 2023 h7ml
本项目使用 Apache--2.0 许可协议。


此 README 是通过 readme-md-generator ❤️ 生成的

juejinbooksspider's People

Contributors

h7ml avatar actions-user avatar binbiubiubiu avatar croatialu avatar yyx990803 avatar kelseyshi avatar dnzzk2 avatar michael-py001 avatar sdras avatar antfu avatar gaearon avatar dependabot[bot] avatar donghuzi1 avatar tiezhu111 avatar reabout avatar whatqiu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.