GithubHelp home page GithubHelp logo

fudanbbs_mirror's Introduction

复旦bbs mirror项目

这个项目一开始仅致力于爬复旦bbs的数据, 并提供一些格式转换脚本方便处理源数据, 之后可能会进一步发展出别的目的

初始提议实现的模块:

  • incremental的nodejs爬虫, 使用mongodb做持久化存储
  • 支持全文搜索的web界面
  • 常用的原始数据格式转换脚本, 方便做不同目的的NLP等任务

可能下一步要做的:

  • one-master-multi-slave 架构的简单分布式爬虫系统, 理论故障单点在master

Dependencies

Ubuntu/Debian

sudo apt install libxcomposite1 libxcursor1 libxdamage1 libxi6 libxtst6 libfontconfig1 libxss1 libxrandr libxrandr2 libgconf-2-4 libasound2 libpangocairo-1.0-0 libatk1.0-0 libatk-bridge2.0-0 libgtk-3-0

Coding Convention

NodeJS

Package Manager

使用yarn而不是npm作为包管理器

Coding Style

  • 使用ES6
  • 2个空格缩进
  • 除非不用会死, 不要用; (话说有时候不在行首加个;真的就会死...)
  • 使用Promise/await/async而非callback, 如果某个库只提供callback的api, 那么封装成Promise再用

Python

  • 使用python3来忘记unicode的烦恼

Issues

  • puppeteer会从googleapi.com下载chromium, 如果被墙请设置NPM_HTTP_PROXY, NPM_HTTPS_PROXY环境变量, 或者连接vpn. 如果直接执行get_headless_chromium_location.js可以获得chromium下载链接和应该解压到的path

fudanbbs_mirror's People

Contributors

dragonly avatar

Watchers

James Cloos avatar Sherwood Wang avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.