GithubHelp home page GithubHelp logo

scrapy_douban's Introduction

文件夹说明

flask内为网站的运行文件 其他的几个文件夹均为不同类库的尝试与制作的DEMO

为什么不使用scrapy:

scrapy天生就是网络爬虫看。虽然强大,但依赖C++组件,并且很难把它抽象为类,实例化使用。 目前有一个解决方案,但这个方案要使用到多线程,介于这个工具部署在免费的Paas平台上,线程的使用时禁止的。所以抛弃scrapy

如何运行scrapy(二选一)

  • python run.py
  • scrapy crawler douban

幸运的是我要抓取的数据非常简单,关键在于解析HTML就够了,beautifulSoup4就可以满足

requirement

  • Flask==0.10.1
  • beautifulsoup4==4.2.1
  • pymongo==2.5.2 (可选)

推荐使用virtualenv来配置虚拟环境并且运行程序:

在本机安装好 virtualenv

  1. 切换至工程文件夹 $ cd project
  2. 创建虚拟环境 $ virtualenv venv
  3. 使用虚拟环境(Windows) $ venv\scripts\activate
  4. 退出虚拟环境 $ deactivate

注意:

  • 在windows下务必使用自带cmd为命令行工具,不可使用git bash,否则无法进入虚拟环境
  • repo克隆在不同pc上时务必重新运行 $ virtualenv venv 命令,重新部署环境

关于运行

运行 python run.py 即可 注意,run.py 是把数据存储在一个变量(内存)中。但基于flask框架不稳定,推荐运行run_mongo.py 版本。将数据存储在mongoDB中(你需要在本地安装mongoDB和在python 中安装pymongo)

参数设置

run.py:

  • EXPIRE_TIME:更新时间间隔,以秒为单位

info.py:

  • FETCH_URLS: 要抓取的豆瓣小组链接
  • PAGE_NUM: 每个小组要抓取的页数
  • PAUSE_SECOND: 每一条链接抓取之间的时间间隔

scrapy_douban's People

Contributors

dingk-r avatar hh54188 avatar invegas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

scrapy_douban's Issues

加点小组吧!

北京租房豆瓣
北京租房(密探)
北京租房!找伴儿一起...
北京租房房东联盟(中介...
北京租房(非中介)
北京租房合租房

mongo.py 第九行冒号后面多了个 's'

环境已经搭建好,直接运行 python run_mongo.py 会报如下错误

line 10
    print "------->Online"
    ^
IndentationError: unexpected indent

去掉第9行 : 后的 s 就可以了。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.