GithubHelp home page GithubHelp logo

meocean / douban-spider Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yeungsk/douban-spider

0.0 1.0 0.0 41 KB

基于Scrapy框架的豆瓣电影爬虫

Home Page: https://yeungsk.github.io/2018/10/08/%E7%88%AC%E8%99%AB%E5%AE%9E%E6%88%981/

Python 100.00%

douban-spider's Introduction

豆瓣电影爬虫

使用Scrapy框架爬取豆瓣电影

项目介绍

豆瓣选影视页面分别筛选地区为**大陆、香港、**(可更换为其他地区),构造Ajax请求,获取电影id,再通过id构造电影链接,解析页面后获得电影详细数据,如名称、年份、导演、主演、类型等。具体可见我的博文:爬虫实战(一)利用scrapy爬取豆瓣华语电影

安装

安装Python

至少Python3.5以上

安装Redis和Mongo

安装好之后将Redis和Mongo服务开启

安装依赖

pip3 install -r requirements.txt

运行

配置代理池

cd ProxyPool
cd proxypool

进入ProxyPool的proxypool目录,修改settings.py文件

PASSWORD为Redis密码,如果为空,则设置为None

目前默认的代理为免费代理,如需添加代理,请在crawler.py的Crawler下添加以crawl_开头的函数。

打开代理池和API

cd ProxyPool
python3 run.py

运行scrapy

cd douban
python3 run.py

获取结果

电影数据存储在MongoDB中名为douban数据库的film表中,数据结果如下:

{
    "_id" : ObjectId("5bb96351fd21815bdbe90124"),
    "id" : "24719063",
    "title" : "烈日灼心",
    "year" : "2015",
    "region" : [ "**大陆"],
    "language" : [ "汉语普通话"],
    "director" : [ "曹保平"],
    "type" : [ "剧情", "悬疑", "犯罪"],
    "actor" : [ "邓超", "段奕宏", "郭涛", "王珞丹", "吕颂贤", "高虎", "白柳汐", "杜志国"],
    "date" : [ "2015-08-27(**大陆)", "2015-06-15(上海电影节)"],
    "runtime" : [ "139分钟"],
    "rate" : "7.9",
    "rating_num" : "290209"
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.