GithubHelp home page GithubHelp logo

zhihuuser's Introduction

Scrapy 抓取知乎用户信息

项目使用方法

  1. 克隆项目

    git clone https://github.com/Annihilater/zhihuuser.git
  2. 安装依赖

    pip install -r requirements.txt
  3. 本机启动 MongoDB 数据库

  4. 运行爬虫

    scrapy crawl zhihu
  5. 然后就会在 MongoDB 的图形化客户端看到 image-20191024173110932

思路

  1. 选择起始人:大 V
  2. 获取它的粉丝和关注列表:通过知乎接口
  3. 获取用户信息列表:通过知乎接口获取列表中每位用户的详细信息
  4. 获取每位用户粉丝和关注列表:进一步获取列表中的每一位用户的详细信息,实现递归爬取

项目实战

爬虫

  • start_requests方法
    1. 获取用户信息
    2. 获取该用户的关注列表
    3. 获取该用户的粉丝列表
  • parse_user方法
    1. 解析用户的详细信息,返回 itemPipline
    2. 获取用户的关注列表,进行下一步递归调用
    3. 获取用户的粉丝列表,进行下一步递归调用
  • parse_follows方法
    1. 解析关注列表或者粉丝列表,获取 url_token,再通过 url_token获取用户详细信息,进行递归调用
    2. 关注列表或者粉丝列表递归进行分页

数据存储

使用 MongoDB 数据库进行用户信息存储,使用的是官方示例代码,唯一改动的是将数据插入操作改成了数据更新操作

def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

改成

    def process_item(self, item, spider):
        self.db[self.collection_name].update_one({'url_token': item['url_token']}, {'$set': item}, True)
        return item

使用 update_one语句:

先去数据库中依据 url_token 查找 item,如果找到了则更新 item,如果找不到,则插入 item

update_one(filter, update, upsert=False, bypass_document_validation=False, collation=None, array_filters=None, session=None)

参数解释:

filter:查询条件,{'url_token': item['url_token']}依据 url_token查询

update:插入信息,{'$set': item}

upsert:如果没有查询到数据,是否执行插入操作

zhihuuser's People

Contributors

annihilater avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.