GithubHelp home page GithubHelp logo

nciefeiniu / totalstsation_scrapy Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 6.0 48 KB

基于scrapy-redis scrapy-splash的通用爬虫(包括ajax请求的数据)

Python 96.86% Dockerfile 1.49% Shell 1.66%
rediscrawlspider splash scrapy-redis scrapy-splash scrapy-redis-splash

totalstsation_scrapy's Introduction

网站通用爬虫 scrapy-splash and scrapy-redis

scrapy redis splash高度结合

主要框架:

  • scrapy
  • scrapy-splash
  • scrapy-redis

splash 可以采用负载均衡,多节点部署。

scrapy 爬虫也需要多节点部署。单机全站爬取太慢。


splash 安装

安装教程 官方文档


数据表结构

表结构在项目下的models.py中

python3 models.py

测试环境下的分布式splash

宿主机安装 nginx

apt install nginx -y
# or
yum install nginx -y

启动splash 容器

sudo ./create_splash.sh

修改 nginx 的配置文件(/etc/nginx/nginx.conf),在 http 中增加

upstream splash {
    least_conn;
    server 127.0.0.1:8051;
    server 127.0.0.1:8052;
    server 127.0.0.1:8053;
    server 127.0.0.1:8054;
    server 127.0.0.1:8055;
}
server {
    listen 8050;
    location / {
        proxy_pass http://splash;
        proxy_connect_timeout 300;
        proxy_read_timeout 400;
    }
}

重新加载 nginx 配置文件

nginx -s reload

测试环境下docker启动爬虫

git clone http://git.epmap.org/tao.liu/totalstation_spider.git

cd totalstation_spider

请修改 .env_sample 文件

# 编译镜像
docker build -t totalspider:v1 .

# 启动docker容器
docker run -itd --name xxx totalspider:v1

然后向redis 中添加代爬取的网站,

# 连接redis
redis-cli -h xx.xx.xx.xx -p xxxx

# 向redis中添加数据
lpush waiting_for_crawl:start_urls http://www.gov.cn

爬虫已经开始全站爬取。

TODO

  • 网页返回数据去重复
  • 网页相似度检测
  • 监测网页更新

totalstsation_scrapy's People

Contributors

nciefeiniu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.