GithubHelp home page GithubHelp logo

qbaoma / scrapy-qiniu Goto Github PK

View Code? Open in Web Editor NEW

This project forked from haizi-zh/scrapy-qiniu

0.0 1.0 0.0 11 KB

Scrapy中,将网络资源(文件、图像等)存储在七牛上的Pipeline扩展

License: Apache License 2.0

Python 100.00%

scrapy-qiniu's Introduction

scrapy-qiniu

Scrapy中的media pipeline机制,可以方便地将静态资源资源(文件、图像等)下载到本地,然后进行处理。scrapy-qiniu扩展了这一机制,可以将资源存储到七牛云存储上面。并且,实现了以下几个特性:

  • 支持缓存,可以避免静态资源的重复下载
  • 采用fetch模式,让七牛服务器代为下载,而不用像默认的FilesPipeline那样,先下载到爬虫所在 主机,然后再上传到七牛服务器

关于Scrapy的media pipeline机制,请参阅这里

Installation

pip install scrapy-qiniu

Usage

Getting started

首先,需要在settings中启用本pipeline:

ITEM_PIPELINES = {
  'scrapy_qiniu.QiniuPipeline': 10
}

注意,QiniuqPipeline的优先级最好要高于普通的pipeline。

然后,在运行爬虫的时候,需要设置好以下Settings项目:

  • PIPELINE_QINIU_ENABLED: 是否启用本pipeline(如果将本设置项置为1,将启用本pipeline)
  • PIPELINE_QINIU_BUCKET: 存放在哪个bucket中
  • PIPELINE_QINIU_KEY_PREFIX: 资源在七牛中的key的名称为:prefix + hash(request.url)

最后,在抓取到网页,构造item的时候,假设需要抓取这两个网址:

http://www.foo.com/bar-1.jpghttp://www.foo.com/bar-2.jpg

可以这么做:

item['file_urls'] = ['http://www.foo.com/bar-1.jpg', 'http://www.foo.com/bar-2.jpg']

这样一来,QiniuPipeline会自动将这两个资源上传到七牛服务器上,并且在返回的item中,将资源上传的结果添加到files字段中:

{
  "key": "your_key",
  "bucket": "your_bucket",
  "checksum": "FpSAj-vs1tGIcQ5qF6PsJku2_sPa",
  "url": "http://www.foo.com/bar.jpg",
  "path": "the_path_string"
}

Advanced usage

如果需要更加细颗粒度地控制静态资源的上传,可以指定item中的qiniu_key_generator属性。这是一个函数对象,它接收一个url,并返回bucket名称和key的取值。QiniuPipeline根据此结果,进行静态资源的下载和保存工作。比如:

def func(url):
    return { 'bucket': 'scrapy', 'key': 'key_name/%s' % hashlib.md5(url).hexdigest() }
    
item['qiniu_key_generator'] = func

在这样的情况下,item的file_urls字段所指定的资源,会被七牛服务器fetch到scrapy这个bucket中,并且key的命名形式为:key_name/{md5_hash}

scrapy-qiniu's People

Contributors

haizi-zh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.