GithubHelp home page GithubHelp logo

haizi-zh / scrapy-qiniu Goto Github PK

View Code? Open in Web Editor NEW
24.0 4.0 10.0 11 KB

Scrapy中,将网络资源(文件、图像等)存储在七牛上的Pipeline扩展

License: Apache License 2.0

Python 100.00%

scrapy-qiniu's Introduction

scrapy-qiniu

Scrapy中的media pipeline机制,可以方便地将静态资源资源(文件、图像等)下载到本地,然后进行处理。scrapy-qiniu扩展了这一机制,可以将资源存储到七牛云存储上面。并且,实现了以下几个特性:

  • 支持缓存,可以避免静态资源的重复下载
  • 采用fetch模式,让七牛服务器代为下载,而不用像默认的FilesPipeline那样,先下载到爬虫所在 主机,然后再上传到七牛服务器

关于Scrapy的media pipeline机制,请参阅这里

Installation

pip install scrapy-qiniu

Usage

Getting started

首先,需要在settings中启用本pipeline:

ITEM_PIPELINES = {
  'scrapy_qiniu.QiniuPipeline': 10
}

注意,QiniuqPipeline的优先级最好要高于普通的pipeline。

然后,在运行爬虫的时候,需要设置好以下Settings项目:

  • PIPELINE_QINIU_ENABLED: 是否启用本pipeline(如果将本设置项置为1,将启用本pipeline)
  • PIPELINE_QINIU_BUCKET: 存放在哪个bucket中
  • PIPELINE_QINIU_KEY_PREFIX: 资源在七牛中的key的名称为:prefix + hash(request.url)

最后,在抓取到网页,构造item的时候,假设需要抓取这两个网址:

http://www.foo.com/bar-1.jpghttp://www.foo.com/bar-2.jpg

可以这么做:

item['file_urls'] = ['http://www.foo.com/bar-1.jpg', 'http://www.foo.com/bar-2.jpg']

这样一来,QiniuPipeline会自动将这两个资源上传到七牛服务器上,并且在返回的item中,将资源上传的结果添加到files字段中:

{
  "key": "your_key",
  "bucket": "your_bucket",
  "checksum": "FpSAj-vs1tGIcQ5qF6PsJku2_sPa",
  "url": "http://www.foo.com/bar.jpg",
  "path": "the_path_string"
}

Advanced usage

如果需要更加细颗粒度地控制静态资源的上传,可以指定item中的qiniu_key_generator属性。这是一个函数对象,它接收一个url,并返回bucket名称和key的取值。QiniuPipeline根据此结果,进行静态资源的下载和保存工作。比如:

def func(url):
    return { 'bucket': 'scrapy', 'key': 'key_name/%s' % hashlib.md5(url).hexdigest() }
    
item['qiniu_key_generator'] = func

在这样的情况下,item的file_urls字段所指定的资源,会被七牛服务器fetch到scrapy这个bucket中,并且key的命名形式为:key_name/{md5_hash}

scrapy-qiniu's People

Contributors

haizi-zh avatar

Stargazers

 avatar perror avatar Feng  avatar liu Guangyao avatar Clair Pfeffer avatar  avatar  avatar  avatar Riont avatar Zijun Cao avatar ShouKou avatar 武明辉 avatar 大海雀 avatar william avatar  avatar  avatar bin.won avatar guoqingzhi avatar Han Yujiang avatar 宠你 avatar  avatar  avatar 冯晓闯 avatar  avatar

Watchers

william avatar James Cloos avatar  avatar bin.won avatar

scrapy-qiniu's Issues

可以兼容Python3吗?

这是错误日志:

2018-09-06 15:38:56 [twisted] CRITICAL:
Traceback (most recent call last):
  File "C:\Anaconda3\envs\sunrised-env\lib\site-packages\scrapy\pipelines\media.py", line 68, in from_crawler
    pipe = cls.from_settings(crawler.settings)
  File "C:\Anaconda3\envs\sunrised-env\lib\site-packages\scrapy_qiniu\impl.py", line 164, in from_settings
    cls.EXPIRES = settings.getint('FILES_EXPIRES', sys.maxint)
AttributeError: module 'sys' has no attribute 'maxint'

我在本地修改为了sys.maxsize,不过为了兼容2和3应该有更好的方法。

你这个配置无效

按照你说的配置
PIPELINE_QINIU_ENABLED
PIPELINE_QINIU_BUCKET =
PIPELINE_QINIU_KEY_PREFIX =

都是无效的

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.