GithubHelp home page GithubHelp logo

movie-scrapy's Introduction

准备环境安装

  • 测试环境 MacOS-10.15.4
  • python 版本3.6.10
  • pip install -r requirements.txt

安装MongoDB

运行

执行爬虫

  • 此程序有两个爬虫,一个mtime是爬取电影名字,年代和页面链接放入本地mongoDB,另外一个是下载图片到本地文件夹

运行数据爬虫

  • scrapy crawl mtime 注意需要修改setting 中的ITEM_PIPELINES配置为movie.pipelines.MoviePipeline, mobie_data

运行图片爬虫

  • scrapy crawl mPicture
  • 注意需要修改setting 中的ITEM_PIPELINES配置为 movie.pipelines.MyImagesPipeline,
  • ITEM_PIPELINES中的IMAGES_STORE为图片文件保存路径
  • IMAGES_MIN_HEIGHT和IMAGES_MIN_WIDTH为图片筛选参数,只有大于这个尺寸图片才会被保存在本地 movie_pictures

更多详细介绍

movie-scrapy's People

Contributors

danielyan86 avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

movie-scrapy's Issues

希望能增加Scrapy版本的问题

纠结scrapy版本的说明,直接运行报错如下

  • 报错1
from scrapy.selector import HtmlXPathSelector
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'HtmlXPathSelector'

scrapy 1.7.0以后就删除了HtmlXPathSelector,我就装了1.6.0版本。
但是运行的时候,还是报错如下

  • 报错2
File "/mnt/d/www/Movie-scrapy/movie/movie/spiders/mtime_spider.py", line 2, in <module>
    from scrapy.spider import BaseSpider
ModuleNotFoundError: No module named 'scrapy.spider'

所以希望能说明scrapy的版本号

引申

我自己尝试写用scrapy抓取的时候,mtime经常会返回521,猜测是js cookie的验证问题。最终只能选择selenium。不知道库主的代码,是否会有这个问题?测试了,确实有这个问题

INFO: Ignoring response <521 http://movie.mtime.com/50004/>: HTTP status code is not handled or not allowed

补充

经过了测试,python3下面,需要安装scrapy,pillow等,安装如下

pip3 install 'scrapy==1.5.2'
pip3 install pillow
pip3 install pymongo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.