movie-scrapy's Introduction

准备环境安装

测试环境 MacOS-10.15.4
python 版本3.6.10
pip install -r requirements.txt

安装MongoDB

安装文档 https://docs.mongodb.com/manual/tutorial/install-mongodb-on-os-x/
the configuration file (/usr/local/etc/mongod.conf)
the log directory path (/usr/local/var/log/mongodb)
the data directory path (/usr/local/var/mongodb)

运行

brew services start [email protected]

执行爬虫

此程序有两个爬虫，一个mtime是爬取电影名字，年代和页面链接放入本地mongoDB，另外一个是下载图片到本地文件夹

运行数据爬虫

scrapy crawl mtime 注意需要修改setting 中的ITEM_PIPELINES配置为movie.pipelines.MoviePipeline，

运行图片爬虫

scrapy crawl mPicture
注意需要修改setting 中的ITEM_PIPELINES配置为 movie.pipelines.MyImagesPipeline，
ITEM_PIPELINES中的IMAGES_STORE为图片文件保存路径
IMAGES_MIN_HEIGHT和IMAGES_MIN_WIDTH为图片筛选参数,只有大于这个尺寸图片才会被保存在本地

Contributors

Stargazers

Watchers

movie-scrapy's Issues

希望能增加Scrapy版本的问题

纠结scrapy版本的说明，直接运行报错如下

报错1

from scrapy.selector import HtmlXPathSelector
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'HtmlXPathSelector'

scrapy 1.7.0以后就删除了HtmlXPathSelector，我就装了1.6.0版本。
但是运行的时候，还是报错如下

报错2

File "/mnt/d/www/Movie-scrapy/movie/movie/spiders/mtime_spider.py", line 2, in <module>
    from scrapy.spider import BaseSpider
ModuleNotFoundError: No module named 'scrapy.spider'

所以希望能说明scrapy的版本号

引申

我自己尝试写用scrapy抓取的时候，mtime经常会返回521，猜测是js cookie的验证问题。最终只能选择selenium。不知道库主的代码，是否会有这个问题？测试了，确实有这个问题

INFO: Ignoring response <521 http://movie.mtime.com/50004/>: HTTP status code is not handled or not allowed

补充

经过了测试，python3下面，需要安装scrapy,pillow等，安装如下

pip3 install 'scrapy==1.5.2'
pip3 install pillow
pip3 install pymongo

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

danielyan86 / movie-scrapy Goto Github PK