GithubHelp home page GithubHelp logo

oumiga1314 / video_url_crawler_demo Goto Github PK

View Code? Open in Web Editor NEW

This project forked from czs0x55aa/video_url_crawler_demo

0.0 2.0 0.0 37 KB

视频网站的URL爬虫,目前只支持爱奇艺

License: GNU General Public License v3.0

Python 100.00%

video_url_crawler_demo's Introduction

video_url_crawler_demo

视频网站的URL爬虫,使用MongoDB存储抓取到的数据
目前只支持爱奇艺
代码还在调试阶段

依赖

使用

在settings.py文件中填写相应的配置信息

1.填写PhantomJS配置

根据实际的系统环境和文件路径配置以下两项

PLATFORM = 'win'	# 'win' or 'linux' or 'mac'
PHANTOMJS_PATH = 'D:/Program Files/Anaconda2/Scripts/phantomjs.exe'

2.填写数据库配置

如果开启了用户认证,需要将'auth'字段设置成True,并填写用户名和密码

# MongoDB database configure
DATABASE = {
	'server': 'localhost',
	'port': 27017,
	'auth': False,
	'user': '',
	'passwd': '',
	'db': 'video_box',	# database name
	'collection': 'aiqiyi',
}

3.配置爬虫信息

CRAWLER = {
	'spider': 'aiqiyi',
	'type_id_list': [2, 3],
	're_type_id': 'http://list.iqiyi.com/www/(\d+)/',
	'url_template': 'http://list.iqiyi.com/www/%s/-------------11-%s-1-iqiyi--.html'
}

spider: 爬虫的名字
type_id: 爱奇艺的视频类型,1:电影,2:电视剧,3:纪录片,4:动漫...
re_type_id:使用正则从URL中提取type_id
url_template: 爱奇艺的视频列表页面的通用URL,第一个%s为视频类型,第二个%s为页码
URL和类型码详见 爱奇艺视频列表页面

4.启动程序

python launch.py

爬虫会抓取爱奇艺指定类型下的所有(从第一页到最后一页)的视频

运行结果

数据的存储结构如下:

{
	'title': 视频项的标题,
	'img_url': 视频的封面图地址,
	'main_url': 视频的抓取地址,
	'type_id': 爱奇艺的视频类型码,
	'status': 视频状态,0:还在更新,1:全集,
	'vedio_list': [
		{'set_name': 视频名称, 'set_url': 视频地址},
		......
	]
}

可供参考的文档

Bugs

  • 异常处理存在问题
  • 部分特殊页面的数据无法抓取

video_url_crawler_demo's People

Contributors

czs0x55aa avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.