facert / awesome-spider Goto Github PK
View Code? Open in Web Editor NEW爬虫集合
License: MIT License
爬虫集合
License: MIT License
如题
m
鱼塘热榜是一个获取各大热门网站热门头条的聚合网站,使用Go语言编写,多协程异步快速抓取信息,预览:https://www.printf520.com/hot.html
能教教我吗,还是刚用github,在网上也找不到教程
使用spider163后发现很多功能不满足(重点是mv和tag),mariadb还有问题。最后改写为mongodb,功能定义强大,使用简单。说明丰富
https://github.com/Grass-CLP/NXSpider.git
那个网易云的热评爬虫挂了,我可以pull requests一个自己写的深度优先遍历的热评爬虫吗
功能:Golang短视频去水印:支持抖音,快手,快手,西瓜视频,虎牙,acfun,好看视频等20个平台
链接:https://github.com/wujunwei928/parse-video
妹子图爬虫(图形用户界面)
https://github.com/Stephen-Pierre/meizitu_Spider
https://github.com/Accelerator086/python_konachan
考虑到原先存在的koanchan爬虫似乎失效了,而且爬取过程较为繁琐,我用爬取json的方法简化爬虫代码,希望能够收录,谢谢!
随便点了几个爬虫发现已经很久没维护了......估计大都不能用了吧...
https://github.com/lizhaode/pornhub
PornHub 的爬虫
https://github.com/lizhaode/Scrapy91
91Porn 的爬虫
作者您好,我们也是一家专业做IP代理的服务商,极速HTTP,想跟您谈谈是否能够达成商业推广上的合作。如果您,有意愿的话,可以联系我,微信:13982004324 谢谢(如果没有意愿的话,抱歉,打扰了)
自己写的几个爬虫程序集合 nodejs-spider。欢迎大家点赞收藏!
都TM9102年了,这里边大部分都是好几年前修改的,现在已经不能用了,但是可以提供一些思路
https://live.kuaishou.com/profile/3xsx74sidgkz2bq
https://live.kuaishou.com/graphql
Origin: https://live.kuaishou.com
Referer: https://live.kuaishou.com/profile/3xsx74sidgkz2bq
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134
Cache-Control: max-age=0
accept: /
content-type: application/json
Accept-Language: zh-CN
Accept-Encoding: gzip, deflate, br
Host: live.kuaishou.com
Content-Length: 1323
Connection: Keep-Alive
Cookie: client_key=65890b29; clientid=3; did=web_221d3c8e7f94a5b146b513e484bf2a54; kuaishou.live.bfb1s=ac5f27b3b62895859c4c1622f49856a4
cookie:通过webmagic框架调用返回作品列表“$.data.publicFeeds.list”为null
4. post请求参数
:{"operationName":"publicFeedsQuery","variables":{"principalId":"3xsx74sidgkz2bq","pcursor":"","count":24},"query":"query publicFeedsQuery($principalId: String, $pcursor: String, $count: Int) {\n publicFeeds(principalId: $principalId, pcursor: $pcursor, count: $count) {\n pcursor\n live {\n user {\n id\n kwaiId\n eid\n profile\n name\n living\n __typename\n }\n watchingCount\n src\n title\n gameId\n gameName\n categoryId\n liveStreamId\n playUrls {\n quality\n url\n __typename\n }\n followed\n type\n living\n redPack\n liveGuess\n anchorPointed\n latestViewed\n expTag\n __typename\n }\n list {\n photoId\n caption\n thumbnailUrl\n poster\n viewCount\n likeCount\n commentCount\n timestamp\n workType\n type\n playUrl\n useVideoPlayer\n imgUrls\n imgSizes\n magicFace\n musicName\n location\n liked\n onlyFollowerCanComment\n relativeHeight\n width\n height\n user {\n id\n eid\n name\n profile\n __typename\n }\n expTag\n __typename\n }\n __typename\n }\n}\n"}
GitHub-trending 地址:https://github.com/bonfy/github-trending 功能:每天爬取GitHub-trending信息
leetcode 地址: https://github.com/bonfy/leetcode 功能: 爬取Leetcode 刷过的题
图站爬虫
https://github.com/machsix/personal-spider
xiumm.org
xiuren.org
大部分已经失效或者不更新了
数据挖掘->数据清洗->数据可视化
https://github.com/haohaizhi/51job_spiders
Microsoft Edge 扩展里面的Web Scraper有了解吗?那个好像是集成的,怎么获取他的代码?
实时爬取国内主流直播平台信息
https://github.com/iamllitog/livewindow
http://www.livewindow.cn/
快连VPN:
https://bitbucket.org/letsgogo/letsgogo_11/src/master/
备用链接:https://github.com/LetsGo666/LetsGo_2
安装后打开可以先到填写推荐人ID:76352258,能得3天会员,试下效果
akshare(https://github.com/jindaxiang/akshare) is a tool for downloading chinese financial data
自荐:https://github.com/NaiboWang/EasySpider
已被WWW 2023计算机顶级会议接受,并获得了国家授权专利。
爬取抖音
功能:爬取微博热搜并发送至邮箱
链接:https://github.com/tonggongzhiqiu/weibo-
Collecting requirements.txt
Exception:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 215, in main
status = self.run(options, args)
File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 353, in run
wb.build(autobuilding=True)
File "/usr/lib/python2.7/dist-packages/pip/wheel.py", line 749, in build
self.requirement_set.prepare_files(self.finder)
File "/usr/lib/python2.7/dist-packages/pip/req/req_set.py", line 380, in prepare_files
ignore_dependencies=self.ignore_dependencies))
File "/usr/lib/python2.7/dist-packages/pip/req/req_set.py", line 554, in _prepare_file
require_hashes
File "/usr/lib/python2.7/dist-packages/pip/req/req_install.py", line 278, in populate_link
self.link = finder.find_requirement(self, upgrade)
File "/usr/lib/python2.7/dist-packages/pip/index.py", line 465, in find_requirement
all_candidates = self.find_all_candidates(req.name)
File "/usr/lib/python2.7/dist-packages/pip/index.py", line 423, in find_all_candidates
for page in self._get_pages(url_locations, project_name):
File "/usr/lib/python2.7/dist-packages/pip/index.py", line 568, in _get_pages
page = self._get_page(location)
File "/usr/lib/python2.7/dist-packages/pip/index.py", line 683, in _get_page
return HTMLPage.get_page(link, session=self.session)
File "/usr/lib/python2.7/dist-packages/pip/index.py", line 795, in get_page
resp.raise_for_status()
File "/usr/share/python-wheels/requests-2.18.4-py2.py3-none-any.whl/requests/models.py", line 935, in raise_for_status
raise HTTPError(http_error_msg, response=self)
HTTPError: 404 Client Error: Not Found for url: https://pypi.org/simple/requirements-txt/
功能:Python短视频去水印:支持抖音,快手,快手,西瓜视频,虎牙,acfun,好看视频等20个平台
链接:https://github.com/wujunwei928/parse-video-py
感谢repo主把weibospider收录进去了。看了这么多 awesome spider,我觉得还差一款爬虫的基础支撑程序,所以自荐HAipproxy
Haipproxy是一款高可用低时延的分布式代理程序,高可用包含两个方面:
HAipproxy
目前测试的速度可以达到 1w+ requests/hour
。下面是以知乎为目标网站,单机测试结果
请求量 | 时间 | 耗时 | IP负载策略 | 客户端 |
---|---|---|---|---|
0 | 2018/03/03 22:03 | 0 | greedy | py_cli |
10000 | 2018/03/03 11:03 | 1 hour | greedy | py_cli |
20000 | 2018/03/04 00:08 | 2 hours | greedy | py_cli |
30000 | 2018/03/04 01:02 | 3 hours | greedy | py_cli |
40000 | 2018/03/04 02:15 | 4 hours | greedy | py_cli |
50000 | 2018/03/04 03:03 | 5 hours | greedy | py_cli |
60000 | 2018/03/04 05:18 | 7 hours | greedy | py_cli |
70000 | 2018/03/04 07:11 | 9 hours | greedy | py_cli |
80000 | 2018/03/04 08:43 | 11 hours | greedy | py_cli |
我看其中有一个QQ空间的爬虫,但是已经是两年前的项目了,可用性很低。
这里自荐一个:
https://github.com/Tactful-biao/Spiders/tree/master/Qzone
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.