GithubHelp home page GithubHelp logo

erma0 / douyin Goto Github PK

View Code? Open in Web Editor NEW
506.0 506.0 106.0 159.98 MB

抖音爬虫——采集账号主页、喜欢、收藏、音乐原声、话题、搜索、合集、作品、关注、粉丝等公开数据。

License: GNU General Public License v3.0

Python 95.57% JavaScript 4.43%
crawler douyin python spider

douyin's People

Contributors

erma0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

douyin's Issues

login.py报错

运行环境:

  • 系统: Win11
  • 浏览器: msedge
  • 使用方式: 源码

**运行命令:exec.py -u users.txt

请提供你运行的命令(主要是测试链接地址)以便复现问题

(例如:./douyin -u https://*/)

问题描述:

详细说明出现什么问题

刚开始登录时定义save_cookies提示错误

def save_cookies(cookies: list, key: list[str] = None):
报错:
发生异常: TypeError

'type' object is not subscriptable
File "D:\codes\Douyin\login.py", line 33, in Login
def save_cookies(cookies: list, key: list[str] = []):
File "D:\codes\Douyin\login.py", line 6, in
class Login(object):
TypeError: 'type' object is not subscriptable

这个错误是因为 list[str] = None 是一个语法错误。list[str] 意思是访问列表 list 的 str 键对应的值,但 str 不是一个整数索引,无法访问列表元素。所以会产生 TypeError: 'type' object is not subscriptable 错误。 要修复这个错误,应将默认参数 key 直接设置为 None,而不是 list[str] = None: py def save_cookies(cookies: list, key: list = None): 将 key 的默认值直接设置为 None。 所以这个函数定义应修复为: py def save_cookies(cookies: list, key: list = None):

还是经常出错


if self.type in ['post', 'like', 'follow', 'fans']: # post页面需提取初始页面数据 self.title = render_data['42']['user']['user']['nickname'] self.info = render_data['42']['user'] # 备用

原程序是41,我自己调试发现是42

还有编译后也不知到哪里有问题,exe不能运行,只能在vscode运行,没来的及调试

支持nickname模糊采集吗?

运行环境:

  • 系统: Win10 / Linux
  • 浏览器: Chrome / msedge
  • 使用方式: 源码 / exe

运行命令:

请提供你运行的命令(主要是测试链接地址)以便复现问题

(例如:./douyin -u https://*/)

问题描述:

详细说明出现什么问题

老大,有空看一下报错是怎么回事

命令:douyin -t post -u https://v.douyin.com/ia7kMcG/
系统:Windows 11
报错内容:requests.exceptions.SSLError: HTTPSConnectionPool(host='v.douyin.com', port=443): Max retries exceeded with url: /iaWnRyC (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1125)')))
[19128] Failed to execute script 'exec' due to unhandled exception!

大佬,这个Douyin.info为空,怎么解决啊?

运行环境:

  • 系统: Linux
  • 浏览器: Chrome
  • 使用方式: 自己编译的

运行命令:

请提供你运行的命令(主要是测试链接地址)以便复现问题

./douyin -b chrome -n 2 -u https://www.douyin.com/user/MS4wLjABAAAA_ELoNs05CNtn3foI5YZnvV25tlVectip3-uFFokqq_iSUtS6jakIkOBzSVMn5vc5?vid=7312361433088445730

问题描述:
采集时报错说Douyin对象的info属性为空

详细说明出现什么问题
前几天还好好的能采集,今天就采集不了了,报错以下信息:
Traceback (most recent call last):
File "exec.py", line 91, in
File "click/core.py", line 1157, in call
File "click/core.py", line 1078, in main
File "click/core.py", line 1434, in invoke
File "click/core.py", line 783, in invoke
File "exec.py", line 74, in main
File "exec.py", line 83, in start
File "spider.py", line 513, in run
File "spider.py", line 449, in page_init
AttributeError: 'Douyin' object has no attribute 'info'

抖音关注

你好,未来有打算出一个采集指定人的关注列表,然后输出到文件里的功能吗?

解次下载0kb文件,与更多接口好像被封

测试
https://www.douyin.com/user/MS4wLjABAAAAeYMREDSRXRWVVy3bk8ielQS59pkqnP-RmxZu5LTB5m-rEnOr0cbTEU-12RupXxAx
也可以换成任意播主
def save(self):

_.append(f'{line["download_addr"]}\n\tdir={self.down_path}\n\tout={filename}.mp4\n')

                        _.append(
                            f'{line["download_addr"]}\n\tdir={self.down_path}\n\tout={filename}.mp4\n\tuser-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36\n\theader=Cookie:{msToken}\n')

而msToken是根据auth.json再请求一次就会生成出来
aria2c无没有带此cookie下载,服务器就好像返回空白或403
def handle(self, route: Route):
<Route request=>
就是这接口,现在返回空白非json

想请问如何爬取视频封面

目前的代码爬取的封面并不是真正的封面,而是短视频的第一帧,想请问下有什么方法或接口能够爬取视频封面吗?

无法爬取一个用户所有视频

作者您好,我使用您的代码,之前是可以爬取所有视频的,但是抖音更新后似乎无法通过原来cursor的更新方法来爬取一个用户后面的视频,会出现无论cursor是多少,总会爬取前面用户前面几个视频的情况,请问这个能够怎么解决呢?
image
image

只采集最近更新的几个视频

你好,请问如何只采集最近更新的几个视频呢?现在的-l 5参数得到的不是最新的视频。

另外使用douyin.exe -t https://....,登录后,本目录下生成的auth.json文件,里面的内容是{"cookies": null},下次再运行时需要删除这个文件再重新登录,否则会报错。

Traceback (most recent call last):
File "douyin.py", line 322, in
File "click\core.py", line 1130, in call
File "click\core.py", line 1055, in main
File "click\core.py", line 1404, in invoke
File "click\core.py", line 760, in invoke
File "douyin.py", line 298, in main
File "douyin.py", line 305, in start
File "douyin.py", line 230, in run
File "playwright\sync_api_generated.py", line 14048, in new_context
File "playwright_impl_sync_base.py", line 104, in _sync
File "playwright_impl_browser.py", line 126, in new_context
File "playwright_impl_connection.py", line 61, in send
File "playwright_impl_connection.py", line 461, in wrap_api_call
File "playwright_impl_connection.py", line 96, in inner_send
playwright._impl._api_types.Error: storageState.cookies: expected array, got object
[9872] Failed to execute script 'douyin' due to unhandled exception!

找到一个小bug

爬取话题时,每个list返回10个作品,后续list爬取时cursor并不会更新,因为返回list中的cursor一直是0,所以相当于当limit>10时,一直在重复爬取前10个视频

运行结果出问题

运行环境:

  • 系统: Win10
  • 浏览器: Chrome
  • 使用方式: exe

运行命令:
直接运行douyin.exe,输入抖音网页URL,运行

问题描述:

显示爬取成功,点开结果发现视频全是0字节,根本没有下载成功。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.