GithubHelp home page GithubHelp logo

whusnoopy / renrenbackup Goto Github PK

View Code? Open in Web Editor NEW
376.0 28.0 65.0 685 KB

A backup tool for renren.com

License: MIT License

Python 3.73% HTML 0.95% CSS 50.34% JavaScript 44.98%
python renren renren-bak flask requests

renrenbackup's Introduction

I'm hiring for meideng.net

为杭州美登科技招聘研发工程师,详见 meideng.net/join

renrenbackup's People

Contributors

dumbear avatar govizlora avatar hanlins avatar rholais avatar ruotianluo avatar whusnoopy avatar xhacker avatar xuan-w avatar yumenoshizuku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

renrenbackup's Issues

人人日志模块已经不能失效,可以导致意外退出

win10 x64 1909 +python3.8.1

正常执行程序
renrenBackup.exe fetch -e 用户名 -p 密码 -s -g -a -b

经过了漫长的等待之后,报了个错

fetched 42 albums
prepare to fetch blogs
start crawl blog list page 0
Traceback (most recent call last):
File "manage.py", line 116, in
File "site-packages\flask_script_init_.py", line 417, in run
File "site-packages\flask_script_init_.py", line 386, in handle
File "site-packages\flask_script\commands.py", line 216, in call
File "manage.py", line 41, in fetch
File "fetch.py", line 99, in fetch_user
File "fetch.py", line 76, in fetch_blog
File "crawl\blog.py", line 83, in get_blogs
File "crawl\blog.py", line 26, in load_blog_list
File "crawl\crawler.py", line 123, in get_json
File "json_init_.py", line 348, in loads
File "json\decoder.py", line 337, in decode
File "json\decoder.py", line 355, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[16464] Failed to execute script manage

现在人人网的日志模块已经不能用了,是不是因为这事儿导致的报错啊???

然后,通过renrenBackup runserver发布的页面,也只能看见一个首页面,看不到其它模块。
求更新解决这个问题。

Illegal multibyte sequence

Describe the bug
感谢这个repo,非常实用!
在使用的过程中经常遇到这个error:illegal multibyte sequence,感觉是和相关的encoding有关。并不是经常出现,但是当内容有一些比较奇怪字符的时候就容易出现这个问题。

使用的系统是windows 10 x64中文版,但感觉这个问题是在insert into database时候出现的问题。

--- Logging error ---
Traceback (most recent call last):
  File "logging\__init__.py", line 1036, in emit
UnicodeEncodeError: 'gbk' codec can't encode character '\uc190' in position 322: illegal multibyte sequence
Call stack:
  File "manage.py", line 116, in <module>
  File "site-packages\flask_script\__init__.py", line 417, in run
  File "site-packages\flask_script\__init__.py", line 386, in handle
  File "site-packages\flask_script\commands.py", line 216, in __call__
  File "manage.py", line 41, in fetch
  File "fetch.py", line 99, in fetch_user
  File "fetch.py", line 76, in fetch_blog
  File "crawl\blog.py", line 83, in get_blogs
  File "crawl\blog.py", line 49, in load_blog_list
  File "crawl\utils.py", line 124, in get_comments
  File "site-packages\peewee.py", line 1574, in inner
  File "site-packages\peewee.py", line 1645, in execute
  File "site-packages\peewee.py", line 2288, in _execute
  File "site-packages\peewee.py", line 2063, in _execute
  File "site-packages\peewee.py", line 2653, in execute
  File "site-packages\peewee.py", line 2628, in execute_sql
  File "logging\__init__.py", line 1371, in debug
  File "logging\__init__.py", line 1519, in _log
  File "logging\__init__.py", line 1529, in handle
  File "logging\__init__.py", line 1591, in callHandlers
  File "logging\__init__.py", line 905, in handle
  File "logging\handlers.py", line 479, in emit
  File "logging\__init__.py", line 1132, in emit
  File "logging\__init__.py", line 1040, in emit
Message: ('INSERT OR REPLACE INTO "comment" ("id", "t", "entry_id", "entry_type", "authorId", "authorName", "content") VALUES (?, ?, ?, ?, ?, ?, ?)', [36028797501124092, datetime.datetime(2008, 4, 8, 7, 22, 45), 282116120, 'blog', 172790766, '赵欢', '回复孙鹤中손학중:过奖啦!呵呵<img src="<a href=\'http://uu.ren/kRMsjR\' target=\'_blank\' title=\'http://static.xiaonei.com/img/editor/emot/emot-10.gif\'>http://uu.ren/kRMsjR </a> "/>'])

可以加入断点续传的功能么?

现在代码抓着抓着,有时候因为网络原因就断了,或者服务器的问题断了,能加入中断之后继续的功能么?还是说本来就有这个功能?

抓头像时被 kaixin.com 的老数据污染

某些老用户的头像是从 kaixin.com 整合而来的,人人返回的头像链接类似 http://kxhdn101.rrimg.com/photos/kxhdn101/20090921/1110/tiny_L5rO_26479i206133.jpg 这样的 kxhdn101 的子域名,直接按这个链接去抓取是会 404 报错,把子域名的 kx 去掉变成 hdn101 就可以正常抓取且是期望的结果

希望在抓头像时对这个做特殊判断处理

相关 PR 包括 #27#24

日志抓取失败

使用python3.7.0
运行python manage.py -e XXX -p OOO -b
显示登录成功但是并不能爬到日志内容 (用python manage.py -e XXX -p OOO -s -g -a都是成功的)
怀疑是因为最近开始电脑端人人网日志页面挂了 (网页上根本点进去是404)
但是手机网页版依旧可以用 不知道能不能改一下code 从手机网页版爬取日志内容

谢谢!

有时无法正确保存用户头像和用户名

当抓取别人的内容时,个别时候,运行结束时,输出的uid正确,而用户名、用户头像都来自于抓取者本人。

我现在手里有一个例子,但限于隐私不便放出。

据我分析,这个问题分成两部分
首先,当fetch.py脚行时,无论是否给出 -u 选项,utils.get_user() 几乎总是给出抓取者本人的姓名和头像。无论提供哪个uid的homepage,结果总是一样的。
似乎这并没有很简短的修复办法。是可以抓取首页上的头像和姓名,但可能需要上BeautifulSoup来解析,而不是简单地字符搜索。
这样的话,初始的姓名和头像就是错的。

第二个问题,似乎绝大多数情况下,正确的用户名和头像会在脚本运行的过程中被更新。但是仍然存在少数时候没有更新。我还没有搞清楚为什么大部分时候会更新,少数时候没有被更新。
更新:
分析源码之后我似乎明白怎么回事了。save_user总共只有三个被调用点。既然 get_user() 返回的总是错的,那么只能是 get_comments() get_likes() 更新了用户的小头像。对于gossips来说,因为用户的头像是当年的头像,抓取留言板的时候并没有更新用户头像信息。
结论:当一个用户从未回复过评论,且从未对自己的内容点赞时,就无法抓取到正确的用户姓名和用户头像。

这样看来,还是修正get_user() 比较可行。我可能研究研究发个PR。

出現"urlopen chunked=chunked"錯誤

Traceback (most recent call last):
File "C:\Users\xxx.virtualenvs\renrenBackup-master-RLOICigy\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\Users\xxx.virtualenvs\renrenBackup-master-RLOICigy\lib\site-packages\urllib3\connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "C:\Users\xxx.virtualenvs\renrenBackup-master-RLOICigy\lib\site-packages\urllib3\connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\xxx\AppData\Local\Programs\Python\Python37-32\Lib\http\client.py", line 1321, in getresponse
response.begin()
File "C:\Users\xxx\AppData\Local\Programs\Python\Python37-32\Lib\http\client.py", line 296, in begin
version, status, reason = self._read_status()
File "C:\Users\xxx\AppData\Local\Programs\Python\Python37-32\Lib\http\client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "C:\Users\xxx\AppData\Local\Programs\Python\Python37-32\Lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] 远端主机已强制关闭一个现存的连线。

Traceback (most recent call last):
File "fetch.py", line 129, in
fetched = fetch_user(fetch_uid, cmd_args)
File "fetch.py", line 98, in fetch_user
fetch_album(uid)
File "fetch.py", line 71, in fetch_album
album_count = crawl_album.get_albums(uid)
File "I:\renrenBackup-master\crawl\album.py", line 118, in get_albums
total += get_album_list_page(cur_page, uid)
File "I:\renrenBackup-master\crawl\album.py", line 106, in get_album_list_page
get_album_summary(aid, uid)
File "I:\renrenBackup-master\crawl\album.py", line 66, in get_album_summary
'src': get_image(p['large']),
File "I:\renrenBackup-master\crawl\utils.py", line 31, in get_image
resp = crawler.get_url(img_url)
File "I:\renrenBackup-master\crawl\crawler.py", line 97, in get_url
return self.get_url(url, params, method, retry)
File "I:\renrenBackup-master\crawl\crawler.py", line 97, in get_url
return self.get_url(url, params, method, retry)
File "I:\renrenBackup-master\crawl\crawler.py", line 97, in get_url
return self.get_url(url, params, method, retry)
[Previous line repeated 2 more times]
File "I:\renrenBackup-master\crawl\crawler.py", line 82, in get_url
raise Exception("network error, exceed max retry time")
Exception: network error, exceed max retry time

export失败 以及 部分抓取后的状态在http://127.0.0.1:5000/和http://localhost:5000点击“评论”没有反应(看不到评论)

Traceback (most recent call last):
File "manage.py", line 116, in
File "site-packages\flask_script_init_.py", line 417, in run
File "site-packages\flask_script_init_.py", line 386, in handle
File "site-packages\flask_script\commands.py", line 216, in call
File "manage.py", line 53, in export
File "export.py", line 139, in export_all
KeyError: 'users'
[27288] Failed to execute script manage

求大大更新

(PS:其实看到了#51export失败提交的错误,也看到您修改了,只是我不知如何应用,看着编码一头雾水。。。)

评论没反应集中在倒数三四页的第二个评论,约莫有三四页中的第二个状态评论无法显示

系统是win10 1909 64位

再加一句:在备份某个好友时,会出现“renren return 500, wait a moment”字样,然后备份就卡在那里(备份一个没有email的,只有一串数字用户名的账号,也出现了同样问题)

json.decoder.JSONDecodeError

Describe the bug

我用以下命令去备份我的个人内容

python manage.py fetch -p *** -e *** -s -g -a -b

前面的下载基本上都正常,但是进行到如下状态后,就报错了

    fetch album 311698393 2008.18 (), 评0/分0/赞0
Traceback (most recent call last):
  File "manage.py", line 158, in <module>
    cli()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "manage.py", line 53, in fetch
    fetched = fetch_user(
  File "/Users/Kinggerm/Downloads/renrenBackup/fetch.py", line 99, in fetch_user
    fetch_album(uid)
  File "/Users/Kinggerm/Downloads/renrenBackup/fetch.py", line 71, in fetch_album
    album_count = crawl_album.get_albums(uid)
  File "/Users/Kinggerm/Downloads/renrenBackup/crawl/album.py", line 163, in get_albums
    count, after = get_album_list_page(uid, after)
  File "/Users/Kinggerm/Downloads/renrenBackup/crawl/album.py", line 153, in get_album_list_page
    get_album_summary(aid, uid)
  File "/Users/Kinggerm/Downloads/renrenBackup/crawl/album.py", line 73, in get_album_summary
    album_data = crawler.get_json(
  File "/Users/Kinggerm/Downloads/renrenBackup/crawl/crawler.py", line 178, in get_json
    r = json.loads(resp.text.replace(",}", "}"))
  File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

To Reproduce
重新运行原命令可在原位复现,但是账号和密码不太方便透露

抓取时遇到JSONDecodeError

试图在Windows系统直接以cmd运行renrenBackup.exe fetch -e EMAIL -p PASSWORD -a,遇到以下错误:

  check login, and get homepage for cookie
need login
prepare login encryt info
prepare post login request
Traceback (most recent call last):
  File "manage.py", line 116, in <module>
  File "site-packages\flask_script\__init__.py", line 417, in run
  File "site-packages\flask_script\__init__.py", line 386, in handle
  File "site-packages\flask_script\commands.py", line 216, in __call__
  File "manage.py", line 38, in fetch
  File "crawl\crawler.py", line 53, in __init__
  File "crawl\crawler.py", line 137, in check_login
  File "crawl\crawler.py", line 80, in get_url
  File "crawl\crawler.py", line 178, in login
  File "json\__init__.py", line 348, in loads
  File "json\decoder.py", line 337, in decode
  File "json\decoder.py", line 355, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[14520] Failed to execute script manage

系统是Windows 10 64位。

留言板爬取错误

在我这爬留言板的时候, 每当爬到 49 页就会报错:

start crawl gossip page 48
  crawled 20 gossip on page 48
start crawl gossip page 49
Traceback (most recent call last):
  File "fetch.py", line 46, in <module>
    gossip_count = crawl_gossip.get_gossip()
  File "/path/to/renrenBackup/crawl/gossip.py", line 62, in get_gossip
    total = load_gossip_page(cur_page)
  File "/path/to/renrenBackup/crawl/gossip.py", line 49, in load_gossip_page
    gossip['content'] = normal_pattern.findall(body)[0]
IndexError: list index out of range

抓取自己的所有信息失败

Describe the bug
按照readme 安装 运行 抓取自己的所有信息失败

To Reproduce
Steps to reproduce the behavior:

python manage.py fetch -e [email protected] -p xxxx -s -g -a -b

Expected behavior
抓取自己的所有信息

Error Output:

  • File: [renrenBackup/crawl/utils.py"]
  • Console Output:
  check login, and get homepage for cookie
need login
prepare login encryt info
prepare post login request
login success with [email protected] as 1234456644
  check login, and get homepage for cookie
    login valid
    login valid
Traceback (most recent call last):
  File "manage.py", line 170, in <module>
    cli()
  File "/Users/lfeng/Dev/projects/renrenBackup/venv/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/lfeng/Dev/projects/renrenBackup/venv/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/lfeng/Dev/projects/renrenBackup/venv/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/lfeng/Dev/projects/renrenBackup/venv/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/lfeng/Dev/projects/renrenBackup/venv/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "manage.py", line 57, in fetch
    fetched = fetch_user(
  File "/Users/lfeng/Dev/projects/renrenBackup/fetch.py", line 88, in fetch_user
    get_user(uid)
  File "/Users/lfeng/Dev/projects/renrenBackup/crawl/utils.py", line 108, in get_user
    name = re.findall(
IndexError: list index out of range

Additional context
mac m1 os 14
python 3.8.5
有点可惜 好多事记不清还想回忆下的

请问后续还有什么更新计划吗?

之前感觉腾讯微博要over,
自己写了一个腾讯微博备份工具(见我的Git),
由于微博比较简单,所以干脆就用selenium备份到了word里面,
然后把腾讯微博注销了。

最近想把人人也注销掉,本来想自己写个备份工具,
结果搜了下git发现竟然有这么好用的备份工具,
刚才备份成功后显示效果也非常不错。
但是不知道目前这个工具的备份功能还有什么后续更新计划吗?
(看到TODO里面的LIST应该不是备份功能)
如果没有的话我就干脆注销掉人人账号了。
(如果有更新或者此版本有备份不全的话注销了就没办法再备一次了😂)

最后感谢这么好用的工具!👍👍👍

请问有mac的运行方法吗?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

增加全文搜索

Describe the solution you'd like
建议为 flask app 增加搜索,用来搜索正文和评论中的关键字,比如使用 elasticsearch

小白求指点:第二个步骤无法继续

第二个步骤:
2.在命令提示符进入该目录,执行 renrenBackup.exe fetch -e email -p password -s -g -a -b 来抓取账号为 email 密码是 password 的用户信息(详细参数可见下方 Python 环境运行方式)

我下载了压缩包,加压后运行那个exe文件,每次一打开,对话框出现一秒钟就自动关闭跳掉,无法继续输入指令,请问怎么处理?

谢谢!

Export時出現錯誤

  1. Fetch 成功了
  2. python export.py backup.tar 時,blog出如下錯誤,导致长时间停:
    File "export.py", line 167, in
    export_all(tar_name)
    File "export.py", line 145, in export_all
    export_blogs(client_app, user['uid'])
    File "export.py", line 118, in export_blogs
    save_file(client, '/blog/{blog_id}'.format(blog_id=blog['id']))
    File "export.py", line 51, in save_file
    output_html = trans_relative_path(resp.data.decode(), local_path)
    File "export.py", line 36, in trans_relative_path
    flags=re.M | re.DOTALL)
    File "C:\Users\xxx.virtualenvs\renrenBackup-master-RLOICigy\lib\re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)

註:在python web.py中显示正常

人人网好像又经历了一次zuo si 的改版,现在啥都打不开了。。。

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Error Output:

  • File: [e.g. crawl/status.py]
  • Console Output: [e.g. IndexError ****]

Additional context
Add any other context about the problem here.

保存的html有问题: 500 internal server error

打开保存的index.html 会有以下报错信息:

Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

查了一下生成的过程,会有 TemplateNotFound 的信息,可能与此有关:

Exception on /blog/436610492 [GET]
Traceback (most recent call last):
File "site-packages\flask\app.py", line 2292, in wsgi_app
File "site-packages\flask\app.py", line 1815, in full_dispatch_request
File "site-packages\flask\app.py", line 1718, in handle_user_exception
File "site-packages\flask_compat.py", line 35, in reraise
File "site-packages\flask\app.py", line 1813, in full_dispatch_request
File "site-packages\flask\app.py", line 1799, in dispatch_request
File "web.py", line 123, in blog_detail_page
File "web.py", line 23, in render_template
File "site-packages\flask\templating.py", line 134, in render_template
File "site-packages\jinja2\environment.py", line 869, in get_or_select_template
File "site-packages\jinja2\environment.py", line 830, in get_template
File "site-packages\jinja2\environment.py", line 804, in _load_template
File "site-packages\jinja2\loaders.py", line 113, in load
File "site-packages\flask\templating.py", line 58, in get_source
File "site-packages\flask\templating.py", line 86, in _get_source_fast
jinja2.exceptions.TemplateNotFound: blog.html

所用的机器为windows10 64位,exe执行。

无法备份

请问现在这个还能用吗?我刚试了不可以了,但愿不是renren网把接口都封了。

编码问题

Describe the bug
--- Logging error ---
Traceback (most recent call last):
File "logging_init_.py", line 1036, in emit
UnicodeEncodeError: 'gbk' codec can't encode character '\u2665' in position 89: illegal multibyte sequence
Call stack:
File "manage.py", line 116, in
File "site-packages\flask_script_init_.py", line 417, in run
File "site-packages\flask_script_init_.py", line 386, in handle
File "site-packages\flask_script\commands.py", line 216, in call
File "manage.py", line 41, in fetch
File "fetch.py", line 87, in fetch_user
File "fetch.py", line 52, in fetch_status
File "crawl\status.py", line 53, in get_status
File "crawl\status.py", line 41, in load_status_page
File "crawl\utils.py", line 149, in get_likes
File "crawl\utils.py", line 48, in save_user
File "logging_init_.py", line 1371, in debug
File "logging_init_.py", line 1519, in log
File "logging_init
.py", line 1529, in handle
File "logging_init_.py", line 1591, in callHandlers
File "logging_init_.py", line 905, in handle
File "logging\handlers.py", line 479, in emit
File "logging_init_.py", line 1132, in emit
File "logging_init_.py", line 1040, in emit
Message: 'try to save masked with headPic masked'
Arguments: ()
get image masked to local

To Reproduce
Steps to reproduce the behavior:

save any info from people whose name contains speical character, such as

Need PR: New way to fetch blog 爬取日志的新方法

人人的 SNS 资产在 2019 年卖给了 Donews,从 2019 年 8 月开始,Web 日志列表页和日志详情页就 404 了,而且一直没有恢复的迹象,后来 Donews 发布了新的人人手机端应用,可以正常看到日志,说明数据没有丢

在 2020.04 经人提醒,日志可以通过类似如下的 URL 来看到摘要并其实有全文透出

http://dnactivity.renren.com/index.html?p=601%2F30314%2F966126912

可以参考 https://github.com/whusnoopy/renrenBackup/blob/master/docs/a0_fetch_blog_after_201908.md 里的细节,求有空的人帮忙 PR

非本人信息

我测试了一下,照片接口,非本人信息也是可以抓取的;
我们可以互相交流一下;

下载的图片 有一些内容是 404

下载的某些图片内容实质是

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>openresty</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

根据图片链接打开,某些图片会404,但是过阵子打开又会正常,推测是cdn第一次访问无图片返回404,回源拿到后返回正常。

不会 Pythonnodejs 写了个,原理是读取 img 目录中 404 的图片,然后按照地址反复多次下载。虽然 人人 已经凉了,如果有人能用得上的自取吧。
https://github.com/lqzhgood/renrenBackup-Timeline#downimg

希望能完善转发内容

Is your feature request related to a problem? Please describe.

  • 目前备份的状态中转发的内容看不到原本的内容,只能看到转发链的评论。不太清楚这是否是预期行为
  • 人人网原来有一个转发页面,会列出所有转发的日志。很多日志虽然不是自己的,但是也很有意义,并且登陆后应该也有读权限,不知道是否可以增加这个功能

Describe the solution you'd like

  • 补全转发内容
  • 增加一个新的tab展现转发的日志

Describe alternatives you've considered
暂无

Additional context
十分感谢这个项目的贡献者,虽然已经距离备份的黄金时间过去了很久,但是现在仍然备份出了非常多珍贵的数据。这次提出的建议我也乐意能做些什么,只是不太清楚是否可行,如果能提供相关接口我可以尝试实现一下。

IndexError: list index out of range

Traceback (most recent call last):
File "fetch.py", line 129, in
fetched = fetch_user(fetch_uid, cmd_args)
File "fetch.py", line 98, in fetch_user
fetch_album(uid)
File "fetch.py", line 71, in fetch_album
album_count = crawl_album.get_albums(uid)
File "I:\renrenBackup-master\crawl\album.py", line 118, in get_albums
total += get_album_list_page(cur_page, uid)
File "I:\renrenBackup-master\crawl\album.py", line 106, in get_album_list_page
get_album_summary(aid, uid)
File "I:\renrenBackup-master\crawl\album.py", line 18, in get_album_summary
first_photo_id = re.findall(r'"photoId":"(\d+)",', resp.text)[0]
IndexError: list index out of range

fetch.py throws no such table: status

Describe the bug
Yields error when the download finishes.

To Reproduce
Run with python fetch.py [email] [password] -b -r or without -r option.

Expected behavior
Finishes without an error.

Error Output:

  • File: fetch.py
  • Console Output:

sqlite3.OperationalError: no such table: status

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "fetch.py", line 134, in
update_fetch_info(fetch_uid)
File "fetch.py", line 30, in update_fetch_info
status=Status.select().where(Status.uid == uid).count(),
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1604, in inner
return method(self, database, *args, **kwargs)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1859, in count
return Select([clone], [fn.COUNT(SQL('1'))]).scalar(database)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1604, in inner
return method(self, database, *args, **kwargs)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1845, in scalar
row = self.tuples().peek(database)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1604, in inner
return method(self, database, *args, **kwargs)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1832, in peek
rows = self.execute(database)[:n]
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1604, in inner
return method(self, database, *args, **kwargs)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1675, in execute
return self._execute(database)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 1826, in _execute
cursor = database.execute(self)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 2696, in execute
return self.execute_sql(sql, params, commit=commit)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 2690, in execute_sql
self.commit()
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 2481, in exit
reraise(new_type, new_type(*exc_args), traceback)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 178, in reraise
raise value.with_traceback(tb)
File "/Users/hou/.local/share/virtualenvs/renrenBackup-I9IlZmg3/lib/python3.7/site-packages/peewee.py", line 2683, in execute_sql
cursor.execute(sql, params or ())
peewee.OperationalError: no such table: status

Additional context
This error also makes python web.py and python export.py backup.tar un-usable.

KeyError: 'authorName'

Traceback (most recent call last):
File "fetch.py", line 129, in
fetched = fetch_user(fetch_uid, cmd_args)
File "fetch.py", line 102, in fetch_user
fetch_blog(uid)
File "fetch.py", line 79, in fetch_blog
blog_count = crawl_blog.get_blogs(uid)
File "I:\renrenBackup-master\crawl\blog.py", line 71, in get_blogs
total = load_blog_list(cur_page, uid)
File "I:\renrenBackup-master\crawl\blog.py", line 47, in load_blog_list
get_comments(bid, 'blog', owner=uid)
File "I:\renrenBackup-master\crawl\utils.py", line 88, in get_comments
save_user(c['authorId'], c['authorName'], c['authorHeadUrl'])
KeyError: 'authorName'

能否请加上好友简介的备份?

好友数目,姓名,简要信息,若是能备份网页版聊天记录(就是首页右侧好友列表的聊天记录)就更好了。

如题,感谢

遇到UnicodeEncodeError错误

登陆之后遇到UnicodeEncodeError错误,注释掉第62行之后就好了

load cookies from ./.cookies.json
check login, and get homepage for cookie
login valid
Traceback (most recent call last):
File "fetch.py", line 129, in
fetched = fetch_user(fetch_uid, cmd_args)
File "fetch.py", line 87, in fetch_user
get_user(uid)
File "/Users/loumingming/code/renrenBackup/crawl/utils.py", line 62, in get_user
print(' get user {uid} {name} with {pic}'.format(uid=uid, name=name, pic=pic))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

抓取日志失败

运行环境:win10 home

命令执行
./renrenBackup.exe fetch -e email -p pwd -b

该错误会在抓取第1~15个日志时出现,应该不是特定日志导致的错误

Arguments: ()
crawled 9 comments on blog 77xxxxx25
Traceback (most recent call last):
File "manage.py", line 116, in
File "site-packages\flask_script_init_.py", line 417, in run
File "site-packages\flask_script_init_.py", line 386, in handle
File "site-packages\flask_script\commands.py", line 216, in call
File "manage.py", line 41, in fetch
File "fetch.py", line 99, in fetch_user
File "fetch.py", line 76, in fetch_blog
File "crawl\blog.py", line 83, in get_blogs
File "crawl\blog.py", line 51, in load_blog_list
File "crawl\utils.py", line 103, in get_comments
File "crawl\crawler.py", line 119, in get_json
File "json_init_.py", line 348, in loads
File "json\decoder.py", line 337, in decode
File "json\decoder.py", line 355, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[15240] Failed to execute script manage

./static/img/icode.jpg 根本不存在。账号密码正确

check login, and get homepage for cookie
need login
prepare login encryt info
prepare post login request
can not get login info, needs icode
get icode image, output to ./static/img/icode.jpg
Traceback (most recent call last):
File "fetch.py", line 125, in
cralwer = prepare_crawler(cmd_args)
File "fetch.py", line 22, in prepare_crawler
config.crawler = Crawler(args.email, args.password, Crawler.load_cookie())
File "/Users/whglamrock/Downloads/renrenBackup-master/crawl/crawler.py", line 49, in init
self.check_login()
File "/Users/whglamrock/Downloads/renrenBackup-master/crawl/crawler.py", line 128, in check_login
self.get_url("http://www.renren.com/{uid}".format(uid=self.uid))
File "/Users/whglamrock/Downloads/renrenBackup-master/crawl/crawler.py", line 76, in get_url
self.login()
File "/Users/whglamrock/Downloads/renrenBackup-master/crawl/crawler.py", line 172, in login
with open(config.ICODE_FILEPATH, 'wb') as fp:
FileNotFoundError: [Errno 2] No such file or directory: './static/img/icode.jpg'

export失败

Traceback (most recent call last):
File "manage.py", line 116, in
File "site-packages\flask_script_init_.py", line 417, in run
File "site-packages\flask_script_init_.py", line 386, in handle
File "site-packages\flask_script\commands.py", line 216, in call
File "manage.py", line 53, in export
File "export.py", line 139, in export_all
KeyError: 'users'
[17940] Failed to execute script manage

简单看了下export.py中的get_json里面总是失败

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.