longxiaofei / spider-baiduindex Goto Github PK
View Code? Open in Web Editor NEWdata sdk for baidu Index
License: MIT License
data sdk for baidu Index
License: MIT License
安装遇到编译问题,建议依赖项换成PyCryptodome,无感替代
Probably, because of some change in Python, I tried to run the demo code and get the following error :(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-d685484c0ecf> in <module>
12 keywords = ['爬虫', 'lol', '张艺兴', '人工智能', '华为', '武林外传']
13 baidu_index = BaiduIndex(keywords, '2018-01-01', '2019-05-02')
---> 14 for index in baidu_index.get_index():
15 print(index)
~/scripts/get_index.py in get_index(self)
56 start_date=params_data['start_date'],
57 end_date=params_data['end_date'],
---> 58 keywords=params_data['keywords']
59 )
60 key = self._get_key(uniqid)
~/scripts/get_index.py in _get_encrypt_datas(self, start_date, end_date, keywords)
107 html = self._http_get(url)
108 datas = json.loads(html)
--> 109 uniqid = datas['data']['uniqid']
110 encrypt_datas = []
111 for single_data in datas['data']['userIndexes']:
TypeError: string indices must be integers
C:\ProgramData\Anaconda3\python.exe C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/demo.py
Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers
我一开始以为是cookie可能是找好的。然后发现您说可能缺了一个cookie。我后来请人解决了这个问题,希望大佬在某个地方给补上一些些注释哈哈。谢谢
如题说问。
我这边如果用你那个参数,是不行的 不知道其他朋友行不
我发现我输入2011年之前的日期,输出的日期,第一天就是2011-1-1。
建议把baidu_index.py里的_all_kind变量改为参数传递,这样方便指定要爬的type类型。
如下:
# _all_kind = ['all', 'pc', 'wise']
def __init__(
self,
*,
keywords: list,
start_date: str,
end_date: str,
cookies: str,
area=0,
type=['all', 'pc', 'wise']
):
self.keywords = keywords
self.area = area
self.start_date = start_date
self.end_date = end_date
self.cookies = cookies
self._params_queue = utils.get_params_queue(start_date, end_date, keywords)
self._all_kind = type
in demo.py :
from get_index import BaiduIndex
if name == "main":
keywords = ['比特币']
baidu_index = BaiduIndex(keywords, '2013-04-01', '2014-03-31')
for index in baidu_index.get_index():
print(index)
Traceback (most recent call last):
File "e:/MyProjects/spider-BaiduIndex/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "e:\MyProjects\spider-BaiduIndex\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "e:\MyProjects\spider-BaiduIndex\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers
How can I modify the code?
request_args = {
'word': json.dumps(word_list),
'startDate': start_date.strftime('%Y-%m-%d'),
'endDate': end_date.strftime('%Y-%m-%d'),
'area': area
}
formated_data['index'] = data['all']['data'][i]
这行代码的data后面应该是用kind,而不是写死的all
另外,想请教下你是怎么解析出uniqid和数值data之间的关系的?之前更新后就发现了这个API ,只是不明白之间的解析规则,这个可以跟我大概讲一下吗?
测试了下
im = im.crop((0,4,all_width,16))
这样刚刚好,截大了tesseract识别不好
再把tesseract进行样本训练,完美识别
Traceback (most recent call last):
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1689, in
main()
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1683, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1083, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\PyCharm 2018.3\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/spider-BaiduIndex-master/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers
在哪儿个js里面有什么函数写的么?
请问有办法录个视频教程吗。。。本人非程序员,但能看得懂一些,但具体操作起来可能比较困难。如果能获得帮助,真是太感谢。。
I keep getting this error when I run the whole project. Why?
我尝试改了下,发现太笨了搞不定。
厉害呀 ,就想请问下,搜到的结果怎么保存到excel?? 本人新手小白,pandas ,baidu_index.to_csv('niushi',index = False) 失败呀
新建 Microsoft Word 文档.docx
上面是代码
为什么会出现一下的错误呢
Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/news_feed.py", line 7, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 68, in get_index
for formated_data in self._format_data(encrypt_data):
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 131, in _format_data
keyword = str(data['word'])
KeyError: 'word'
此外,如果我想爬取咨询指数,我将这句成这样,
baidu_index = BaiduIndex(keywords, '2016-1-01', '2020-4-09',type,'feed')
也是错误的。请问怎么修改呢?
Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/news_feed.py", line 5, in
baidu_index = BaiduIndex(keywords, '2016-1-01', '2020-4-09', type,'feed')
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 42, in init
self._pre_url = self.pre_url_dict[kind]
KeyError: <class 'type'>
运行demo时,baidu_index.py 88行 uniqid = datas["data"]["uniqid"] 报错TypeError string indices must be integers
目前area=0的情况下,取到的数据是“全国”数据,请问如何更改area取到各省份/城市的数据呢?
这个api有无相关的接口文档,或者如何确认各种省份/城市选项的编码呢?
感觉找一个词x一段时间,手工对应去做暴力破解会有点麻烦,请问有无其他方法~
爬的指数有些多,单个账号爬好像被限制了,不知道要怎么实现多个账号爬取的功能
{'status': 10000, 'data': '', 'message': 'not login'}
前面的issues主要是cookies设置错误或者关键词不存在,但是我检查了这些我都满足,test_cookies为true,搜索指数和媒体指数都有。
关键词是龙脉温泉,出错是咨询指数(难度是因为他咨询指数各个值都为0吗?)
我把demo的代码改了一下。写了一个循环,以便爬完一个省份爬下一个。尴尬的是我发现当它运行到第17个循环的时候出现了bug,即result_data只能出现一个1×1数字,按道理来说数组的应该会出现我设置的天数那么长。我一开始以为是省份的问题,但我单独去爬取918号省份的时候是正常的。我不知道这是因为我连续爬了16个省份被百度发现了吗,但是每次我都是运行到第17个循环出错,想请教一下longxiaofei老师这个是怎么回事。
————————————
from get_index import BaiduIndex
import numpy as np
import pandas as pd
if name == "main":
keywords = ['天使投资']
result = []
times = 36
for i in range(times):
area_index = str(901+i)
baidu_index = BaiduIndex(keywords, '2019-1-02', '2019-1-07',area_index)
result_data = []
c=baidu_index.get_index()
for index in c:
if index['type'] == 'all':
np.array(result_data.append(index.get('index')))
if i == 0:
result=result_data
else:
result= np.vstack((result,result_data))
df = pd.DataFrame(result)
df.to_csv("天使投资.csv",encoding='utf_8_sig')
一个是 搜索指数, 一个是 资讯指数。 可以爬资讯指数吗,测了下只能爬搜索指数
换了几个COOKIE,在不同电脑上,不同网络环境下,都无法获取到指数数据
您好!首先非常感谢您的代码!参考您的代码我已经成功爬下了几个关键词的数据(写课程论文用的)!
可能我爬取的数据比较多,在爬取前七个关键词(我一个个爬的)的时候都是正确的,都是再次爬取的时候就又出现了string indices must be integers的报错,这是被网站禁了吗?我更换cookie和ip地址都还是会出现这个报错。
用的程序是在您的最新的程序的基础上加入自己的cookie,其他的未作修改。
请教一下是基于什么算法做到的,看起来像是知道原来的加密算法
Traceback (most recent call last):
File "/Users/xxx/Desktop/baidu/demo.py", line 28, in
for index in baidu_index.get_index():
File "/Users/xxx/Desktop/baidu/baidu_index/baidu_index.py", line 51, in get_index
encrypt_datas, uniqid = self._get_encrypt_datas(
File "/Users/xxx/Desktop/baidu/baidu_index/baidu_index.py", line 88, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers
用大佬的demo.py跑了一下,但出现了如上报错。cookies是配置成功了的。
否则还是打印的搜索指数?单独抓取时就会显示baidu_index is not defined。
运行DEMO代码报错。cookie正常。之前运行也是正常的。今天突然报错。换了几个cookie仍然报错。我想知道是我的代码的问题还是百度又修改算法了?
Hi, I reused this scraper after two months. I used the cookies, "BDUSS" or "H_PS_PSSID", and still returns this error:
File "/Users/X/Downloads/spider-BaiduIndex-master 3/baidu_index/baidu_index.py", line 84, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers
This is where I get my cookies from:
Thanks..
我遇到一个问题是之前粘贴了cookie这后成功运行了,但是今天再次运行的时候就出现了以下错误
Traceback (most recent call last):
File "demo.py", line 26, in
for i, keyword_type_date_index in enumerate(baidu_index.get_index()):
File "C:\Users\jxyzh\Desktop\spider-BaiduIndex-master\baidu_index\baidu_index.py", line 51, in get_index
encrypt_datas, uniqid = self._get_encrypt_datas(
File "C:\Users\jxyzh\Desktop\spider-BaiduIndex-master\baidu_index\baidu_index.py", line 84, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers
我发现之前有人提问过这个问题,不过我这里之前已经成功运行过了,但是今天却报错,换上今天重新粘贴的COOKIE也不行,所以我想问一下有什么办法可以解决吗还是这是百度那里的原因。
非常感谢
{"status":10001,"data":"","logid":3107694845,"message":"request block"}
在本地执行又可以在,在服务器上执行了两天就不行了
比较了下,跟网站上原有的数据大小完全一致!感谢!
百度指数有一个分类:PC+移动,PC,移动 这个可以添加吗?
在对某些地区的数据进行爬取时会报错:list index out of range,想请问要怎么办?
比如针对源代码中给出的几个关键字,地区代码设为32(湖北襄阳)时会报错。
希望能得到答复,谢谢~
cookies测试成功,但是数据返回未知错误,建议换个接口测试cookies是否有效
test cookies返回正确,但是仍然报错
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.