longxiaofei / spider-baiduindex Goto Github PK

View Code? Open in Web Editor NEW

740.0 26.0 226.0 228 KB

data sdk for baidu Index

License: MIT License

Python 100.00%

spider-baiduindex's Introduction

Qdata - Python SDK for index and search

为什么给项目改了名

想做一个提供更多数据的SDK包,但不一定有时间。。。
老的代码包可以在old_baiduindex里找到
会根据我自己个人的数据需求，往里面添加不同的数据源，如果恰好帮助到你，很开心
老的数据源会尽力维护

Data Source

Install

pip uninstall pycrypto  # 避免与pycryptodome冲突
pip install --upgrade qdata

Examples

百度指数

./examples/test_baidu_index.py

可以参考以下代码进行百度指数的获取 ./examples/baidu_index_best_practice.py

百度搜索

./examples/test_baidu_search.py

百度登录(获取百度Cookie)

./examples/test_baidu_login.py

目前只提供二维码登录，密码账号登录也可以做，但不做，因为没必要。
幸好工作不做爬虫，心太累了。

天眼查

./examples/test_tianyancha.py

老婆做汇报着急用

Changelog

2021/03/25 上线
2021/03/26 更新百度登录功能
2021/04/07 百度指数新增:实时百度指数
2021/04/13 添加天眼查高级搜索公司数数据
2021/05/18 修正打包问题
2022/05/12 百度指数添加Cipher-Text(不确定部分逻辑)
2022/05/16 一些小的改动
2022/05/30 修正百度指数加密逻辑
2022/09/06 添加检查关键词方法、添加最佳实践脚本

Stargazers over time

spider-baiduindex's People

Contributors

Stargazers

Watchers

Forkers

goodcode123345 farfly xfy447 tomnattle nanadabendan yuandongzhong jhyhue jude1992 tamara4746 flygirl1993 younord snow611 xaxed mashiro120 eliaqian runnytone jornason reinhardhsu nemochin zhangzw0353 winghou tengcong12345 mojiqaq liumenglife runfortheworld raitd tmacmilan mrjun longqst13 veedou hhy5277 houxinshuo jackyin5918 wwqwwqwwqwwq nutalk miven naninghugo cs921128lyf mukess yunlongjia131 ruinalv ouerzc hongtao666 gsp412 rollincupcake llzhi001 wyj0613 tlemar jiangzi messimercy bibibabibobi joe2loft schopenhauerzhang wilson1823 wyclance chloetan13 emma-tsai garethjjn liujisheng candicandi huihui7987 zhangnuist haojiang520 funjackyone yangwohenmai qrmbqh gasbarroni8 xinxianren xuxiaobogit huangzhaor mrliujiankun cao-pan2008 toyijiu doublefisherghost baoyinyuan yangu1992 grcxmn klausegong flyingpiggy-yoyo franklili3 cyyunyishang relbeyond mosas simplzyu liumeng404 owen-debug smartaec shitianshiwa pengjinfu pengdaojie lir0629 changfeng-xu roundnose gaohuatj 1046517444 qaq-ahuahuahuahua fuxinjiang hw-gabriel daidarengit bruce4research

spider-baiduindex's Issues

今天在·使用的时候似乎开始有点bug？

Traceback (most recent call last):
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1689, in
main()
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1683, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1083, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\PyCharm 2018.3\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/spider-BaiduIndex-master/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

ERROR-10002：未知错误

你好，我想问一下这个报错是什么问题？
第一次在JupyterNotebook写好后，后面再次使用时发现会一直报错这个。再次使用后使用的是新的Cookies

PyCryptodome代替PyCrypto

安装遇到编译问题，建议依赖项换成PyCryptodome，无感替代

TypeError: string indices must be integers

Traceback (most recent call last):
File "/Users/xxx/Desktop/baidu/demo.py", line 28, in
for index in baidu_index.get_index():
File "/Users/xxx/Desktop/baidu/baidu_index/baidu_index.py", line 51, in get_index
encrypt_datas, uniqid = self._get_encrypt_datas(
File "/Users/xxx/Desktop/baidu/baidu_index/baidu_index.py", line 88, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

用大佬的demo.py跑了一下，但出现了如上报错。cookies是配置成功了的。

作者太牛逼了，顺便问下能抓取百度指数里的“实时”数据么

我尝试改了下，发现太笨了搞不定。

调用连接是：http://index.baidu.com/api/LiveApi/getLive

大佬你好：string indices must be integers

C:\ProgramData\Anaconda3\python.exe C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/demo.py
Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

我一开始以为是cookie可能是找好的。然后发现您说可能缺了一个cookie。我后来请人解决了这个问题，希望大佬在某个地方给补上一些些注释哈哈。谢谢

似乎不能用了，可能是哪里出了问题？

下载chromedriver, 并将它放到环境变量中
下载tesseract, 并将它放到环境变量中
单账号抓取：请你打开百度的首页，登录后，将百度首页的cookie复制后，粘贴到config.py中的COOKIES对象中
找到tesseract文件夹, tesseract/3.05.02/share/tessdata/configs中的digits
这些都做了。不知道怎么进行调试

请问下request block是不是被封IP了？

{"status":10001,"data":"","logid":3107694845,"message":"request block"}

在本地执行又可以在，在服务器上执行了两天就不行了

搜索指数中area的编号如何确定是哪个省份呢

request_args = {
            'word': json.dumps(word_list),
            'startDate': start_date.strftime('%Y-%m-%d'),
            'endDate': end_date.strftime('%Y-%m-%d'),
            'area': area
        }

【BUG】cookies测试成功但是无法获取数据

cookies测试成功，但是数据返回未知错误，建议换个接口测试cookies是否有效

老哥，你怎么知道https://miao.baidu.com/abdr这个的参数是写死的呢？

我这边如果用你那个参数，是不行的不知道其他朋友行不

希望大佬可以开发组合词爬取功能

如图

TypeError: string indices must be integers

运行DEMO代码报错。cookie正常。之前运行也是正常的。今天突然报错。换了几个cookie仍然报错。我想知道是我的代码的问题还是百度又修改算法了？

关键词数据返回

请教前辈，我关键词为上市公司名称，数据返回后，公司名称被分割为单个汉字，这个要如何解决呢？谢谢！

大佬怎么发现的数据解密方法呀？

在哪儿个js里面有什么函数写的么？

decrypt_func是什么原理对返回的加密值做了解析

请教一下是基于什么算法做到的，看起来像是知道原来的加密算法

TypeError: string indices must be integers

in demo.py :
from get_index import BaiduIndex

if name == "main":
keywords = ['比特币']
baidu_index = BaiduIndex(keywords, '2013-04-01', '2014-03-31')
for index in baidu_index.get_index():
print(index)

Traceback (most recent call last):
File "e:/MyProjects/spider-BaiduIndex/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "e:\MyProjects\spider-BaiduIndex\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "e:\MyProjects\spider-BaiduIndex\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

How can I modify the code?

关于天眼查里面的"base_datas", "category"，和"base_datas", "area"这个可以提供下不

如题说问。

厉害呀大神！，就想请问下，1.搜到的结果怎么保存到excel？？

厉害呀，就想请问下，搜到的结果怎么保存到excel？？本人新手小白，pandas ，baidu_index.to_csv('niushi',index = False) 失败呀

你好

请问有办法录个视频教程吗。。。本人非程序员，但能看得懂一些，但具体操作起来可能比较困难。如果能获得帮助，真是太感谢。。

百度媒体指数的抓取部分倒数第二行代码中 baidu_index是否应该改为news_index

否则还是打印的搜索指数？单独抓取时就会显示baidu_index is not defined。

代码报错string indices must be integers

test cookies返回正确，但是仍然报错

想请教一下大神的媒体指数和咨询指数爬取

新建 Microsoft Word 文档.docx
上面是代码
为什么会出现一下的错误呢

Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/news_feed.py", line 7, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 68, in get_index
for formated_data in self._format_data(encrypt_data):
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 131, in _format_data
keyword = str(data['word'])
KeyError: 'word'

此外，如果我想爬取咨询指数，我将这句成这样，
baidu_index = BaiduIndex(keywords, '2016-1-01', '2020-4-09',type,'feed')

也是错误的。请问怎么修改呢？

Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/news_feed.py", line 5, in
baidu_index = BaiduIndex(keywords, '2016-1-01', '2020-4-09', type,'feed')
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 42, in init
self._pre_url = self.pre_url_dict[kind]
KeyError: <class 'type'>

请问如何确认api请求中的area编码？

目前area=0的情况下，取到的数据是“全国”数据，请问如何更改area取到各省份/城市的数据呢？
这个api有无相关的接口文档，或者如何确认各种省份/城市选项的编码呢？

感觉找一个词x一段时间，手工对应去做暴力破解会有点麻烦，请问有无其他方法~

大佬我又来了，我想调用同时爬取所有省份的数据时出现了问题

我把demo的代码改了一下。写了一个循环，以便爬完一个省份爬下一个。尴尬的是我发现当它运行到第17个循环的时候出现了bug，即result_data只能出现一个1×1数字，按道理来说数组的应该会出现我设置的天数那么长。我一开始以为是省份的问题，但我单独去爬取918号省份的时候是正常的。我不知道这是因为我连续爬了16个省份被百度发现了吗，但是每次我都是运行到第17个循环出错，想请教一下longxiaofei老师这个是怎么回事。

————————————
from get_index import BaiduIndex
import numpy as np
import pandas as pd
if name == "main":
keywords = ['天使投资']
result = []
times = 36
for i in range(times):
area_index = str(901+i)
baidu_index = BaiduIndex(keywords, '2019-1-02', '2019-1-07',area_index)
result_data = []
c=baidu_index.get_index()
for index in c:
if index['type'] == 'all':
np.array(result_data.append(index.get('index')))
if i == 0:
result=result_data
else:
result= np.vstack((result,result_data))
df = pd.DataFrame(result)
df.to_csv("天使投资.csv",encoding='utf_8_sig')

之前的版本似乎不行了，然后用了新版OK。但是新版本不能爬取PC端2011年之前的

我发现我输入2011年之前的日期，输出的日期，第一天就是2011-1-1。

有人知道如何获取百度的主题功能吗

截图区域大了点

测试了下
im = im.crop((0,4,all_width,16))
这样刚刚好，截大了tesseract识别不好
再把tesseract进行样本训练，完美识别

我发现了一个百度指数防爬取的招数

就是本来全图是都有数据的，但是随着下拉栏的移动，图的数据就会不由自主的变成0（似乎是故意显示成错误），毫无疑问，我在您的代码爬取的时候，也是爬着爬着就出现了很多0，不知道有什么方法解决吗

TypeError: string indices must be integers

Hi, I reused this scraper after two months. I used the cookies, "BDUSS" or "H_PS_PSSID", and still returns this error:

File "/Users/X/Downloads/spider-BaiduIndex-master 3/baidu_index/baidu_index.py", line 84, in _get_encrypt_datas
uniqid = datas['data']['uniqid']

TypeError: string indices must be integers

This is where I get my cookies from:

Thanks..

请问添加了cookie，为什么返回数据显示未登录

{'status': 10000, 'data': '', 'message': 'not login'}

报错：TypeError string indices must be integers

运行demo时，baidu_index.py 88行 uniqid = datas["data"]["uniqid"] 报错TypeError string indices must be integers

部分地区数据无法抓取？

在对某些地区的数据进行爬取时会报错：list index out of range，想请问要怎么办？
比如针对源代码中给出的几个关键字，地区代码设为32（湖北襄阳）时会报错。
希望能得到答复，谢谢~

key 的请求方式是不是失效了，请求回来的是空值，计算出来的index与网站不一致

能不能实现多个cookies抓取呀，账号被限制了。。。

爬的指数有些多，单个账号爬好像被限制了，不知道要怎么实现多个账号爬取的功能

爬取到的指数与网站上显示不完全一致

比较了下，跟网站上原有的数据大小完全一致！感谢！

string indices must be integers报错（和前面的issues好像不太一样）

前面的issues主要是cookies设置错误或者关键词不存在，但是我检查了这些我都满足，test_cookies为true，搜索指数和媒体指数都有。
关键词是龙脉温泉，出错是咨询指数（难度是因为他咨询指数各个值都为0吗？）

TypeError: string indices must be integers

Probably, because of some change in Python, I tried to run the demo code and get the following error :(

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-d685484c0ecf> in <module>
     12     keywords = ['爬虫', 'lol', '张艺兴', '人工智能', '华为', '武林外传']
     13     baidu_index = BaiduIndex(keywords, '2018-01-01', '2019-05-02')
---> 14     for index in baidu_index.get_index():
     15         print(index)

~/scripts/get_index.py in get_index(self)
     56                     start_date=params_data['start_date'],
     57                     end_date=params_data['end_date'],
---> 58                     keywords=params_data['keywords']
     59                 )
     60                 key = self._get_key(uniqid)

~/scripts/get_index.py in _get_encrypt_datas(self, start_date, end_date, keywords)
    107         html = self._http_get(url)
    108         datas = json.loads(html)
--> 109         uniqid = datas['data']['uniqid']
    110         encrypt_datas = []
    111         for single_data in datas['data']['userIndexes']:

TypeError: string indices must be integers

关于可能由于被禁而出现string indices must be integers的问题

您好！首先非常感谢您的代码！参考您的代码我已经成功爬下了几个关键词的数据（写课程论文用的）！

可能我爬取的数据比较多，在爬取前七个关键词（我一个个爬的）的时候都是正确的，都是再次爬取的时候就又出现了string indices must be integers的报错，这是被网站禁了吗？我更换cookie和ip地址都还是会出现这个报错。

用的程序是在您的最新的程序的基础上加入自己的cookie，其他的未作修改。

突然遇到未知错误 data.errors.QdataError: 'ERROR-10002: 未知错误'

之前运行过后没有问题的，突然关闭文件再运行就显示了以下错误：

请问有朋友可以看看如何解决吗

部分关键词数据获取不到

代码爬取广州2020-01-01到2020-12-31的数据，关键词是['糖果', '冻干', '月饼', '啤酒', '洋酒']，但奇怪的是每一次跑程序，有时会有几列数据抓取不到，结果如下图：

但有时又能全部抓到

想问一下这里是需要怎么修改吗

百度指数有两个指标

一个是搜索指数，一个是资讯指数。可以爬资讯指数吗，测了下只能爬搜索指数

请问这个报错如何解决string indices must be integers

分类选项

百度指数有一个分类：PC+移动，PC，移动这个可以添加吗？

baidu_index.py里的_all_kind变量改为参数传递

建议把baidu_index.py里的_all_kind变量改为参数传递，这样方便指定要爬的type类型。

如下：
# _all_kind = ['all', 'pc', 'wise']

def __init__(
    self,
    *,
    keywords: list,
    start_date: str,
    end_date: str,
    cookies: str,
    area=0,
    type=['all', 'pc', 'wise']
):
    self.keywords = keywords
    self.area = area
    self.start_date = start_date
    self.end_date = end_date
    self.cookies = cookies
    self._params_queue = utils.get_params_queue(start_date, end_date, keywords)
    self._all_kind = type

关于使用COOKIE的问题

我遇到一个问题是之前粘贴了cookie这后成功运行了，但是今天再次运行的时候就出现了以下错误
Traceback (most recent call last):
File "demo.py", line 26, in
for i, keyword_type_date_index in enumerate(baidu_index.get_index()):
File "C:\Users\jxyzh\Desktop\spider-BaiduIndex-master\baidu_index\baidu_index.py", line 51, in get_index
encrypt_datas, uniqid = self._get_encrypt_datas(
File "C:\Users\jxyzh\Desktop\spider-BaiduIndex-master\baidu_index\baidu_index.py", line 84, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

我发现之前有人提问过这个问题，不过我这里之前已经成功运行过了，但是今天却报错，换上今天重新粘贴的COOKIE也不行，所以我想问一下有什么办法可以解决吗还是这是百度那里的原因。

非常感谢

longxiaofei / spider-baiduindex Goto Github PK

spider-baiduindex's Introduction

Qdata - Python SDK for index and search

为什么给项目改了名

Data Source

Install

Examples

百度指数

百度搜索

百度登录(获取百度Cookie)

天眼查

Changelog

Stargazers over time

spider-baiduindex's People

Contributors

Stargazers

Watchers

Forkers

spider-baiduindex's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs