GithubHelp home page GithubHelp logo

spider-baiduindex's Introduction

Qdata - Python SDK for index and search

为什么给项目改了名

  • 想做一个提供更多数据的SDK包,但不一定有时间。。。
  • 老的代码包可以在old_baiduindex里找到
  • 会根据我自己个人的数据需求,往里面添加不同的数据源,如果恰好帮助到你,很开心
  • 老的数据源会尽力维护

Data Source

Install

pip uninstall pycrypto  # 避免与pycryptodome冲突
pip install --upgrade qdata

Examples

百度指数

./examples/test_baidu_index.py

可以参考以下代码进行百度指数的获取 ./examples/baidu_index_best_practice.py

百度搜索

./examples/test_baidu_search.py

百度登录(获取百度Cookie)

./examples/test_baidu_login.py

  • 目前只提供二维码登录,密码账号登录也可以做,但不做,因为没必要。
  • 幸好工作不做爬虫,心太累了。

天眼查

./examples/test_tianyancha.py

  • 老婆做汇报着急用

Changelog

  • 2021/03/25 上线
  • 2021/03/26 更新百度登录功能
  • 2021/04/07 百度指数新增:实时百度指数
  • 2021/04/13 添加天眼查高级搜索公司数数据
  • 2021/05/18 修正打包问题
  • 2022/05/12 百度指数添加Cipher-Text(不确定部分逻辑)
  • 2022/05/16 一些小的改动
  • 2022/05/30 修正百度指数加密逻辑
  • 2022/09/06 添加检查关键词方法、添加最佳实践脚本

Stargazers over time

Stargazers over time

spider-baiduindex's People

Contributors

longxiaofei avatar m4rque2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spider-baiduindex's Issues

今天在·使用的时候似乎开始有点bug?

Traceback (most recent call last):
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1689, in
main()
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1683, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "E:\PyCharm 2018.3\helpers\pydev\pydevd.py", line 1083, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\PyCharm 2018.3\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/spider-BaiduIndex-master/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

ERROR-10002:未知错误

1621349083(1)
你好,我想问一下这个报错是什么问题?
第一次在JupyterNotebook写好后,后面再次使用时发现会一直报错这个。再次使用后使用的是新的Cookies

TypeError: string indices must be integers

Traceback (most recent call last):
File "/Users/xxx/Desktop/baidu/demo.py", line 28, in
for index in baidu_index.get_index():
File "/Users/xxx/Desktop/baidu/baidu_index/baidu_index.py", line 51, in get_index
encrypt_datas, uniqid = self._get_encrypt_datas(
File "/Users/xxx/Desktop/baidu/baidu_index/baidu_index.py", line 88, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

用大佬的demo.py跑了一下,但出现了如上报错。cookies是配置成功了的。

大佬你好:string indices must be integers

C:\ProgramData\Anaconda3\python.exe C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/demo.py
Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

我一开始以为是cookie可能是找好的。然后发现您说可能缺了一个cookie。我后来请人解决了这个问题,希望大佬在某个地方给补上一些些注释哈哈。谢谢

似乎不能用了,可能是哪里出了问题?

image

  • 下载chromedriver, 并将它放到环境变量中

  • 下载tesseract, 并将它放到环境变量中

  • 单账号抓取:请你打开百度的首页,登录后,将百度首页的cookie复制后,粘贴到config.py中的COOKIES对象中

  • 找到tesseract文件夹, tesseract/3.05.02/share/tessdata/configs中的digits
    这些都做了。不知道怎么进行调试

TypeError: string indices must be integers

运行DEMO代码报错。cookie正常。之前运行也是正常的。今天突然报错。换了几个cookie仍然报错。我想知道是我的代码的问题还是百度又修改算法了?

关键词数据返回

请教前辈,我关键词为上市公司名称,数据返回后,公司名称被分割为单个汉字,这个要如何解决呢?谢谢!
image

TypeError: string indices must be integers

in demo.py :
from get_index import BaiduIndex

if name == "main":
keywords = ['比特币']
baidu_index = BaiduIndex(keywords, '2013-04-01', '2014-03-31')
for index in baidu_index.get_index():
print(index)

Traceback (most recent call last):
File "e:/MyProjects/spider-BaiduIndex/new_spider_without_selenium/demo.py", line 6, in
for index in baidu_index.get_index():
File "e:\MyProjects\spider-BaiduIndex\new_spider_without_selenium\get_index.py", line 58, in get_index
keywords=params_data['keywords']
File "e:\MyProjects\spider-BaiduIndex\new_spider_without_selenium\get_index.py", line 109, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

How can I modify the code?

你好

请问有办法录个视频教程吗。。。本人非程序员,但能看得懂一些,但具体操作起来可能比较困难。如果能获得帮助,真是太感谢。。

想请教一下大神的媒体指数和咨询指数爬取

新建 Microsoft Word 文档.docx
上面是代码
为什么会出现一下的错误呢

Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/news_feed.py", line 7, in
for index in baidu_index.get_index():
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 68, in get_index
for formated_data in self._format_data(encrypt_data):
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 131, in _format_data
keyword = str(data['word'])
KeyError: 'word'

此外,如果我想爬取咨询指数,我将这句成这样,
baidu_index = BaiduIndex(keywords, '2016-1-01', '2020-4-09',type,'feed')

也是错误的。请问怎么修改呢?

Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/spider-BaiduIndex-master/new_spider_without_selenium/news_feed.py", line 5, in
baidu_index = BaiduIndex(keywords, '2016-1-01', '2020-4-09', type,'feed')
File "C:\Users\Administrator\Desktop\spider-BaiduIndex-master\new_spider_without_selenium\get_extended_index.py", line 42, in init
self._pre_url = self.pre_url_dict[kind]
KeyError: <class 'type'>

请问如何确认api请求中的area编码?

目前area=0的情况下,取到的数据是“全国”数据,请问如何更改area取到各省份/城市的数据呢?
这个api有无相关的接口文档,或者如何确认各种省份/城市选项的编码呢?

感觉找一个词x一段时间,手工对应去做暴力破解会有点麻烦,请问有无其他方法~

大佬我又来了,我想调用同时爬取所有省份的数据时出现了问题

我把demo的代码改了一下。写了一个循环,以便爬完一个省份爬下一个。尴尬的是我发现当它运行到第17个循环的时候出现了bug,即result_data只能出现一个1×1数字,按道理来说数组的应该会出现我设置的天数那么长。我一开始以为是省份的问题,但我单独去爬取918号省份的时候是正常的。我不知道这是因为我连续爬了16个省份被百度发现了吗,但是每次我都是运行到第17个循环出错,想请教一下longxiaofei老师这个是怎么回事。

————————————
from get_index import BaiduIndex
import numpy as np
import pandas as pd
if name == "main":
keywords = ['天使投资']
result = []
times = 36
for i in range(times):
area_index = str(901+i)
baidu_index = BaiduIndex(keywords, '2019-1-02', '2019-1-07',area_index)
result_data = []
c=baidu_index.get_index()
for index in c:
if index['type'] == 'all':
np.array(result_data.append(index.get('index')))
if i == 0:
result=result_data
else:
result= np.vstack((result,result_data))
df = pd.DataFrame(result)
df.to_csv("天使投资.csv",encoding='utf_8_sig')

截图区域大了点

测试了下
im = im.crop((0,4,all_width,16))
这样刚刚好,截大了tesseract识别不好
再把tesseract进行样本训练,完美识别

我发现了一个百度指数防爬取的招数

image
image
就是本来全图是都有数据的,但是随着下拉栏的移动,图的数据就会不由自主的变成0(似乎是故意显示成错误),毫无疑问,我在您的代码爬取的时候,也是爬着爬着就出现了很多0,不知道有什么方法解决吗

TypeError: string indices must be integers

Hi, I reused this scraper after two months. I used the cookies, "BDUSS" or "H_PS_PSSID", and still returns this error:

File "/Users/X/Downloads/spider-BaiduIndex-master 3/baidu_index/baidu_index.py", line 84, in _get_encrypt_datas
uniqid = datas['data']['uniqid']

TypeError: string indices must be integers

This is where I get my cookies from:

Screenshot 2020-07-31 at 16 05 15

Thanks..

部分地区数据无法抓取?

在对某些地区的数据进行爬取时会报错:list index out of range,想请问要怎么办?
比如针对源代码中给出的几个关键字,地区代码设为32(湖北襄阳)时会报错。
希望能得到答复,谢谢~

TypeError: string indices must be integers

Probably, because of some change in Python, I tried to run the demo code and get the following error :(

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-d685484c0ecf> in <module>
     12     keywords = ['爬虫', 'lol', '张艺兴', '人工智能', '华为', '武林外传']
     13     baidu_index = BaiduIndex(keywords, '2018-01-01', '2019-05-02')
---> 14     for index in baidu_index.get_index():
     15         print(index)

~/scripts/get_index.py in get_index(self)
     56                     start_date=params_data['start_date'],
     57                     end_date=params_data['end_date'],
---> 58                     keywords=params_data['keywords']
     59                 )
     60                 key = self._get_key(uniqid)

~/scripts/get_index.py in _get_encrypt_datas(self, start_date, end_date, keywords)
    107         html = self._http_get(url)
    108         datas = json.loads(html)
--> 109         uniqid = datas['data']['uniqid']
    110         encrypt_datas = []
    111         for single_data in datas['data']['userIndexes']:

TypeError: string indices must be integers

关于可能由于被禁而出现string indices must be integers的问题

您好!首先非常感谢您的代码!参考您的代码我已经成功爬下了几个关键词的数据(写课程论文用的)!

可能我爬取的数据比较多,在爬取前七个关键词(我一个个爬的)的时候都是正确的,都是再次爬取的时候就又出现了string indices must be integers的报错,这是被网站禁了吗?我更换cookie和ip地址都还是会出现这个报错。

用的程序是在您的最新的程序的基础上加入自己的cookie,其他的未作修改。

部分关键词数据获取不到

代码爬取广州2020-01-01到2020-12-31的数据,关键词是['糖果', '冻干', '月饼', '啤酒', '洋酒'],但奇怪的是每一次跑程序,有时会有几列数据抓取不到,结果如下图:
微信图片_20210221141400
但有时又能全部抓到
微信图片_20210221141658

想问一下这里是需要怎么修改吗

百度指数有两个指标

一个是 搜索指数, 一个是 资讯指数。 可以爬资讯指数吗,测了下只能爬搜索指数

分类选项

百度指数有一个分类:PC+移动,PC,移动 这个可以添加吗?

baidu_index.py里的_all_kind变量改为参数传递

建议把baidu_index.py里的_all_kind变量改为参数传递,这样方便指定要爬的type类型。

如下:
# _all_kind = ['all', 'pc', 'wise']

def __init__(
    self,
    *,
    keywords: list,
    start_date: str,
    end_date: str,
    cookies: str,
    area=0,
    type=['all', 'pc', 'wise']
):
    self.keywords = keywords
    self.area = area
    self.start_date = start_date
    self.end_date = end_date
    self.cookies = cookies
    self._params_queue = utils.get_params_queue(start_date, end_date, keywords)
    self._all_kind = type

关于使用COOKIE的问题

我遇到一个问题是之前粘贴了cookie这后成功运行了,但是今天再次运行的时候就出现了以下错误
Traceback (most recent call last):
File "demo.py", line 26, in
for i, keyword_type_date_index in enumerate(baidu_index.get_index()):
File "C:\Users\jxyzh\Desktop\spider-BaiduIndex-master\baidu_index\baidu_index.py", line 51, in get_index
encrypt_datas, uniqid = self._get_encrypt_datas(
File "C:\Users\jxyzh\Desktop\spider-BaiduIndex-master\baidu_index\baidu_index.py", line 84, in _get_encrypt_datas
uniqid = datas['data']['uniqid']
TypeError: string indices must be integers

我发现之前有人提问过这个问题,不过我这里之前已经成功运行过了,但是今天却报错,换上今天重新粘贴的COOKIE也不行,所以我想问一下有什么办法可以解决吗还是这是百度那里的原因。

非常感谢

发现一个小BUG

formated_data['index'] = data['all']['data'][i]
这行代码的data后面应该是用kind,而不是写死的all

另外,想请教下你是怎么解析出uniqid和数值data之间的关系的?之前更新后就发现了这个API ,只是不明白之间的解析规则,这个可以跟我大概讲一下吗?

无法获取百度指数

换了几个COOKIE,在不同电脑上,不同网络环境下,都无法获取到指数数据

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.