GithubHelp home page GithubHelp logo

jerry1014 / fundcrawler Goto Github PK

View Code? Open in Web Editor NEW
502.0 17.0 151.0 2.39 MB

爬取天天基金网,辅助对投资基金的选择

Home Page: http://fund.eastmoney.com/

License: GNU General Public License v3.0

Python 100.00%
cralwer fund-crawler

fundcrawler's Introduction

天天基金爬虫

简介

重要提示

GitHub license

    购买基金前,请务必在官方网站上确认爬取的数据无误!
    推荐书籍《解读基金:我的投资观与实践》
    推荐网站 晨星**:www.morningstar.cn
  • 基金类型,资产规模,基金管理人,基金净值,基金经理(最近连续最长任职),基金经理的上任时间,近三年标准差,近三年夏普,近三年涨幅,近五年涨幅
  • 爬取全部数据需要4930s(2024-05-22 总基金数 18917),瓶颈为网站的反爬策略 Image text

食用方法

  • Python3.12 依赖见requirements.txt
  • 运行run.py 爬取基金数据
  • 杂七杂八
    • 只想爬一点点数据看下效果 test_process_manager.SmokeTestTaskManager.test_run
    • 爬了很多我不需要的数据,很慢 module.crawling_data.async_crawling_data.AsyncCrawlingData.init
    • 爬取过程中的日志文件 process_manager.TaskManager.__init__
    • 爬取结果文件 module.save_result.save_result_2_file.SaveResult2File.__init__
    • 爬取结果分析 (通过堆,取三年夏普最高的前几个基金)utils.result_analyse.analyse
    • 想爬取更多的数据
      1 看下现有的爬取网页上是否有对应的信息
      module.crawling_data.data_mining.data_mining_type.PageType 有的话,直接在对应的策略上,通过正则或其他的方式将信息提取出来
      没有的话,新增一个策略,爬取新的网页,以及进行对应的清洗

技术相关

Image text

  • 因为数据清洗和 http下载分别是计算密集和IO密集的,为了避免GIL和频繁的线程切换影响效率。 AsyncHttpRequestDownloader起了一个新进程,在子进程内通过线程池进行http的爬取,通过队列来交换爬取任务和结果,通过事件来感知爬取结束
  • 目前的爬取瓶颈是网站的反爬策略,可以通过utils.downloader.rate_control.rate_control_analyse.draw_analyse来分析当前网络环境下 所能支持的并发任务数
    当前的速率控制策略是 1 通过环 记录和计算最近几次的任务爬取失败率(避免过于敏感)
    2.1 失败率大于0,并发任务数的阈值修改为当前值的一半(在失败率恢复之前,只修改一次),当前的并发任务数修改为0
    2.2 失败率等于0,当前值=max(阈值*1/2, 当前并发任务数+步长),当 当前值和阈值的距离越大时,步长越大(尽快恢复原有的爬取速率) 当 当前值大于阈值时,步长为固定值(缓慢增长,试探是否有进一步加速的空间) Image text

Star History

Star History Chart

未来更新计划

  • 继续优化爬取速率,当前的方法不够科学,考虑用1s的滑动窗口(通过成功率来控制速率)
  • 进度看能不能搞好看点
  • 健壮性也考虑下,怎么发现\校验挖掘的数据没有问题(包括说 现在正则都是就短,会匹配上一大串错误的东西)
  • 基金越来越多了,要支持断点续跑

fundcrawler's People

Contributors

iyabao avatar jerry1014 avatar taiyouzhang avatar tianjiaoding avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fundcrawler's Issues

RuntimeError: can't start new thread

爬取进度:[# ] 7%Process GetPageByWebWithAnotherProcessAndMultiThreading-1:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/pi/Documents/myPython/FundCrawler-Dev/CrawlingCore.py", line 115, in run
t.start()
File "/usr/lib/python3.7/threading.py", line 847, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
can't start new threadcan't start new thread
^CTraceback (most recent call last):
File "CrawlingFund.py", line 139, in
crawling_fund(fund_list)
File "CrawlingFund.py", line 74, in crawling_fund
a_result = result_queue.get()
File "/usr/lib/python3.7/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/usr/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt

能不能加个参数控制一下线程的数量?

添加基金净值爬取

基金净值数据是基金的一个重要方面,有待爬取,如有可能,望加上这个feature。谢谢!

AttributeError: 'NoneType' object has no attribute 'group'

Traceback (most recent call last):
File "D:/Desktops/FundCrawler/CrawlingFund.py", line 181, in
crawling_fund(GetFundListByWeb())
File "D:/Desktops/FundCrawler/CrawlingFund.py", line 142, in crawling_fund
new_fund_info: FundInfo = fund_web_page_parse.send(a_result[1:])
File "D:\Desktops\FundCrawler\ParsingHtml.py", line 72, in _parse_fund_info
re.search(r'基金规模:((?:\d+(?:.\d{2}|)|--)亿元.*?)<', page_context).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

第一次运行时报 NotImplementedError

正在初始化随机UA模块,若此步消耗了大量时间,请将FakeUA.py中的IF_UPDATE_FAKE_UA修改为False(默认值) 随机UA模块初始化完成 获取基金列表中。。。 共发现8040个基金 Traceback (most recent call last): File "/Users/zhaolin/Project/oneself/FundCrawler/CrawlingFund.py", line 181, in <module> crawling_fund(GetFundListByWeb()) File "/Users/zhaolin/Project/oneself/FundCrawler/CrawlingFund.py", line 123, in crawling_fund while having_fund_need_to_crawl and result_queue.qsize() < 100 and input_queue.qsize() < 10: File "/Users/zhaolin/.pyenv/versions/3.7.0/lib/python3.7/multiprocessing/queues.py", line 117, in qsize return self._maxsize - self._sem._semlock._get_value() NotImplementedError

NotImplementedError

/usr/local/Cellar/python/3.7.6_1/bin/python3 /Users/gongtao/Documents/dev/project/PycharmProjects/FundCrawler/CrawlingFund.py
正在初始化随机UA模块,若此步消耗了大量时间,请将FakeUA.py中的IF_UPDATE_FAKE_UA修改为False(默认值)
随机UA模块初始化完成
获取基金列表中。。。
共发现8408个基金
Traceback (most recent call last):
File "/Users/gongtao/Documents/dev/project/PycharmProjects/FundCrawler/CrawlingFund.py", line 181, in
crawling_fund(GetFundListByWeb())
File "/Users/gongtao/Documents/dev/project/PycharmProjects/FundCrawler/CrawlingFund.py", line 123, in crawling_fund
while having_fund_need_to_crawl and result_queue.qsize() < 100 and input_queue.qsize() < 10:
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/queues.py", line 117, in qsize
return self._maxsize - self._sem._semlock._get_value()
NotImplementedError

AttributeError: 'NoneType' object has no attribute 'group'

报错信息:
Traceback (most recent call last):
File "CrawlingFund.py", line 181, in
crawling_fund(GetFundListByWeb())
File "CrawlingFund.py", line 142, in crawling_fund
new_fund_info: FundInfo = fund_web_page_parse.send(a_result[1:])
File "ParsingHtml.py", line 72, in _parse_fund_info
re.search(r'基金规模:((?:\d+(?:.\d{2}|)|--)亿元.*?)<', page_context).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

麻烦看一下呢~~

不能执行.提示ParsingHtml.py的第129行有错误, AttributeError

File ".\CrawlingFund.py", line 181, in
crawling_fund(GetFundListByWeb())
File ".\CrawlingFund.py", line 148, in crawling_fund
new_fund_info: FundInfo = manager_web_page_parse.send(a_result[1:])
File "C:\Users\huyy5\FundCrawler\ParsingHtml.py", line 129, in _parse_manager_info
fund_info.set_manager_info(fund_info.manager_need_process_list.pop()[1], manager_info.group(1))
AttributeError: 'NoneType' object has no attribute 'group'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.