GithubHelp home page GithubHelp logo

haodf's Introduction

好大夫的爬虫工具

Dependence(Python 3.6)

  • BeautifulSoup
  • selenium
  • time
  • datetime
  • random
  • pyprind
  • pymysql

数据存储方式

包括四个表:all_url, QA, doctor, relative_qa。具体结构如图所示; haodf

代码执行步骤

  1. 配置ConnectDatabase.py中的数据库参数,与数据库建立链接;
  2. 执行getHaodf.py,获取所有的URL,可以通过设置日期获取;
# 爬取文件开始日期
CURRENT_DATE = "20180714"
# 爬取文件结束日期
END_DATE = "20181231"
  1. 获取每个URL的信息,好大夫的数据结构有很多种,目前发现了两种:{"class": "zzx_yh_stream"}{"class": "f-card clearfix js-f-card"},如果出现新的数据结构,需要重新编写。
# 解析QA
qa_list = 1
# 默认第一种解析方式{"class": "zzx_yh_stream"}
split_type = 1
qa_content_soups = soup.find_all("div", {"class": "zzx_yh_stream"})
# 第二种解析方式{"class": "f-card clearfix js-f-card"}
if len(qa_content_soups) == 0:
    split_type = 3
    print("第二种解析方式")
    qa_content_soups = soup.find_all("div", {"class": "f-card clearfix js-f-card"})
# 出现了新的网站结构。需要手动解析
if len(qa_content_soups) == 0:
    split_type = 5
    input("未知解析方式!")
  1. 执行getContent.py

更新信息

2019.06.21

  • 为读取页面URL添加了进度条;
  • 进行了1000个页面的测试;

2019.07.05

  • 增加了多线程访问;
  • 优化了代码结构;
  • 增加了解析失败url的保存文档;
  • 对已知大部分的url解析可能更新了状态;

209.07.12

  • 增加了日志文件;
  • 修复了部分数据存入数据库失败的BUG;
  • 整合了存储部分的代码,减少数据库操作;

haodf's People

Contributors

helloatilol avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.