GithubHelp home page GithubHelp logo

job_spider's Introduction

JobSpider

目的

此项目为Python练手项目,功能是爬取各种求职网上的的职位信息(包括职位名称、薪水、公司名称等),项目以爬取拉勾网上的Python职位为例

数据库设计

数据库暂定使用非关系型数据库Mongodb,它以键值对存储,结构不固定,这样每一条记录可以有不同的字段,可以少建几个关联表,方便爬虫功能的扩展

jobs表的设计用一个职位的爬取结果展示如下:

{   
    "_id" :
    "job_title" : "Python开发工程师",
    "salary" : "10k-20k",
    "company" : {
        "company_name" : '顺网科技',
        "industry" : '游戏,文化娱乐 / 上市公司'
    },
    "location" : "成都·武侯区",
    "tags" : "['游戏', '直播', '中级', 'Java', '后端']",
    "welfare" : "上市公司,大数据,大平台,福利健全",
    "format_time" : 2018-03-19
}

运行说明

  • 环境:Python3 & requests & BeautifulSoup

  • 运行: python job_spider.py

  • 数据处理

  • 将处理好的data.csv文件导入mongodb

    mongoimport -d "resources" --type "csv" -c "jobs" --file=data.csv -h localhost:27017 -f "title,salary,company,location,tags,welfare,format_time"

需要解决的问题

  • 目前只能爬取前五页,所以需要写反反爬虫机制
  • 需要爬取某一岗位的总页数,目前是写死的
  • 目前只是把数据存入了csv文件,还需要建立数据库来存储
  • 需要考虑到爬虫异常的处理,如网络异常等
  • 分布式爬取
  • ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.