GithubHelp home page GithubHelp logo

trungtv / linkedinjobskillsdb Goto Github PK

View Code? Open in Web Editor NEW

This project forked from whatamithinking/linkedinjobskillsdb

0.0 1.0 0.0 331 KB

Database of job skills scraped from the LinkedIn skills directory. Stored in an SQLAlchemy database.

Python 100.00%

linkedinjobskillsdb's Introduction

LinkedIn Job Skills DB

Database of job skills scraped from the LinkedIn skills directory. Stored in an SQLAlchemy database.

Features

  • Partially-complete job skills database scraped from LinkedIn
  • Refresh job skills in database
  • Scrape job skills pages
  • Automatic login to LinkedIn
Notes

The database is incomplete. Being polite makes scraping 35,000+ pages a slow business, but the tools are here to finish the job.

Tested with Python 3.4.8

Getting Started

  1. Download the code and unzip

  2. Open the folder and open command prompt to that directory.

  3. Install requirements from the requirements.txt file:

    pip install -r requirements.txt
    
  4. Run the following code in command prompt:

    from LinkedInJobSkills import LinkedInJobSkills
    l=LinkedInJobSkills( '[email protected]','YOUR_LINKEDIN_PASSWORD' )
    
  5. Start querying your skills database:

    Skill = 'python'
    RootSkillRows = l.DB.execute( 'select * from root_skills where skill = {Skill}'.format( Skill=Skill ) )
    ...
    # do what you want with data
    ....
    

Notes

Database Design
  • companies - id, timestamp, skill, company, relation_count
  • related_skills - id, timestamp, skill, related_skill, relation_count
  • root_skills - id, timestamp, skill, link
Notes
  • relation_count = number of times a relationship between a skill and a company or a skill and another skill was observed
  • timestamp = last time the skill was updated
  • root_skills = stores the links to skills pages and keeps track of skill pages visited which had data and which did not have data ( skill is null )
Refresh Skills Data
  • Set your min and max sleep times, which will be normally distributed to make the requests look more human and to be polite to their servers.
    l.MinSleepSecs = 5                                  # min amount of time between requests
    l.MaxSleepSecs = 10                                 # max amount of time between requests
    
  • To refresh just one skill's data:
    Skill = 'python'
    SkillPageLinks = l._getLinksList()                  # get list of all skill page links. Last updated: 02/11/2018
    FilteredSkillPageLinks = \
        [ x for x in SkillPageLinks if Skill in x ]     # filter to get links for this skill
    for SkillPageLink in FilteredSkillPageLinks:        # refresh data for each skill page
        l.refreshSkill( SkillPageLink )
    
  • To refresh the entire skills database:
    l.refreshAllSkills( SkipExisting=True )             # scrape the entire skills directory and repopulate database
    

Authors

  • Connor Mawynes - Initial work

LinkedInJobSkillsDB

linkedinjobskillsdb's People

Contributors

whatamithinking avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.