GithubHelp home page GithubHelp logo

job-postings-dwh's Introduction

job-postings-dwh

Kimball dimmensional model and ETL design and implementation on job postings dataset

Data

The dataset consists of a sample of 30,000 jobs from 2010 to 2020 that were posted through the Armenian human resource portal CareerCenter.

A job posting usually has some structure, although some fields of the posting are not necessarily filled out by the creator (poster). The job data includes information about the applicants.

The dataset is in JSON format according to the next schema and it partitioned every 300 objects.

The data schema as well as a row sample can be revised below.

Data Schema

Dat Schema

Json data row sample:

[
  {
    "id": "e4b558ad-adee-4ebc-8b6e-853a278e919a",
    "adverts": {
      "id": "99f54f6c-f8cd-4e1f-9d39-0a9e72192497",
      "activeDays": 12,
      "applyUrl": "https://istockphoto.com/lobortis/convallis.jsp?et=ut&ultrices=at&posuere=dolor&cubilia=quis&curae=odio&duis=consequat&faucibus=varius&accumsan=integer&odio=ac&curabitur=leo&convallis=pellentesque&duis=ultrices&consequat=mattis&dui=odio&nec=donec&nisi=vitae&volutpat=nisi&eleifend=nam&donec=ultrices&ut=libero&dolor=non&morbi=mattis&vel=pulvinar&lectus=nulla&in=pede&quam=ullamcorper&fringilla=augue&rhoncus=a&mauris=suscipit&enim=nulla&leo=elit&rhoncus=ac&sed=nulla&vestibulum=sed&sit=vel&amet=enim&cursus=sit&id=amet&turpis=nunc&integer=viverra&aliquet=dapibus&massa=nulla&id=suscipit&lobortis=ligula&convallis=in&tortor=lacus&risus=curabitur&dapibus=at&augue=ipsum&vel=ac&accumsan=tellus&tellus=semper&nisi=interdum&eu=mauris&orci=ullamcorper&mauris=purus&lacinia=sit&sapien=amet&quis=nulla&libero=quisque&nullam=arcu&sit=libero&amet=rutrum&turpis=ac&elementum=lobortis&ligula=vel&vehicula=dapibus&consequat=at&morbi=diam",
      "publicationDateTime": "1353621541",
      "status": "Inactive"
    },
    "benefits": [
      "Car",
      "Medical Insurance",
      "Phone",
      "Home Office"
    ],
    "company": "O'Conner Group",
    "sector": "Product Management",
    "title": "Statistician II",
    "city": "Hawassa",
    "applicants": [
      {
        "firstName": null,
        "lastName": null,
        "skills": null,
        "age": null,
        "applicationDate": null
      },
      {
        "firstName": null,
        "lastName": null,
        "skills": null,
        "age": null,
        "applicationDate": null
      },
      {
        "firstName": "Scarlett",
        "lastName": "Cowl",
        "skills": [
          "Eggs",
          "RTO",
          "EBR",
          "KVM",
          "EOC"
        ],
        "age": 35,
        "applicationDate": "1354226341"
      },
      {
        "firstName": "Noe",
        "lastName": "Groundwator",
        "skills": [
          "LDPE",
          "DFT",
          "Watercolor",
          "Online Travel",
          "TMJ Dysfunction"
        ],
        "age": 39,
        "applicationDate": "1354658341"
      },
      {
        "firstName": "Fonsie",
        "lastName": "Tomkin",
        "skills": [
          "Oracle",
          "Visual Basic",
          "MVS",
          "RPO",
          "SBIR"
        ],
        "age": 41,
        "applicationDate": "1353880741"
      }
    ]
  }

Dimensional model

The dimensional design process revolves around low-level activities in a business process of organizations, declaring it's grain, and then it extends the model by determining dimensions and facts.

1. Business Process

The business of job advertising isn't just drived by the companies posting jobs, the real goal is met when job applicants submit their proposals and show interest on job positions, satisfying the organizations' needs. In this sense, the business process "transaction" to model is job applications submitted by job seekers.

2. Declare the Grain

The best granular data is generated by a job application submitted by a person to a individual job post. When a job applicant applies to a certain position, it also submits a collection of skill related to the job position, it could be tempting to set one row per skills submitted in the fact table, but this would exponentially increase the table size and impact performance. Thus, granularity will be set by individual job application and skills will be handled with dimension and bridge tables to keep fact table granularity at a reasonable size and detail.

3. Identify the Dimensions

When a job post is published, a "id" natural key is generated as well as the job title, it also describes a set of benefits of the position, company to work in, city, it's sector. A set of attributes pertaining the publication are also recorded, like activeDays of the publication, application URL, publication date and current status.

When a person applies for the job, their age, first name, last name and proposed skills are recorded.

In this sense, the appropitate dimensions to model are: Date, Advertisement, Applicant, Job, Company, Skills, Benefits.

One of the most arguable decisions could be the fact that the applicant skills have been modeled as a dimension itself, instead of trying to couple skills inside the Applicant dimension. If you give some thought to it, whenever a applicant applies to a position, depending on the company and skills required, the applicant adapts it's resume, CV, and skills to the knowledge needed by the position. If one tries to capture skills as slowly changing attribute inside the Applicant dimension, the dimension modeling could end up being a big column and rows entanglement and will end up being hard to maintain. Instead, everytime an applicant submits applications to a job, their proposed skillset will be recorded at the fact table granular level, this allows full history recording and will simplify queries and model maintenance over time.

4. Identify the Facts

The business process in this dataset is connecting applicants to companies looking to fill a position, when job seekers apply and are accepted by a company, no quantifiable measures are generated. Thus, the fact table will record this events without recording any numeric transactiond data, also known as Factless Fact Table, at the event of application to the job only dimension characteristics are recordes.

The resulting dimensional model for the dataset after analysis is depicted below.

Job Posting Schema

Data dictionary for dimension and fact tables

Calendar_dim table

Attribute name Required Type Description
date_key yes integer Integer representing the date in the yyyymmdd format
full_date yes date Date saved as date type in the format yyyy-MM-dd
day_num_in_month yes integer Integer representing the day of the month for the date
day_name yes string Descriptive day name
week_num_in_year yes integer Number of the week of the year
month yes integer Number of the month
month_name yes string Calendar month name
quarter yes integer Quater of the year
year yes integer year number
month_end_flag yes string String that takes “month end” or “not month end” descriptive values
same_day_year_ago yes date Date of the same day but from one year before actual row date

Adverts_dim table

Attribute name Required Type Description
key yes integer Surrogate sequential integer key of the table
activeDays no integer Number of days active of the job post
applyUrl no string URL for websites with further job post description
publicationDate yes date Date of the job post publication
status yes string Whether the job post is still “active” or already “inactive”

Applican_dim table

Attribute name Required Type Description
key yes integer Surrogate sequential integer key of the table
firstName no string First name of the applicant
lastName no string Last name of the applicant
age no integer Age of the applicant
age_range no string Age band of the applicant

Job_dim table

Attribute name Required Type Description
key yes integer Surrogate sequential integer key of the table
job_id no string Natural key representing a unique id for the job post
city no string City in which the job is located
title yes string The title of the job position
sector no string Sector to which the job position belongs in the industry

Benefit_dim table

Attribute name Required Type Description
key yes integer Surrogate sequential integer key of the table
benefit_name yes string Description of the benefit offered with the job position

Skill_dim table

Attribute name Required Type Description
key yes integer Surrogate sequential integer key of the table
skill_name yes string Name of the skill

Company_dim table

Attribute name Required Type Description
key yes integer Surrogate sequential integer key of the table
company_name yes string Name of the company offering the job position

Job-benefit bridge table

Attribute name Required Type Description
job_key yes integer Foreign key pointing to the job_dim primary key
benefit_key yes integer Foreign key pointing to the benefit key for the position

Application Factless fact table

Attribute name Required Type Description
key yes integer Surrogate sequential integer key of the table
date_key yes integer Foreign key pointing to the date of the application
applicant_key yes integer Foreign key pointing to the applicant id in its table
company_key yes integer Foreign key pointing to the company dimension
job_key yes integer Foreign key pointing to the job dimension

Application-skillset bridge table

Attribute name Required Type Description
application_key yes integer Foreign key pointing to the application row in the application_fact table
skill_key yes integer Foreign key pointing to the skill of the applicant, one per row

Dataflow explanation and diagram

The workflow requires certain steps that depend on each other in order to implement de dimensional model on Amazon Redshift.

First, data is extracted from the data lake in which a Spark/Scala app reads the raw json data, performs some core transformation and saves the resulting dataframe in parquet format on the stagging bucket.

A calendar master data in the form of a spreadsheet is added to the stagging bucket too. This is a manual step that is perfomed once as the master date data can last for many years to come without changes. There are many business related attributes in the Calendar dimension that SQL queries can't generate automatically, so generating the calendar dim rows by hand in a spreadsheet is a common practice in dimensional modeling as stated by Kimball in his book (https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/).

Then a series of spark applications written in scala do the ETL core steps to create the dimensional tables. Following the defined steps hierarchy, dimension tables are generated first, followed by the job-benefits bridge table, then the factless fact table and at last, the bridge table between the application factles fact table and the skills table. The resulting parquet files are saved in the presentation bucket.

The dataflow diagram is depicted bellow.

dataflow

job-postings-dwh's People

Contributors

dalas5 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.