job-postings-dwh

Kimball dimmensional model and ETL design and implementation on job postings dataset

Data

The dataset consists of a sample of 30,000 jobs from 2010 to 2020 that were posted through the Armenian human resource portal CareerCenter.

A job posting usually has some structure, although some fields of the posting are not necessarily filled out by the creator (poster). The job data includes information about the applicants.

The dataset is in JSON format according to the next schema and it partitioned every 300 objects.

The data schema as well as a row sample can be revised below.

Data Schema

Json data row sample:

[
  {
    "id": "e4b558ad-adee-4ebc-8b6e-853a278e919a",
    "adverts": {
      "id": "99f54f6c-f8cd-4e1f-9d39-0a9e72192497",
      "activeDays": 12,
      "applyUrl": "https://istockphoto.com/lobortis/convallis.jsp?et=ut&ultrices=at&posuere=dolor&cubilia=quis&curae=odio&duis=consequat&faucibus=varius&accumsan=integer&odio=ac&curabitur=leo&convallis=pellentesque&duis=ultrices&consequat=mattis&dui=odio&nec=donec&nisi=vitae&volutpat=nisi&eleifend=nam&donec=ultrices&ut=libero&dolor=non&morbi=mattis&vel=pulvinar&lectus=nulla&in=pede&quam=ullamcorper&fringilla=augue&rhoncus=a&mauris=suscipit&enim=nulla&leo=elit&rhoncus=ac&sed=nulla&vestibulum=sed&sit=vel&amet=enim&cursus=sit&id=amet&turpis=nunc&integer=viverra&aliquet=dapibus&massa=nulla&id=suscipit&lobortis=ligula&convallis=in&tortor=lacus&risus=curabitur&dapibus=at&augue=ipsum&vel=ac&accumsan=tellus&tellus=semper&nisi=interdum&eu=mauris&orci=ullamcorper&mauris=purus&lacinia=sit&sapien=amet&quis=nulla&libero=quisque&nullam=arcu&sit=libero&amet=rutrum&turpis=ac&elementum=lobortis&ligula=vel&vehicula=dapibus&consequat=at&morbi=diam",
      "publicationDateTime": "1353621541",
      "status": "Inactive"
    },
    "benefits": [
      "Car",
      "Medical Insurance",
      "Phone",
      "Home Office"
    ],
    "company": "O'Conner Group",
    "sector": "Product Management",
    "title": "Statistician II",
    "city": "Hawassa",
    "applicants": [
      {
        "firstName": null,
        "lastName": null,
        "skills": null,
        "age": null,
        "applicationDate": null
      },
      {
        "firstName": null,
        "lastName": null,
        "skills": null,
        "age": null,
        "applicationDate": null
      },
      {
        "firstName": "Scarlett",
        "lastName": "Cowl",
        "skills": [
          "Eggs",
          "RTO",
          "EBR",
          "KVM",
          "EOC"
        ],
        "age": 35,
        "applicationDate": "1354226341"
      },
      {
        "firstName": "Noe",
        "lastName": "Groundwator",
        "skills": [
          "LDPE",
          "DFT",
          "Watercolor",
          "Online Travel",
          "TMJ Dysfunction"
        ],
        "age": 39,
        "applicationDate": "1354658341"
      },
      {
        "firstName": "Fonsie",
        "lastName": "Tomkin",
        "skills": [
          "Oracle",
          "Visual Basic",
          "MVS",
          "RPO",
          "SBIR"
        ],
        "age": 41,
        "applicationDate": "1353880741"
      }
    ]
  }

Dimensional model

The dimensional design process revolves around low-level activities in a business process of organizations, declaring it's grain, and then it extends the model by determining dimensions and facts.

1. Business Process

The business of job advertising isn't just drived by the companies posting jobs, the real goal is met when job applicants submit their proposals and show interest on job positions, satisfying the organizations' needs. In this sense, the business process "transaction" to model is job applications submitted by job seekers.

2. Declare the Grain

The best granular data is generated by a job application submitted by a person to a individual job post. When a job applicant applies to a certain position, it also submits a collection of skill related to the job position, it could be tempting to set one row per skills submitted in the fact table, but this would exponentially increase the table size and impact performance. Thus, granularity will be set by individual job application and skills will be handled with dimension and bridge tables to keep fact table granularity at a reasonable size and detail.

3. Identify the Dimensions

When a job post is published, a "id" natural key is generated as well as the job title, it also describes a set of benefits of the position, company to work in, city, it's sector. A set of attributes pertaining the publication are also recorded, like activeDays of the publication, application URL, publication date and current status.

When a person applies for the job, their age, first name, last name and proposed skills are recorded.

In this sense, the appropitate dimensions to model are: Date, Advertisement, Applicant, Job, Company, Skills, Benefits.

One of the most arguable decisions could be the fact that the applicant skills have been modeled as a dimension itself, instead of trying to couple skills inside the Applicant dimension. If you give some thought to it, whenever a applicant applies to a position, depending on the company and skills required, the applicant adapts it's resume, CV, and skills to the knowledge needed by the position. If one tries to capture skills as slowly changing attribute inside the Applicant dimension, the dimension modeling could end up being a big column and rows entanglement and will end up being hard to maintain. Instead, everytime an applicant submits applications to a job, their proposed skillset will be recorded at the fact table granular level, this allows full history recording and will simplify queries and model maintenance over time.

4. Identify the Facts

The business process in this dataset is connecting applicants to companies looking to fill a position, when job seekers apply and are accepted by a company, no quantifiable measures are generated. Thus, the fact table will record this events without recording any numeric transactiond data, also known as Factless Fact Table, at the event of application to the job only dimension characteristics are recordes.

The resulting dimensional model for the dataset after analysis is depicted below.

Data dictionary for dimension and fact tables

Calendar_dim table

Attribute name	Required	Type	Description
date_key	yes	integer	Integer representing the date in the yyyymmdd format
full_date	yes	date	Date saved as date type in the format yyyy-MM-dd
day_num_in_month	yes	integer	Integer representing the day of the month for the date
day_name	yes	string	Descriptive day name
week_num_in_year	yes	integer	Number of the week of the year
month	yes	integer	Number of the month
month_name	yes	string	Calendar month name
quarter	yes	integer	Quater of the year
year	yes	integer	year number
month_end_flag	yes	string	String that takes “month end” or “not month end” descriptive values
same_day_year_ago	yes	date	Date of the same day but from one year before actual row date

Adverts_dim table

Attribute name	Required	Type	Description
key	yes	integer	Surrogate sequential integer key of the table
activeDays	no	integer	Number of days active of the job post
applyUrl	no	string	URL for websites with further job post description
publicationDate	yes	date	Date of the job post publication
status	yes	string	Whether the job post is still “active” or already “inactive”

Applican_dim table

Attribute name	Required	Type	Description
key	yes	integer	Surrogate sequential integer key of the table
firstName	no	string	First name of the applicant
lastName	no	string	Last name of the applicant
age	no	integer	Age of the applicant
age_range	no	string	Age band of the applicant

Job_dim table

Attribute name	Required	Type	Description
key	yes	integer	Surrogate sequential integer key of the table
job_id	no	string	Natural key representing a unique id for the job post
city	no	string	City in which the job is located
title	yes	string	The title of the job position
sector	no	string	Sector to which the job position belongs in the industry

Benefit_dim table

Attribute name	Required	Type	Description
key	yes	integer	Surrogate sequential integer key of the table
benefit_name	yes	string	Description of the benefit offered with the job position

Skill_dim table

Attribute name	Required	Type	Description
key	yes	integer	Surrogate sequential integer key of the table
skill_name	yes	string	Name of the skill

Company_dim table

Attribute name	Required	Type	Description
key	yes	integer	Surrogate sequential integer key of the table
company_name	yes	string	Name of the company offering the job position

Job-benefit bridge table

Attribute name	Required	Type	Description
job_key	yes	integer	Foreign key pointing to the job_dim primary key
benefit_key	yes	integer	Foreign key pointing to the benefit key for the position

Application Factless fact table

Attribute name	Required	Type	Description
key	yes	integer	Surrogate sequential integer key of the table
date_key	yes	integer	Foreign key pointing to the date of the application
applicant_key	yes	integer	Foreign key pointing to the applicant id in its table
company_key	yes	integer	Foreign key pointing to the company dimension
job_key	yes	integer	Foreign key pointing to the job dimension

Application-skillset bridge table

Attribute name	Required	Type	Description
application_key	yes	integer	Foreign key pointing to the application row in the application_fact table
skill_key	yes	integer	Foreign key pointing to the skill of the applicant, one per row

Dataflow explanation and diagram

The workflow requires certain steps that depend on each other in order to implement de dimensional model on Amazon Redshift.

First, data is extracted from the data lake in which a Spark/Scala app reads the raw json data, performs some core transformation and saves the resulting dataframe in parquet format on the stagging bucket.

A calendar master data in the form of a spreadsheet is added to the stagging bucket too. This is a manual step that is perfomed once as the master date data can last for many years to come without changes. There are many business related attributes in the Calendar dimension that SQL queries can't generate automatically, so generating the calendar dim rows by hand in a spreadsheet is a common practice in dimensional modeling as stated by Kimball in his book (https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/).

Then a series of spark applications written in scala do the ETL core steps to create the dimensional tables. Following the defined steps hierarchy, dimension tables are generated first, followed by the job-benefits bridge table, then the factless fact table and at last, the bridge table between the application factles fact table and the skills table. The resulting parquet files are saved in the presentation bucket.

The dataflow diagram is depicted bellow.

dalas5 / job-postings-dwh Goto Github PK

job-postings-dwh's Introduction