- Setup venv and activate venv
- Install Requirements
- Install extensions :Add to GIT Ignore, YAML, Better Jinja
DBT-for-bigquery-set-up
- Set up dbt via command line: from https://docs.getdbt.com/docs/core/connect-data-platform/bigquery-setup#local-oauth-gcloud-setup
- gcloud cli ..
- Authenticate Big-Query - Run this line:
For Mac or Linux
gcloud auth application-default login \
--scopes=https://www.googleapis.com/auth/bigquery,\
https://www.googleapis.com/auth/drive.readonly,\
https://www.googleapis.com/auth/iam.test
For Windows:
gcloud auth application-default login --scopes=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive.readonly,https://www.googleapis.com/auth/iam.test
CREATING A NEW DBT PROJECT
- Get into a new project
- Run: "dbt init" to create a project : input a name name: your-project-name chose database: [1] BigQuery chose authentication: [1] oauth enter GCP Project name OR ID: enter dataset name thread: 60 select; US
NOTE: Always ensure you are inside the folder that has your dbt virtual environment and that your virtual environment is activated
- Run:
dbt debug --config-dir
- open the profiles.yaml : >> copy the dev profiles and create a productions profile : change the schema
create new branch by clicking on main: switch to the branch >> stage the changes and commit
Install dbt power user extension:
- Search for dbt power user: install
- Run " command shift p "
- In the tab tha shows up search for user settings.json : open it and copy the below into the json file: Add this to the file
// Associated the right file types with the right VSCode extensions
"files.associations": {
"*.sql": "jinja-sql"
},
// CRUCIAL - you need to change this to terminal.integrated.env.[osx|windows|linux] depending on your system
// and point it to the folder where your profiles directory is stored!
"terminal.integrated.env.osx": {
"DBT_PROFILES_DIR": "fill-this-with-the-path-to-the-folder-containing-your-profiles.yaml-file-on-your-local-file",
"BIGQUERY_PROJECT": "fill-this-with-your-gcp-pproject-name"
},
Above
a. if you are on a mac you can add this ~/.dbt
as DBT_PROFILES_DIR
b. add GCP-PROJECT-ID
as BIGQUERY_PROJECT NAME
c. In the terminal.integrated.env.osx
tag replace "osx" with either windows or linux
depending on your system's OS
- run "dbt clean && dbt deps"
- Restart the vs code or run "command shift p" >> select reload window >> choose the correct python interpreter
- check the models folder to see if the models have the dbt icon (it works successfully if it does)
NOTE ADD thelook_ecommerce (ECOMMERCE) DATA FROM BIGQUERY PUBLIC DATASET
Dbt Project Architecture: Source >> Staging >> Intermediate >> Final Table (Fact or Dimensions Table or View) (Staging : Usually layer of transformation)
- Create a staging folder inside the models folder
- Inside the staging folder : add a src_ecommerce.yml file and file it up with all the souce description as below:
version: 2
sources:
- name: thelook_ecommerce
database: bigquery-public-data
tables:
- name: inventory_items
- name: order_items
- name: orders
- name: products
- name: users
- Inside the same root that has the dbt_project.yml create a packages.yml and fill it up as below:
packages:
- package: dbt-labs/dbt_utils
version: 1.0.0
- package: calogica/dbt_expectations
version: [">=0.8.0", "<0.9.0"]
- package: dbt-labs/codegen
version: 0.9.0
-
Run:
dbt deps
-
Before running the next command ensure that the bigquery-public-data.thelook_ecommerce has been pulled into your bigquery project: do this on the bigquery
-
To generate the query for the staging orders table, Run:
dbt run-operation --profiles-dir /Users/abidakunabisoye/.dbt generate_base_model --args '{"source_name": "thelook_ecommerce", "table_name": "orders"}'
or
dbt run-operation generate_base_model --args '{"source_name": "thelook_ecommerce", "table_name": "orders"}'
-
Create a stg_ecommerce_orders.sql
-
After running the above query a query is generated : copy the query generated and paste into the stg_ecommerce_orders.sql
-
Run:
dbt run
ordbt run -s stg_ecommerce_orders
-
Generate the yml file for the stg_ecommerce_orders model: run
or
dbt run-operation --profiles-dir /Users/path-to-profiles.yaml-file/.dbt generate_model_yaml --args '{"model_names": ["stg_ecommerce_orders"]}'
- Crteated a stg_ecommerce_orders.yml file and paste in the result of the previous command
- Update the table as shown below
## Testing, documentation, Referencing and configuration
version: 2
models:
- name: stg_ecommerce_orders
description: " Table describing and detailing every order on per row basis "
columns:
- name: order_id
description: "The order Id of the order"
tests:
- not_null
- unique
- name: user_id
description: "User Id who placed the order "
tests:
- not_null
- name: created_at
description: "When the order was placed"
tests:
- not_null
- name: returned_at
description: "When the order was returned"
tests:
- not_null:
where: "status = 'Returned'"
- name: shipped_at
description: "When the order was shipped"
tests:
- not_null:
where: " delivered_at IS NOT NULL OR status = 'Shipped'"
- name: delivered_at
description: "When the order was delivered"
tests:
- not_null:
where: " returned_at IS NOT NULL OR status = 'Complete'"
- name: status
description: "Status of the order"
tests:
- accepted_values:
name: expected_order_status
values:
- Processing
- Cancelled
- shipped
- Complete
- Returned
- name: num_of_item
description: "Number of items in the order"
tests:
- not_null:
- Run :
dbt test -s stg_ecommerce_orders
- Running
dbt run and dbt test
==dbt build
- To add descriptions to bigquery: add this under materialisation in the dbt_project.yml file:
+materialized: table
+persist_docs:
relation: true
columns: true
- Run dbt run --full-refresh
- Create a intermediate folder in the models folder
- Add two files into the marts folders: int_order_items_products.sql or int_ecommerce.yml Fill up the two folders up...
- Create a marts folder in the models folder
- Add two files into the marts folders: orders.sql or orders.yml
-- Freshness: Setting Severity Level for Test
- Open the dbt_projects.yml file
- Add severity level of error for the project to severity warn
tests:
learning:
+severity: warn # this means all test in the project has a default levle of warn
- Comment out the women on the dbt accepted value test in stg_ecommerce_products
- Run
- dbt test -s stg_ecommerce_products
Macros for Testing
- Add this to the stg_ecommerce_products.yml file
- name: cost
description: "How much the product cost the business to purchase"
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
- name: retail_price
description: "How much the product retails for on the online store"
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
- dbt_utils.expression_is_true:
name: retail_price_not_leading_to_loss
expression: ">=cost"
(Naming the test is useful for debugging what test failed) 2. Run the Test
Writing Custom Test
- Create file in the tests folder called test_orders_match_order_items_details.sql
- Create a folder named documents in the models
- add the folder called doc_order_status.md in
- paste into it
{% docs status %}
The status of the order. Can be one of:
- Processing
- Cancelled
- Shipped
- Complete
- Returned
{% enddocs %}
- Inside the orders.yml and stg_ecommerce_orders.ym; file input the following:
- name: order_status
description: "{{ doc('status') }}"
- Run : dbt run -s stg_ecommerce_orders
- Go to the bigquery table to see the data description
Add Seed
- Describe slowly changing data types:
- Create a distribution_center.csv file inside the seed folder
- Paste into it
id,name,latitude,longitude
11,Miami FL,25.7617,-80.191788
12,Denver CO,39.7392,-104.9903
- create a seed.yml file inside the seed folder
- Paste into it:
version: 2
seeds:
- name: seed_distribution_centers_new
description: "An example of using a CSV file to load data into your warehouse"
tests:
- dbt_expectations.expect_table_row_count_to_equal:
value: 2
# Column names, descriptions, and tests can all be done as normal
columns:
- name: id
tests:
- not_null
- unique
- name: name
description: "Distribution center name"
- name: latitude
- name: longitude
# If you want to enforce datatypes, you can do so here
# Otherwise, BigQuery will do it for you!
config:
column_types:
id: INTEGER
name: STRING
latitude: FLOAT
longitude: FLOAT
- Run : dbt seed
- Check the bigquery table for the seed materialisation
Adding Snapshot Explain what snap_shot means: Open this: "https://docs.getdbt.com/docs/build/snapshots"
- Create a file in the snapshot folder names snapshot_distribution_centre.sql
- Paste into it:
{% snapshot snapshot_distribution_centre %}
{{
config(
target_schema='dbt_test',
unique_key='id',
strategy='check',
check_cols=['name', 'latitude', 'longitude']
)
}}
SELECT * FROM {{ source('thelook_ecommerce', 'distribution_centers') }}
{% endsnapshot %}
- Explain the target_schema concept
- Add to the src_ecommerce.yml file
- name: distribution_centers
- Run dbt snapshot
- Update the seed file with the remaining csv row
- Run dbt seed
- Run dbt snapshot
- Run dbt
- Discribe the changes in the valid to and updated columns
Explain Models
- Create an sql file called int_initial_order_created in the intermediate folder
{{
config(materialized='ephemeral')
}}
SELECT
user_id,
MIN(created_at) AS first_order_created_at
FROM {{ ref('stg_ecommerce_orders') }}
GROUP BY 1
- Add
-&user_id
to the int_ecommerce.yml to the user_id - Add the this below in the int_ecommerce.yml file
- name: int_initial_order_created
columns:
- *user_id
# This would keep the name & tests defined in the anchor, but overwrite the description
- Add this to the orders.sql file
-- a good way to demonstrate how to use an ephemeral materialisation
TIMESTAMP_DIFF(od.created_at, user_data.first_order_created_at, DAY) AS days_since_first_order
and join with
LEFT JOIN {{ ref('int_initial_order_created') }} AS user_data
ON od.user_id = user_data.user_id
- run the orders.sql model
link : https://docs.getdbt.com/docs/build/incremental-models
- create file named stg_ecommerce_events
- Add
/* Add this in the config after
unique_key='event_id',
on_schema_change='sync_all_columns',
partition_by={
"field": "created_at",
"data_type": "timestamp",
"granularity": "day"
}
*/
{{
config(
materialized='incremental',
)
}}
WITH source AS (
SELECT *
FROM {{ source('thelook_ecommerce', 'events') }}
{# first run with " WHERE created_at <= '2023-01-01' then remove where clause "#}
WHERE created_at <= '2023-01-01'
)
SELECT
id AS event_id,
user_id,
sequence_number,
session_id,
created_at,
ip_address,
city,
state,
postal_code,
browser,
traffic_source,
uri AS web_link,
event_type,
FROM source
{# Only runs this filter on an incremental run #}
{% if is_incremental() %}
{# The {{ this }} macro is essentially a {{ ref('') }} macro that allows for a circular reference #}
{# Circular reference is when a table references itself #}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
- Run on the where created_at <= '2023-01-01' run without the below
unique_key='event_id',
on_schema_change='sync_all_columns',
uisng dbt run -s stg_ecommerce_events
4. run without WHERE created_at <= '2023-01-01' by commenting it out
5. In the events log you'll see "create" for the first and "merge" for the second
6. Notice and increase in the rows for that table in bigquery
7. If ran again no data will be available
8. Add the unique_key and on_schema_change
unique_key='event_id',
on_schema_change='sync_all_columns',
explain reason 9. Run: dbt run -s stg_ecommerce_events --full-refresh
Partitioning is when a table is cut into partition based on a time_stamp column to make query faster Can be done on any type of table not only incremental tables Its useful for cutting down the size of the table
- Add the below to the stg_ecommerce_events model
partition_by={
"field": "created_at",
"data_type": "timestamp",
"granularity": "day"
}
- Run dbt run -s stg_ecommerce_events --full-refresh
- check the target folder and run folder
links : https://docs.getdbt.com/docs/collaborate/govern/about-model-governance Model Access: restrict those who have access to a model Model Contract ; model has certain models and columns, and shape Model Version: basically versions Model Access: restrict those who have access to a model
- Add the below to the orders.yml file
groups:
- name: sales
owner:
# 'name' or 'email' is required; additional properties allowed
email: [email protected]
slack: sales-channel
github: sales-channel-data
models:
- name: orders
description: "Table of order level information"
# Set this model to be a part of the sales group we define above
# Groups can be defined in another yml file
config:
group: sales
# 3 settings:
# Private - only other models in the same (sales) group can ref() this model
# Protected - only other models in the same group or project can ref() this model
# Public - any other model can ref() this model
access: protected
- Create the file test_marketing_orders.sql
- Add : SELECT * FROM {{ ref("orders") }}
- Run : dbt run -s test_marketing_orders
- Add groups marketing to orders.yml file
- name: marketing
owner:
email: [email protected]
slack: marketing-channel
github: marketing-channel-data
- Add {{ config(groups='marketing')}} to the test_marketing_orders model
- change access to private
- Run: dbt run -s test_marketing_orders and Watch error thrown
- Remove the marketing groups referencing from the test_marketing_orders model to stop error from being thrown.or delete model
Assignments: Implement severity, dbt_expectations.expect_column_values_to_be_between and dbt_utils.expression_is_true on the orders model
MODEL contract
- Overhaul the stg_ecommerce_order_items models with Note:
- Model contract (run BEFORE model is built)
- Model tests (run AFTER model is built)
- not_null checks are removed from test has it basically duplicates the contract
version: 2
models:
- name: stg_ecommerce_order_items
description: "Line items from orders"
config:
contract:
enforced: true
columns:
- name: order_item_id
tests:
- not_null:
severity: error
- unique:
severity: error
data_type: INTEGER
constraints:
- type: not_null
- name: order_id
data_type: INTEGER
tests:
- not_null
- relationships:
to: ref('stg_ecommerce__orders')
field: order_id
- name: user_id
data_type: INTEGER
tests:
- not_null
- name: product_id
data_type: INTEGER
tests:
- not_null
- relationships:
to: ref('stg_ecommerce__products')
field: product_id
- name: item_sale_price
data_type: FLOAT64
description: "How much the item sold for"
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
- Run dbt run -s stg_ecommerce_order_items
- Comment out the product_id in the above model then run dbt run -s stg_ecommerce_order_items
- Cast user_id as string in the above model then run bt run -s stg_ecommerce_order_items watch to see the various contract failure or errors thrown
MODEL VERSIONS
- Create a new folder called product_versions inside it copy and paste the stg_ecommerce_products model and rename it stg_ecommerce_products_v1 remove the brand column
- In the stg_ecommerce_products.yml file add the following
- name: brand
description: "Brand of the product"
AND 3. Add the brand column to the version 2 (rename current version : stg_ecommerce_products_v2)
latest_version: 2
versions:
# Matches what's above -- nothing more needed
- v: 1
columns:
# This means: use the 'columns' list from above, but exclude "brand" as we added it in v2
- include: all
exclude: [brand]
# We added a new brand column
- v: 2
# Makes this table stay as stg_ecommerce__products in our database!
config:
alias: stg_ecommerce_products
columns:
# This means: use the 'columns' list from above
- include: all
NOTE: State this - If you don't specify the latest, version, dbt will either look - for the unversioned file name (e.g. stg_ecommerce__products.sql), or reference the - latest version. In this case, it'd reference version 2 automatically, but this shows you - how you could do a pre-release version (e.g. create version 2, but by default dbt points at version 1) - using latest_version: 1 4. Run dbt clean && dbt deps 5. Run dbt run -s stg_ecommerce_products 6. change the latest version to 1 in there stg_ecommerce__products and run the below 7. Run everything up to the int_order_items_products model : dbt run -s +int_order_items_products 8. change the latest version to 2 9. Run everything up to the int_order_items_products model : dbt run -s +int_order_items_products 10. We can reference the models by using "{{ ref('learning', 'stg_ecommerce_products', v='1') }}" or "{{ ref('learning', 'stg_ecommerce_products', v='2') }}" 11. An alias can be added for a version instead by chaning the stg_ecommerce_products_v2 into stg_ecommerce_products then add an alias to the the version 2 then change the latest version to 2
config:
alias: stg_ecommerce_products