This issue will act as a notebook for useful (postgre)SQL snippets that aid in cleanin

Notes on postgres commands or recipes for efficiently cleaning column values about analytics_data_where_house HOT 4 OPEN

MattTriano commented on May 27, 2024

Notes on postgres commands or recipes for efficiently cleaning column values

from analytics_data_where_house.

Comments (4)

MattTriano commented on May 27, 2024

Finding likely typos in values of a categorical column where entry was free-text

These CTEs

select all distinct values in a column
perform a cross join (or cartesian product) to produce all distinct combinations of those strings
calculates the "distance" between those strings for each combination (by calculating the levenshtein distance, 0 means they're identical, higher values indicate more differences)
- This operation is computationally taxing, I included an optional step that uses a much lighter-weight calculation to filter out combinations where the difference in string lengths is above some threshold)

And the final query selects combinations where the strings don't match exactly but are below some threshold value.

WITH distinct_vals AS (
  SELECT DISTINCT name_of_a_string_col 
  FROM some_table
),
all_combos AS (
  SELECT
    a.name_of_a_string_col AS col_a,
    b.name_of_a_string_col AS col_b
  FROM
    distinct_vals AS a 
      CROSS JOIN
    distinct_vals AS b
),
col_distance AS (
  SELECT col_a, col_b, levenshtein(col_a, col_b) AS distance
  FROM all_combos
  WHERE abs(length(col_a) - length(col_b)) <= max_length_difference      -- optional  
  ORDER BY distance
)

SELECT *
FROM col_distance
WHERE distance > 0
AND distance <= some_threshold_difference

from analytics_data_where_house.

MattTriano commented on May 27, 2024

Select records where a column contains alphabetic values

Occasionally columns that should be numeric contain letters (maybe the person entering the data typed "two"). This query selects records where that column contains letters.

SELECT *
FROM some_table
WHERE nearly_numerical_col ~* '[[:alpha:]]' IS TRUE

from analytics_data_where_house.

MattTriano commented on May 27, 2024

Extract one component of date-like or timestamp-like values

In the case that one wants to count records per {century, year, month, day, hour, minute, second, etc}, it's useful to extract that grouping timespan from the date-like values.

WITH homicides AS (
  SELECT id, case_number, date, extract(hour from date) AS hour
  FROM standardized.chicago_crimes_standardized
  WHERE primary_type = 'HOMICIDE'
)

SELECT count(*), hour
FROM homicides
GROUP BY hour
ORDER BY hour

Potential uses

This kind of check can be useful in the extremely unlikely () event that you get confused about timezones. For example, I'm currently using this to correct incorrectly set time-zones in some _standardized stage dbt transformation models. I know that there is a daily rhythm to homicides and shootings (at least here in Chicago) where the rate of violent crime is lowest from 6am to 9am and highest from ~11pm to 3am, and I know Chicago is UTC-5 in winter and UTC-6 in summer (and most homicides happen in summer). I also know postgres stores timestamp-like data in UTC-0, so I should expect to see the highest counts from (23+6) % 24 to (3+6) % 24 (or 5:00:00+00:00 to 9:00:00+00:00, in tz-aware format) if timezones were set correctly.

References:

Postgres documentation for extract and date_part

from analytics_data_where_house.

MattTriano commented on May 27, 2024

Check multiple timezone commands simultaneously

If you want to quickly check that a timezone coercion did what you expect, you can do multiple tz-coercions at the same time.

WITH ts_table AS (
  SELECT crash_date AS date
  FROM data_raw.chicago_traffic_crashes   
)

SELECT
  date AS basic_date,
  date::timestamptz AT TIME ZONE 'UTC' AS date_utc,
  date::timestamptz AT TIME ZONE 'America/Chicago' AS date_chi,
  date::timestamptz AT TIME ZONE 'UTC' AT TIME ZONE 'America/Chicago' AS date_utc_chi
FROM ts_table

from analytics_data_where_house.

Notes on postgres commands or recipes for efficiently cleaning column values about analytics_data_where_house HOT 4 OPEN

Comments (4)

Finding likely typos in values of a categorical column where entry was free-text

Select records where a column contains alphabetic values

Extract one component of date-like or timestamp-like values

Potential uses

References:

Check multiple timezone commands simultaneously

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs