GithubHelp home page GithubHelp logo

Comments (4)

MattTriano avatar MattTriano commented on May 27, 2024

Finding likely typos in values of a categorical column where entry was free-text

These CTEs

  • select all distinct values in a column
  • perform a cross join (or cartesian product) to produce all distinct combinations of those strings
  • calculates the "distance" between those strings for each combination (by calculating the levenshtein distance, 0 means they're identical, higher values indicate more differences)
    • This operation is computationally taxing, I included an optional step that uses a much lighter-weight calculation to filter out combinations where the difference in string lengths is above some threshold)

And the final query selects combinations where the strings don't match exactly but are below some threshold value.

WITH distinct_vals AS (
  SELECT DISTINCT name_of_a_string_col 
  FROM some_table
),
all_combos AS (
  SELECT
    a.name_of_a_string_col AS col_a,
    b.name_of_a_string_col AS col_b
  FROM
    distinct_vals AS a 
      CROSS JOIN
    distinct_vals AS b
),
col_distance AS (
  SELECT col_a, col_b, levenshtein(col_a, col_b) AS distance
  FROM all_combos
  WHERE abs(length(col_a) - length(col_b)) <= max_length_difference      -- optional  
  ORDER BY distance
)

SELECT *
FROM col_distance
WHERE distance > 0
AND distance <= some_threshold_difference

from analytics_data_where_house.

MattTriano avatar MattTriano commented on May 27, 2024

Select records where a column contains alphabetic values

Occasionally columns that should be numeric contain letters (maybe the person entering the data typed "two"). This query selects records where that column contains letters.

SELECT *
FROM some_table
WHERE nearly_numerical_col ~* '[[:alpha:]]' IS TRUE

from analytics_data_where_house.

MattTriano avatar MattTriano commented on May 27, 2024

Extract one component of date-like or timestamp-like values

In the case that one wants to count records per {century, year, month, day, hour, minute, second, etc}, it's useful to extract that grouping timespan from the date-like values.

WITH homicides AS (
  SELECT id, case_number, date, extract(hour from date) AS hour
  FROM standardized.chicago_crimes_standardized
  WHERE primary_type = 'HOMICIDE'
)

SELECT count(*), hour
FROM homicides
GROUP BY hour
ORDER BY hour

Potential uses

This kind of check can be useful in the extremely unlikely () event that you get confused about timezones. For example, I'm currently using this to correct incorrectly set time-zones in some _standardized stage dbt transformation models. I know that there is a daily rhythm to homicides and shootings (at least here in Chicago) where the rate of violent crime is lowest from 6am to 9am and highest from ~11pm to 3am, and I know Chicago is UTC-5 in winter and UTC-6 in summer (and most homicides happen in summer). I also know postgres stores timestamp-like data in UTC-0, so I should expect to see the highest counts from (23+6) % 24 to (3+6) % 24 (or 5:00:00+00:00 to 9:00:00+00:00, in tz-aware format) if timezones were set correctly.

References:

Postgres documentation for extract and date_part

from analytics_data_where_house.

MattTriano avatar MattTriano commented on May 27, 2024

Check multiple timezone commands simultaneously

If you want to quickly check that a timezone coercion did what you expect, you can do multiple tz-coercions at the same time.

WITH ts_table AS (
  SELECT crash_date AS date
  FROM data_raw.chicago_traffic_crashes   
)

SELECT
  date AS basic_date,
  date::timestamptz AT TIME ZONE 'UTC' AS date_utc,
  date::timestamptz AT TIME ZONE 'America/Chicago' AS date_chi,
  date::timestamptz AT TIME ZONE 'UTC' AT TIME ZONE 'America/Chicago' AS date_utc_chi
FROM ts_table

from analytics_data_where_house.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.