Comments (4)
Finding likely typos in values of a categorical column where entry was free-text
These CTEs
- select all distinct values in a column
- perform a cross join (or cartesian product) to produce all distinct combinations of those strings
- calculates the "distance" between those strings for each combination (by calculating the levenshtein distance, 0 means they're identical, higher values indicate more differences)
- This operation is computationally taxing, I included an optional step that uses a much lighter-weight calculation to filter out combinations where the difference in string lengths is above some threshold)
And the final query selects combinations where the strings don't match exactly but are below some threshold value.
WITH distinct_vals AS (
SELECT DISTINCT name_of_a_string_col
FROM some_table
),
all_combos AS (
SELECT
a.name_of_a_string_col AS col_a,
b.name_of_a_string_col AS col_b
FROM
distinct_vals AS a
CROSS JOIN
distinct_vals AS b
),
col_distance AS (
SELECT col_a, col_b, levenshtein(col_a, col_b) AS distance
FROM all_combos
WHERE abs(length(col_a) - length(col_b)) <= max_length_difference -- optional
ORDER BY distance
)
SELECT *
FROM col_distance
WHERE distance > 0
AND distance <= some_threshold_difference
from analytics_data_where_house.
Select records where a column contains alphabetic values
Occasionally columns that should be numeric contain letters (maybe the person entering the data typed "two"). This query selects records where that column contains letters.
SELECT *
FROM some_table
WHERE nearly_numerical_col ~* '[[:alpha:]]' IS TRUE
from analytics_data_where_house.
Extract one component of date-like or timestamp-like values
In the case that one wants to count records per {century, year, month, day, hour, minute, second, etc}, it's useful to extract that grouping timespan from the date-like values.
WITH homicides AS (
SELECT id, case_number, date, extract(hour from date) AS hour
FROM standardized.chicago_crimes_standardized
WHERE primary_type = 'HOMICIDE'
)
SELECT count(*), hour
FROM homicides
GROUP BY hour
ORDER BY hour
Potential uses
This kind of check can be useful in the extremely unlikely () event that you get confused about timezones. For example, I'm currently using this to correct incorrectly set time-zones in some _standardized stage dbt transformation models. I know that there is a daily rhythm to homicides and shootings (at least here in Chicago) where the rate of violent crime is lowest from 6am to 9am and highest from ~11pm to 3am, and I know Chicago is UTC-5 in winter and UTC-6 in summer (and most homicides happen in summer). I also know postgres stores timestamp-like data in UTC-0, so I should expect to see the highest counts from (23+6) % 24
to (3+6) % 24
(or 5:00:00+00:00 to 9:00:00+00:00, in tz-aware format) if timezones were set correctly.
References:
Postgres documentation for extract
and date_part
from analytics_data_where_house.
Check multiple timezone commands simultaneously
If you want to quickly check that a timezone coercion did what you expect, you can do multiple tz-coercions at the same time.
WITH ts_table AS (
SELECT crash_date AS date
FROM data_raw.chicago_traffic_crashes
)
SELECT
date AS basic_date,
date::timestamptz AT TIME ZONE 'UTC' AS date_utc,
date::timestamptz AT TIME ZONE 'America/Chicago' AS date_chi,
date::timestamptz AT TIME ZONE 'UTC' AT TIME ZONE 'America/Chicago' AS date_utc_chi
FROM ts_table
from analytics_data_where_house.
Related Issues (20)
- Develop tooling to ingest Census geographic data from the TIGER data offerings
- Add pipelines to collect TIGER geospatial features distributed in spanning different geometries
- Extend TIGER taskflow to include validation and ingestion into a persistent data_raw table
- Extend Census API Caller taskflow to include validation and ingestion into a persistent data_raw table
- Upgrade Airflow images to 2.6.2
- Rewrite great_expectations sections in the docs
- Implement success/failure notifiers for Airflow DAGs HOT 1
- Implement functionality to provide more meaningful variable names for census dataset features
- Upgrade airflow images to 2.6.3 along with package versions
- Update Superset Version to 2.1.0
- Add new datasets from this year's CCSO Open Data Refresh
- Upgrade Superset to version 2.1.0
- Remove references to the deprecated and obsoleted py-utils container
- Add instructions for backing up and restoring a database
- Change date format in backup archive names
- Upgrade Airflow to v2.7.2 and Airflow image package versions as appropriate
- Redact Superset credentials from init script
- A failed typecasting prevents CTA bus stop data standardization and cleaning
- Update startup script and recipes to also work on a fresh Ubuntu/Debian instance
- Modify system to work when run by Docker in rootless mode
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from analytics_data_where_house.