Comments (6)
Succeeded in 37 minutes!
from bigquery-etl.
I'm currently testing a process where we first produce a list of document_ids with the number of occurrences, then we select all the records where occurrences = 1
into the stable table, then we do the window query on just the rows where occurrences > 1
. The vast majority of rows have no duplicates in the live table, so this may make the window function tenable. It may also be possible to express this entire operation as a single query without the need for a temp table.
First attempt failed, and now I'm going to try breaking into pieces.
from bigquery-etl.
On the bright side, the other copy_deduplicate job (which handles populating all the stable tables besides main) finished in just under 10 minutes, so that looks to be working well.
from bigquery-etl.
I was able to successfully create a deduped version of main_v4
via a query broken into three parts; the process took just over 30 minutes of runtime. I've now unified those parts together into a single query and am waiting to see if that completes:
CREATE TABLE
tmp.klukas_main_deduped2
PARTITION BY
DATE(submission_timestamp)
CLUSTER BY
sample_id AS
WITH
base AS (
SELECT
*
FROM
`moz-fx-data-shared-prod.telemetry_live.main_v4`
WHERE
DATE(submission_timestamp) = '2019-08-22' ),
--
duped_docids AS (
SELECT
document_id,
COUNT(document_id) AS occurrences
FROM
base
GROUP BY
document_id
HAVING
occurrences > 1),
--
nonduped AS (
SELECT
base.*
FROM
base
LEFT JOIN
duped_docids
USING
(document_id)
WHERE
duped_docids.document_id IS NULL),
--
numbered_duplicates AS (
SELECT
base.*,
ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY submission_timestamp) AS _n
FROM
base
JOIN
duped_docids
USING
(document_id) ),
--
deduped AS (
SELECT
* EXCEPT (_n)
FROM
numbered_duplicates
WHERE
_n = 1 )
SELECT
*
FROM
nonduped
UNION ALL
SELECT
*
FROM
deduped
from bigquery-etl.
The above single query succeeded in 42 minutes. I'm going to PR this change to bigquery-etl.
from bigquery-etl.
The new docker image has been built and published, so I kicked off the airflow job to run again. We should see that succeed in ~40 minutes.
from bigquery-etl.
Related Issues (20)
- Hide outdated diff comments HOT 1
- Additional data needed in mozdata.ctms dataset HOT 4
- Support rebuilding a rolling window of multiple table partitions via Airflow HOT 1
- Telemetry Dev Cycle: active metric count is wrong HOT 2
- Should ./bqetl initialize not overwrite tables that contain data?
- Bug in generated DAGs using depends_on with tasks that have the same name
- Remove usage of `referenced_tables` in metadata.yaml
- Spurious integration task failures due to /tmp/gcp.json is not a valid json file
- (Infra Day) Optimize schema generation for views in CI
- Data missing from fenix.clients_yearly on 2024-01-08 HOT 1
- [Data Checks Improvement]: Optional strings that describe the test case HOT 1
- dataset_metadata.yaml not correctly updated when deprecated: true HOT 2
- Replace use of `gke_command` with `GKEPodOperator` in DAG jinja templates
- [Data Checks Bug]: Checks are not being picked up correctly with a space between # and check type HOT 1
- Separate SQL tests and bigquery-etl tooling tests
- Precompute or optimize queries against firefox_desktop.pageload HOT 9
- bqetl tooling check for updates automatically
- Reduce noise in PR diffs
- Views incorrectly removed in CI
- Add additional fields to org_mozilla_broken_site_report.user_reports view
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bigquery-etl.