GithubHelp home page GithubHelp logo

Comments (6)

jklukas avatar jklukas commented on May 20, 2024 2

Succeeded in 37 minutes!

from bigquery-etl.

jklukas avatar jklukas commented on May 20, 2024 1

I'm currently testing a process where we first produce a list of document_ids with the number of occurrences, then we select all the records where occurrences = 1 into the stable table, then we do the window query on just the rows where occurrences > 1. The vast majority of rows have no duplicates in the live table, so this may make the window function tenable. It may also be possible to express this entire operation as a single query without the need for a temp table.

First attempt failed, and now I'm going to try breaking into pieces.

from bigquery-etl.

jklukas avatar jklukas commented on May 20, 2024

On the bright side, the other copy_deduplicate job (which handles populating all the stable tables besides main) finished in just under 10 minutes, so that looks to be working well.

from bigquery-etl.

jklukas avatar jklukas commented on May 20, 2024

I was able to successfully create a deduped version of main_v4 via a query broken into three parts; the process took just over 30 minutes of runtime. I've now unified those parts together into a single query and am waiting to see if that completes:

CREATE TABLE
  tmp.klukas_main_deduped2
PARTITION BY
  DATE(submission_timestamp)
CLUSTER BY
  sample_id AS
WITH
  base AS (
  SELECT
    *
  FROM
    `moz-fx-data-shared-prod.telemetry_live.main_v4`
  WHERE
    DATE(submission_timestamp) = '2019-08-22' ),
  --
  duped_docids AS (
  SELECT
    document_id,
    COUNT(document_id) AS occurrences
  FROM
    base
  GROUP BY
    document_id
  HAVING
    occurrences > 1),
  --
  nonduped AS (
  SELECT
    base.*
  FROM
    base
  LEFT JOIN
    duped_docids
  USING
    (document_id)
  WHERE
    duped_docids.document_id IS NULL),
  --
  numbered_duplicates AS (
  SELECT
    base.*,
    ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY submission_timestamp) AS _n
  FROM
    base
  JOIN
    duped_docids
  USING
    (document_id) ),
  --
  deduped AS (
  SELECT
    * EXCEPT (_n)
  FROM
    numbered_duplicates
  WHERE
    _n = 1 )
SELECT
  *
FROM
  nonduped
UNION ALL
SELECT
  *
FROM
  deduped

from bigquery-etl.

jklukas avatar jklukas commented on May 20, 2024

The above single query succeeded in 42 minutes. I'm going to PR this change to bigquery-etl.

from bigquery-etl.

jklukas avatar jklukas commented on May 20, 2024

The new docker image has been built and published, so I kicked off the airflow job to run again. We should see that succeed in ~40 minutes.

from bigquery-etl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.