The performance team makes heavy use of the firefox_desktop.

➤ Denis Palmeiro commented: The query.sql files look good to m

Precompute or optimize queries against firefox_desktop.pageload about bigquery-etl HOT 10 CLOSED

data-sync-user commented on July 29, 2024

Precompute or optimize queries against firefox_desktop.pageload

from bigquery-etl.

Comments (10)

data-sync-user commented on July 29, 2024

➤ Andrew Creskey commented:

cc: Denis Palmeiro who also makes use of queries against firefox_desktop.pageload.

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Shell Escalante commented:

we don’t have a focus area. George Kaberere should we tag this in any way before moving to DENG? there wasn’t a data modelling area.

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Denis Palmeiro commented:

After using this table quite extensively recently, I think what would help us substantially is to have at least these subsets in separate tables to make lookups faster:

firefoxdesktop.pageloadnightly (nightly only)
firefox.desktop.pageload_1pct (1% of firefox_desktop.pageload)
firefoxdesktop.pageloadexperiments (pings that have a non-null value for “ping_info.experiments”)

Andrew Creskey or Bas Schouten can maybe think of some other useful subsets. I don’t use the beta population often, but maybe that could also be useful to have.

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Winnie Chan commented:

Denis Palmeiro Andrew Creskey
I have started a pull request to create the three tables: https://github.com/mozilla/bigquery-etl/pull/5359/files ( https://github.com/mozilla/bigquery-etl/pull/5359/files|smart-link )

There are some questions I hope you can help answer:

Can you take a look at the query.sql files to ensure it captures what you need?
how far back do these tables need data for?
most 1pct tables are created by filtering sample_id. However, majority of these pings do not have client_id and therefore no sample_id (that is generated by client_id). As a result, I used document_id instead. It is unclear to me if this is appropriate so it would be ideal if you can help verify this.

cc George Kaberere

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Denis Palmeiro commented:

The query.sql files look good to me.
Preferably a year.
I’m not familiar with the document_id but from https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/obsolete/fhr/identifiers.html ( https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/obsolete/fhr/identifiers.html|smart-link ) , it sounds like it’s pretty much random so I think generating the 1pct tables from document_id should be fine.

Thanks!

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Winnie Chan commented:

Denis Palmeiro

I have created the three views with data for April 2024 only at the moment. Could you take a look to make sure it fits your needs before I backfill the tables for more data (as you mentioned perhaps from a year ago around 2023-04-01).

moz-fx-data-shared-prod.firefox_desktop.pageload_nightly
moz-fx-data-shared-prod.firefox_desktop.pageload_1pct
moz-fx-data-shared-prod.firefox_desktop.pageload_experiments

Note that the experiments table is still big in size (currently at 35TB with 1 month of data. With another 11 months of data it may not be that much smaller than the original table of 200TB in size). Let me know what you think.

Thanks.

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Denis Palmeiro commented:

Winnie Chan Thanks, those tables look great. Since the experiments table is still so large, let’s just get rid of it and do the other 2 instead. Thanks!

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Winnie Chan commented:

Denis Palmeiro I have backfilled the following tables from 2023-05-01. Let me know if you need more data.

moz-fx-data-shared-prod.firefox_desktop.pageload_nightly
moz-fx-data-shared-prod.firefox_desktop.pageload_1pct

I can go ahead and delete moz-fx-data-shared-prod.firefox_desktop.pageload_experiments.

However, I wonder if the two tables above would be sufficient for your use cases in querying for experiments? The ticket included some sample redash queries (96384 ( https://sql.telemetry.mozilla.org/queries/96384 ), 92832 ( https://sql.telemetry.mozilla.org/queries/92832 )) that may look at experiments in other channels?

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Denis Palmeiro commented:

Thanks Winnie Chan!

The experiment subset is the biggest use case for us, and we are mostly just interested in performance experiments but unfortunately there is no good way to just isolate those. However, the nightly and 1% should help us when we're doing quick lookups of data.

from bigquery-etl.

data-sync-user commented on July 29, 2024

➤ Winnie Chan commented:

Thanks Denis Palmeiro .

In that case I will close this ticket and delete the experiments table moz-fx-data-shared-prod.firefox_desktop.pageload_experiments.

You can start using the new tables where applicable, particularly in any scheduled redash queries or dashboards. I will continue to monitor usage/costs of the the pageload tables for the next little while and see if there are more needs to optimize.

Feel free to reach out if you have any questions!

from bigquery-etl.

Precompute or optimize queries against firefox_desktop.pageload about bigquery-etl HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs