GithubHelp home page GithubHelp logo

Comments (10)

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Andrew Creskey commented:

cc: Denis Palmeiro who also makes use of queries against firefox_desktop.pageload.

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Shell Escalante commented:

we don’t have a focus area. George Kaberere should we tag this in any way before moving to DENG? there wasn’t a data modelling area.

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Denis Palmeiro commented:

After using this table quite extensively recently, I think what would help us substantially is to have at least these subsets in separate tables to make lookups faster:

firefoxdesktop.pageloadnightly (nightly only)
firefox.desktop.pageload_1pct (1% of firefox_desktop.pageload)
firefoxdesktop.pageloadexperiments (pings that have a non-null value for “ping_info.experiments”)

Andrew Creskey or Bas Schouten can maybe think of some other useful subsets. I don’t use the beta population often, but maybe that could also be useful to have.

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Winnie Chan commented:

Denis Palmeiro Andrew Creskey
I have started a pull request to create the three tables: https://github.com/mozilla/bigquery-etl/pull/5359/files ( https://github.com/mozilla/bigquery-etl/pull/5359/files|smart-link )

There are some questions I hope you can help answer:

  • Can you take a look at the query.sql files to ensure it captures what you need?
  • how far back do these tables need data for?
  • most 1pct tables are created by filtering sample_id. However, majority of these pings do not have client_id and therefore no sample_id (that is generated by client_id). As a result, I used document_id instead. It is unclear to me if this is appropriate so it would be ideal if you can help verify this.

cc George Kaberere

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Denis Palmeiro commented:

  1. The query.sql files look good to me.
  2. Preferably a year.
  3. I’m not familiar with the document_id but from https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/obsolete/fhr/identifiers.html ( https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/obsolete/fhr/identifiers.html|smart-link ) , it sounds like it’s pretty much random so I think generating the 1pct tables from document_id should be fine.

Thanks!

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Winnie Chan commented:

Denis Palmeiro

I have created the three views with data for April 2024 only at the moment. Could you take a look to make sure it fits your needs before I backfill the tables for more data (as you mentioned perhaps from a year ago around 2023-04-01).

  • moz-fx-data-shared-prod.firefox_desktop.pageload_nightly
  • moz-fx-data-shared-prod.firefox_desktop.pageload_1pct
  • moz-fx-data-shared-prod.firefox_desktop.pageload_experiments

Note that the experiments table is still big in size (currently at 35TB with 1 month of data. With another 11 months of data it may not be that much smaller than the original table of 200TB in size). Let me know what you think.

Thanks.

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Denis Palmeiro commented:

Winnie Chan Thanks, those tables look great. Since the experiments table is still so large, let’s just get rid of it and do the other 2 instead. Thanks!

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Winnie Chan commented:

Denis Palmeiro I have backfilled the following tables from 2023-05-01. Let me know if you need more data.

  • moz-fx-data-shared-prod.firefox_desktop.pageload_nightly
  • moz-fx-data-shared-prod.firefox_desktop.pageload_1pct

I can go ahead and delete moz-fx-data-shared-prod.firefox_desktop.pageload_experiments.

However, I wonder if the two tables above would be sufficient for your use cases in querying for experiments? The ticket included some sample redash queries (96384 ( https://sql.telemetry.mozilla.org/queries/96384 ), 92832 ( https://sql.telemetry.mozilla.org/queries/92832 )) that may look at experiments in other channels?

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Denis Palmeiro commented:

Thanks Winnie Chan!

The experiment subset is the biggest use case for us, and we are mostly just interested in performance experiments but unfortunately there is no good way to just isolate those. However, the nightly and 1% should help us when we're doing quick lookups of data.

from bigquery-etl.

data-sync-user avatar data-sync-user commented on June 12, 2024

➤ Winnie Chan commented:

Thanks Denis Palmeiro .

In that case I will close this ticket and delete the experiments table moz-fx-data-shared-prod.firefox_desktop.pageload_experiments.

You can start using the new tables where applicable, particularly in any scheduled redash queries or dashboards. I will continue to monitor usage/costs of the the pageload tables for the next little while and see if there are more needs to optimize.

Feel free to reach out if you have any questions!

from bigquery-etl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.