mozilla / bigquery-etl Goto Github PK

View Code? Open in Web Editor NEW

245.0 245.0 98.0 160.9 MB

Bigquery ETL

Home Page: https://mozilla.github.io/bigquery-etl

License: Mozilla Public License 2.0

Python 94.21% Dockerfile 0.09% Shell 2.79% JavaScript 0.87% HTML 0.11% Jinja 1.94%

bigquery-etl's Introduction

BigQuery ETL

This repository contains Mozilla Data Team's:

Derived ETL jobs that do not require a custom container
User-defined functions (UDFs)
Airflow DAGs for scheduled bigquery-etl queries
Tools for query & UDF deployment, management and scheduling

For more information, see https://mozilla.github.io/bigquery-etl/

Quick Start

Pre-requisites

Pyenv (optional) Recommended if you want to install different versions of python, see instructions here. After the installation of pyenv, make sure that your terminal app is configured to run the shell as a login shell.
Homebrew (not required, but useful for Mac) - Follow the instructions here to install homebrew on your Mac.
Python 3.11+ - (see this guide for instructions if you're on a mac and haven't installed anything other than the default system Python).

GCP CLI tools

For Mozilla Employees (not in Data Engineering) - Set up GCP command line tools, as described on docs.telemetry.mozilla.org. Note that some functionality (e.g. writing UDFs or backfilling queries) may not be allowed. Run gcloud auth login --update-adc to authenticate against GCP.
For Data Engineering - In addition to setting up the command line tools, you will want to log in to shared-prod if making changes to production systems. Run gcloud auth login --update-adc --project=moz-fx-data-shared-prod (if you have not run it previously).

Installing bqetl

Clone the repository

git clone [email protected]:mozilla/bigquery-etl.git
cd bigquery-etl

Install the bqetl command line tool

./bqetl bootstrap

Install standard pre-commit hooks

venv/bin/pre-commit install

Finally, if you are using Visual Studio Code, you may also wish to use our recommended defaults:

cp .vscode/settings.json.default .vscode/settings.json
cp .vscode/launch.json.default .vscode/launch.json

And you should now be set up to start working in the repo! The easiest way to do this is for many tasks is to use bqetl. You may also want to read up on common workflows.

Releasing a new version of `bqetl`

To push a new version of bqetl to PyPI, update the version in pyproject.toml. The version numbers follow the CalVer scheme, with the Micro version numbers starting at 1. For example, for the first package version getting published in March 2024, the version would be 2024.3.1.

bigquery-etl's People

Contributors

Stargazers

Watchers

Forkers

mozilla-github-standards scholtzan fbertsch glasserc jasonthomas tdsmith jason-cooke acmiyaguchi chenhan1218 yzen benmiroglio cnevinc alyswidan conjuror yeondudad wcbeard ncloudioj mnoorenberghe felixlawrence ksiegler1 daisy-pliu mozilla-tw dexterp37 gkabbz kozakjefferson chutten mdlglobal-atlassian-net din-din-din xrao2 jaredhirsch sashovo tigramma kariak70 10allday-software saptarshiguha asiotus harterrt leayeh meghajain-1711 deepanshu-nickelfox gopinath678 hngerebara dawoodshahat nonbinaryfrog patrisio iinh popdaily terrorizer1980 alexandrkhabarov sclmaomao bobsilverberg dblohm7 erdal-pb alekhyamoz pombredanne santakd tunguyen-12 nguyensinhtu willkg kirill-demtchouk wbsouza annasob edugfilho chelseatroy wfordwfu abdulqadir100 rvandermeulen rikihoang badboy xdxycx fyenne rgda-official sogaussian caugner isabella232 lucia-vargas-a pocket kanishk-alm yashika-khurana yashikakhurana ksy36 nkemdev adityadeshpande09 husain-sayyed njuguna-brian saschanaz d1mentrincho hubayirp ahal mehdi4crypto zhuguanghuizaijianning

bigquery-etl's Issues

Dump public JSON data to the correct bucket

This follows-up on #294. The bigquery-etl runner will need to again read the metadata.yaml and determine if a query is a public-json query; if so, we will need to write it to the correct place. That location is being determined in Bug 1573826 That will entail the following:

Create the table, as done previously (also includes the public-BQ tables)
Parse metadata.yaml and determine if this is a public-json table
If so, use bq extract to write the data as ndjson to GCS
Once the data is there, we need to change from ndjson to a single JSON array. This could possibly be done with a cloud function(?) or locally
Write the file out to gcs://$bucket/api/v1/tables /$dataset/$tablename/v1/files/$date/$filename for an incremental table, or gcs://$bucket/api/v1/tables /$dataset/$tablename/v1/files/$filename for a non-incremental table
Update gcs://$bucket/api/v1/tables /$dataset/$tablename/v1/last_updated

Drop generated_time columns

it is a duplicate of existing bigquery table metadata (lastModifiedTime for each table/partition), so it's only useful if we aren't setting WRITE_TRUNCATE.

example query showing that we can see last modified time per-partition:

seq 20190323 20190324 |
    xargs -i~ bq show moz-fx-data-derived-datasets:telemetry.clients_daily_v6'$'~ |
    jq '.lastModifiedTime|fromjson|./1000' |
    TZ=UTC xargs -i~ date -d@~ +'%F %T %Z'

output

2019-03-24 18:38:25 UTC
2019-03-25 18:30:36 UTC

and for the whole table:

bq show moz-fx-data-derived-datasets:telemetry.clients_daily_v6 |
    jq '.lastModifiedTime|fromjson|./1000' |
    TZ=UTC xargs -i~ date -d@~ +'%F %T %Z'

output

2019-03-25 18:30:36 UTC

Add script to automate building incremental tables

to be used for both creating and backfilling tables, should eventually grow a feature to automatically detect the difference between non-recursive table (clients_daily) and a recursive one (clients_last_seen)

Script for running all view definitions

We schedule queries individually in DAGs for tables we want to populate daily, but we don't have any automation yet for running views defined in this repository.

We should likely have a job in Airflow that scans through this repo for all queries that define views, and run them, so that we don't get drift between what's defined in this repo and the views that actually live in BQ.

Also, if the underlying tables have their schemas evolve, such as adding fields, those fields won't show up in the BQ console when you inspect the schema of the view, since the view schema is determined at creation time. Running a new "CREATE OR REPLACE VIEW" statement should get the updated schema applied for the view.

So, the result of this issue should be a script in this repository that scans through sql/ to find files that contain CREATE OR REPLACE VIEW statements and run them.

Migrate DesktopDau and DesktopActiveDau to clients_last_seen

these are probably already covered by firefox_desktop_exact_mau28 and growth dashboard, but this should include disconnecting the old datasets where possible.

Migrate Addons View to SQL

currently generated by AddonsView.scala

Add UDF for parsing mobile events

It looks like focus events aren't nested by process the way desktop events are in main ping. We should have a separate UDF for handling that case. Example:

CREATE TEMP FUNCTION
  udf_js_json_extract_events (input STRING)
  
  RETURNS ARRAY<STRUCT<
  event_timestamp INT64,
  event_category STRING,
  event_object STRING,
  event_method STRING,
  event_string_value STRING,
  event_map_values ARRAY<STRUCT<key STRING, value STRING>>
  >>
  LANGUAGE js AS """
    if (input == null) {
      return null;
    }
    var parsed = JSON.parse(input);
    var result = [];
    parsed.forEach(event => {
        var structured = {
          "event_timestamp": event[0],
          "event_category": event[1],
          "event_method": event[2],
          "event_object": event[3],
          "event_string_value": event[4],
          "event_map_values": []
        }
        for (var key in event[5]) {
          structured.event_map_values.push({"key": key, "value": event[5][key]})
        }
        result.push(structured)
    });
    return result;
""";


SELECT 
  client_id, event
FROM
  `moz-fx-data-shar-nonprod-efed.telemetry.focus_event`
CROSS JOIN
  UNNEST(udf_js_json_extract_events(JSON_EXTRACT(additional_properties, '$.events'))) AS event
WHERE
  DATE(submission_timestamp) = '2019-08-03'
LIMIT 100

See https://bugzilla.mozilla.org/show_bug.cgi?id=1570264

Run python linters on python scripts

I thought this would be as easy as adding specific files to a list somewhere, but pytest is refusing to load files that don't end in a .py extension.

Write convenience udfs for accessing maps

longer description tbd

Dry run queries in CI

Add UDF for unzipping gzipped bytes

@scholtzan has already put together a proof-of-concept JS UDF for unzipping, and she is going to work on making that available in this repo.

Consider adding field to clients_daily/etc. to identify profiles created due to v67 profile-per-install change

@jklukas, as discussed, It is likely that we will often need to perform analysis excluding the profiles created due to v67 profile-per-install change. For consistency, reduced likelihood of error, and cost, it may make sense for data engineering to own this and expose the needed tables/fields for analysis.

@SuYoungHong did some analysis of this and suggests that following query to identify the profiles:

SELECT
DISTINCT client_id
FROM
main_summary
WHERE
submission_date_s3 >= '20190131'
AND scalar_parent_startup_profile_selection_reason IN ('firstrun-skipped-default',
'restart-skipped-default')

The date is the first date for which he observed the scalar_parent_startup_profile_selection_reason being populated.

@gkabbz has done some work here as well, specifically looking at new profiles creation. He suggests it may be necessary to look at first shutdown pings as well as main_summary to find all the relevant profiles.

Throw better exceptions from test_generated.py

the fixtures control what generated test was running, but that obscures what query failed

we should add information to the exception that indicates which generated test was running

Migrate search clients daily to SQL

https://github.com/acmiyaguchi/python_mozetl/blob/master/mozetl/search/aggregates.py

Ensure we can support cross-project queries in Airflow

As discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1563742, we will be starting to migrate all derived tables into the shared-prod project, so there will be an interim period where we will have tables split between derived-datasets:telemetry and shared-prod:telemetry_raw. As we migrate, we will be providing views in derived-datasets:telemetry and shared-prod:telemetry to ensure a consistent and uninterrupted interface for users.

During the interim, we will have some queries in Airflow writing to derived-datasets and some to shared-prod, with perhaps needing to read from a different project than they're writing into. We should check that we understand how to express this in our Airflow dags and that our entrypoint script is able to support these scenarios.

Filter on fxa_content_auth_events_v1 causes some events to be dropped

Here's an example of an event that makes it to amplitude but is filtered out by the current content server ETL job.

(querying for its flow_id in fxa_content_events_v1 shows its missing, but its there in the FxA logs)

I think this is because of this line in the etl job:
https://github.com/mozilla/bigquery-etl/blob/master/sql/fxa_content_events_v1.sql#L14

This causes many form views of relier-hosted forms to be filtered out, for example form views on about:welcome that are logged via a call to the FxA metrics endpoint. Without these view events its not possible to construct an end-to-end funnel that begins on the relier's site.

Based on querying for null user ids in the FxA logs, I think removing this filter would add another 3M events per day to the derived datasets.

These are almost all this fxa_email_first - view event. Compare GCP with amplitude.

I do think we need these events for some analyses, but I'm guessing it would be costly to backfill them, so maybe we can start by just removing the filter going forward? Or would that break something else that I'm not aware of. I think we would at least have to make user_id nullable (if its not already).

The filter can stay on the auth server ETL, there will never be null user_ids there.

Remove UDF declarations from queries

Follow-up to #141

Extend the code in tests/parse_udfs.py to also read through the files under sql/ and identify usage of udfs there
Write a python script to use the above udf dependency resolution logic, writing out generated sql files to a new directory (perhaps we call this target/sql to match Maven's conventions); each file would contain the body of all temporary UDFs needed in the query, and then the query text itself
Remove UDF declarations from the source files under sql/ and update the logic for our CircleCI deploy job to make sure that the published container has a sql/ directory that contains the generated queries with UDF definitions

Support running in alternate projects

A request we've seen is to use bigquery-etl and it's tooling (alongside Airflow) for other projects. This issue will cover how we can tackle that.

To start, we can move all datasets to their own project's directory; e.g. telemetry will move to moz-fx-data-shared-prod/telemetry.

We will add codeowners to each project, such that the Data Engineering team doesn't need to review queries that are in other projects (unless they own them).

We will need to enable the BQ command in the GKE cluster to use alternate projects. Some options might be:

Give GKE access to all projects
Run the workload inside the associated project's GKE cluster, which Airflow needs access to

cc @jklukas

Script for creating default views

As part of the accepted BQ Layout and Table Structure Proposal, we will be hiding tables from users in the telemetry_raw and other <namespace>_raw datasets in shared-prod, and providing user-facing views in the telemetry dataset.

It is becoming difficult to manage this by hand, so we need to start automating it. A first step could be a script in this repository that lists tables in the shared-prod *_raw datasets, and generates simple CREATE VIEW namespace.foo AS SELECT * FROM namespace_raw.foo_<highest_version_number> files in the templates directory wherever they are missing. We could then run this offline, prune and adjust the contents by hand, and commit the results.

Depends on #189

Add documentation about published UDFs to docs.tmo

Factored out from #222. Add documentation to the Firefox Data Documentation about published UDFs.

Make generate_incremental_table less platform-dependent

I attempted to use generate_incremental_table, but kept encountering errors with xargs and date options, since I don't have core-utils installed on my Mac.

It may be worth rewriting this as a python script that handles the generation of dates using python's stdlib, and then eventually spawns off invocations of the existing entrypoint script.

Main ping copy_deduplicate query raises memory exception

tl;dr -

Resources exceeded during query execution: The query could not be executed in the allotted memory

Looks like a single day of main ping is too much for the copy_deduplicate query as currently expressed. The Airflow job failed last night. From the logs:

[2019-08-23 04:57:44,743] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,743] {pod_launcher.py:104} INFO -     raise exceptions.from_http_response(response)
[2019-08-23 04:57:44,747] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,747] {pod_launcher.py:104} INFO - google.api_core.exceptions.BadRequest: 400 GET https://www.googleapis.com/bigquery/v2/projects/moz-fx-data-derived-datasets/queries/c31b5f34-69c7-45d7-8b89-ec0d89153f70?maxResults=0&location=US: Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 101% of limit.
[2019-08-23 04:57:44,748] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,748] {pod_launcher.py:104} INFO - Top memory consumer(s):
[2019-08-23 04:57:44,749] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,749] {pod_launcher.py:104} INFO -   query parsing and optimization: 5%
[2019-08-23 04:57:44,749] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,749] {pod_launcher.py:104} INFO -   other/unattributed: 95%

I will look this morning into whether it's possible to recast this query to be more efficient. It may be necessary to break this into two steps with a temp table in between.

cc @relud @whd

Fix udf deploys with gcs buckets

the job that deploys udfs runs tests first, but has different permissions and tests require access to a test-permissions-only bucket. we should probably depend on the test job and not re-run.

https://circleci.com/gh/mozilla/bigquery-etl/2024?utm_campaign=workflow-failed&utm_medium=email&utm_source=notification

Generate docs for all UDFs

We should have a reference page for all the persistent UDFs, and we're much more likely to keep the docs consistent if the documentation lives in docstrings with the code. We'll need to add a build step that auto-generates docs and submits a PR to DTMO if there are changes.

Create more usable FxA content and auth server datasets, or fix current ones

Pulling from the fxa content server logs, here's an example of a query that would give us a usable content server dataset for trailhead. there's a similar query for the auth server that I could make.

Basically, we need easily queryable datasets that contain non-null values from jsonPayload.fields.user_properties and jsonPayload.fields.event_properties.

The current jobs that create fxa_content_events_v1 and fxa_auth_events_v1 only have null values for the fields contained in jsonPayload.fields.user_properties and jsonPayload.fields.event_properties.

So, we could either fix those jobs or create new ones based on queries like that above.

@jklukas opinions?

(you may need to refresh the link to the query above, I forgot to save it)

Move queries into named directories

We will move each query into its appropriately named subdirectory, e.g. sql/search/search_clients_daily_v7.sql will move to sql/search/search_clients_daily_v7/query.sql.

The Airflow runner will have to be notified of this change as well.

Optional: Move view creation scripts to a view.sql file instead of query.sql.

We will also add a metadata.yaml file to every query subdirectory. There are currently only a few query writers so we should be able to appropriately fill-in owners, friendly_name, and description; optionally, some of the job descriptions can be put in labels, e.g. application, refresh, version, incremental.

For more details, see the public data proposal.

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Nondesktop and FxA daily tables may be using wrong sorting order

See #74 (comment)

These tables are windowing with DESC on the sort, but using udf_mode_last, which effectively gives us udf_mode_first. We should sort ASC for these aggregations.

Migrate Addon Aggregates to SQL

https://github.com/mozilla/python_mozetl/blob/master/mozetl/addon_aggregates/addon_aggregates.py

Create public tables in the correct project

This task follows-up on Bug 1573822 and #294. Once we're able to label public BQ datasets, we will need to write those query results to the public BQ project (instead of the default project).

Once the data is there, we will also write an internal view on that data in the default project.

This will take the format of:

bin/entrypoint reading the metadata.yaml, and determining if this is a bq-public query
If it is not, continue as before
if it is, write to the public project
write a view to the default project on that public table

Depending on the outcome in bug 1573822, we may also need to create the dataset in the public project (if it doesn't exist). We should coordinate with ops in that bug on how we want to tackle that.

Change bigquery schema file extension in tests to .bqschema.json

Harness for deploying and testing (dependent) UDFs

I have previously worked with https://github.com/PeriscopeData/redshift-udfs which provides a structure for defining python UDFs for Redshift along with tests, and a harness for deploying the functions and running tests.

It would be nice to have something similar for this repo. For the BigQuery case, this is made more complicated by the fact that persistent UDFs are not yet generally available.

Here's a potential way forward:

Change udf definition files under udf/ to use syntax for creating persistent udfs
Write a small python package for parsing through the udf files to determine dependencies (find usages of udf_*) and build a DAG
Write a small python package for executing the udf definition files in the order they appear in the DAG
Write a test harness that can run creation of the persistent udfs as part of running tests in a generated temporary dataset, then runs tests defined similarly to how we have tests defined for tables
Write a small python lib for parsing the files under sql/ and identifying usage of udfs there; we would then inject temporary udf definitions and output a generated directory; this allows us to more reliably use udfs in our production etl queries without having to duplicate them directly into the source files

The above seems like a significant chunk of work, so likely not an immediate priority.

Union Parquet history into user-facing ping views

Now that we have a full prod deploy, we can add full history for docTypes that have a direct2parquet equivalent by unioning.

Wiki changes

FYI: The following changes were made to this repository's wiki:

defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).

These were made as the result of a recent automated defacement of publically writeable wikis.

Auto-generate Views on Prod Tables

Currently we have a view defined for every table in prod. We're proposing auto-generating these views instead. The code to auto-generate the views should live along-side the table deploys, which happens daily (once the generated-schemas branch is pushed to MPS).

If a view exists here instead, then that view will override the default one; this will e.g. allow us to selectively update these views to handle new columns, data changes, or unions of versions.

New versions will be automatically pointed to by the view. If a union is needed with a previous version that will have to be done manually.

cc @jklukas

Add DOCKER_PASS and DOCKER_USER to CI

@whd / @jasonthomas / ?? could you add this at https://circleci.com/gh/mozilla/bigquery-etl/edit#env-vars

passing secrets to forked builds is disabled.

Needs write permission to mozilla/bigquery-etl on dockerhub.

Investigate hash functions for id_bucket

We are currently using the following calculation for assigning records and id_bucket which enables confidence interval calculations:

MOD(ABS(FARM_FINGERPRINT(user_id)), 20) AS id_bucket

That is the simplest and likely most computationally efficient solution using built-in BigQuery functions. But we have not yet rigorously evaluated it for potential sources of bias:

Fingerprint64 does not have as robust an avalanche property as cryptographic functions such as MD5 (see #33 for an MD5-based implementation)
There may be some bias introduced by ABS

For further discussion, see #27, https://github.com/mozilla/bigquery-etl/pull/32/files#r268184403, and #33.

Write script to check state of requiring partition filters

should be scheduled in airflow
should be able to check all tables in the telemetry data set at once
should return non-zero if any tables do not require a partition filter, unless:
- the table is under 1TB in size
- the table is in an exemption list

Consider alternatives to test_generated.py

the tests don't allow us to use pytest tests/$QUERY/$TEST because that directory only contains the resources for the test, and no python files

we might be able to

use a pytest hook to act like those directories are tests
or require a test_run.py in each test tests/$QUERY/$TEST directory with some boilerplate to run the test

I'm leaning towards #2 but i'm not sure

Script for populating raw deduplicated tables from live tables

As discussed in the BigQuery Table Layout and Structure Proposal, we will have the GCP pipeline populate "live" tables clustered on submission_timestamp, then rely on Airflow to run a nightly job to populate "raw" tables clustered on sample_id.

That likely will look like an additional mode in this repo's entrypoint script that will invoke a query like the following:

WITH
  srctable AS (
  SELECT
    *
  FROM
    `moz-fx-data-shared-prod.${document_namespace}_live.${document_type}_v${document_version}` 
  WHERE
    DATE(submission_timestamp) = @submission_date ),
  --
  numbered_duplicates AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY submission_timestamp) AS _n
  FROM
    srctable )
  --
  SELECT
    * EXCEPT (_n)
  FROM
    numbered_duplicates
  WHERE
    _n = 1 )

with output going to destination table moz-fx-data-shared-prod.${document_namespace}_raw.${document_type}_v${document_version}$ds_nodash.

We will need to have two different modes. In one, we run the above query for only a specific table or set of tables. We'll need to add that at the root of the main_summary DAG, for example, to get live main pings into the deduplicated raw table before running main_summary and all the downstream jobs.

In the other mode, we run the above query for all tables in _live datasets that are not already run as part of other DAGs. We probably need to pass in a list of tables to exclude, and keep that in sync with the tables that are handled in other DAGs.

Document underscore prefix for internal fields

See discussion in https://github.com/mozilla/bigquery-etl/pull/8/files#r263117536

We may want to decide on, document, and consistently apply a standard of prefixing "internal" column names with a _ to avoid interfering with names of source columns.

[Proposal] Maintain arrays of active days in clients_last_seen

@relud just finished figuring out a refactor of date_last_seen to days_since_seen, but I'm going to propose another refactor to accommodate days-per-week calculations needed by @jmccrosky for the Smoot project.

For a user who was active in a given 7-day period (equivalent to WAU), we can measure intensity of their usage by counting the number of active days in that same period. In order to do so, though, we need to know not just the last day they were active, but rather maintain a history of which days were active in the previous 7.

I propose we generalize this concept such that instead of maintaining activity history as a single integer days_since_seen, we maintain a days_seen ARRAY<BOOL> where the first entry in the array represents whether that user was active in the day of measurement, the second entry indicates whether they were active on the previous day, etc.

We can recover both days_since_seen and days_seen_last7 from days_seen. Here's an example of those calculations on some dummy data:

WITH
  raw AS (
  SELECT
    [FALSE,
    FALSE,
    TRUE,
    FALSE,
    TRUE,
    TRUE,
    FALSE,
    FALSE,
    TRUE,
    FALSE,
    TRUE,
    FALSE] AS days_seen )
SELECT
  (
  SELECT
    FIRST_VALUE(
    OFFSET
      ) OVER (ORDER BY OFFSET )
  FROM
    UNNEST(days_seen) AS seen
  WITH
  OFFSET
  WHERE
    seen
  LIMIT
    1) AS days_since_seen,
  (
  SELECT
    COUNT(seen) OVER ()
  FROM
    UNNEST(days_seen) AS seen
  WITH
  OFFSET
  WHERE
    seen
    AND
  OFFSET
    < 7
  LIMIT
    1) AS days_seen_last7
FROM
  raw

That yields the expected days_since_seen = 2 and days_seen_last7 = 3. We might want to factor the above into UDFs to ease extraction.

We'd move to the array format for tracking all usage criteria (so we'd have days_seen, days_visited_5_uri, etc.) in clients_last_seen, and then we'd calculate things like visited_5_uri_wau and visited_5_uri_mean_weekly_days for the firefox_desktop_exact_mau28_by_dimensions table.

Pros:

More generic and flexible structure for tracking activity history

Cons:

Makes clients_last_seen more abstract and less ergonomic to use directly, erasing the usability gains of the recent switch to days_since_seen

Migrate Generic Longitudinal to SQL

https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/GenericLongitudinal.scala

Add test for up to date udfs in queries

for each function in udf/ we should test that if used in a query in sql/ the function should be present at the top of the query and actually match the code provided in udf/

Validate queries using BQ Dryrun option

Tests cover specific cases, but running the query in the correct project with a dryrun can ensure that there are no obvious schema incompatibilities the tests may have missed.

This will require us to define in bigquery-etl which project/dataset the query is supposed to run in, but we could also merge that with the Airflow runner so it's only described here (rather than in two places).

There is an example of using this in re:dash.

Migrate Main Events View to SQL

currently generated by https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/MainEventsView.scala

cc @mreid-moz not sure if this one should be migrated at all, but if it should then it's a good first issue

Investigate using Storage API to modify table schemas

Queries cost $5 per TiB scanned and must complete in 6 hours.

The BigQuery Storage API costs $1.10 per TiB scanned, can read a subset of columns, and has no time-limit on operations.

Maybe we can save on cost and effort need to change partitioning, change clustering, delete columns, or change columns to required, with an automated process to read avro records from the Storage API, write them to GCS, load them into a new table, and clean up afterward.

Consider moving FxA amplitude event export to after midnight PDT

FxA Amplitude uses PDT as its timezone of reference. We are pulling those events into big query once a day. I can't find exactly when, but it looks like before midnight PDT, and likely just after midnight UTC.

This creates an issue that we have to wait an extra day for complete data in BQ if we want to write queries that are framed around PDT - I prefer to do this to keep results comparable to the actual amplitude UI.

Would it cause any problems to move the ETL to sometime shortly after midnight PDT? @relud any thoughts (I think Jeff is still on PTO)

Script for publishing persistent UDFs

We should have a script to run through the udf/ directory and publish all of them as persistent UDFs in the udf dataset.

As part of this, we may want to alter the file names and definition files to not include a udf_ prefix, and only insert that prefix when we're generating files in sql/ to provide temporary view definitions.

Or, we can leave the files as is and have the script's logic strip the prefix when publishing to the udf dataset.