GithubHelp home page GithubHelp logo

mozilla / bigquery-etl Goto Github PK

View Code? Open in Web Editor NEW
223.0 39.0 96.0 154.65 MB

Bigquery ETL

Home Page: https://mozilla.github.io/bigquery-etl

License: Mozilla Public License 2.0

Python 93.62% Dockerfile 0.10% Shell 3.05% JavaScript 0.96% HTML 0.12% Jinja 2.15%

bigquery-etl's Introduction

CircleCI

BigQuery ETL

This repository contains Mozilla Data Team's:

  • Derived ETL jobs that do not require a custom container
  • User-defined functions (UDFs)
  • Airflow DAGs for scheduled bigquery-etl queries
  • Tools for query & UDF deployment, management and scheduling

For more information, see https://mozilla.github.io/bigquery-etl/

Quick Start

Pre-requisites

  • Pyenv (optional) Recommended if you want to install different versions of python, see instructions here. After the installation of pyenv, make sure that your terminal app is configured to run the shell as a login shell.
  • Homebrew (not required, but useful for Mac) - Follow the instructions here to install homebrew on your Mac.
  • Python 3.11+ - (see this guide for instructions if you're on a mac and haven't installed anything other than the default system Python).

GCP CLI tools

  • For Mozilla Employees (not in Data Engineering) - Set up GCP command line tools, as described on docs.telemetry.mozilla.org. Note that some functionality (e.g. writing UDFs or backfilling queries) may not be allowed. Run gcloud auth login --update-adc to authenticate against GCP.
  • For Data Engineering - In addition to setting up the command line tools, you will want to log in to shared-prod if making changes to production systems. Run gcloud auth login --update-adc --project=moz-fx-data-shared-prod (if you have not run it previously).

Installing bqetl

  1. Clone the repository
git clone [email protected]:mozilla/bigquery-etl.git
cd bigquery-etl
  1. Install the bqetl command line tool
./bqetl bootstrap
  1. Install standard pre-commit hooks
venv/bin/pre-commit install

Finally, if you are using Visual Studio Code, you may also wish to use our recommended defaults:

cp .vscode/settings.json.default .vscode/settings.json
cp .vscode/launch.json.default .vscode/launch.json

And you should now be set up to start working in the repo! The easiest way to do this is for many tasks is to use bqetl. You may also want to read up on common workflows.

bigquery-etl's People

Contributors

acmiyaguchi avatar akkomar avatar alekhyamoz avatar anich avatar badboy avatar benwu avatar chelseybeck avatar curtismorales avatar dependabot-preview[bot] avatar dependabot[bot] avatar edugfilho avatar fbertsch avatar gleonard-m avatar iinh avatar jklukas avatar kik-kik avatar kwindau avatar lelilia avatar lucia-vargas-a avatar m-d-bowerman avatar marlene-m-hirose avatar ncloudioj avatar quiiver avatar rebecca-burwei avatar relud avatar scholtzan avatar sean-rose avatar whd avatar wlach avatar wwyc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigquery-etl's Issues

Ensure we can support cross-project queries in Airflow

As discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1563742, we will be starting to migrate all derived tables into the shared-prod project, so there will be an interim period where we will have tables split between derived-datasets:telemetry and shared-prod:telemetry_raw. As we migrate, we will be providing views in derived-datasets:telemetry and shared-prod:telemetry to ensure a consistent and uninterrupted interface for users.

During the interim, we will have some queries in Airflow writing to derived-datasets and some to shared-prod, with perhaps needing to read from a different project than they're writing into. We should check that we understand how to express this in our Airflow dags and that our entrypoint script is able to support these scenarios.

Script for creating default views

As part of the accepted BQ Layout and Table Structure Proposal, we will be hiding tables from users in the telemetry_raw and other <namespace>_raw datasets in shared-prod, and providing user-facing views in the telemetry dataset.

It is becoming difficult to manage this by hand, so we need to start automating it. A first step could be a script in this repository that lists tables in the shared-prod *_raw datasets, and generates simple CREATE VIEW namespace.foo AS SELECT * FROM namespace_raw.foo_<highest_version_number> files in the templates directory wherever they are missing. We could then run this offline, prune and adjust the contents by hand, and commit the results.

Depends on #189

Move queries into named directories

We will move each query into its appropriately named subdirectory, e.g. sql/search/search_clients_daily_v7.sql will move to sql/search/search_clients_daily_v7/query.sql.

The Airflow runner will have to be notified of this change as well.

Optional: Move view creation scripts to a view.sql file instead of query.sql.

We will also add a metadata.yaml file to every query subdirectory. There are currently only a few query writers so we should be able to appropriately fill-in owners, friendly_name, and description; optionally, some of the job descriptions can be put in labels, e.g. application, refresh, version, incremental.

For more details, see the public data proposal.

Script for running all view definitions

We schedule queries individually in DAGs for tables we want to populate daily, but we don't have any automation yet for running views defined in this repository.

We should likely have a job in Airflow that scans through this repo for all queries that define views, and run them, so that we don't get drift between what's defined in this repo and the views that actually live in BQ.

Also, if the underlying tables have their schemas evolve, such as adding fields, those fields won't show up in the BQ console when you inspect the schema of the view, since the view schema is determined at creation time. Running a new "CREATE OR REPLACE VIEW" statement should get the updated schema applied for the view.

So, the result of this issue should be a script in this repository that scans through sql/ to find files that contain CREATE OR REPLACE VIEW statements and run them.

Wiki changes

FYI: The following changes were made to this repository's wiki:

These were made as the result of a recent automated defacement of publically writeable wikis.

Run python linters on python scripts

I thought this would be as easy as adding specific files to a list somewhere, but pytest is refusing to load files that don't end in a .py extension.

Create more usable FxA content and auth server datasets, or fix current ones

Pulling from the fxa content server logs, here's an example of a query that would give us a usable content server dataset for trailhead. there's a similar query for the auth server that I could make.

Basically, we need easily queryable datasets that contain non-null values from jsonPayload.fields.user_properties and jsonPayload.fields.event_properties.

The current jobs that create fxa_content_events_v1 and fxa_auth_events_v1 only have null values for the fields contained in jsonPayload.fields.user_properties and jsonPayload.fields.event_properties.

So, we could either fix those jobs or create new ones based on queries like that above.

@jklukas opinions?

(you may need to refresh the link to the query above, I forgot to save it)

Investigate hash functions for id_bucket

We are currently using the following calculation for assigning records and id_bucket which enables confidence interval calculations:

MOD(ABS(FARM_FINGERPRINT(user_id)), 20) AS id_bucket

That is the simplest and likely most computationally efficient solution using built-in BigQuery functions. But we have not yet rigorously evaluated it for potential sources of bias:

  • Fingerprint64 does not have as robust an avalanche property as cryptographic functions such as MD5 (see #33 for an MD5-based implementation)
  • There may be some bias introduced by ABS

For further discussion, see #27, https://github.com/mozilla/bigquery-etl/pull/32/files#r268184403, and #33.

Script for populating raw deduplicated tables from live tables

As discussed in the BigQuery Table Layout and Structure Proposal, we will have the GCP pipeline populate "live" tables clustered on submission_timestamp, then rely on Airflow to run a nightly job to populate "raw" tables clustered on sample_id.

That likely will look like an additional mode in this repo's entrypoint script that will invoke a query like the following:

WITH
  srctable AS (
  SELECT
    *
  FROM
    `moz-fx-data-shared-prod.${document_namespace}_live.${document_type}_v${document_version}` 
  WHERE
    DATE(submission_timestamp) = @submission_date ),
  --
  numbered_duplicates AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY submission_timestamp) AS _n
  FROM
    srctable )
  --
  SELECT
    * EXCEPT (_n)
  FROM
    numbered_duplicates
  WHERE
    _n = 1 )

with output going to destination table moz-fx-data-shared-prod.${document_namespace}_raw.${document_type}_v${document_version}$ds_nodash.

We will need to have two different modes. In one, we run the above query for only a specific table or set of tables. We'll need to add that at the root of the main_summary DAG, for example, to get live main pings into the deduplicated raw table before running main_summary and all the downstream jobs.

In the other mode, we run the above query for all tables in _live datasets that are not already run as part of other DAGs. We probably need to pass in a list of tables to exclude, and keep that in sync with the tables that are handled in other DAGs.

Generate docs for all UDFs

We should have a reference page for all the persistent UDFs, and we're much more likely to keep the docs consistent if the documentation lives in docstrings with the code. We'll need to add a build step that auto-generates docs and submits a PR to DTMO if there are changes.

Remove UDF declarations from queries

Follow-up to #141

  • Extend the code in tests/parse_udfs.py to also read through the files under sql/ and identify usage of udfs there
  • Write a python script to use the above udf dependency resolution logic, writing out generated sql files to a new directory (perhaps we call this target/sql to match Maven's conventions); each file would contain the body of all temporary UDFs needed in the query, and then the query text itself
  • Remove UDF declarations from the source files under sql/ and update the logic for our CircleCI deploy job to make sure that the published container has a sql/ directory that contains the generated queries with UDF definitions

Add script to automate building incremental tables

to be used for both creating and backfilling tables, should eventually grow a feature to automatically detect the difference between non-recursive table (clients_daily) and a recursive one (clients_last_seen)

Support running in alternate projects

A request we've seen is to use bigquery-etl and it's tooling (alongside Airflow) for other projects. This issue will cover how we can tackle that.

To start, we can move all datasets to their own project's directory; e.g. telemetry will move to moz-fx-data-shared-prod/telemetry.

We will add codeowners to each project, such that the Data Engineering team doesn't need to review queries that are in other projects (unless they own them).

We will need to enable the BQ command in the GKE cluster to use alternate projects. Some options might be:

  • Give GKE access to all projects
  • Run the workload inside the associated project's GKE cluster, which Airflow needs access to

cc @jklukas

Harness for deploying and testing (dependent) UDFs

I have previously worked with https://github.com/PeriscopeData/redshift-udfs which provides a structure for defining python UDFs for Redshift along with tests, and a harness for deploying the functions and running tests.

It would be nice to have something similar for this repo. For the BigQuery case, this is made more complicated by the fact that persistent UDFs are not yet generally available.

Here's a potential way forward:

  • Change udf definition files under udf/ to use syntax for creating persistent udfs
  • Write a small python package for parsing through the udf files to determine dependencies (find usages of udf_*) and build a DAG
  • Write a small python package for executing the udf definition files in the order they appear in the DAG
  • Write a test harness that can run creation of the persistent udfs as part of running tests in a generated temporary dataset, then runs tests defined similarly to how we have tests defined for tables
  • Write a small python lib for parsing the files under sql/ and identifying usage of udfs there; we would then inject temporary udf definitions and output a generated directory; this allows us to more reliably use udfs in our production etl queries without having to duplicate them directly into the source files

The above seems like a significant chunk of work, so likely not an immediate priority.

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

  1. Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
  2. Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Add UDF for parsing mobile events

It looks like focus events aren't nested by process the way desktop events are in main ping. We should have a separate UDF for handling that case. Example:

CREATE TEMP FUNCTION
  udf_js_json_extract_events (input STRING)
  
  RETURNS ARRAY<STRUCT<
  event_timestamp INT64,
  event_category STRING,
  event_object STRING,
  event_method STRING,
  event_string_value STRING,
  event_map_values ARRAY<STRUCT<key STRING, value STRING>>
  >>
  LANGUAGE js AS """
    if (input == null) {
      return null;
    }
    var parsed = JSON.parse(input);
    var result = [];
    parsed.forEach(event => {
        var structured = {
          "event_timestamp": event[0],
          "event_category": event[1],
          "event_method": event[2],
          "event_object": event[3],
          "event_string_value": event[4],
          "event_map_values": []
        }
        for (var key in event[5]) {
          structured.event_map_values.push({"key": key, "value": event[5][key]})
        }
        result.push(structured)
    });
    return result;
""";


SELECT 
  client_id, event
FROM
  `moz-fx-data-shar-nonprod-efed.telemetry.focus_event`
CROSS JOIN
  UNNEST(udf_js_json_extract_events(JSON_EXTRACT(additional_properties, '$.events'))) AS event
WHERE
  DATE(submission_timestamp) = '2019-08-03'
LIMIT 100

See https://bugzilla.mozilla.org/show_bug.cgi?id=1570264

Make generate_incremental_table less platform-dependent

I attempted to use generate_incremental_table, but kept encountering errors with xargs and date options, since I don't have core-utils installed on my Mac.

It may be worth rewriting this as a python script that handles the generation of dates using python's stdlib, and then eventually spawns off invocations of the existing entrypoint script.

Add test for up to date udfs in queries

for each function in udf/ we should test that if used in a query in sql/ the function should be present at the top of the query and actually match the code provided in udf/

Filter on fxa_content_auth_events_v1 causes some events to be dropped

Here's an example of an event that makes it to amplitude but is filtered out by the current content server ETL job.

(querying for its flow_id in fxa_content_events_v1 shows its missing, but its there in the FxA logs)

I think this is because of this line in the etl job:
https://github.com/mozilla/bigquery-etl/blob/master/sql/fxa_content_events_v1.sql#L14

This causes many form views of relier-hosted forms to be filtered out, for example form views on about:welcome that are logged via a call to the FxA metrics endpoint. Without these view events its not possible to construct an end-to-end funnel that begins on the relier's site.

Based on querying for null user ids in the FxA logs, I think removing this filter would add another 3M events per day to the derived datasets.

These are almost all this fxa_email_first - view event. Compare GCP with amplitude.

I do think we need these events for some analyses, but I'm guessing it would be costly to backfill them, so maybe we can start by just removing the filter going forward? Or would that break something else that I'm not aware of. I think we would at least have to make user_id nullable (if its not already).

The filter can stay on the auth server ETL, there will never be null user_ids there.

[Proposal] Maintain arrays of active days in clients_last_seen

@relud just finished figuring out a refactor of date_last_seen to days_since_seen, but I'm going to propose another refactor to accommodate days-per-week calculations needed by @jmccrosky for the Smoot project.

For a user who was active in a given 7-day period (equivalent to WAU), we can measure intensity of their usage by counting the number of active days in that same period. In order to do so, though, we need to know not just the last day they were active, but rather maintain a history of which days were active in the previous 7.

I propose we generalize this concept such that instead of maintaining activity history as a single integer days_since_seen, we maintain a days_seen ARRAY<BOOL> where the first entry in the array represents whether that user was active in the day of measurement, the second entry indicates whether they were active on the previous day, etc.

We can recover both days_since_seen and days_seen_last7 from days_seen. Here's an example of those calculations on some dummy data:

WITH
  raw AS (
  SELECT
    [FALSE,
    FALSE,
    TRUE,
    FALSE,
    TRUE,
    TRUE,
    FALSE,
    FALSE,
    TRUE,
    FALSE,
    TRUE,
    FALSE] AS days_seen )
SELECT
  (
  SELECT
    FIRST_VALUE(
    OFFSET
      ) OVER (ORDER BY OFFSET )
  FROM
    UNNEST(days_seen) AS seen
  WITH
  OFFSET
  WHERE
    seen
  LIMIT
    1) AS days_since_seen,
  (
  SELECT
    COUNT(seen) OVER ()
  FROM
    UNNEST(days_seen) AS seen
  WITH
  OFFSET
  WHERE
    seen
    AND
  OFFSET
    < 7
  LIMIT
    1) AS days_seen_last7
FROM
  raw

That yields the expected days_since_seen = 2 and days_seen_last7 = 3. We might want to factor the above into UDFs to ease extraction.

We'd move to the array format for tracking all usage criteria (so we'd have days_seen, days_visited_5_uri, etc.) in clients_last_seen, and then we'd calculate things like visited_5_uri_wau and visited_5_uri_mean_weekly_days for the firefox_desktop_exact_mau28_by_dimensions table.

Pros:

  • More generic and flexible structure for tracking activity history

Cons:

  • Makes clients_last_seen more abstract and less ergonomic to use directly, erasing the usability gains of the recent switch to days_since_seen

Consider adding field to clients_daily/etc. to identify profiles created due to v67 profile-per-install change

@jklukas, as discussed, It is likely that we will often need to perform analysis excluding the profiles created due to v67 profile-per-install change. For consistency, reduced likelihood of error, and cost, it may make sense for data engineering to own this and expose the needed tables/fields for analysis.

@SuYoungHong did some analysis of this and suggests that following query to identify the profiles:

SELECT
DISTINCT client_id
FROM
main_summary
WHERE
submission_date_s3 >= '20190131'
AND scalar_parent_startup_profile_selection_reason IN ('firstrun-skipped-default',
'restart-skipped-default')

The date is the first date for which he observed the scalar_parent_startup_profile_selection_reason being populated.

@gkabbz has done some work here as well, specifically looking at new profiles creation. He suggests it may be necessary to look at first shutdown pings as well as main_summary to find all the relevant profiles.

Dump public JSON data to the correct bucket

This follows-up on #294. The bigquery-etl runner will need to again read the metadata.yaml and determine if a query is a public-json query; if so, we will need to write it to the correct place. That location is being determined in Bug 1573826 That will entail the following:

  1. Create the table, as done previously (also includes the public-BQ tables)
  2. Parse metadata.yaml and determine if this is a public-json table
  3. If so, use bq extract to write the data as ndjson to GCS
  4. Once the data is there, we need to change from ndjson to a single JSON array. This could possibly be done with a cloud function(?) or locally
  5. Write the file out to gcs://$bucket/api/v1/tables /$dataset/$tablename/v1/files/$date/$filename for an incremental table, or gcs://$bucket/api/v1/tables /$dataset/$tablename/v1/files/$filename for a non-incremental table
  6. Update gcs://$bucket/api/v1/tables /$dataset/$tablename/v1/last_updated

Auto-generate Views on Prod Tables

Currently we have a view defined for every table in prod. We're proposing auto-generating these views instead. The code to auto-generate the views should live along-side the table deploys, which happens daily (once the generated-schemas branch is pushed to MPS).

If a view exists here instead, then that view will override the default one; this will e.g. allow us to selectively update these views to handle new columns, data changes, or unions of versions.

New versions will be automatically pointed to by the view. If a union is needed with a previous version that will have to be done manually.

cc @jklukas

Create public tables in the correct project

This task follows-up on Bug 1573822 and #294. Once we're able to label public BQ datasets, we will need to write those query results to the public BQ project (instead of the default project).

Once the data is there, we will also write an internal view on that data in the default project.

This will take the format of:

  • bin/entrypoint reading the metadata.yaml, and determining if this is a bq-public query
  • If it is not, continue as before
  • if it is, write to the public project
  • write a view to the default project on that public table

Depending on the outcome in bug 1573822, we may also need to create the dataset in the public project (if it doesn't exist). We should coordinate with ops in that bug on how we want to tackle that.

Script for publishing persistent UDFs

We should have a script to run through the udf/ directory and publish all of them as persistent UDFs in the udf dataset.

As part of this, we may want to alter the file names and definition files to not include a udf_ prefix, and only insert that prefix when we're generating files in sql/ to provide temporary view definitions.

Or, we can leave the files as is and have the script's logic strip the prefix when publishing to the udf dataset.

Main ping copy_deduplicate query raises memory exception

tl;dr -

Resources exceeded during query execution: The query could not be executed in the allotted memory

Looks like a single day of main ping is too much for the copy_deduplicate query as currently expressed. The Airflow job failed last night. From the logs:

[2019-08-23 04:57:44,743] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,743] {pod_launcher.py:104} INFO -     raise exceptions.from_http_response(response)
[2019-08-23 04:57:44,747] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,747] {pod_launcher.py:104} INFO - google.api_core.exceptions.BadRequest: 400 GET https://www.googleapis.com/bigquery/v2/projects/moz-fx-data-derived-datasets/queries/c31b5f34-69c7-45d7-8b89-ec0d89153f70?maxResults=0&location=US: Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 101% of limit.
[2019-08-23 04:57:44,748] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,748] {pod_launcher.py:104} INFO - Top memory consumer(s):
[2019-08-23 04:57:44,749] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,749] {pod_launcher.py:104} INFO -   query parsing and optimization: 5%
[2019-08-23 04:57:44,749] {logging_mixin.py:95} INFO - [2019-08-23 04:57:44,749] {pod_launcher.py:104} INFO -   other/unattributed: 95%

I will look this morning into whether it's possible to recast this query to be more efficient. It may be necessary to break this into two steps with a temp table in between.

cc @relud @whd

Consider alternatives to test_generated.py

the tests don't allow us to use pytest tests/$QUERY/$TEST because that directory only contains the resources for the test, and no python files

we might be able to

  1. use a pytest hook to act like those directories are tests
  2. or require a test_run.py in each test tests/$QUERY/$TEST directory with some boilerplate to run the test

I'm leaning towards #2 but i'm not sure

Write script to check state of requiring partition filters

  • should be scheduled in airflow
  • should be able to check all tables in the telemetry data set at once
  • should return non-zero if any tables do not require a partition filter, unless:
    • the table is under 1TB in size
    • the table is in an exemption list

Drop generated_time columns

it is a duplicate of existing bigquery table metadata (lastModifiedTime for each table/partition), so it's only useful if we aren't setting WRITE_TRUNCATE.

example query showing that we can see last modified time per-partition:

seq 20190323 20190324 |
    xargs -i~ bq show moz-fx-data-derived-datasets:telemetry.clients_daily_v6'$'~ |
    jq '.lastModifiedTime|fromjson|./1000' |
    TZ=UTC xargs -i~ date -d@~ +'%F %T %Z'

output

2019-03-24 18:38:25 UTC
2019-03-25 18:30:36 UTC

and for the whole table:

bq show moz-fx-data-derived-datasets:telemetry.clients_daily_v6 |
    jq '.lastModifiedTime|fromjson|./1000' |
    TZ=UTC xargs -i~ date -d@~ +'%F %T %Z'

output

2019-03-25 18:30:36 UTC

Investigate using Storage API to modify table schemas

Queries cost $5 per TiB scanned and must complete in 6 hours.

The BigQuery Storage API costs $1.10 per TiB scanned, can read a subset of columns, and has no time-limit on operations.

Maybe we can save on cost and effort need to change partitioning, change clustering, delete columns, or change columns to required, with an automated process to read avro records from the Storage API, write them to GCS, load them into a new table, and clean up afterward.

Consider moving FxA amplitude event export to after midnight PDT

FxA Amplitude uses PDT as its timezone of reference. We are pulling those events into big query once a day. I can't find exactly when, but it looks like before midnight PDT, and likely just after midnight UTC.

This creates an issue that we have to wait an extra day for complete data in BQ if we want to write queries that are framed around PDT - I prefer to do this to keep results comparable to the actual amplitude UI.

Would it cause any problems to move the ETL to sometime shortly after midnight PDT? @relud any thoughts (I think Jeff is still on PTO)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.