httparchive / bigquery Goto Github PK

View Code? Open in Web Editor NEW

65.0 65.0 19.0 2.14 MB

BigQuery import and processing pipelines

Ruby 0.12% Shell 1.15% Java 1.09% Jupyter Notebook 96.98% Python 0.63% JavaScript 0.04%

bigquery

bigquery's Introduction

The HTTP Archive tracks how the Web is built

!! Important: This repository is deprecated. Please see HTTPArchive/httparchive.org for the latest development !!

This repo contains the source code powering the HTTP Archive data collection.

What is the HTTP Archive?

Successful societies and institutions recognize the need to record their history - this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web's digitized content.

In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

bigquery's People

Contributors

Stargazers

Watchers

Forkers

pombredanne dongjoon-hyun johndharrison rodnandes rviscomi dougsillars leewalter dalavancloud ikalachy jcchouinard rreverser tienqnguyen paulcalvano 00mjk ota2000 aprilsalto jelly65

bigquery's Issues

Automate CDN HTTPS cert renewal

HTTPArchive/httparchive.org#14 documents the steps needed to renew the cert.

The certificate currently has a 3 month expiration. The next expiration date is April 9, 2018. Ensure that it is renewed before then and ideally automate the process.

I'm assigning this issue to the bigquery repo because I think it makes the most sense to have a cron job on the GCE instance doing the automation, as opposed to the GAE web server itself.

blink_features.usage has null rank column

Since we have this column can we populate it with the new CrUX ranking? It's confusing not to have it in here, makes joins more difficult, and means you need an extra join to summary_pages table to get ranking.

@rviscomi / @pmeenan not sure what populates this table and so where this change would need to be made?

Duplicates in BigQuery

Hi folk,

There are too many duplicates in technology tables. For example in many rows WordPress is as CMS and BLOG stored for same URL. There are also some JavaScript Libraries which are as JavaScript Framework and Library stored. In these duplicates, just the category name is different. I think these duplicates should be removed in next releases (at least) .

Create tti.sql and firstCPUIdle.sql or rename ttci.sql and ttfi.sql

To finish HTTPArchive/httparchive.org#110 we should rename ttci.sql and ttfi.sql or for some backward compatible reasons create new tti.sql and firstCPUIdle.sql

Generate new tables for JS library results

We had a scratch space table created for ad hoc analysis of the JS library results, but we need something more permanent.

Similar to the Lighthouse tables, generate a new table for JS libraries.

Make it easier to automatically generate CrUX reports for YYYYMM-1 dates

Example: in early June the May CrUX dataset is available under the YYYYMM format of 201805. When we generate the reports in June the YYYYMM value is 201806, whose corresponding dataset is not yet available. So we need a better way to generate CrUX reports for YYYYMM-1 automatically.

Maybe after the last sync har/csv completes, we run generate_reports.sh -h YYYMM as well as something like generate_report.sh -d YYYYMM-1/crux[fp,fcp,dcl,ol].json to generate each crux report individually for the previous month.

No new tables since July 1st

I noticed that there haven't been any tables newer than 2015_07_01_*, did something go wrong?

Monthly batch jobs hosted under older user directory

The monthly batch jobs are still hosted in /home/igrigorik/code and the cron runs under that user (though connecting to a different BigQuery user).

Additional this is currently not authenticated to GCP:

Looks like @rviscomi it was using your account last but I obviously can't re-authenticate that:

igrigorik@worker:~/code$ bq show "httparchive:pages.2021_12_01_desktop"
ERROR: (bq) Your current active account [[email protected]] does not have any valid credentials

We need to fix the BigQuery authentication issue before the January run finishes in the next month or it won't process the pipeline nor ruin the reports as they are under this cron.

Longer term we should probably also moved these out of the /home/igrigorik/code directory and cron to a generic account on the server (create a httparchive user?) ideally with an equivalent BigQuery account it can use that won't expire.

HAR dataset does not have Alexa ranking data

runs.* tables have a rank attribute for each page, whereas har.* files produces by WPT do not.

@souders where and how do we populate the ranking data? It'd be nice to have this within our HAR files.

Clean up errors

I hate regular errors in log files. Makes it too easy too miss real errors and difficult for new people to support something as they don't know if they are expected errors or something has gone wrong.

Currently a number of SQL queries cannot run, including:

CrUX histograms are based on CrUX data not HTTP Archive crawl so they fail as data usually missing - see HTTPArchive/httparchive.org#306
Some of the SQL queries do not work for lenses - for example Blink Usage queries for Capabilities report do not have URL level data to apply lens's.

There's a few things we could do here:

Remove reports like CrUX to a separate folder so they can be run separately
Fix queries so they do run (e.g. CruUX could be changed to look at previous month's data, Blink Usage reports could add dummy URL column so at least the query doesn't fail, even if it doesn't return data)
Add exclusion functionality so certain reports do not run for certain dates or lens's

Any thoughts?

Deduplicate generate_report.sh and generate_reports.sh

The only difference between the scripts is that _reports iterates through the histograms/timeseries directories and queries each SQL file.

Rewrite _reports so it simply calls _report for each metric, passing through all of the flags. This way all of the query/lens/storage logic is in one script.

Investigate why EOM report generation runs multiple days in a row

Each big blue spike is the BigQuery analysis cost, which coincides with the end of the month when we generate reports for httparchive.org.

The most troubling thing to me is not so much the height of the bars (a lot of money) but that they seem to be repeating unnecessarily over consecutive days. Report generation should happen once after the data is ready and subsequent cron jobs should see that it's already been done and stop. That doesn't seem to be happening.

Reports have not generated for January 2022

So the January reports have not run. This happens every so often and ran it manually. but it's bugged me, and think I've finally figured it out.

We run the following in the cron:

$ crontab -l
0 15 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_csv.sh `date +\%b_1_\%Y`'  >> /var/log/HAimport.log 2>&1
0  8 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_csv.sh mobile_date +\%b_1_\%Y`'  >> /var/log/HAimport.log 2>&1
0 10 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_har.sh chrome' >> /var/log/HA-import-har-chrome.log 2>&1
0 11 * * * /bin/bash -l -c 'cd /home/igrigorik/code && ./sync_har.sh android' >> /var/log/HA-import-har-android.log 2>&1

The CSV jobs generate the summary tables and then attempt to run the reports if all the other data is there.
The HAR jobs generate the non-summary tables and then attempts to run the reports if all the other data is there.

So the last job to upload the data should run the reports, because at that point all 4 sets of tables are there.
The other 3 jobs only do the imports and fail on the report generation as not all the tables are there.

Running this shows the completion date of each upload:

	bq show "httparchive:summary_pages.${YYYY_MM_DD}_desktop" | head -5
	bq show "httparchive:summary_pages.${YYYY_MM_DD}_mobile" | head -5
	bq show "httparchive:pages.${YYYY_MM_DD}_desktop" | head -5
	bq show "httparchive:pages.${YYYY_MM_DD}_mobile" | head -5

Which is summarised below

dataset	data
httparchive:summary_pages.2022_01_01_desktop	19 Jan 01:04:59
httparchive:summary_pages.2022_01_01_mobile	25 Jan 22:16:00
httparchive:pages.2022_01_01_desktop	24 Jan 16:54:34
httparchive:pages.2022_01_01_mobile	25 Jan 07:16:24

So the last job to complete is the summary pages for mobile. So it should have kicked off the reports.

However the logs show this:

Attempting to generate reports...
The BigQuery tables for 2022_01_01_mobile are not available.

This is because the date passed to the sql/generate_reports.sh script is 2022_01_01_mobile instead of 2022_01_01. This is due to a bug in the sync_csv.sh script that sets this to the _date_client (for other reasons in the script).

The net effect is, if the mobile CSV/summary pages finishes last the reports are not generated automatically. If any of the other tables finish last, then they are automatically generated.

Will submit a fix for this, and rerun the reports.

Hopefully this. whole hacky script will be rewritten soon but this is a simple fix for now.

Support manual backfilling with sync_har.sh of non-standard dates

The 12/1 test batch didn't actually run until 12/2, so things got a bit screwed up and 12/1 never appeared in BigQuery. Manually running the sync_har.sh script doesn't work because it only expects the standard [1, 15] dates.

support non-standard dates
support mapping the BigQuery table name to a standardized date

Context

Add field comparable to firstHtml to the har.request tables

The runs.request tables include a firstHtml field to indicate that the request is for the parent document.

Queries on the har.request tables must join on the corresponding runs table to get this info. There are tens of millions of requests in each table, so the join is expensive.

To simplify queries and make them less expensive, add a boolean field comparable to firstHtml to the har.request tables. It should share the same logic as the runs table; first 200 response with HTML mime type.

Views for "Latest Requests" on Desktop and Mobile Are Returning Page Data Instead of Requests

The VIEWs for httparchive:latest.requests_desktop and httparchive:latest.requests_mobile seem to be querying the summary_pages tables instead of the summary_requests tables.

response_bodies.2018_12_01_mobile missing data

The 12/1 mobile table is much smaller and missing a lot of data compared to the previous crawl and the current desktop crawl.

2018_11_15_mobile: 35,441,289 rows, 1.41 TB
2018_12_01_mobile: 18,084,199 rows, 152 GB

2018_11_15_desktop: 45,975,086 rows, 2.00 TB
2018_12_01_desktop: 46,284,186 rows, 2.01 TB

I just reran the 12/1 mobile HAR dataflow pipeline and it produced identical results.

cc @jeffposnick

Prune Lighthouse audits from LHR

Per GoogleChrome/lighthouse#10716 (comment) we should prune the full-page-screenshot and final-screenshot audits.

Truncate request_bodies at 10 MB

The row limit seems to be 10 MB, not 2 MB per this error message: Row size is larger than: 10485760

If that's the case, we can raise the ceiling on request (response?) bodies rows.

Mobile HAR pipeline failing

        [...]
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at com.httparchive.dataflow.BigQueryImport$DataExtractorFn.processElement(BigQueryImport.java:222)

https://github.com/HTTPArchive/bigquery/blob/master/dataflow/java/src/main/java/com/httparchive/dataflow/BigQueryImport.java#L222

This code processes LH results. I think in the latest LH release there were some changes to the JSON LHR, so I'll have to update this.

(PS: Yay for Stackdriver error notifications!)

`httparchive.urls.*` tables schema change

The schema of the httparchive.urls.* tables seems to have changed from…

…pre-2018:
```
domain,rank
```
…to since 2108 just:
```
origin
```

I used to be able to quickly get historical ranks by querying httparchive.urls.* and extracting the date as the _TABLE_SUFFIX, but this is now no longer possible. Was this announced anywhere? If so, I missed it and also can't find it now.

Lighthouse `2021_01_01_mobile` table missing

There's a desktop table, but no mobile ...

Pages table have two entries for each URL

The httparchive:runs.2016_09_15_pages table seems to have two entries for each page:

SELECT rank, url FROM [httparchive:runs.2016_09_15_pages] 
WHERE rank > 0
ORDER BY rank ASC
LIMIT 10
>>
1   1   http://www.google.com/   
2   1   http://www.google.com/   
3   2   http://www.youtube.com/      
4   2   http://www.youtube.com/      
5   3   http://www.facebook.com/     
6   3   http://www.facebook.com/     
7   4   http://www.baidu.com/    
8   4   http://www.baidu.com/    
9   5   http://www.yahoo.com/    
10  5   http://www.yahoo.com/

SELECT count(rank) FROM [httparchive:runs.2016_09_15_pages] 
>> 980874
SELECT count(DISTINCT rank) FROM [httparchive:runs.2016_09_15_pages] 
>> 471366

If this is intentional, would be nice to note in http://httparchive.org/about.php#testchanges
PS: Apologies if this is not the right place to file the issue.

BigQuery import is paused due to a DataFlow SDK issue

@igrigorik said:

@RByers there was a regression in latest DataFlow SDK so I paused the automated imports... I can work around it though if its high priority.

We're beginning to rely on this data more and more in blink API owner discussions (eg. this particular came up in the context of this intent to remove. So having the data flowing regularly is definitely valuable. In many cases the Feb data is good enough though, so I can't really say it's that urgent (yet).

Apr_15_2018 dir on GCS doesn't exist

The CSV pipeline failed because gs://httparchive/Apr_15_2018/ doesn't exist. This is created at the end of the crawl which should have happened by now since May 1 just started. Still investigating.

cc @pmeenan

Incomplete page resources in httparchive.response_bodies.* tables

The site Wired registers a service worker located at https://www.wired.com/sw.js and references a web app manifest located at https://www.wired.com/manifest.json. While the service worker is available in several runs, the manifest is not:

SELECT
  *
FROM
  `httparchive.response_bodies.*`
WHERE
  url = "https://www.wired.com/manifest.json"
  OR url = "https://www.wired.com/sw.js"

Web app manifests are generally indexed, as can be seen in this quick test:

SELECT
  url,
  body
FROM
  `httparchive.response_bodies.*`
WHERE
  url LIKE "%/manifest.json"
  AND body LIKE "%\"short_name\":%"
LIMIT
  10

I am not sure what causes the manifest to not be included, but I think Wired's manifest should be indexed as well, the HTML referencing it definitely was indexed:

SELECT
  page,
  url,
  body
FROM
  `httparchive.response_bodies.*`
WHERE
  page = "http://www.wired.com/"
  AND (url = "http://www.wired.com/"
    OR url = "https://www.wired.com/")
  AND body LIKE "%manifest.json%"

(CC: @rviscomi and @jeffposnick)

Too few summary_requests.2019_07_01_desktop rows

In the most recent 2019_07_01 crawl, which we will use for the Almanac, I'm only seeing 240,411,901 rows in the desktop summary_requests table. For reference, the corresponding requests table has 420,510,876 rows. The corresponding mobile tables are much more aligned: 468,544,640 vs 463,862,666.

I'll try rerunning the CSV sync to ensure the Almanac queries are accurate.

Clean up processed data after each crawl

Downloading and processing CSV files eats up a lot of disk space and can cause the pipeline to stall if it runs out of space.

We need to clean up the CSVs and processed data when we're done generating the BigQuery tables.

CommandException: Invalid command "application/json".

During generateReports.sh at the end of each metric being generated, I noticed the following error:

CommandException: Invalid command "application/json".

Ensure that the JSON files are being uploaded to GCS correctly.

Tables either missing or import went wrong

I wanted to run queries on the last 2 years of page data and found some missing tables/data.
Posting here as to not lose this information: https://twitter.com/igrigorik/status/577158258000953344

2013_07_15_pages does not exist
2013_12_01_pages does not exist
2014_12_01_pages 0 rows
2014_03_15_pages 0 rows

Upgrade to the latest Apache Beam SDK version to prevent job disruption

Upgrade to the latest Apache Beam SDK version or add your project to an “allow” list to ensure continuity of current workflow.

Hello Rick,

This is a reminder that we will soon discontinue support for the JSON-RPC protocol and Global HTTP Batch and, as a result, will decommission the following SDK versions on March 31, 2020:

Apache Beam SDK for Java, versions 2.4.0 and below (inclusive)

Apache Beam SDK for Python, versions 2.4.0 and below (inclusive)

Cloud Dataflow SDK for Java, versions 2.4.0 and below (inclusive)

Cloud Dataflow SDK for Python, 2.4.0 and below (inclusive)

Timeline for decommissioning:

January 31, 2020 - Deadline to add a project(s) to the “allow” list (see instructions below).

February 2020 - Jobs using the SDKs listed above will start to fail unless added to the allow list. Jobs that have been upgraded to the latest Apache Beam SDK version will not be affected.

March 31, 2020 - Any job still running on the SDKs listed above will fail, even if the project was added to the allow list.

What do I need to know?

Jobs will start failing in February 2020 as a way to notify all users of the requirement to upgrade/migrate affected pipelines to supported SDKs before March 31, 2020. Adding your project to the allow list lets us know that you got the message, and are working on migrating your projects before the March deadline. After March 31, 2020, any job still running on Apache Beam or Cloud Dataflow SDK versions 2.4.0 or earlier will fail.
Your projects listed below will be affected by this change:

HTTP Archive (httparchive)

What do I need to do?

To exempt jobs running affected SDKs from failure between February and March 2020, request that the project ID(s) be added to the “allow” list. Requests must be submitted by January 31, 2020.
If you have a technical account manager (TAM) or a strategic cloud engineer (SCE), contact them directly to have your project(s) added to the allow list. Include the project ID(s) for the job(s) to be exempted.
If you do not have a TAM or SCE, reply to this email to request a project be added to the allow list. Include the project ID(s) for the job(s) to be exempted.

Migrate your affected jobs to the latest Apache Beam SDK version by March 31, 2020.
If you have any questions or require assistance, please reply to this email to contact Google Cloud Support.

Thanks for choosing Apache Beam and Cloud Dataflow.

—The Google Apache Beam and Cloud Dataflow Teams

Cookie values missing from recent runs

It seems that starting with the September 2019 run, cookies were stripped from the WebPageTest results due to a change in the required net-log command line argruments, resulting in the a message such as "[x bytes were stripped]" being stored in place of ~85% of all cookies values.

The commands line argument used in wptagent was updated by @pmeenan . I'm just opening this issue as a reminder to check the cookie data in next month's data.

Create and maintain a 10k-row subset table

Suggested in the HTTP Archive Slack channel:

Was wondering if it makes sense to add a "sample" dataset that contains data for the first ~1000 pages. This way you can easily test out a query on httparchive.latest.response_bodies_desktop using something smaller like httparchive.sample.response_bodies_desktop. I manually create sample datasets for the same reason when working with the larger tables.

having an official 10K subset would make this process cheaper for non-Google folks, and would make it feasible to create an occasional query without hitting the free plan limits

Just need to figure out which tables to subset, how to organize them, and how to keep them updated with the latest release.

Ensure 100% table coverage in BigQuery

https://discuss.httparchive.org/t/missing-2016-02-15-chrome-requests/1310 is a bug report that some 2016_02_15 tables are missing.

We should take inventory of all tables across all dates and reprocess anything that's missing.

This can be a good first bug for first time contributors. Overview of the expected workflow:

use the bq command line interface to list the contents of each dataset
export results to a spreadsheet
- graph the results to make it obvious if there are any gaps
or write a script to check if any YYYY_MM_[01, 15] tables are missing
- some early tables are not necessarily DD=[01, 15]
ignore tables that are expected to be missing, eg lighthouse.YYYY_MM_DD_desktop, or others missing as a result of known data loss bugs (citation needed)

httparchive.latest.summary_requests_desktop/mobile not updated

summary_pages_desktop was updated on July 1, but the summary_requests_mobile table hasn't been updated since May 1:

As a result of this, it's not possible to use the two tables together by joining on pageid. I'm instead having to use JSON_EXTRACT(payload, '$._contentType') AS contentType on the full requests table.

Context: I'm updating the HTTP Archive for web compat decision making doc.

The 2016_11_15_chrome_requests_bodies table has incorrect URLs

Example query:
https://bigquery.cloud.google.com:443/savedquery/762219082167:af96186e5c904f698b123b74869fd98f

For example, https://wiggio.com/images/facebook_home.png (from page http://www.wiggio.com/) shows up amongst the result, with a body containing "OpenTok.js 2.9.3 41dae66" close to the beginning. This appears to be some mixup, and far from the only one.

I don't know if the error is in the original HARs.

@igrigorik

Integrate Wappalyzer platform detection

Implementation of HTTPArchive/legacy.httparchive.org#90

cc @AliasIO

Reduce size of Lighthouse payload

The latest lighthouse.2018_10_15 table is 237 GB. Querying all lighthouse tables currently costs 4.15 TB and runs in several minutes.

identify parts of the JSON payload that are unnecessary or unlikely to have analytical value and also significant contributors to the payload size
modify the Dataflow pipeline to omit these parts of the payload
profit

Total page size status: 0 bytes

Lots of pages reporting 0bytes since April:

Point sync_csv script to legacy subdomain

sync_csv.sh points to http://httparchive.org for the downloads folder containing CSV data. Since the launch of the new website, the new host for this data is https://legacy.httparchive.org. The URLs need to be updated in this script to point to the appropriate server.

This caused a pipeline failure during the 2018_04_01 crawl and had to be manually fixed and restarted to complete.

Add HTTP Archive to the public datasets in BigQuery

The new experimental UI of BigQuery doesn't seem to allow adding external sources, unless they're part of either the organisation or the catalogue of public datasets.

While, for now, it's possible to switch to the old/stable UI and so use httparchive from BigQuery, we should look into adding HTTPArchive to public datasets to make it accessible in the future too.

Update the "latest" tables from Dataflow

Forked from #76

Currently we use scheduled queries to scan each dataset/client combo for the latest release and save that to its respective latest.<dataset>_<client> table.

For example, here's the scheduled query that generates the latest.response_bodies_mobile table:

#standardSQL
SELECT
  *
FROM
  `httparchive.response_bodies.*`
WHERE
  ENDS_WITH(_TABLE_SUFFIX, 'mobile') AND
  SUBSTR(_TABLE_SUFFIX, 0, 10) = (
  SELECT
    SUBSTR(table_id, 0, 10) AS date
  FROM
    `httparchive.response_bodies.__TABLES_SUMMARY__`
  ORDER BY
    table_id DESC
  LIMIT
    1)

BigQuery usually has some heuristics to help minimize the number of bytes processed by a query if the WHERE clause clearly limits the _TABLE_SUFFIX pseudocolumn to a particular table. But I'm not sure if that's happening here because the estimated cost of this query is over $1000 (200 TB): This query will process 202.9 TB when run..

Queries for each dataset/client combo are scheduled to run on the first couple of days of every month. They become more expensive over time as we add new tables to every dataset.

A much more efficient approach would be to overwrite the latest.* tables in the Dataflow pipeline when we create the tables for each release. Rather than updating the deprecated Java pipeline, add this as a feature to #79.

Investigate 3x increase in response body rows as of 2021_07_01

Now that we've got the first response_bodies data in several months, it's strange to see a steep increase in the number of rows per table despite the table size (TB) not growing by as much: https://datastudio.google.com/u/0/reporting/1jh_ScPlCIbSYTf2r2Y6EftqmX9SQy4Gn/page/5ike

Investigate the cause of the increased rows and deduplicate if needed. This table will be used by the 2021 Web Almanac, so it's important to make sure it doesn't introduce any data errors.

A couple of theories to start on:

Bisecting the HARs results in some null rows
Bisecting the HARs results in some duplicate rows

Making the HTTP2 query cheaper

We have a HTTP/2 requests graph which does a look up on the $_protocol field in the requests.payload column. This currently costs 211TB and costs an estimated $1,058 (yes - one thousand bucks!!!) and counting, to run and is re-run every month. Which is quite frankly ridiculous. It also takes forever to run and sometimes times out.

I wanted to add an HTTP/3 graph since it's getting out there but can't justify doubling that cost! While our generous benefactor may be able to absorb that, others can't, and I think we should be setting a better examples here.

If we use the summary_requests table and use the reqHttpVersion, or respHttpVersion (or both!) then the cost plummets to 363GB and or an estimated $1.77!!! And the data looks pretty similar (not exactly the same as requests and summary_requests look to have slight differences in number of rows, but close enough).

However, there is an issue as these fields had bad data for a long time (relevant WPT issue and was only fixed from October 2020. I would prefer to track the growth longer than that and ideally back to 2015 when HTTP/2 was launched.

So we've a few choices:

Fix up the bad data. Ideally we'd join requests to summary_requests and update the bad reqHttpVersion, or respHttpVersion values to the $._protocol field but can't figure out how to do that.
Patch the bad data by saying ori:, us:, od:, me: or : / values are effectively HTTP/2. This isn't always the case and there are a small number of HTTP/1.1 connections which give those values, but it's close enough and a lot easier to run this clean up than option 1 (unless there is a way to join these two tables I'm not seeing?).
Have a hacky SQL (see below) to patch it in the query instead. Seems a bit of a hack.
Add the protocol column to summary_requests table and backfill all the old values. Seems like quite an effort.
Wait until we reorganised the tables like we've talked about.
Leave as is and just implement HTTP/3 query in cheaper manner.

Thoughts?

#standardSQL
SELECT
  SUBSTR(_TABLE_SUFFIX, 0, 10) AS date,
  UNIX_DATE(CAST(REPLACE(SUBSTR(_TABLE_SUFFIX, 0, 10), '_', '-') AS DATE)) * 1000 * 60 * 60 * 24 AS timestamp,
  IF(ENDS_WITH(_TABLE_SUFFIX, 'desktop'), 'desktop', 'mobile') AS client,
  ROUND(SUM(IF(respHttpVersion = 'HTTP/2'
               OR respHttpVersion = 'ori' -- bad value that mostly means HTTP/2 (parsed incorrectly from :authority:)
               OR respHttpVersion = 'us:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :status:)
               OR respHttpVersion = 'od:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :method:)
               OR respHttpVersion = 'me:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :scheme:)
               OR respHttpVersion = ': /' -- bad value that mostly means HTTP/2 (parsed incorrectly from :path:)
               OR reqHttpVersion = 'HTTP/2'
               OR reqHttpVersion = 'ori' -- bad value that mostly means HTTP/2 (parsed incorrectly from :authority:)
               OR reqHttpVersion = 'us:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :status:)
               OR reqHttpVersion = 'od:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :method:)
               OR reqHttpVersion = 'me:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :scheme:)
               OR reqHttpVersion = ': /' -- bad value that mostly means HTTP/2 (parsed incorrectly from :path:)
             , 1, 0)) * 100 / COUNT(0), 2) AS percent
FROM
  `httparchive.summary_requests.*`
GROUP BY
  date,
  timestamp,
  client
ORDER BY
  date DESC,
  client

Here's the comparison of what that comes back with compared to the current production site:

date	timestamp	client	percent	curr_pct	diff
2021_05_01	1.6198E+12	desktop	64.55	64.8	0.25
2021_05_01	1.6198E+12	mobile	64.96	65.3	0.34
2021_04_01	1.6172E+12	desktop	68.46	68.6	0.14
2021_04_01	1.6172E+12	mobile	67.47	67.6	0.13
2021_03_01	1.6146E+12	desktop	68.5	68.6	0.1
2021_03_01	1.6146E+12	mobile	68.15	68.3	0.15
2021_02_01	1.6121E+12	desktop	68.15	68.3	0.15
2021_02_01	1.6121E+12	mobile	68.05	68.2	0.15
2021_01_01	1.6095E+12	desktop	67.19	67.3	0.11
				67.5	67.5
2020_12_01	1.6068E+12	desktop	66.75	66.9	0.15
2020_12_01	1.6068E+12	mobile	67.11	67.3	0.19
2020_11_01	1.6042E+12	desktop	65.95	66.1	0.15
2020_11_01	1.6042E+12	mobile	66.24	66.4	0.16
2020_10_01	1.6015E+12	desktop	65.57	65.7	0.13
2020_10_01	1.6015E+12	mobile	65.46	65.6	0.14
2020_09_01	1.5989E+12	desktop	63.52	64.8	1.28
2020_09_01	1.5989E+12	mobile	65.61	64.9	-0.71
2020_08_01	1.5962E+12	desktop	62.53	63.7	1.17
2020_08_01	1.5962E+12	mobile	65.09	63.8	-1.29
2020_07_01	1.5936E+12	desktop	62.23	64.2	1.97
2020_07_01	1.5936E+12	mobile	64.43	64.2	-0.23
2020_06_01	1.591E+12	desktop	61.46	64.4	2.94
2020_06_01	1.591E+12	mobile	62.34	64.5	2.16
2020_05_01	1.5883E+12	desktop	60.63	63.4	2.77
2020_05_01	1.5883E+12	mobile	61.79	63.8	2.01
2020_04_01	1.5857E+12	desktop	59.6	62.2	2.6
2020_04_01	1.5857E+12	mobile	60.6	62.4	1.8
2020_03_01	1.583E+12	desktop	59.79	62.3	2.51
2020_03_01	1.583E+12	mobile	60.68	62.5	1.82
2020_02_01	1.5805E+12	desktop	60.32	63.5	3.18
2020_02_01	1.5805E+12	mobile	60.91	63.1	2.19
2020_01_01	1.5778E+12	desktop	55.1	59.2	4.1
2020_01_01	1.5778E+12	mobile	55.11	59.3	4.19
2019_12_01	1.5752E+12	desktop	54.37	58.9	4.53
2019_12_01	1.5752E+12	mobile	54.27	58.9	4.63
2019_11_01	1.5726E+12	desktop	47.22	58	10.78
2019_11_01	1.5726E+12	mobile	53.51	58.2	4.69
2019_10_01	1.5699E+12	desktop	52.55	57.1	4.55
2019_10_01	1.5699E+12	mobile	52.43	56.9	4.47
2019_09_01	1.5673E+12	desktop	51.8	56.2	4.4
2019_09_01	1.5673E+12	mobile	53.47	56	2.53
2019_08_01	1.5646E+12	desktop	51.4	55.7	4.3
2019_08_01	1.5646E+12	mobile	55.16	55.5	0.34
2019_07_01	1.5619E+12	desktop	51.81	54.9	3.09
2019_07_01	1.5619E+12	mobile	54.53	54.8	0.27
2019_06_01	1.5593E+12	desktop	50.83	53.8	2.97
2019_06_01	1.5593E+12	mobile	50.21	53.3	3.09
2019_05_01	1.5567E+12	desktop	48.16	53.1	4.94
2019_05_01	1.5567E+12	mobile	47.38	52.6	5.22
2019_04_01	1.5541E+12	desktop	45.57	52.3	6.73
2019_04_01	1.5541E+12	mobile	44.19	52	7.81
2019_03_01	1.5514E+12	desktop	48.49	50.6	2.11
2019_03_01	1.5514E+12	mobile	47.34	50.7	3.36
2019_02_01	1.549E+12	desktop	49.63	49.7	0.07
2019_02_01	1.549E+12	mobile	49.79	49.8	0.01
				48.3	48.3
				48.3	48.3
2018_12_15	1.5448E+12	desktop	32.8	47.8	15
2018_12_15	1.5448E+12	mobile	36.73	48.9	12.17
				49.1	49.1
				48.8	48.8
2018_11_15	1.5422E+12	desktop	46.92	48.4	1.48
2018_11_15	1.5422E+12	mobile	46.87	48.4	1.53
2018_11_01	1.541E+12	desktop	46.27	47.8	1.53
				47.5	47.5
2018_10_15	1.5396E+12	desktop	45.53	46.5	0.97
2018_10_15	1.5396E+12	mobile	45.13	46.2	1.07
2018_10_01	1.5384E+12	desktop	45.89	46	0.11
2018_10_01	1.5384E+12	mobile	45.53	45.5	-0.03
2018_09_15	1.537E+12	desktop	45.66	45.8	0.14
2018_09_15	1.537E+12	mobile	45.19	45.2	0.01
2018_09_01	1.5358E+12	desktop	44.83	45	0.17
2018_09_01	1.5358E+12	mobile	44.6	44.6	0
2018_08_15	1.5343E+12	desktop	44.65	44.8	0.15
				44.9	44.9
2018_08_01	1.5331E+12	desktop	44.26	44.4	0.14
2018_08_01	1.5331E+12	mobile	44.61	44.6	-0.01
2018_07_15	1.5316E+12	desktop	43.77	44	0.23
2018_07_15	1.5316E+12	mobile	44.3	44.3	0
2018_07_01	1.5304E+12	desktop	43.42	43.6	0.18
2018_07_01	1.5304E+12	mobile	41.37	41.6	0.23
2018_06_15	1.529E+12	desktop	38.59	38.8	0.21
2018_06_15	1.529E+12	mobile	40.36	40.6	0.24
2018_06_01	1.5278E+12	desktop	38.17	38.2	0.03
2018_06_01	1.5278E+12	mobile	39.9	40.1	0.2
2018_05_15	1.5263E+12	desktop	38.16	38.3	0.14
2018_05_15	1.5263E+12	mobile	39.56	39.7	0.14
2018_05_01	1.5251E+12	desktop	37.94	38	0.06
2018_05_01	1.5251E+12	mobile	39.21	39.4	0.19
2018_04_15	1.5238E+12	desktop	37.59	37.6	0.01
2018_04_15	1.5238E+12	mobile	39.16	39.4	0.24
				37.1	37.1
				38.7	38.7
2018_03_15	1.5211E+12	desktop	36.67	36.8	0.13
2018_03_15	1.5211E+12	mobile	37.82	38	0.18
2018_03_01	1.5199E+12	desktop	35.9	35.9	0
2018_03_01	1.5199E+12	mobile	37.1	37.3	0.2
2018_02_15	1.5187E+12	desktop	35.46	35.5	0.04
2018_02_15	1.5187E+12	mobile	36.39	36.5	0.11
2018_02_01	1.5174E+12	desktop	35.23	35.3	0.07
2018_02_01	1.5174E+12	mobile	35.98	36.1	0.12
2018_01_15	1.516E+12	desktop	33.9	34	0.1
2018_01_15	1.516E+12	mobile	34.69	34.8	0.11
2018_01_01	1.5148E+12	desktop	33.3	33.7	0.4
2018_01_01	1.5148E+12	mobile	34.3	34.7	0.4
2017_12_15	1.5133E+12	desktop	33	33.4	0.4
2017_12_15	1.5133E+12	mobile	34.03	34.4	0.37
2017_12_01	1.5121E+12	desktop	31.92	32.4	0.48
2017_12_01	1.5121E+12	mobile	32.58	33.1	0.52
2017_11_15	1.5107E+12	desktop	31.39	31.8	0.41
				32.6	32.6
2017_11_01	1.5095E+12	desktop	31.11	31.5	0.39
2017_11_01	1.5095E+12	mobile	31.76	32.4	0.64
2017_10_15	1.508E+12	desktop	30.19	30.6	0.41
2017_10_15	1.508E+12	mobile	31.06	31.5	0.44
2017_10_01	1.5068E+12	desktop	29.89	30.2	0.31
2017_10_01	1.5068E+12	mobile	30.54	31.1	0.56
2017_09_15	1.5054E+12	desktop	28.88	29.2	0.32
2017_09_15	1.5054E+12	mobile	29.43	30	0.57
2017_09_01	1.5042E+12	desktop	28.21	0	-28.21
2017_09_01	1.5042E+12	mobile	29	0.1	-28.9
2017_08_15	1.5028E+12	desktop	27.25	0	-27.25
2017_08_15	1.5028E+12	mobile	28.07	0	-28.07
2017_08_01	1.5015E+12	desktop	26.76	0	-26.76
2017_08_01	1.5015E+12	mobile	27.41	0	-27.41
2017_07_15	1.5001E+12	desktop	26.63	26.5	-0.13
2017_07_15	1.5001E+12	mobile	27.02	27.1	0.08
2017_07_01	1.4989E+12	desktop	26.14	26	-0.14
2017_07_01	1.4989E+12	mobile	26.44	26.5	0.06
2017_06_15	1.4975E+12	desktop	25.29	25.2	-0.09
2017_06_15	1.4975E+12	mobile	25.88	26	0.12
2017_06_01	1.4963E+12	desktop	25.05	25	-0.05
2017_06_01	1.4963E+12	mobile	25.47	25.7	0.23
2017_05_15	1.4948E+12	desktop	25.02	24.9	-0.12
2017_05_15	1.4948E+12	mobile	25.29	25.5	0.21
2017_05_01	1.4936E+12	desktop	24.87	23.9	-0.97
2017_05_01	1.4936E+12	mobile	24.49	23.8	-0.69
2017_04_15	1.4922E+12	desktop	25.12	24.9	-0.22
2017_04_15	1.4922E+12	mobile	25.41	25.2	-0.21
2017_04_01	1.491E+12	desktop	24.55	24.7	0.15
2017_04_01	1.491E+12	mobile	24.69	24.9	0.21
2017_03_15	1.4895E+12	desktop	23.78	24	0.22
2017_03_15	1.4895E+12	mobile	23.69	23.9	0.21
2017_03_01	1.4883E+12	desktop	23.4	23.4	0
2017_03_01	1.4883E+12	mobile	23.3	23.4	0.1
2017_02_15	1.4871E+12	desktop	23.07	23.1	0.03
2017_02_15	1.4871E+12	mobile	22.91	23.1	0.19
2017_02_01	1.4859E+12	desktop	22.74	22.8	0.06
2017_02_01	1.4859E+12	mobile	22.85	22.9	0.05
				22	22
2017_01_15	1.4844E+12	mobile	22	22	0
				21.3	21.3
2017_01_01	1.4832E+12	mobile	21.58	21.6	0.02
2016_12_15	1.4818E+12	desktop	19.68	20.9	1.22
				21.3	21.3
				20.7	20.7
				21.2	21.2
2016_11_15	1.4792E+12	desktop	20.54	20.3	-0.24
2016_11_15	1.4792E+12	mobile	20.55	20.6	0.05
2016_11_01	1.478E+12	desktop	20.25	20.3	0.05
2016_11_01	1.478E+12	mobile	19.91	20	0.09
2016_10_15	1.4765E+12	desktop	18.66	18.6	-0.06
2016_10_15	1.4765E+12	mobile	19.37	19.7	0.33
2016_10_01	1.4753E+12	desktop	18.5	18.7	0.2
2016_10_01	1.4753E+12	mobile	19.32	19.5	0.18
2016_09_15	1.4739E+12	desktop	17.11	17.4	0.29
2016_09_15	1.4739E+12	mobile	17.29	17.5	0.21
2016_09_01	1.4727E+12	desktop	16.45	16.5	0.05
2016_09_01	1.4727E+12	mobile	16.66	16.5	-0.16
2016_08_15	1.4712E+12	desktop	16.49	16.5	0.01
2016_08_15	1.4712E+12	mobile	16.4	16.4	0
2016_08_01	1.47E+12	desktop	16.36	16.4	0.04
				16.2	16.2
2016_07_15	1.4685E+12	desktop	15.9	0	-15.9
				0	0
2016_07_01	1.4673E+12	desktop	15.47	0	-15.47
				0	0
2016_06_15	1.4659E+12	desktop	15.16	0	-15.16
				0	0
2016_06_01	1.4647E+12	desktop	13.72	0	-13.72
				0	0
2016_05_15	1.4633E+12	desktop	13.15	0	-13.15
				0	0
2016_05_01	1.4621E+12	desktop	0	0	0
				0	0
2016_04_15	1.4607E+12	desktop	0	0	0
				0	0
2016_04_01	1.4595E+12	desktop	0	0	0
				0	0
2016_03_15	1.458E+12	desktop	0	0	0
				0	0
2016_03_01	1.4568E+12	desktop	0	0	0
2016_03_01	1.4568E+12	mobile	0	0	0
2016_02_15	1.4555E+12	desktop	0	0	0
2016_02_15	1.4555E+12	mobile	0	0	0
2016_02_01	1.4543E+12	desktop	0	0	0
2016_02_01	1.4543E+12	mobile	0	0	0
2016_01_15	1.4528E+12	desktop	0	0	0
2016_01_15	1.4528E+12	mobile	0	0	0
2016_01_01	1.4516E+12	desktop	0	0	0
2016_01_01	1.4516E+12	mobile	0	0	0

httparchive / bigquery Goto Github PK

bigquery's Introduction

The HTTP Archive tracks how the Web is built

What is the HTTP Archive?

bigquery's People

Contributors

Stargazers

Watchers

Forkers

bigquery's Issues

Timeline for decommissioning:

What do I need to know?

What do I need to do?

Recommend Projects

Recommend Topics

Recommend Org

Jobs