httparchive / almanac.httparchive.org Goto Github PK

View Code? Open in Web Editor NEW

597.0 88.0 159.0 383.26 MB

HTTP Archive's annual "State of the Web" report made by the web community

Home Page: https://almanac.httparchive.org

License: Apache License 2.0

Python 2.72% CSS 2.91% HTML 87.45% Shell 1.03% JavaScript 5.78% Dockerfile 0.04% Batchfile 0.08%

web-almanac http-archive bigquery

almanac.httparchive.org's Introduction

The HTTP Archive tracks how the Web is built

!! Important: This repository is deprecated. Please see HTTPArchive/httparchive.org for the latest development !!

This repo contains the source code powering the HTTP Archive data collection.

What is the HTTP Archive?

Successful societies and institutions recognize the need to record their history - this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web's digitized content.

In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

almanac.httparchive.org's People

Contributors

Stargazers

Watchers

Forkers

tyohan anoblet kjlarson tanhengyeow dougsillars jrharalson aridiosilva foxdavidj avgp tunetheweb argyleink andydavies webrating catalinred housseindjirdeh welenofsky kherinwork ftonato anzart apalm brucelawson suffyan-t silentjma arsenicraghav vuchkov borisschapira cuulee justinyahin nabiloo19 sitedata 25prathamesh techhtml bcinarli ibnesayeed machawk1 chengxicn masup9 kasiageb gregorywolf sareg0 mcmd c3333 saptaks adityapandey1998 hakacode super-fly navaneeth-akam goleedev sudheendrachari arigaud-ca abbytsai rviscomi taechonkim elaynelemos eduqg allemas lex111 praveenpal4232 exterkamp antoineeripret mo271 hemanth fellowhuman1101 ahmadawais mikebishop simonhearne cybai giopunt seanom jjjjackson zuckjet moniloria hijuliancode chefleo rockeynebhwani rheehot dtikhonov tymosh eeeps corocoto stephanebachelier admariner emanuelgsouza ksakae1216 ademarcardoso soulcorrosion connorjclark fatmabadri alangdm lucasbona05 openinspiration demianrenzulli siakaramalegos gjfr nithanaroy cqueern konfirmed nishugoel lbmanit tosinarasi

almanac.httparchive.org's Issues

Build the development environment to maintain the Almanac

AI(@HTTPArchive/developers): Design the tech stack for the Almanac.

The Almanac user experience will be entirely static and stateless, so a solution as simple as GitHub Pages could work. I think we want a bit more control over the backend (response headers, SSR templates) so I'm leaning towards a similar setup as https://github.com/HTTPArchive/httparchive.org in which we build on App Engine. Thoughts?

TO DO:

Create Python App Engine development environment with Flask
Create an initial base template, styles
Extend the base template to create a temporary splash page to be deployed on almanac.httparchive.org while the project is under construction

Finalize assignments: Chapter 11. PWA

Section	Chapter	Authors	Reviewers
II. User Experience	11. PWA	@tomayac @jeffposnick	@HyperPress @ahmadawais

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

% of pages with SW installs
Manifest
Stats on different service worker events
Stats on different web app manifest properties
Workbox adoption/usage
beforeinstallprompt usage

👉AI (coauthors): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of PWAs" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the PWA landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 9. Accessibility

Section	Chapter	Authors	Reviewers
II. User Experience	9. Accessibility	Nektarios Paisios, @nadinarama, @OBTo	@rachellcostello @kleinab

In an effort to be more accessible to Nektarios, planning for this chapter is being done in a Google Doc. See Planning the Accessibility chapter.

I'll keep this issue in sync in terms of status and AIs.

To do:

Assign subject matter experts (coauthors)
Finalize peer reviewers
Finalize metrics

Create an almanac_sample dataset

@HTTPArchive/data-analysts

It'd be helpful to have a sample dataset of ~1000 pages to try out queries and play with the data to see what types of metrics are possible.

The dataset should contain tables for each of the different data types:

httparchive.almanac_sample.

blink_features
lighthouse_mobile
requests_ [desktop, mobile]
response_bodies_ [desktop, mobile]
summary_pages_ [desktop, mobile]
summary_requests_ [desktop, mobile]
technologies_ [desktop, mobile]

A random sample of 1000 pages from the most recent 2019_05_01 crawl for both desktop and mobile should do.

@paulcalvano do you have time to set this up?

Finalize assignments: Chapter 13. Ecommerce

Section	Chapter	Coauthors	Reviewers
III. Content Publishing	13. Ecommerce	@samdutton @alankent	@voltek62 @wizardlyhel

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Assign peer reviewers
Finalize metrics

Current list of metrics:

Top ecommerce platforms:

Marketplace: for example, eBay or Etsy.
Hosted shop: for example, Shopify.
Hosted platform: for example, Magento Commerce.
Self-hosted platform: for example, Magento Open Source.
Not on a platform or marketplace: sites that show payment activity but don't appear to be on a platform or marketplace.

Stats for sites that appear to be e-commerce sites (as above):

Images: quantity, format, byte size, pixel dimensions, etc.
Home page HTML size.
Performance stats.
Third-party content: total weight and number of requests (and performance impact if possible).
Analytics providers.
Ad providers.
Indexability.
Qualification (or not) as a PWA.

👉AI (coauthors): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of ecommerce" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the ecommerce landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 19. Resource Hints

Section	Chapter	Authors	Reviewers
IV. Content Distribution	19. Resource Hints	@khempenius @yoavweiss	@andydavies @addyosmani

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Assign peer reviewers
Finalize metrics

Current list of metrics:

For each resource hint (preload, prefetch, preconnect, prerender):

% of sites using $HINT; how this has changed since a year ago.
For sites using $HINT, # of times it is used.
crossorigin attribute, as attribute, resource priority
Preload/Prefetch-only: the resource types that $HINT is used for (e.g. js, document, etc.)
Preload only: % of sites that are using as=font/as=fetch without a crossorigin attribute, or that are using any other as value with a crossorigin attribute.
Preload only: % of sites where a preload of low priority is done before a load of higher priority and a different as attribute value.

Priority Hints:

% of sites using this
Note: depending on how small the sample size is the following metrics may not be worth calculating
Usage breakdown by tag (i.e. iframe, img, link, or script)
Usage breakdown by Importance (i.e. low/high/auto)
(Optional) tag x importance (e.g. do scripts tend to be "high" importance? iframes "low" importance? etc.)

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of priority hints" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the priority hints landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 1. JavaScript

Section	Chapter	Coauthors	Reviewers
I. Page Content	1. JavaScript	@addyosmani @housseindjirdeh @mathiasbynens	@rwaldron @RReverser

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Assign peer reviewers
Finalize metrics

Current list of metrics:

Transfer size/count
- Distribution of JS bytes
- Distribution of first party JS bytes vs. third party
- Number of JS requests
- Number of first-party JS requests vs. third-party
- % of gzip-compressed scripts
- % of brotli-compressed scripts
Runtime cost
- Breakdown of V8 CPU times (if feasible)
Library usage
- Top N JS libraries
- Notable changes in popularity since last year
- Top N JS client-side frameworks (React, Vue, etc…)
- Distribution of JS bytes on site per JavaScript framework
Feature adoption
- % of pages that use <script type=module>
- % of pages that use <script nomodule>
- % of pages that use <link rel=preload> for JS resources
- % of pages that use <link rel=modulepreload>
- % of pages that use <link rel=prefetch> for JS resources
- Use of navigator.connection.effectiveType property
- Estimate adoption of specific JS language features (by looking for the following raw strings in JS response bodies)
  - Atomics
  - Intl
  - Proxy
  - SharedArrayBuffer
  - WeakMap
  - WeakSet
  - dynamic import (by looking for "import(")
Other
- % of sites that ship sourcemaps

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of JS" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the JS landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Add analytics script

This Google Analytics tracking script needs to be added to the document head of the base template for every page.

<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-22381566-3"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-22381566-3');
</script>

This issue is blocked on creation of the base template in #25.

Triage all proposed metrics (396 of 396 done)

Assigned: @HTTPArchive/data-analysts team

Due date: No later than July 1

Any metrics that require augmenting the test infrastructure (eg custom metrics) must be ready to go when the July crawl starts. This ensures that when the crawl completes at the end of July, we can query the dataset and pass it off to authors for interpretation in August.

As of now there are 350+ metrics spread over 20 chapters.

Part	Chapter	Able To Query	Not Feasible	Grand Total
I	01. JavaScript	24	1	25
I	02. CSS	39	7	46
I	03. Markup	4	1	5
I	04. Media	20	5	25
I	05. Third Parties	13		13
I	06. Fonts	40	7	47
II	07. Performance	24		24
II	08. Security	36	5	41
II	09. Accessibility	32	6	38
II	10. SEO	15		15
II	11. PWA	6		6
II	12. Mobile web	19	2	21
III	13. Ecommerce	10	3	13
III	14. CMS	11	1	12
IV	15. Compression	3	1	4
IV	16. Caching	14	1	15
IV	17. CDN	13	3	16
IV	18. Page Weight	3		3
IV	19. Resource Hints	10		10
IV	20. HTTP/2	14	3	17
	Grand Total	350	46	396

I've copied all of the metrics for each chapter to this sheet (named "Metrics Triage"). To edit the sheet please give me your email address to add to the editors list. What we need to do is go through the list of metrics for each chapter and assign a status from one of the following:

To Be Reviewed
Need More Info
Not Feasible
Able To Query
Custom Metric Required
Custom Metric Written
Query Written

The lifecycle is:

All metrics start as TBR
- Move to NMI if the metric is vaguely worded or otherwise unclear what is being asked for. Get in touch with the chapter author(s) and straighten out what the expected data should look like.
- Move to NF if the metric cannot be queried using the HTTP Archive dataset or other publicly available datasets on BigQuery (eg CrUX). This is the "done" state for metrics which cannot progress any further.
- Move to ATQ if the metric is able to be queried from the dataset based on the latest schema
  - Move to QW if the metric has a corresponding query written. This is the ideal "done" state for all metrics.
- Move to CMR if the metric can only be queried with the addition of a custom metric
  - Move to CMW if the metric has had a corresponding custom metric written. Metrics in this state must also have a corresponding query written and moved to QW when complete.

Custom metrics should only be added as a last resort and must adhere to strict performance requirements. We test on millions of pages so any complex/slow scripts would impede the crawl. Because we anticipate needing many custom metrics, we'll implement everything as individual functions within a single custom metric whose output is a JSON-encoded object with each result as its own sub-property. More on this when we get there.

Add your name in the Analyst column to take responsibility for moving it through the metric lifecycle.

Once we're ready to begin writing queries, we will create a thread on https://discuss.httparchive.org for each chapter, listing all queryable metrics. Hopefully we can crowdsource some of the querying by tapping into the power users on the forum.

Optimize the 2019_07_01 dataset for querying

At the end of July the 2019_07_01 dataset will be available. Here are some ideas to minimize the cost of queries:

implement the partitioning/clustering proposal with the 2019_07_01 dataset
add a column to response_bodies annotating their file types (eg js, css, html) or join the entire body column with the requests table
what else?

One principle should be that whatever queries we write for the Almanac should be reproducible against any other monthly dataset. So if we optimize the July dataset we should also apply the same optimizations to all others. This doesn't have to happen until launch.

Finalize assignments: Chapter 2. CSS

Section	Chapter	Coauthors	Reviewers
I. Page Content	2. CSS	@una @argyleink	@meyerweb @huijing

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Assign peer reviewers
Finalize metrics

Current list of metrics:

Section	Metric description
Usage of popular/new APIs	Custom properties
Usage of popular/new APIs	`@import` `@supports`
Usage of popular/new APIs	Filters
Usage of popular/new APIs	Blend modes
Usage of popular/new APIs	Logical properties
Usage of Unit Types	Hsl vs hsla vs. rgb vs rgba vs. hex
Usage of Unit Types	rem vs em vs px vs ex vs cm etc.
Usage of Unit Types	classes vs ids
CSS Tooling Today	Top CSS development tools
Usage of Popular CSS Libraries	Top CSS libraries
Resets	Top reset utilities
Layout	RTL vs. LTR
Layout	Flexbox
Layout	Grid
Media Queries	Most Popular Snap Points
Media Queries	max vs. min-width
Media Queries	Ems vs rems vs px in media queries
Media Queries	How many people using print style media queries
Size of style payload	number of stylesheets per page
Size of style payload	Most popular names for stylesheets
Size of style payload	Minified vs unminified
Size of style payload	# of fonts downloaded
Size of style payload	Types of fonts downloaded
Size of style payload	Average size of CSS load per site
Size of style payload	Average size of images loaded by stylesheets (inline and linked)
Individual files vs bundled files	Came with the HTML
Individual files vs bundled files	Inserted post page load
Individual files vs bundled files	Async v sync
Individual files vs bundled files	Constructible stylesheets
Individual files vs bundled files	Inline Styles vs. one stylesheet link
Duplication / Etc	Shorthand vs. longhand properties
Duplication / Etc	Number of colors declared per site
Duplication / Etc	Number of duplicate colors per those sites
Duplication / Etc	Number of fonts declared per site
Duplication / Etc	Number of duplicate font family declarations
Duplication / Etc	Number different font size values per site
Duplication / Etc	Number of z-indices per site
Duplication / Etc	Most popular z index values (chart)
Duplication / Etc	Number of different media query values per site
Duplication / Etc	Number of different margins per site
Duplication / Etc	Number of transitions used per site
Duplication / Etc	Number of `@keyframes` declared per site
Duplication / Etc	Number of `[id="foo"]`
Duplication / Etc	Number of [class*='foo']``[class^='foo'] `[class$='foo']` `[class~='foo']`
Duplication / Etc	Number of classes per element
Duplication / Etc	Average Length of classes

The metrics should paint a holistic, data-driven picture of the CSS landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Consider users' Accept-Language preference when selecting the default language

When users first visit, I think that it will be a good experience for the user to be not conscious of language selection.

To do this, need to check the Accept-Language header sent by the client and route it to the appropriate language.
(Must be careful not to work with redirects after being selected by the switcher)

This needs to be addressed after #50.

Translate Japanese, Spanish, and Russian

Per #29 and #35, before launch we should translate the Almanac into our P0 languages: Japanese, Spanish, and Russian.

Translation assignments:

Language	Primary Translator	Secondary
Japanese	@MSakamaki	(none)
Spanish	@c-torres	@taytus @JMPerez
Russian	@Pavel-Evdokimov	(none)

If anyone would like to sign up to review/assist translations for any of the three languages, let me know.

Decide on typography

Does anyone have design experience, specifically in typography, who could offer some suggestions for Almanac fonts?

Build a language switcher

@HTTPArchive/developers

Somewhere on the page we should have a dropdown field for users to change the language. The expected UX of the field is:

the default/selected option is the current language
clicking the field will open a switcher with the other supported languages displayed
selecting any other language will reload the current page with the language specified in the URL

For example, if on / and the user selects Japanese, the new URL will be /ja/. Similarly, if on /2019/outline the new URL will be /ja/2019/outline. Switching to English will force the /en URL path prefix for simplicity.

In terms of implementation, it should be built into the templates/base.html template. We can look at the {{ lang }} property and set an option as selected if it matches. For accessibility, does this need to have anchor elements or are there equivalent ARIA attributes we can use?

Design questions:

is there a common place for this UI to go? header/footer?
should we show flags as the options as a way to convey the language in a way not dependent on translation? (need a text-based fallback anyway for accessibility)
should the options be translated in the destination language?

Form a team of data analysts

Data analysts are responsible for working with authors to provide HTTP Archive data that match the metrics outlined in each chapter. Analysts should be familiar with the BigQuery dataset and comfortable with SQL. Mentoring is available for anyone who wants to learn!

Our current assignment is to triage all ~250 metrics. New analysts: please see #33 for more info on how to access the metrics sheet and what to do.

Join the team: @HTTPArchive/data-analysts

Finalize assignments: Chapter 12. Mobile web

Section	Chapter	Authors	Reviewers
II. User Experience	12. Mobile web	@slightlyoff @OBTo	@HyperPress @AymenLoukil

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

Tap targets
- Lets tackle through [1], with the freq being how many offending elements are on each page
Legible Font size. Analyzing this with what lighthouse deems as an acceptable % of legible text is fine
Proper font contrast
- Please tackle like "Tap Targets" above
Mobile configuration split - separate mobile and desktop sites, responsive site, dynamically served content.
% sites prevent users from scaling the viewport
% site with a meta viewport at all
% sites containing any CSS breakpoints <= 600px
% sites locking display orientation
% of sites preventing pasting into password fields
% of sites sites making NO permission requests. Should only be making these upon a user-interaction like a click.
For each of the following, what % of sites make this permission request while loading: Notifications, Geolocation, Camera, Microphone
# of links or buttons (ideally any element with a click listener attached) only containing an icon [1]
- This can be tested by checking if only svg is inside the button or if a single character is (font icons)
How well are sites using native features on the web to simplify a users job:
- What is the penetration for each of the following input types [1]
  - color, date, datetime-local, email, month, number, range, reset, search, tel, time, url, week, datalist
  - % of sites using ANY of the above input types
- Penetration for each of the following attributes [1]
  - autocomplete, min or max, pattern, placeholder, required, step
  - % of sites using ANY of the above attributes (besides placeholder and required)
For sites which have a document event listener triggering on a scroll (touchstart, wheel, etc), how many are using passive event listeners
% of sites that send more JS than the size of the viewport ("Web Bloat Score") per pageload
number/fraction of sites specifying a webapp manifest
number of sites registering a Service Worker
cumulative layout shift

[1] The best way to both analyze and display these pieces of data is through a frequency distribution graph. With this we can both find out how big of an issue this tends to be for the average site, and what the global trends are

👉AI (@slightlyoff @OBTo): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

👉 AI (@slightlyoff @OBTo): Finalize which metrics you might like to include in an annual "state of mobile web" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the mobile web landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 14. CMS

Section	Chapter	Author	Reviewers
III. Content Publishing	14. CMS	@amedina	@westonruter @mor10 @sirjonathan

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Assign peer reviewers
Finalize metrics

Current list of metrics:

Section | Metric description

What are the top CMSs
-- There are studies and reports classifying CMSes according to market share
-- The WordPress community commonly cites W3Techs
-- It would be interesting to validate such claims with HTTPArchive/CrUX data
-- That is: would the sample space represented by these datasets correlate to the reported market shares elsewhere?
AMP adoption: number WordPress-powered pages using the AMP plugin for WordPress
-- Version of the plugin
-- Number of AMP pages using the different Template mode used (reader/classic, transitional/paired, native).
-- Suggestion: WordPress.com enables AMP by default so it would be interesting to see how many sites have disabled it in addition to how many self-hosted sites have enabled it. Not sure if this is possible or not but would also help
-- The AMP plugin for WordPress generates the following meta tag:
<meta name="generator" content="AMP Plugin v1.1.2; mode=native">
Coupled vs. Decoupled CMS use: Headless CMSes
-- There is a “trend” of using some CMSes in headless mode; it would be interesting to capture the prevalence of such uses
-- Measuring this is not easily doable but we would like to keep this metric and analyze in terms of the metrics obtained for regular CMS usage (e.g. non Headless)
Device Distribution
-- With so much device fragmentation and the impact on performance of using low-end devices, it would be good to know where content powered by difference CMSes is being accessed from.
-- Comparison against non-CMS cases would also shed light on demographics, geography (together with device usage per region)
Connection distribution
-- Connection types
HTTPArchive/CrUX Metrics: We should capture a view of the ecosystem in terms of usability metrics
-- Is it happening?: Has the navigation started successfully? has the server started responding? Metric: First Paint,TTFB (HTTPArchive/CrUX Metrics)
-- Is it useful?: when you’ve painted text, an image or content that allows the user to derive value from the experience and engage with it. Metrics: First Contentful Paint, First Meaningful Paint, Speed Index (HTTPArchive/CrUX Metrics)
-- Is it usable?: when a user can start meaningfully interacting with the experience and have something happen (e.g tapping on a button). This can be critical as users can get disappointed if they try using UI that looks ready but isn't. Metrics: Time to Interactive(lab), First CPU Idle, First Input Delay (field)
-- Is it delightful?: delightfulness is about ensuring performance of the user experience remains consistent after page load. Can you smoothly scroll without janking? are animations smooth and running at 60fps? do other Long Tasks block any of these from happening?.

Check the brainstorming doc to explore ideas.

These metrics would paint a holistic, data-driven picture of the CMS landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Generate the Contributors page

The Contributors page will be generated based on a JSON file with all of the contributor metadata. Required info for all contributors:

full name
list of teams contributed

Not required but super nice to have info:

avatar URL (Gravatar?)
short personal tagline ("Engineer at Company", "Web dev thoughtleader", etc)
GitHub profile
Twitter profile

TODO: @HTTPArchive/developers

contributors have verified their information is correct in the Contributor sheet
create a src/config/contributors.json file to organize all contributor metadata
contributors have submitted a PR to update their metadata as needed
generate the contributors.html template based on the JSON metadata

Finalize assignments: Chapter 4. Media

Section	Chapter	Coauthors	Reviewers
I. Page Content	4. Media	@dougsillars @colinbendell @Yonet	@ahmadawais @kornelski

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

Image formats
- Lighthouse data on responsiveness, format, quality, lazy loading
- adoption of newer image formats like WebP
- SVG
  - Inline versus external sources (from css or otherwise)
  - comments volume v. total bytes
  - SVGO comparison
- Microdata usage (og:image, twitter:image, etc)
- Use of <source sizes>
- Preloader effectiveness (initiator Source: javascript, css, vanilla-html)
- Fallback image support for legacy devices that don’t support <picture> or <srcset>
- Accept-CH in <meta> vs http
- Photographic v. illustration score per pixel
- Bytes per pixel for photographic
- Use of Vary (Either User-Agent or Accept)
- A11y: Support for Alt tags
- TCP/TLS connection time delay (use of preconnect for cross origin hosts)
- inlined / base64 image content
Video formats
- MP4 sizes, streaming info
- how many pages are self-serving video (not YouTube)
- JS player size,
- container options (mp4;hevc, v. mp4;avc1 v. webm:vp9, wbem;vp8)
- Use of posters, autoplay, fallback image
- A11y: Support for description or fig
Hero media
- how many pages include a large "hero" graphic above the fold?
- Same or different microdata hero images
- Orientation and pixel volume of hero images
- Hero video usage
Emerging media
- WebXR (TODO @Yonet)

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of web media" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the web image/video landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Initialize backend App Engine project

create a project on GCP
set up App Engine
map the almanac.httparchive.org subdomain to the App Engine instance

Generate an ebook

@HTTPArchive/developers curious to hear thoughts from others about this, might be crazy.

I'd like to see the entire contents of the Almanac on a single web page, formatted similar to a book. It would also have a print stylesheet to handle things like page breaks and page numbers, so one could print to PDF and it'd just work™️ as a fully formed e-book. It'd also be a PWA that could be added to home screen and read offline.

There are concerns like lazy loading, history state management, deep linking, etc but I think these are all solvable problems.

I'm excited about this idea because a report on the state of the web should ideally maximize the web's capabilities for a great UX.

WDYT?

Requirements (edit by @mikegeyser):

Structure:

Rendering:

Tooling:

Try and come up with a solution that doesn't need the html to be served from flask
Integrate weasyprint into the generate script
Make the rendering config more dynamic
Currently have to call weasyprint per year and language - script this.

Please feel free to add any more, and we can see if they're ~~possible~~ feasible. :)

Upgrade App Engine environment to Python 3

Does it make sense to upgrade to Python 3 as early as possible?

Relative paths and the readme

Should the repro instructions from the src/readme be in the main readme? If not, would it make sense to change /src/README.md to include instructions relative to the project root?

pip install -r requirements.txt > pip install -r src/requirements.txt
python main.py > python src/main.py

Not sure if this is an issue for many people

Design the home page

list all of the information that will need to be included on the home page
design the UX of the home page, establishing the identity and style of the Almanac website

I'll get started on the first one to unblock the designer who will work on the second one.

Define and categorize metrics

Refer to the Content Brainstorm doc for the latest draft

define a list of high-level "sections" (eg content, experience, distribution, publishing, etc)
define a list of mid-level "chapters" (eg for content: JS, CSS, img, etc)
define a list of low-level metrics (eg for JS: bytes, bootup time, libraries, etc)

Finalize assignments: Chapter 18. Page weight

Section	Chapter	Author	Reviewers
IV. Content Distribution	18. Page weight	@khempenius @henrisgh @tammyeverts	@paulcalvano @flowlabs

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

Distribution of resource size: (p10, p25, p50, p75, p90) x (total, JS, CSS HTML, Fonts). In addition, how this has changed over the past year.
Distribution of resource quantity: (p10, p25, p50, p75, p90) x (total, JS, CSS HTML, Fonts). In addition, how this has changed over the past year and since the release of H2.
Very Optional: H2's impact on resource quantity.
- Determine which sites serve the majority of their first-party content using H2. (This % alone would be interesting. I also wonder if there would be a significant difference between using .5 and .9 as the threshold.)
- For those sites using H2, look at how CSS & JSS resource quantities varied before and after H2 adoption.

👉 AI (@khempenius): Finalize which metrics you might like to include in an annual "state of page weights" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

👉Optional AI (@khempenius): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

The metrics should paint a holistic, data-driven picture of the page weight landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 3. Markup

Section	Chapter	Author	Reviewers
I. Page Content	3. Markup	@bkardell	@zcorpan

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Assign peer reviewers
Finalize metrics

Current list of metrics:

Deprecated elements
Popular elements
Custom elements (“slang”)
Attribute usage (stretch goal)
count of shadowRoots

👉AI (@bkardell): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.

👉 AI (@bkardell): Finalize which metrics you might like to include in an annual "state of markup" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the markup landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Write queries and add to the repo

When the Analyst team generates queries for each metric, they should create a PR to merge it into the repo. This has two benefits: the PR process provides an opportunity for peer review, and it is a place to share and maintain the canonical queries. On the Almanac website we can link directly to the queries from each respective chapter/figure so readers can see exactly how it was calculated and fork it for their own analysis.

create a new directory system to organize queries (@KJLarson)
test queries (analysts)
file a PR to merge the queries into their respective directory (analysts)

For testing queries, you can query the new almanac dataset, which contains desktop/mobile sample tables for 1,000 websites. This smaller dataset should help you refine your queries without incurring the full cost for all ~5M websites.

Query guidelines:

must specify #standardSQL on the first line and use Standard SQL
must include a short description of the metric it's analyzing, eg:

# Percentage of requests that are third party requests
# broken down by third party category by resource type.

must query the 2019_07_01 dataset (unless otherwise needed)
must be reasonably optimized where possible
file must be named according to its metric ID, eg 05_03.sql
file must be placed in the directory according to its chapter, eg 05_ThirdParties/05_03.sql

Finalize assignments: Chapter 7. Performance

Section	Chapter	Coauthors	Reviewers
II. User Experience	7. Performance	@rviscomi @zeman	@JMPerez @OBTo @sergeychernyshev

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Assign peer reviewers
Finalize metrics

Current list of metrics:

Field (Chrome UX Report)
- global distribution of FCP fast/avg/slow
- global distribution of FID fast/avg/slow
- % of fast FCP websites
- % of fast FID websites
- % of fast FCP+FID websites, per PSI definition
- % of websites with offline experiences
- country/region comparison of any/all of the above
- mobile vs desktop comparison of any/all of the above
- ECT comparison of any/all of the above
Lab (HTTP Archive)
- Hero times
  - first/last painted hero
  - H1 rendering time
  - Largest Image
  - Largest Background Image
- Visually Complete
- First CPU Idle
- Time To Interactive
- Blocking CSS requests
- Blocking JS request
- Time To First Byte (Backend)
- Scripting CPU time
- Layout CPU time
- Paint CPU time
- Loading CPU time
- Lighthouse Performance Score

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of web performance" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the web perf landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Add alternate/hreflang link tags for translations

Per https://support.google.com/webmasters/answer/189077 we should make alternate translations discoverable with <link rel="alternate" hreflang="..." href="..."> elements.

These should be available on every page for all supported translations.

This is basically the metadata/SEO equivalent of #50, so there may be some opportunities to reuse code. cc @tanhengyeow

Add ja-JP templates

We should test out the i18n routing by creating a Japanese translation of the "coming soon" home page and serve it from https://almanac.httparchive.org/ja-JP/.

translate the coming soon page to Japanese
route /ja-JP/ to the Japanese index template (#43)

@MSakamaki can you manually translate the English version and add it to a new ja-JP template directory?

When that's done I can help with the next two items.

cc @HTTPArchive/translators

Finalize assignments: Chapter 10. SEO

Section	Chapter	Authors	Reviewers
II. User Experience	10. SEO	@rachellcostello @ymschaap @AVGP	@clarkeclark @andylimn @voltek62

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Assign peer reviewers
Finalize metrics

Current list of metrics:

Structured data rich results eligibility (ratings, search, etc,)
Lang attribute usage and mistakes (lang='en')
<link> rel="amphtml" (AMP)
<link> hreflang="en-us" (localisation usage)
Breakdown of type of structured data served (ld+json, microformatting, schema.org + what @type)?
Indexability - looking at meta tags like <meta> noindex, <link> canonicals.
<meta> description + <title> (presence & length)
Status codes and whether pages are accessible - 200, 3xx, 4xx, 5xx.
Content - looking at word count, thin pages, header usage, alt attributes images
Linking - extract <a href> count per page (internal + external)
Linking - fragment URLs (together with SPAs to navigate content)
robots.txt (It is mentioned in Lighthouse, can we parse the content or only confirm its existence? E.g. check if has a sitemap reference - seems it does list the potential issues)
If the desktop site is responsive/mobile-ready, or a specific mobile site (redirect, UA)? (Can we find if these are different sites?)
Descriptive link text usage (available in Lighthouse data)
speed metrics (FCP, server response time)

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of SEO" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the SEO landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 6. Fonts

Section	Chapter	Coauthors	Reviewers
I. Page Content	6. Fonts	@davelab6 @zachleat	@HyperPress @AymenLoukil

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

Local vs hosted
Popular hosts
Font formats
Font-Display usage
Variable fonts (see below)
- Latency gains on existing families
- New modes of typographic expression
- New ways to make quality text typography
how many fonts are loaded but also how many type-faces (families) are used
Related, group by weight/style: how many people use italics? those are often left off
Font formats (how many people are still using Bulletproof font face syntax?; WOFF2 use specifically)
Icon fonts (not sure how to measure this, might show up if we measure popular families?)
CSS Font Loading API use?
unicode-range use (and range size, perhaps to glean some info on subsetting)
uses preconnect for web font cdn? popular preconnect domains?
+1 to preload, as Paul said
Use of local() in src

Variable fonts:

Latency gains on existing families
how many pages in the HTTPArchive link to a variable font via @font-face?
- what percent of total pages use VFs?
- what is the % growth over some time period (3, 6, 12 months)?
of those pages linking to a VF, how many are using the 4 font selectors that select on a variable font family?
how many pages link to a VF, but never actually use it?
how many pages link to a VF, but never use it beyond old CSS3 values?
how many pages use new CSS4 values, like font-weight: 555 and not font-weight: 500?
how many pages use @supports to screen for variations capable browsers?
is font-stretch usage growing?
how often is font-size selecting within opsz axis ranges?
which axes are most commonly used today? "top 10 axes"?
which axes are used 6-20pt, and which are used 20pt+?
which axes are used in concert?

Others:

Top Fonts:

top fonts globally
top fonts per provider - Google Fonts, AdobeFonts/TypeKit, Cloud.Typography, FontStand, etc
top self-hosted fonts
what is the bar chart of the number of custom fonts per page?
which page uses the most fonts?

Formats:

is SVG going away?
is EOT going away?
is raw TTF going away?
how many pages do only WOFF and WOFF2?
how many pages do only WOFF?
how many pages do only WOFF2?
how many pages use color fonts?
how many pages use fonts with each of the 4 (SBIX, CBDT, CPAL, SVG) color font formats?

Optimizations:

how many pages use each of the font-display properties?
how many pages use each of the font preloading properties?
how many pages place a single Google Fonts <link> element within <head>?
how many pages place a single Google Fonts <link> element as the very first element within <head>?

👉Optional AI (coauthors): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of web fonts" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the web fonts landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Review all chapter metrics by analysts

AI(@HTTPArchive/data-analysts): After June 3 when the metrics are finalized, they need to be reviewed to ensure that it's clear what the authors are looking for and their questions are answerable using the HTTP Archive dataset.

Ensure all contributors have accepted team invitations

We have GitHub teams for each of the 4 roles: authors, reviewers, analysts, and developers. It's important that everyone who wants to contribute in these areas is on the team so they can be notified all at once using @[team] as needed.

All teams should be set up so that you can manually request to join. I should have also sent individual invitations to everyone who has expressed interest in a team. (Some people just got theirs as I was making this list!) Invites are sent to your GH email, but if you didn't get it you can accept by going to the organization page: https://github.com/HTTPArchive

If your name is below, please accept your invitation! As a bonus, all teams have write access for this repo so you can check your own name off when you've accepted!

Authors

Reviewers

Analysts

Developers

Coordinating a project with 64+ contributors will be challenging. Thank you for helping to make it easier!

Create templates for content pages

We don't have chapters written yet but now would be a good time to create the infrastructure for them so we can just drop in the content when they're ready.

I'm working with a designer to create the look and feel, which should be ready by July. Meanwhile we can get started on the backend routing and templating in preparation.

We should create the following templates:

rename the splash page at index.html to splash.html and update main.py
create a new index.html for the post-launch home page, stub things like navigation and introductory text
create /2019/outline.html where we show a table of contents for all chapters
create /2019/chapter.html as a template for each chapter
create /2019/methodology.html to provide technical info about our test methodology
create /2019/contributors.html to acknowledge everyone who contributed to the project

The Almanac is an annual report and I expect to do this again next year, so we should be proactive and separate the annual content from the static content. In this case, the home page is static to the project while things like table of contents, chapters, methodology, and contributors may change year to year. These should all be organized in a 2019 template directory.

The contributors page will be based on the data in this sheet. It might be a good idea to create a JSON metadata file of all contributors and generate this page dynamically. This way we can also dynamically add authors/reviewers on the bylines of their respective chapters.

Any volunteers from @HTTPArchive/developers who can take part or all of this issue?

Form a team of translators

The Almanac will be published in English, but we want to make sure that everyone can access and understand it.

I'd like to put together a team of translators to help make the Almanac available in as many languages as possible, prioritizing those with the largest developer communities. If you're interested, can you reply to this issue with the languages you're available to contribute?

The entire Almanac website would need to be translated, including: home page, methodology page, contributors page, and all chapters. If there are multiple translators for a language we can split up the work.

According to the timeline, if all goes to schedule, we'll have about 8 weeks after the chapters are completed to get them translated before launch. I'm open to making some major languages "launch blockers" if there's a good case and available resources, otherwise we can commit to translating after launch. All content in all languages will be available on GitHub, so at any time anyone could submit a PR with entire translations or specific translation fixes.

Priority	Language	Primary Translator	Secondary
0	Japanese	@MSakamaki	(none)
0	Spanish	@c-torres	@taytus @JMPerez
0	Russian	@Pavel-Evdokimov	(none)
1	French	@AymenLoukil	(none)
1	Portuguese	@ibrahimcesar	(none)
1	German	@Awesomecloud	(none)
2	Dutch	(none)	(none)
2	Italian	@performize	@realjoker
2	Polish	(none)	(none)

Reference sheet

@HTTPArchive/translators

Finalize assignments: Chapter 5. Third parties

Section	Chapter	Author	Reviewers
I. Page Content	5. Third parties	@patrickhulce	@simonhearne @flowlabs @jasti @zeman

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

Percentage of pages that include at least one third-party resource.
Percentage of pages that include at least one ad resource.
Percentage of requests that are third party requests broken down by third party category by resource type.
Percentage of total bytes that are from third party requests broken down by third party category by resource type.
Percentage of total script execution time that is from third party scripts broken down by third party category.
Median page-relative percentage of requests that are third party requests broken down by third party category by resource type.
Median page-relative percentage of total bytes that are from third party requests broken down by third party category by resource type.
Median page-relative percentage of total script execution time that is from third party scripts broken down by third party category.
Top 100 third party domains by request volume
Top 100 third party domains by total byte weight
Top 100 third party domains by total script execution time
Top 100 third party requests by request volume
Top 100 third party requests by total script execution time

👉 AI (@patrickhulce): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the third party landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 17. CDN

Section	Chapter	Authors	Reviewers
IV. Content Distribution	17. CDN	@andydavies @colinbendell	@yoavweiss @paulcalvano @pmeenan

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Assign peer reviewers
Finalize metrics

Current list of metrics:

What are the top CDNs (by number of sites using rather requests?)
What % of sites use a CDN
% of sites that use a CDN for primary domain i.e. www
% of sites that use a CDN for secondary domain e.g. static. media.
Usage of 3rd-party public CDNs e.g jQuery, apis.google etc.
CDN TTFB
HTTP things (not necessarily cdn related): header volume, STS, Timing-Allow-Origin, Via, Keep-Alive, Server-Timing metrics/presence, Vary, Content-Disposition, etc
TLS negotiation time
TLS Certificate size
OCSP stapling support
Dns v. anycast ip use
Cwnd growth rate (not sure this will be measurable)
TLS connection coalescing with H2 connections
Number of CDN's used per page
H2 push?
HTTPS uses 1.1 or 2?
Use of CDN header directives (s-max-age, stale-while-revalidate, nopush, stale-while-error, pre-check and Surrogate-Control)
How have these patterns changed over the last year / two years

👉 AI (reviewers): Finalize which metrics you might like to include in an annual "state of CDNs" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to the subject matter experts to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the CDN landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 16. Caching

Section	Chapter	Author	Reviewers
IV. Content Distribution	16. Caching	@paulcalvano	@yoavweiss @colinbendell

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

TTL by resource
Resources served without cache
Cache strategy?
Cache TTL vs Content Age
Availability of Last-Modified vs. ETag validators
Validity of Dates in Last-Modified and Date headers
Set-Cookie on cacheable responses?
Use of Cache-Control: max-age vs. Expires
Use of Vary (how many dimensions, what headers, etc.)
Use of other Cache-Control directives (e.g., public, private, immutable)
1st Party vs 3rd Party Caching
Public vs Private
Use of must-revalidate
Service Worker caching
AppCache

👉Optional AI (@paulcalvano): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

👉 AI (@paulcalvano): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Form a team of web designers and developers

We're looking for web designers and developers to help build the UX of the Almanac itself. It will be a static website home to each annual report and read like an ebook to help users navigate the various sections and chapters.

Designers should have the bandwidth throughout July and August to conceptualize the UX of the website and the data visualizations used by the chapters. This is a big burden so we're open to contracting this out. See this Upwork post for the full details of the design work.

Update: We've hired contractors to do the illustration and design work. You can join the discussion in the #web-almanac-design Slack channel.

Developers will be responsible for implementing the designers' vision and merging it with the authors' written content, while following accessibility and SEO best practices.

Join the team: @HTTPArchive/developers
See open issues: Development label

Licensing and attribution

In the interest of openness and transparency, we want the Almanac to be free, shareable, and extensible. At the same time, authors deserve attribution for their content. And everyone else who has helped build this report (now over 50 people!) should get some kind of recognition.

It's my intent to give credit to everyone who participates in the project in any form on a "Contributors" page. Each chapter will also name their respective authors and reviewers.

We should also provide an easy way to grab quotes from anywhere in the Almanac and see exactly how we expect them to be attributed. For example, if I write something in the Performance chapter that gets quoted, maybe we should annotate it with "Rick Viscomi, 2019 Web Almanac (II.6)" or similar.

What would not be ok is if someone scrapes all of the content and sells it. Our license should have protections against that sort of thing. This repo is marked as Apache 2.0 which permits commercial use, but that should apply only to our source code not the authored content.

Does anyone have experience with this kind of thing? Any ideas for protecting our work while making it as open as possible?

Authoring in markdown

There was some earlier discussion about allowing for authoring the chapter content using markdown. Would this still be something worth investigating as a spike task?

The benefits are that it seems easier to write in markdown than html, and that gitlocalize seems to have better support for markdown.

The challenges are that we will likely want some rich data visualisation in the chapters, which may not have be simple to achieve using markdown.

Finalize assignments: Chapter 15. Compression

Section	Chapter	Author	Reviewers
IV. Content Distribution	15. Compression	@paulcalvano	@yoavweiss @colinbendell

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Finalize peer reviewers
Finalize metrics

Current list of metrics:

What compression formats are being used (gzip, brotli, etc)
Is there anything we can tell from the level of compression
Are there missed opportunities for compressing resources
Compression by Content Type

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Assign subject matter experts and peer reviewers to each chapter

Part	Chapter	Authors	Reviewers	Tracking Issue
I. Page Content	1. JavaScript	@addyosmani @housseindjirdeh @mathiasbynens	@rwaldron @RReverser	#3
I. Page Content	2. CSS	@una @argyleink	@meyerweb @huijing	#4
I. Page Content	3. Markup	@bkardell	@zcorpan	#5
I. Page Content	4. Media	@dougsillars @colinbendell @Yonet	@ahmadawais @kornelski	#6
I. Page Content	5. Third Parties	@patrickhulce	@simonhearne @flowlabs @jasti @zeman	#8
I. Page Content	6. Fonts	@davelab6 @zachleat	@HyperPress @AymenLoukil	#7
II. User Experience	7. Performance	@rviscomi @zeman	@JMPerez @OBTo @sergeychernyshev	#9
II. User Experience	8. Security	@arturjanc @ScottHelme	@paulcalvano @bazzadp @ghedo @ndrnmnn	#10
II. User Experience	9. Accessibility	Nektarios Paisios, @nadinarama, @OBTo	@rachellcostello, @kleinab	#11
II. User Experience	10. SEO	@rachellcostello @ymschaap @AVGP	@clarkeclark @andylimn @voltek62	#12
II. User Experience	11. PWA	@tomayac @jeffposnick	@HyperPress @ahmadawais	#13
II. User Experience	12. Mobile web	@slightlyoff @OBTo	@HyperPress @AymenLoukil	#14
III. Content Publishing	13. Ecommerce	@samdutton @alankent	@voltek62 @wizardlyhel	#15
III. Content Publishing	14. CMS	@amedina	@westonruter @mor10 @sirjonathan	#16
IV. Content Distribution	15. Compression	@paulcalvano	@yoavweiss @colinbendell	#17
IV. Content Distribution	16. Caching	@paulcalvano	@yoavweiss @colinbendell	#18
IV. Content Distribution	17. CDN	@andydavies @colinbendell	@yoavweiss @paulcalvano @pmeenan	#19
IV. Content Distribution	18. Page Weight	@khempenius @henrisGH @tammyeverts	@paulcalvano @flowlabs	#20
IV. Content Distribution	19. Resource Hints	@khempenius @yoavweiss	@andydavies @addyosmani	#21
IV. Content Distribution	20. HTTP/2	@bazzadp	@bagder @rmarx @dotjs	#22

Reference sheet

For more context about the Almanac project and how you can help, see this post.

Finalize assignments: Chapter 8. Security

Section	Chapter	Authors	Reviewers
II. User Experience	8. Security	@arturjanc @ScottHelme	@paulcalvano @bazzadp @ghedo @ndrnmnn

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter experts (coauthors)
Assign peer reviewers
Finalize metrics

Current list of metrics:

TLS 🔒

Protocol Usage
- SSLv2 / SSLv3 / TLSv1.0 / TLSv1.1 / TLSv1.2 / TLSv1.3
Unique CA issuers
RSA certificates
ECDSA certificates
Certificate validation level (DV / OV / EV)
Cipher suite usage
- Suites supporting Forward Secrecy (ECDHE / DHE)
- Authenticated suites (GCM / CCM)
- Modern suites (AES GCM, ChaCha20-Polyc1305)
- Legacy suites (AES CBC, 3DES, RC4)
OCSP Stapling
Session ID/Ticket assignment
Sites redirecting to HTTPS
Sites with degraded HTTPS UI (mixed-content)

Security Headers 📋

Content Security Policy
- Policies with frame-ancestors
- Policies with 'nonce-*'
- Policies with 'hash-*'
- Policies with 'unsafe-inline'
- Policies with 'unsafe-eval'
- Policies with 'strict-dynamic'
- Policies with 'trusted-types'
- Policies with 'upgrade-insecure-requests'
HTTP Strict Transport Security
- Variance in max-age
- Use of includeSubDomains
- Use of preload token
Network Error Logging
Report To
Referrer Policy
Feature Policy
X-Content-Type-Options
X-Xss-Protection
X-Frame-Options
Cross-Origin-Resource-Policy
Cross-Origin-Opener-Policy
Vary (Sec-Fetch-* values)

Cookies 🍪

Use of HttpOnly
Use of Secure
Use of SameSite
Use of prefixes

Other ❓

Use of SRI on subresources
Vulnerable JS libraries (lighthouse?)

👉 AI (coauthors): Finalize which metrics you might like to include in an annual "state of web security" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the web security landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 20. HTTP/2

Section	Chapter	Authors	Reviewers
IV. Content Distribution	20. HTTP/2	@bazzadp	@bagder @rmarx @dotjs

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

Assign subject matter expert (author)
Assign peer reviewers
Finalize metrics

Current list of metrics:

Adoption rate of HTTP/2 by site (home page only) and by requests (all request on page) over the years. Trend graph over all available years.
Measure of HTTP version negotiated (0.9, 1.0, 1.1, 2, gQUIC) for main page of all sites, and for HTTPS sites. Table for last crawl. For example:

Version	All sites	HTTPS only sites
HTTP/0.9	0%	0%
HTTP/1.0	2%	0%
HTTP/1.1	48%	20%
HTTP/2	44%	70%
gQUIC	6%	10%

For gQUIC it will be sites that return Alt-Svc HTTP Header which starts with quic.

Average percentage of resources loaded over HTTP/2 (or gQUIC) versus HTTP/1.1 per site. Trend graph over all available years.
Number of HTTP (not HTTPS) sites which return upgrade HTTP header containing h2. Once off stat for last crawl.
Number of HTTPS sites using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl.
Number of HTTPS sites not using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl.
% of sites affected by CDN prioritization issues (H2 and served by CDN) - https://github.com/andydavies/http2-prioritization-issues#cdns--cloud-hosting-services. If not possible then maybe just list sites by CDN and can then manually vlookup from table in Andy's github issue? Once off stat for last crawl.
Count of HTTP/2 sites grouped by server HTTP header value but strip version numbers (e.g. Apache and Apache 2.4.28 and Apache 2.4.29 should all report as Apache, but Apache Tomcat should report as Tomcat. Probably need to massive the results to achieve this). Once off stat for last crawl.
Count of non-HTTP/2 sites grouped by server HTTP header value but strip version numbers. Once off stat for last crawl.
Count of HTTP/2 sites which use HTTP/2 Push. Trend graph over all available years.
Average number of HTTP/2 Pushed resources and average bytes. Once off stat for last crawl.
Count and number of bytes pushed by asset type (CSS, JS, Images...etc.). Once off stat for last crawl.
Count of preload HTTP Headers with nopush attribute set. Once off stat for last crawl.
Is it possible to see HTTP/2 Pushed resources which are not used on the page load?
Measure number of TCP Connections per site. Average number of domains per site still going down year on year as per HTTP Archive State of the Web report? Trend graph over all available years.
Measure average number of TCP Connections per site for HTTP/1.1 sites versus HTTP/2 sites. Once off stat for last crawl.
Count of HTTP/2 sites grouped by SETTINGS_MAX_CONCURRENT_STREAMS (including HTTP/2 sites which don't set this value). Note this was added recently as per #22 (comment). Once off stat for last crawl.

👉 AI (@bazzadp): Finalize which metrics you might like to include in an annual "state of HTTP/2" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the HTTP/2 landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Website architecture design

I have some questions / remarks :

Is there somewhere a document describing the target architecture of the Almanac website (information architecture, URL structure) ?
I understood from reading the issues, that we aim to have a static page and every year version should go under a dedicated sub-folder /2019, /2020..etc. IMHO, it would be better to rather have this setup :

Every year we put the content on the root (HP contains static + dynamic insights of the current version), /outline for the current outline, etc... and when we publish a new edition, we archive previous ones on subfolders.

Example :
2019 in the main generic 'website'
2018 content goes on /2018 folder
And when we publish 2020 version, main generic 'website' should get updated with 2020 data
and we archive 2019 on /2019 folder...Etc

The pros of this method :

People who search for Almanac insights they find them on the root of the website, simpler. it's trivial for them to find and go to the current version
The generic website will gain value over the years, while still allowing the older versions to be accessible if someone explicitly looks for them

What do you think ? @rviscomi

Develop a translation workflow

The web server should accommodate multiple languages. See #29 for context.

To summarize the thread:

content will initially be created in English within the src/templates/en directory
each chapter will be written in Markdown per #59
translations of non-chapter contents will be done manually in a PR and saved to their respective directories organized by language code
translations of chapter contents based in Markdown will be done using the gitlocalize tool

Original comment for reference:

I think we should create a src/en/ directory and put all of the English-specific templates there by default, including our current splash page.

So the directory structure might look like this

src
- en
  - templates
    - base.html (English base template)
- ja
  - templates
    - base.html (Japanese base template)

In src/main.py we should use the en directory by default, unless one is specified in the URL and it's a supported language code.

We also need to start thinking about (but not necessarily implement at this stage) a way to switch languages in the UI.

@HTTPArchive/developers any takers for this issue? Any other implementation ideas?

httparchive / almanac.httparchive.org Goto Github PK

almanac.httparchive.org's Introduction

The HTTP Archive tracks how the Web is built

What is the HTTP Archive?

almanac.httparchive.org's People

Contributors

Stargazers

Watchers

Forkers

almanac.httparchive.org's Issues

Due date: No later than July 1

Requirements (edit by @mikegeyser):

Structure:

Rendering:

Tooling:

Authors

Reviewers

Analysts

Developers

Recommend Projects

Recommend Topics

Recommend Org

Jobs