GithubHelp home page GithubHelp logo

httparchive / almanac.httparchive.org Goto Github PK

View Code? Open in Web Editor NEW
597.0 88.0 159.0 383.26 MB

HTTP Archive's annual "State of the Web" report made by the web community

Home Page: https://almanac.httparchive.org

License: Apache License 2.0

Python 2.72% CSS 2.91% HTML 87.45% Shell 1.03% JavaScript 5.78% Dockerfile 0.04% Batchfile 0.08%
web-almanac http-archive bigquery

almanac.httparchive.org's Introduction

The HTTP Archive tracks how the Web is built

!! Important: This repository is deprecated. Please see HTTPArchive/httparchive.org for the latest development !!

This repo contains the source code powering the HTTP Archive data collection.

What is the HTTP Archive?

Successful societies and institutions recognize the need to record their history - this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web's digitized content.

In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

almanac.httparchive.org's People

Contributors

bazzadp avatar bkardell avatar borisschapira avatar c-torres avatar catalinred avatar chefleo avatar denar90 avatar dependabot[bot] avatar foxdavidj avatar github-actions[bot] avatar hakacode avatar j9t avatar jmperez avatar kevinfarrugia avatar ksakae1216 avatar lex111 avatar max-ostapenko avatar mikegeyser avatar msakamaki avatar patrickhulce avatar paulcalvano avatar rviscomi avatar saptaks avatar shantsis avatar strangernr7 avatar tiggerito avatar tomvangoethem avatar tunetheweb avatar victorlep avatar ymschaap avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

almanac.httparchive.org's Issues

Build the development environment to maintain the Almanac

AI(@HTTPArchive/developers): Design the tech stack for the Almanac.

The Almanac user experience will be entirely static and stateless, so a solution as simple as GitHub Pages could work. I think we want a bit more control over the backend (response headers, SSR templates) so I'm leaning towards a similar setup as https://github.com/HTTPArchive/httparchive.org in which we build on App Engine. Thoughts?

TO DO:

  • Create Python App Engine development environment with Flask
  • Create an initial base template, styles
  • Extend the base template to create a temporary splash page to be deployed on almanac.httparchive.org while the project is under construction

Finalize assignments: Chapter 11. PWA

Section Chapter Authors Reviewers
II. User Experience 11. PWA @tomayac @jeffposnick @HyperPress @ahmadawais

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • % of pages with SW installs
  • Manifest
  • Stats on different service worker events
  • Stats on different web app manifest properties
  • Workbox adoption/usage
  • beforeinstallprompt usage

๐Ÿ‘‰AI (coauthors): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of PWAs" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the PWA landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Create an almanac_sample dataset

@HTTPArchive/data-analysts

It'd be helpful to have a sample dataset of ~1000 pages to try out queries and play with the data to see what types of metrics are possible.

The dataset should contain tables for each of the different data types:

httparchive.almanac_sample.

  • blink_features
  • lighthouse_mobile
  • requests_ [desktop, mobile]
  • response_bodies_ [desktop, mobile]
  • summary_pages_ [desktop, mobile]
  • summary_requests_ [desktop, mobile]
  • technologies_ [desktop, mobile]

A random sample of 1000 pages from the most recent 2019_05_01 crawl for both desktop and mobile should do.

@paulcalvano do you have time to set this up?

Finalize assignments: Chapter 13. Ecommerce

Section Chapter Coauthors Reviewers
III. Content Publishing 13. Ecommerce @samdutton @alankent @voltek62 @wizardlyhel

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

Top ecommerce platforms:

  • Marketplace: for example, eBay or Etsy.
  • Hosted shop: for example, Shopify.
  • Hosted platform: for example, Magento Commerce.
  • Self-hosted platform: for example, Magento Open Source.
  • Not on a platform or marketplace: sites that show payment activity but don't appear to be on a platform or marketplace.

Stats for sites that appear to be e-commerce sites (as above):

  • Images: quantity, format, byte size, pixel dimensions, etc.
  • Home page HTML size.
  • Performance stats.
  • Third-party content: total weight and number of requests (and performance impact if possible).
  • Analytics providers.
  • Ad providers.
  • Indexability.
  • Qualification (or not) as a PWA.

๐Ÿ‘‰AI (coauthors): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of ecommerce" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the ecommerce landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 19. Resource Hints

Section Chapter Authors Reviewers
IV. Content Distribution 19. Resource Hints @khempenius @yoavweiss @andydavies @addyosmani

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

For each resource hint (preload, prefetch, preconnect, prerender):

  • % of sites using $HINT; how this has changed since a year ago.
  • For sites using $HINT, # of times it is used.
  • crossorigin attribute, as attribute, resource priority
  • Preload/Prefetch-only: the resource types that $HINT is used for (e.g. js, document, etc.)
  • Preload only: % of sites that are using as=font/as=fetch without a crossorigin attribute, or that are using any other as value with a crossorigin attribute.
  • Preload only: % of sites where a preload of low priority is done before a load of higher priority and a different as attribute value.

Priority Hints:

  • % of sites using this
    Note: depending on how small the sample size is the following metrics may not be worth calculating
  • Usage breakdown by tag (i.e. iframe, img, link, or script)
  • Usage breakdown by Importance (i.e. low/high/auto)
  • (Optional) tag x importance (e.g. do scripts tend to be "high" importance? iframes "low" importance? etc.)

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of priority hints" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the priority hints landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 1. JavaScript

Section Chapter Coauthors Reviewers
I. Page Content 1. JavaScript @addyosmani @housseindjirdeh @mathiasbynens @rwaldron @RReverser

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

  • Transfer size/count

    • Distribution of JS bytes
    • Distribution of first party JS bytes vs. third party
    • Number of JS requests
    • Number of first-party JS requests vs. third-party
    • % of gzip-compressed scripts
    • % of brotli-compressed scripts
  • Runtime cost

    • Breakdown of V8 CPU times (if feasible)
  • Library usage

    • Top N JS libraries
    • Notable changes in popularity since last year
    • Top N JS client-side frameworks (React, Vue, etcโ€ฆ)
    • Distribution of JS bytes on site per JavaScript framework
  • Feature adoption

    • % of pages that use <script type=module>
    • % of pages that use <script nomodule>
    • % of pages that use <link rel=preload> for JS resources
    • % of pages that use <link rel=modulepreload>
    • % of pages that use <link rel=prefetch> for JS resources
    • Use of navigator.connection.effectiveType property
    • Estimate adoption of specific JS language features (by looking for the following raw strings in JS response bodies)
      • Atomics
      • Intl
      • Proxy
      • SharedArrayBuffer
      • WeakMap
      • WeakSet
      • dynamic import (by looking for "import(")
  • Other

    • % of sites that ship sourcemaps

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of JS" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the JS landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Add analytics script

This Google Analytics tracking script needs to be added to the document head of the base template for every page.

<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-22381566-3"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-22381566-3');
</script>

This issue is blocked on creation of the base template in #25.

Triage all proposed metrics (396 of 396 done)

Assigned: @HTTPArchive/data-analysts team

Due date: No later than July 1

Any metrics that require augmenting the test infrastructure (eg custom metrics) must be ready to go when the July crawl starts. This ensures that when the crawl completes at the end of July, we can query the dataset and pass it off to authors for interpretation in August.

As of now there are 350+ metrics spread over 20 chapters.

Part Chapter Able To Query Not Feasible Grand Total
I 01. JavaScript 24 1 25
I 02. CSS 39 7 46
I 03. Markup 4 1 5
I 04. Media 20 5 25
I 05. Third Parties 13 ย  13
I 06. Fonts 40 7 47
II 07. Performance 24 ย  24
II 08. Security 36 5 41
II 09. Accessibility 32 6 38
II 10. SEO 15 ย  15
II 11. PWA 6 ย  6
II 12. Mobile web 19 2 21
III 13. Ecommerce 10 3 13
III 14. CMS 11 1 12
IV 15. Compression 3 1 4
IV 16. Caching 14 1 15
IV 17. CDN 13 3 16
IV 18. Page Weight 3 ย  3
IV 19. Resource Hints 10 ย  10
IV 20. HTTP/2 14 3 17
ย  Grand Total 350 46 396

I've copied all of the metrics for each chapter to this sheet (named "Metrics Triage"). To edit the sheet please give me your email address to add to the editors list. What we need to do is go through the list of metrics for each chapter and assign a status from one of the following:

  • To Be Reviewed
  • Need More Info
  • Not Feasible
  • Able To Query
  • Custom Metric Required
  • Custom Metric Written
  • Query Written

The lifecycle is:

  • All metrics start as TBR
    • Move to NMI if the metric is vaguely worded or otherwise unclear what is being asked for. Get in touch with the chapter author(s) and straighten out what the expected data should look like.
    • Move to NF if the metric cannot be queried using the HTTP Archive dataset or other publicly available datasets on BigQuery (eg CrUX). This is the "done" state for metrics which cannot progress any further.
    • Move to ATQ if the metric is able to be queried from the dataset based on the latest schema
      • Move to QW if the metric has a corresponding query written. This is the ideal "done" state for all metrics.
    • Move to CMR if the metric can only be queried with the addition of a custom metric
      • Move to CMW if the metric has had a corresponding custom metric written. Metrics in this state must also have a corresponding query written and moved to QW when complete.

Custom metrics should only be added as a last resort and must adhere to strict performance requirements. We test on millions of pages so any complex/slow scripts would impede the crawl. Because we anticipate needing many custom metrics, we'll implement everything as individual functions within a single custom metric whose output is a JSON-encoded object with each result as its own sub-property. More on this when we get there.

Add your name in the Analyst column to take responsibility for moving it through the metric lifecycle.

Once we're ready to begin writing queries, we will create a thread on https://discuss.httparchive.org for each chapter, listing all queryable metrics. Hopefully we can crowdsource some of the querying by tapping into the power users on the forum.

Optimize the 2019_07_01 dataset for querying

At the end of July the 2019_07_01 dataset will be available. Here are some ideas to minimize the cost of queries:

  • implement the partitioning/clustering proposal with the 2019_07_01 dataset
  • add a column to response_bodies annotating their file types (eg js, css, html) or join the entire body column with the requests table
  • what else?

One principle should be that whatever queries we write for the Almanac should be reproducible against any other monthly dataset. So if we optimize the July dataset we should also apply the same optimizations to all others. This doesn't have to happen until launch.

Finalize assignments: Chapter 2. CSS

Section Chapter Coauthors Reviewers
I. Page Content 2. CSS @una @argyleink @meyerweb @huijing

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

Section Metric description
Usage of popular/new APIs Custom properties
Usage of popular/new APIs @import @supports
Usage of popular/new APIs Filters
Usage of popular/new APIs Blend modes
Usage of popular/new APIs Logical properties
Usage of Unit Types Hsl vs hsla vs. rgb vs rgba vs. hex
Usage of Unit Types rem vs em vs px vs ex vs cm etc.
Usage of Unit Types classes vs ids
CSS Tooling Today Top CSS development tools
Usage of Popular CSS Libraries Top CSS libraries
Resets Top reset utilities
Layout RTL vs. LTR
Layout Flexbox
Layout Grid
Media Queries Most Popular Snap Points
Media Queries max vs. min-width
Media Queries Ems vs rems vs px in media queries
Media Queries How many people using print style media queries
Size of style payload number of stylesheets per page
Size of style payload Most popular names for stylesheets
Size of style payload Minified vs unminified
Size of style payload # of fonts downloaded
Size of style payload Types of fonts downloaded
Size of style payload Average size of CSS load per site
Size of style payload Average size of images loaded by stylesheets (inline and linked)
Individual files vs bundled files Came with the HTML
Individual files vs bundled files Inserted post page load
Individual files vs bundled files Async v sync
Individual files vs bundled files Constructible stylesheets
Individual files vs bundled files Inline Styles vs. one stylesheet link
Duplication / Etc Shorthand vs. longhand properties
Duplication / Etc Number of colors declared per site
Duplication / Etc Number of duplicate colors per those sites
Duplication / Etc Number of fonts declared per site
Duplication / Etc Number of duplicate font family declarations
Duplication / Etc Number different font size values per site
Duplication / Etc Number of z-indices per site
Duplication / Etc Most popular z index values (chart)
Duplication / Etc Number of different media query values per site
Duplication / Etc Number of different margins per site
Duplication / Etc Number of transitions used per site
Duplication / Etc Number of @keyframes declared per site
Duplication / Etc Number of [id="foo"]
Duplication / Etc Number of [class*='foo']``[class^='foo'] [class$='foo'] [class~='foo']
Duplication / Etc Number of classes per element
Duplication / Etc Average Length of classes

๐Ÿ‘‰AI (coauthors): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.

The metrics should paint a holistic, data-driven picture of the CSS landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Consider users' Accept-Language preference when selecting the default language

When users first visit, I think that it will be a good experience for the user to be not conscious of language selection.

To do this, need to check the Accept-Language header sent by the client and route it to the appropriate language.
(Must be careful not to work with redirects after being selected by the switcher)

This needs to be addressed after #50.

Decide on typography

Does anyone have design experience, specifically in typography, who could offer some suggestions for Almanac fonts?

Build a language switcher

@HTTPArchive/developers

Somewhere on the page we should have a dropdown field for users to change the language. The expected UX of the field is:

  • the default/selected option is the current language
  • clicking the field will open a switcher with the other supported languages displayed
  • selecting any other language will reload the current page with the language specified in the URL

For example, if on / and the user selects Japanese, the new URL will be /ja/. Similarly, if on /2019/outline the new URL will be /ja/2019/outline. Switching to English will force the /en URL path prefix for simplicity.

In terms of implementation, it should be built into the templates/base.html template. We can look at the {{ lang }} property and set an option as selected if it matches. For accessibility, does this need to have anchor elements or are there equivalent ARIA attributes we can use?

Design questions:

  • is there a common place for this UI to go? header/footer?
  • should we show flags as the options as a way to convey the language in a way not dependent on translation? (need a text-based fallback anyway for accessibility)
  • should the options be translated in the destination language?

Form a team of data analysts

Data analysts are responsible for working with authors to provide HTTP Archive data that match the metrics outlined in each chapter. Analysts should be familiar with the BigQuery dataset and comfortable with SQL. Mentoring is available for anyone who wants to learn!

Our current assignment is to triage all ~250 metrics. New analysts: please see #33 for more info on how to access the metrics sheet and what to do.

Join the team: @HTTPArchive/data-analysts

Finalize assignments: Chapter 12. Mobile web

Section Chapter Authors Reviewers
II. User Experience 12. Mobile web @slightlyoff @OBTo @HyperPress @AymenLoukil

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • Tap targets
    • Lets tackle through [1], with the freq being how many offending elements are on each page
  • Legible Font size. Analyzing this with what lighthouse deems as an acceptable % of legible text is fine
  • Proper font contrast
    • Please tackle like "Tap Targets" above
  • Mobile configuration split - separate mobile and desktop sites, responsive site, dynamically served content.
  • % sites prevent users from scaling the viewport
  • % site with a meta viewport at all
  • % sites containing any CSS breakpoints <= 600px
  • % sites locking display orientation
  • % of sites preventing pasting into password fields
  • % of sites sites making NO permission requests. Should only be making these upon a user-interaction like a click.
  • For each of the following, what % of sites make this permission request while loading: Notifications, Geolocation, Camera, Microphone
  • # of links or buttons (ideally any element with a click listener attached) only containing an icon [1]
    • This can be tested by checking if only svg is inside the button or if a single character is (font icons)
  • How well are sites using native features on the web to simplify a users job:
    • What is the penetration for each of the following input types [1]
      • color, date, datetime-local, email, month, number, range, reset, search, tel, time, url, week, datalist
      • % of sites using ANY of the above input types
    • Penetration for each of the following attributes [1]
      • autocomplete, min or max, pattern, placeholder, required, step
      • % of sites using ANY of the above attributes (besides placeholder and required)
  • For sites which have a document event listener triggering on a scroll (touchstart, wheel, etc), how many are using passive event listeners
  • % of sites that send more JS than the size of the viewport ("Web Bloat Score") per pageload
  • number/fraction of sites specifying a webapp manifest
  • number of sites registering a Service Worker
  • cumulative layout shift

[1] The best way to both analyze and display these pieces of data is through a frequency distribution graph. With this we can both find out how big of an issue this tends to be for the average site, and what the global trends are

๐Ÿ‘‰AI (@slightlyoff @OBTo): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

๐Ÿ‘‰ AI (@slightlyoff @OBTo): Finalize which metrics you might like to include in an annual "state of mobile web" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the mobile web landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 14. CMS

Section Chapter Author Reviewers
III. Content Publishing 14. CMS @amedina @westonruter @mor10 @sirjonathan

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

Section | Metric description

  • What are the top CMSs
    -- There are studies and reports classifying CMSes according to market share
    -- The WordPress community commonly cites W3Techs
    -- It would be interesting to validate such claims with HTTPArchive/CrUX data
    -- That is: would the sample space represented by these datasets correlate to the reported market shares elsewhere?

  • AMP adoption: number WordPress-powered pages using the AMP plugin for WordPress
    -- Version of the plugin
    -- Number of AMP pages using the different Template mode used (reader/classic, transitional/paired, native).
    -- Suggestion: WordPress.com enables AMP by default so it would be interesting to see how many sites have disabled it in addition to how many self-hosted sites have enabled it. Not sure if this is possible or not but would also help
    -- The AMP plugin for WordPress generates the following meta tag:
    <meta name="generator" content="AMP Plugin v1.1.2; mode=native">

  • Coupled vs. Decoupled CMS use: Headless CMSes
    -- There is a โ€œtrendโ€ of using some CMSes in headless mode; it would be interesting to capture the prevalence of such uses
    -- Measuring this is not easily doable but we would like to keep this metric and analyze in terms of the metrics obtained for regular CMS usage (e.g. non Headless)

  • Device Distribution
    -- With so much device fragmentation and the impact on performance of using low-end devices, it would be good to know where content powered by difference CMSes is being accessed from.
    -- Comparison against non-CMS cases would also shed light on demographics, geography (together with device usage per region)

  • Connection distribution
    -- Connection types

  • HTTPArchive/CrUX Metrics: We should capture a view of the ecosystem in terms of usability metrics
    -- Is it happening?: Has the navigation started successfully? has the server started responding? Metric: First Paint,TTFB (HTTPArchive/CrUX Metrics)
    -- Is it useful?: when youโ€™ve painted text, an image or content that allows the user to derive value from the experience and engage with it. Metrics: First Contentful Paint, First Meaningful Paint, Speed Index (HTTPArchive/CrUX Metrics)
    -- Is it usable?: when a user can start meaningfully interacting with the experience and have something happen (e.g tapping on a button). This can be critical as users can get disappointed if they try using UI that looks ready but isn't. Metrics: Time to Interactive(lab), First CPU Idle, First Input Delay (field)
    -- Is it delightful?: delightfulness is about ensuring performance of the user experience remains consistent after page load. Can you smoothly scroll without janking? are animations smooth and running at 60fps? do other Long Tasks block any of these from happening?.

Check the brainstorming doc to explore ideas.

These metrics would paint a holistic, data-driven picture of the CMS landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Generate the Contributors page

The Contributors page will be generated based on a JSON file with all of the contributor metadata. Required info for all contributors:

  • full name
  • list of teams contributed

Not required but super nice to have info:

  • avatar URL (Gravatar?)
  • short personal tagline ("Engineer at Company", "Web dev thoughtleader", etc)
  • GitHub profile
  • Twitter profile

TODO: @HTTPArchive/developers

  • contributors have verified their information is correct in the Contributor sheet
  • create a src/config/contributors.json file to organize all contributor metadata
  • contributors have submitted a PR to update their metadata as needed
  • generate the contributors.html template based on the JSON metadata

Finalize assignments: Chapter 4. Media

Section Chapter Coauthors Reviewers
I. Page Content 4. Media @dougsillars @colinbendell @Yonet @ahmadawais @kornelski

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • Image formats
    • Lighthouse data on responsiveness, format, quality, lazy loading
    • adoption of newer image formats like WebP
    • SVG
      • Inline versus external sources (from css or otherwise)
      • comments volume v. total bytes
      • SVGO comparison
    • Microdata usage (og:image, twitter:image, etc)
    • Use of <source sizes>
    • Preloader effectiveness (initiator Source: javascript, css, vanilla-html)
    • Fallback image support for legacy devices that donโ€™t support <picture> or <srcset>
    • Accept-CH in <meta> vs http
    • Photographic v. illustration score per pixel
    • Bytes per pixel for photographic
    • Use of Vary (Either User-Agent or Accept)
    • A11y: Support for Alt tags
    • TCP/TLS connection time delay (use of preconnect for cross origin hosts)
    • inlined / base64 image content
  • Video formats
    • MP4 sizes, streaming info
    • how many pages are self-serving video (not YouTube)
    • JS player size,
    • container options (mp4;hevc, v. mp4;avc1 v. webm:vp9, wbem;vp8)
    • Use of posters, autoplay, fallback image
    • A11y: Support for description or fig
  • Hero media
    • how many pages include a large "hero" graphic above the fold?
    • Same or different microdata hero images
    • Orientation and pixel volume of hero images
    • Hero video usage
  • Emerging media

๐Ÿ‘‰AI (coauthors): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of web media" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the web image/video landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Generate an ebook

@HTTPArchive/developers curious to hear thoughts from others about this, might be crazy.

I'd like to see the entire contents of the Almanac on a single web page, formatted similar to a book. It would also have a print stylesheet to handle things like page breaks and page numbers, so one could print to PDF and it'd just workโ„ข๏ธ as a fully formed e-book. It'd also be a PWA that could be added to home screen and read offline.

There are concerns like lazy loading, history state management, deep linking, etc but I think these are all solvable problems.

I'm excited about this idea because a report on the state of the web should ideally maximize the web's capabilities for a great UX.

WDYT?

Requirements (edit by @mikegeyser):

Structure:

  • Table of contents
  • Page numbers
  • Header/footer in the margins
  • Cover page
  • Methodology section
  • Contributors section

Rendering:

  • Set page metadata correctly (title etc.)
  • Mirror page margins.
  • Solve the problem of urls when printed.
  • Table wrappers aren't rendering properly
  • Problem with 'that z-index figure' in the CSS chapter
  • Weird black chart in the markup chapter
  • Prevent breaking a figure over multiple pages (separating the caption from the image)
  • Figure out what to do with tables that span multiple pages (particularly font chapter)
  • Figure out what to do with charts that are too big for a page
  • Handle internal links in-chapter references, that are repeated (e.g. #fig1, #conclusion).
  • Handle internal links cross-chapter references).
  • Handle internal links (Author links top).
  • Handle internal links (Author links top).
  • Author Avatars missing in PDF.
  • Internationalise book name (e.g. ebook-en.pdf)
  • Internationalise CSS content (e.g. title, pg...etc.) - maybe with inline <style> tags in ebook template?
  • ToC page numbers are misaligned in the Japanese PDF.

Tooling:

  • Try and come up with a solution that doesn't need the html to be served from flask
  • Integrate weasyprint into the generate script
  • Make the rendering config more dynamic
  • Currently have to call weasyprint per year and language - script this.

Please feel free to add any more, and we can see if they're possible feasible. :)

Relative paths and the readme

Should the repro instructions from the src/readme be in the main readme? If not, would it make sense to change /src/README.md to include instructions relative to the project root?

pip install -r requirements.txt > pip install -r src/requirements.txt
python main.py > python src/main.py

Not sure if this is an issue for many people

Design the home page

  • list all of the information that will need to be included on the home page
  • design the UX of the home page, establishing the identity and style of the Almanac website

I'll get started on the first one to unblock the designer who will work on the second one.

Define and categorize metrics

Refer to the Content Brainstorm doc for the latest draft

  • define a list of high-level "sections" (eg content, experience, distribution, publishing, etc)
  • define a list of mid-level "chapters" (eg for content: JS, CSS, img, etc)
  • define a list of low-level metrics (eg for JS: bytes, bootup time, libraries, etc)

Finalize assignments: Chapter 18. Page weight

Section Chapter Author Reviewers
IV. Content Distribution 18. Page weight @khempenius @henrisgh @tammyeverts @paulcalvano @flowlabs

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • Distribution of resource size: (p10, p25, p50, p75, p90) x (total, JS, CSS HTML, Fonts). In addition, how this has changed over the past year.
  • Distribution of resource quantity: (p10, p25, p50, p75, p90) x (total, JS, CSS HTML, Fonts). In addition, how this has changed over the past year and since the release of H2.
  • Very Optional: H2's impact on resource quantity.
    • Determine which sites serve the majority of their first-party content using H2. (This % alone would be interesting. I also wonder if there would be a significant difference between using .5 and .9 as the threshold.)
    • For those sites using H2, look at how CSS & JSS resource quantities varied before and after H2 adoption.

๐Ÿ‘‰ AI (@khempenius): Finalize which metrics you might like to include in an annual "state of page weights" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

๐Ÿ‘‰Optional AI (@khempenius): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

The metrics should paint a holistic, data-driven picture of the page weight landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 3. Markup

Section Chapter Author Reviewers
I. Page Content 3. Markup @bkardell @zcorpan

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

  • Deprecated elements
  • Popular elements
  • Custom elements (โ€œslangโ€)
  • Attribute usage (stretch goal)
  • count of shadowRoots

๐Ÿ‘‰AI (@bkardell): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.

๐Ÿ‘‰ AI (@bkardell): Finalize which metrics you might like to include in an annual "state of markup" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the markup landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Write queries and add to the repo

When the Analyst team generates queries for each metric, they should create a PR to merge it into the repo. This has two benefits: the PR process provides an opportunity for peer review, and it is a place to share and maintain the canonical queries. On the Almanac website we can link directly to the queries from each respective chapter/figure so readers can see exactly how it was calculated and fork it for their own analysis.

  • create a new directory system to organize queries (@KJLarson)
  • test queries (analysts)
  • file a PR to merge the queries into their respective directory (analysts)

For testing queries, you can query the new almanac dataset, which contains desktop/mobile sample tables for 1,000 websites. This smaller dataset should help you refine your queries without incurring the full cost for all ~5M websites.

Query guidelines:

  • must specify #standardSQL on the first line and use Standard SQL
  • must include a short description of the metric it's analyzing, eg:
# Percentage of requests that are third party requests
# broken down by third party category by resource type.
  • must query the 2019_07_01 dataset (unless otherwise needed)
  • must be reasonably optimized where possible
  • file must be named according to its metric ID, eg 05_03.sql
  • file must be placed in the directory according to its chapter, eg 05_ThirdParties/05_03.sql

Finalize assignments: Chapter 7. Performance

Section Chapter Coauthors Reviewers
II. User Experience 7. Performance @rviscomi @zeman @JMPerez @OBTo @sergeychernyshev

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

  • Field (Chrome UX Report)
    • global distribution of FCP fast/avg/slow
    • global distribution of FID fast/avg/slow
    • % of fast FCP websites
    • % of fast FID websites
    • % of fast FCP+FID websites, per PSI definition
    • % of websites with offline experiences
    • country/region comparison of any/all of the above
    • mobile vs desktop comparison of any/all of the above
    • ECT comparison of any/all of the above
  • Lab (HTTP Archive)
    • Hero times
      • first/last painted hero
      • H1 rendering time
      • Largest Image
      • Largest Background Image
    • Visually Complete
    • First CPU Idle
    • Time To Interactive
    • Blocking CSS requests
    • Blocking JS request
    • Time To First Byte (Backend)
    • Scripting CPU time
    • Layout CPU time
    • Paint CPU time
    • Loading CPU time
    • Lighthouse Performance Score

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of web performance" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the web perf landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Add ja-JP templates

We should test out the i18n routing by creating a Japanese translation of the "coming soon" home page and serve it from https://almanac.httparchive.org/ja-JP/.

  • translate the coming soon page to Japanese
  • route /ja-JP/ to the Japanese index template (#43)

@MSakamaki can you manually translate the English version and add it to a new ja-JP template directory?

When that's done I can help with the next two items.

cc @HTTPArchive/translators

Finalize assignments: Chapter 10. SEO

Section Chapter Authors Reviewers
II. User Experience 10. SEO @rachellcostello @ymschaap @AVGP @clarkeclark @andylimn @voltek62

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

  • Structured data rich results eligibility (ratings, search, etc,)
  • Lang attribute usage and mistakes (lang='en')
  • <link> rel="amphtml" (AMP)
  • <link> hreflang="en-us" (localisation usage)
  • Breakdown of type of structured data served (ld+json, microformatting, schema.org + what @type)?
  • Indexability - looking at meta tags like <meta> noindex, <link> canonicals.
  • <meta> description + <title> (presence & length)
  • Status codes and whether pages are accessible - 200, 3xx, 4xx, 5xx.
  • Content - looking at word count, thin pages, header usage, alt attributes images
  • Linking - extract <a href> count per page (internal + external)
  • Linking - fragment URLs (together with SPAs to navigate content)
  • robots.txt (It is mentioned in Lighthouse, can we parse the content or only confirm its existence? E.g. check if has a sitemap reference - seems it does list the potential issues)
  • If the desktop site is responsive/mobile-ready, or a specific mobile site (redirect, UA)? (Can we find if these are different sites?)
  • Descriptive link text usage (available in Lighthouse data)
  • speed metrics (FCP, server response time)

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of SEO" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the SEO landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 6. Fonts

Section Chapter Coauthors Reviewers
I. Page Content 6. Fonts @davelab6 @zachleat @HyperPress @AymenLoukil

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • Local vs hosted
  • Popular hosts
  • Font formats
  • Font-Display usage
  • Variable fonts (see below)
    • Latency gains on existing families
    • New modes of typographic expression
    • New ways to make quality text typography
  • how many fonts are loaded but also how many type-faces (families) are used
  • Related, group by weight/style: how many people use italics? those are often left off
  • Font formats (how many people are still using Bulletproof font face syntax?; WOFF2 use specifically)
  • Icon fonts (not sure how to measure this, might show up if we measure popular families?)
  • CSS Font Loading API use?
  • unicode-range use (and range size, perhaps to glean some info on subsetting)
  • uses preconnect for web font cdn? popular preconnect domains?
  • +1 to preload, as Paul said
  • Use of local() in src

Variable fonts:

  • Latency gains on existing families
  • how many pages in the HTTPArchive link to a variable font via @font-face?
    • what percent of total pages use VFs?
    • what is the % growth over some time period (3, 6, 12 months)?
  • of those pages linking to a VF, how many are using the 4 font selectors that select on a variable font family?
  • how many pages link to a VF, but never actually use it?
  • how many pages link to a VF, but never use it beyond old CSS3 values?
  • how many pages use new CSS4 values, like font-weight: 555 and not font-weight: 500?
  • how many pages use @supports to screen for variations capable browsers?
  • is font-stretch usage growing?
  • how often is font-size selecting within opsz axis ranges?
  • which axes are most commonly used today? "top 10 axes"?
  • which axes are used 6-20pt, and which are used 20pt+?
  • which axes are used in concert?

Others:

Top Fonts:

  • top fonts globally
  • top fonts per provider - Google Fonts, AdobeFonts/TypeKit, Cloud.Typography, FontStand, etc
  • top self-hosted fonts
  • what is the bar chart of the number of custom fonts per page?
  • which page uses the most fonts?

Formats:

  • is SVG going away?
  • is EOT going away?
  • is raw TTF going away?
  • how many pages do only WOFF and WOFF2?
  • how many pages do only WOFF?
  • how many pages do only WOFF2?
  • how many pages use color fonts?
  • how many pages use fonts with each of the 4 (SBIX, CBDT, CPAL, SVG) color font formats?

Optimizations:

  • how many pages use each of the font-display properties?
  • how many pages use each of the font preloading properties?
  • how many pages place a single Google Fonts <link> element within <head>?
  • how many pages place a single Google Fonts <link> element as the very first element within <head>?

๐Ÿ‘‰Optional AI (coauthors): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of web fonts" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the web fonts landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Review all chapter metrics by analysts

AI(@HTTPArchive/data-analysts): After June 3 when the metrics are finalized, they need to be reviewed to ensure that it's clear what the authors are looking for and their questions are answerable using the HTTP Archive dataset.

Ensure all contributors have accepted team invitations

We have GitHub teams for each of the 4 roles: authors, reviewers, analysts, and developers. It's important that everyone who wants to contribute in these areas is on the team so they can be notified all at once using @[team] as needed.

All teams should be set up so that you can manually request to join. I should have also sent individual invitations to everyone who has expressed interest in a team. (Some people just got theirs as I was making this list!) Invites are sent to your GH email, but if you didn't get it you can accept by going to the organization page: https://github.com/HTTPArchive

If your name is below, please accept your invitation! As a bonus, all teams have write access for this repo so you can check your own name off when you've accepted!

Authors

Reviewers

Analysts

Developers

Coordinating a project with 64+ contributors will be challenging. Thank you for helping to make it easier!

Create templates for content pages

We don't have chapters written yet but now would be a good time to create the infrastructure for them so we can just drop in the content when they're ready.

I'm working with a designer to create the look and feel, which should be ready by July. Meanwhile we can get started on the backend routing and templating in preparation.

We should create the following templates:

  • rename the splash page at index.html to splash.html and update main.py
  • create a new index.html for the post-launch home page, stub things like navigation and introductory text
  • create /2019/outline.html where we show a table of contents for all chapters
  • create /2019/chapter.html as a template for each chapter
  • create /2019/methodology.html to provide technical info about our test methodology
  • create /2019/contributors.html to acknowledge everyone who contributed to the project

The Almanac is an annual report and I expect to do this again next year, so we should be proactive and separate the annual content from the static content. In this case, the home page is static to the project while things like table of contents, chapters, methodology, and contributors may change year to year. These should all be organized in a 2019 template directory.

The contributors page will be based on the data in this sheet. It might be a good idea to create a JSON metadata file of all contributors and generate this page dynamically. This way we can also dynamically add authors/reviewers on the bylines of their respective chapters.

Any volunteers from @HTTPArchive/developers who can take part or all of this issue?

Form a team of translators

The Almanac will be published in English, but we want to make sure that everyone can access and understand it.

I'd like to put together a team of translators to help make the Almanac available in as many languages as possible, prioritizing those with the largest developer communities. If you're interested, can you reply to this issue with the languages you're available to contribute?

The entire Almanac website would need to be translated, including: home page, methodology page, contributors page, and all chapters. If there are multiple translators for a language we can split up the work.

According to the timeline, if all goes to schedule, we'll have about 8 weeks after the chapters are completed to get them translated before launch. I'm open to making some major languages "launch blockers" if there's a good case and available resources, otherwise we can commit to translating after launch. All content in all languages will be available on GitHub, so at any time anyone could submit a PR with entire translations or specific translation fixes.

Priority Language Primary Translator Secondary
0 Japanese @MSakamaki (none)
0 Spanish @c-torres @taytus @JMPerez
0 Russian @Pavel-Evdokimov (none)
1 French @AymenLoukil (none)
1 Portuguese @ibrahimcesar (none)
1 German @Awesomecloud (none)
2 Dutch (none) (none)
2 Italian @performize @realjoker
2 Polish (none) (none)

Reference sheet

@HTTPArchive/translators

Finalize assignments: Chapter 5. Third parties

Section Chapter Author Reviewers
I. Page Content 5. Third parties @patrickhulce @simonhearne @flowlabs @jasti @zeman

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • Percentage of pages that include at least one third-party resource.
  • Percentage of pages that include at least one ad resource.
  • Percentage of requests that are third party requests broken down by third party category by resource type.
  • Percentage of total bytes that are from third party requests broken down by third party category by resource type.
  • Percentage of total script execution time that is from third party scripts broken down by third party category.
  • Median page-relative percentage of requests that are third party requests broken down by third party category by resource type.
  • Median page-relative percentage of total bytes that are from third party requests broken down by third party category by resource type.
  • Median page-relative percentage of total script execution time that is from third party scripts broken down by third party category.
  • Top 100 third party domains by request volume
  • Top 100 third party domains by total byte weight
  • Top 100 third party domains by total script execution time
  • Top 100 third party requests by request volume
  • Top 100 third party requests by total script execution time

๐Ÿ‘‰ AI (@patrickhulce): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the third party landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 17. CDN

Section Chapter Authors Reviewers
IV. Content Distribution 17. CDN @andydavies @colinbendell @yoavweiss @paulcalvano @pmeenan

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

  • What are the top CDNs (by number of sites using rather requests?)

  • What % of sites use a CDN

  • % of sites that use a CDN for primary domain i.e. www

  • % of sites that use a CDN for secondary domain e.g. static. media.

  • Usage of 3rd-party public CDNs e.g jQuery, apis.google etc.

  • CDN TTFB

  • HTTP things (not necessarily cdn related): header volume, STS, Timing-Allow-Origin, Via, Keep-Alive, Server-Timing metrics/presence, Vary, Content-Disposition, etc

  • TLS negotiation time

  • TLS Certificate size

  • OCSP stapling support

  • Dns v. anycast ip use

  • Cwnd growth rate (not sure this will be measurable)

  • TLS connection coalescing with H2 connections

  • Number of CDN's used per page

  • H2 push?

  • HTTPS uses 1.1 or 2?

  • Use of CDN header directives (s-max-age, stale-while-revalidate, nopush, stale-while-error, pre-check and Surrogate-Control)

  • How have these patterns changed over the last year / two years

๐Ÿ‘‰ AI (reviewers): Finalize which metrics you might like to include in an annual "state of CDNs" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to the subject matter experts to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the CDN landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 16. Caching

Section Chapter Author Reviewers
IV. Content Distribution 16. Caching @paulcalvano @yoavweiss @colinbendell

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • TTL by resource
  • Resources served without cache
  • Cache strategy?
  • Cache TTL vs Content Age
  • Availability of Last-Modified vs. ETag validators
  • Validity of Dates in Last-Modified and Date headers
  • Set-Cookie on cacheable responses?
  • Use of Cache-Control: max-age vs. Expires
  • Use of Vary (how many dimensions, what headers, etc.)
  • Use of other Cache-Control directives (e.g., public, private, immutable)
  • 1st Party vs 3rd Party Caching
  • Public vs Private
  • Use of must-revalidate
  • Service Worker caching
  • AppCache

๐Ÿ‘‰Optional AI (@paulcalvano): Peer reviewers are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have multiple reviewers who can promote a diversity of perspectives. You currently have 1 peer reviewer.

๐Ÿ‘‰ AI (@paulcalvano): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the third party landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Form a team of web designers and developers

We're looking for web designers and developers to help build the UX of the Almanac itself. It will be a static website home to each annual report and read like an ebook to help users navigate the various sections and chapters.

Designers should have the bandwidth throughout July and August to conceptualize the UX of the website and the data visualizations used by the chapters. This is a big burden so we're open to contracting this out. See this Upwork post for the full details of the design work.

Update: We've hired contractors to do the illustration and design work. You can join the discussion in the #web-almanac-design Slack channel.

Developers will be responsible for implementing the designers' vision and merging it with the authors' written content, while following accessibility and SEO best practices.

Join the team: @HTTPArchive/developers
See open issues: Development label

Licensing and attribution

In the interest of openness and transparency, we want the Almanac to be free, shareable, and extensible. At the same time, authors deserve attribution for their content. And everyone else who has helped build this report (now over 50 people!) should get some kind of recognition.

It's my intent to give credit to everyone who participates in the project in any form on a "Contributors" page. Each chapter will also name their respective authors and reviewers.

We should also provide an easy way to grab quotes from anywhere in the Almanac and see exactly how we expect them to be attributed. For example, if I write something in the Performance chapter that gets quoted, maybe we should annotate it with "Rick Viscomi, 2019 Web Almanac (II.6)" or similar.

What would not be ok is if someone scrapes all of the content and sells it. Our license should have protections against that sort of thing. This repo is marked as Apache 2.0 which permits commercial use, but that should apply only to our source code not the authored content.

Does anyone have experience with this kind of thing? Any ideas for protecting our work while making it as open as possible?

Authoring in markdown

There was some earlier discussion about allowing for authoring the chapter content using markdown. Would this still be something worth investigating as a spike task?

The benefits are that it seems easier to write in markdown than html, and that gitlocalize seems to have better support for markdown.

The challenges are that we will likely want some rich data visualisation in the chapters, which may not have be simple to achieve using markdown.

Finalize assignments: Chapter 15. Compression

Section Chapter Author Reviewers
IV. Content Distribution 15. Compression @paulcalvano @yoavweiss @colinbendell

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Finalize peer reviewers
  • Finalize metrics

Current list of metrics:

  • What compression formats are being used (gzip, brotli, etc)
  • Is there anything we can tell from the level of compression
  • Are there missed opportunities for compressing resources
  • Compression by Content Type

๐Ÿ‘‰ AI (@paulcalvano): Finalize which metrics you might like to include in an annual "state of third parties" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the third party landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Assign subject matter experts and peer reviewers to each chapter

Part Chapter Authors Reviewers Tracking Issue
I. Page Content 1. JavaScript @addyosmani @housseindjirdeh @mathiasbynens @rwaldron @RReverser #3
I. Page Content 2. CSS @una @argyleink @meyerweb @huijing #4
I. Page Content 3. Markup @bkardell @zcorpan #5
I. Page Content 4. Media @dougsillars @colinbendell @Yonet @ahmadawais @kornelski #6
I. Page Content 5. Third Parties @patrickhulce @simonhearne @flowlabs @jasti @zeman #8
I. Page Content 6. Fonts @davelab6 @zachleat @HyperPress @AymenLoukil #7
II. User Experience 7. Performance @rviscomi @zeman @JMPerez @OBTo @sergeychernyshev #9
II. User Experience 8. Security @arturjanc @ScottHelme @paulcalvano @bazzadp @ghedo @ndrnmnn #10
II. User Experience 9. Accessibility Nektarios Paisios, @nadinarama, @OBTo @rachellcostello, @kleinab #11
II. User Experience 10. SEO @rachellcostello @ymschaap @AVGP @clarkeclark @andylimn @voltek62 #12
II. User Experience 11. PWA @tomayac @jeffposnick @HyperPress @ahmadawais #13
II. User Experience 12. Mobile web @slightlyoff @OBTo @HyperPress @AymenLoukil #14
III. Content Publishing 13. Ecommerce @samdutton @alankent @voltek62 @wizardlyhel #15
III. Content Publishing 14. CMS @amedina @westonruter @mor10 @sirjonathan #16
IV. Content Distribution 15. Compression @paulcalvano @yoavweiss @colinbendell #17
IV. Content Distribution 16. Caching @paulcalvano @yoavweiss @colinbendell #18
IV. Content Distribution 17. CDN @andydavies @colinbendell @yoavweiss @paulcalvano @pmeenan #19
IV. Content Distribution 18. Page Weight @khempenius @henrisGH @tammyeverts @paulcalvano @flowlabs #20
IV. Content Distribution 19. Resource Hints @khempenius @yoavweiss @andydavies @addyosmani #21
IV. Content Distribution 20. HTTP/2 @bazzadp @bagder @rmarx @dotjs #22

Reference sheet

For more context about the Almanac project and how you can help, see this post.

Finalize assignments: Chapter 8. Security

Section Chapter Authors Reviewers
II. User Experience 8. Security @arturjanc @ScottHelme @paulcalvano @bazzadp @ghedo @ndrnmnn

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter experts (coauthors)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

TLS ๐Ÿ”’

  • Protocol Usage
    • SSLv2 / SSLv3 / TLSv1.0 / TLSv1.1 / TLSv1.2 / TLSv1.3
  • Unique CA issuers
  • RSA certificates
  • ECDSA certificates
  • Certificate validation level (DV / OV / EV)
  • Cipher suite usage
    • Suites supporting Forward Secrecy (ECDHE / DHE)
    • Authenticated suites (GCM / CCM)
    • Modern suites (AES GCM, ChaCha20-Polyc1305)
    • Legacy suites (AES CBC, 3DES, RC4)
  • OCSP Stapling
  • Session ID/Ticket assignment
  • Sites redirecting to HTTPS
  • Sites with degraded HTTPS UI (mixed-content)

Security Headers ๐Ÿ“‹

  • Content Security Policy
    • Policies with frame-ancestors
    • Policies with 'nonce-*'
    • Policies with 'hash-*'
    • Policies with 'unsafe-inline'
    • Policies with 'unsafe-eval'
    • Policies with 'strict-dynamic'
    • Policies with 'trusted-types'
    • Policies with 'upgrade-insecure-requests'
  • HTTP Strict Transport Security
    • Variance in max-age
    • Use of includeSubDomains
    • Use of preload token
  • Network Error Logging
  • Report To
  • Referrer Policy
  • Feature Policy
  • X-Content-Type-Options
  • X-Xss-Protection
  • X-Frame-Options
  • Cross-Origin-Resource-Policy
  • Cross-Origin-Opener-Policy
  • Vary (Sec-Fetch-* values)

Cookies ๐Ÿช

  • Use of HttpOnly
  • Use of Secure
  • Use of SameSite
  • Use of prefixes

Other โ“

  • Use of SRI on subresources
  • Vulnerable JS libraries (lighthouse?)

๐Ÿ‘‰AI (coauthors): Assign peer reviewers. These are trusted experts who can support you when brainstorming metrics, interpreting results, and writing the report. Ideally this chapter will have 2 or more reviewers who can promote a diversity of perspectives.

๐Ÿ‘‰ AI (coauthors): Finalize which metrics you might like to include in an annual "state of web security" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the web security landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Finalize assignments: Chapter 20. HTTP/2

Section Chapter Authors Reviewers
IV. Content Distribution 20. HTTP/2 @bazzadp @bagder @rmarx @dotjs

Due date: To help us stay on schedule, please complete the action items in this issue by June 3.

To do:

  • Assign subject matter expert (author)
  • Assign peer reviewers
  • Finalize metrics

Current list of metrics:

  • Adoption rate of HTTP/2 by site (home page only) and by requests (all request on page) over the years. Trend graph over all available years.
  • Measure of HTTP version negotiated (0.9, 1.0, 1.1, 2, gQUIC) for main page of all sites, and for HTTPS sites. Table for last crawl. For example:
Version All sites HTTPS only sites
HTTP/0.9 0% 0%
HTTP/1.0 2% 0%
HTTP/1.1 48% 20%
HTTP/2 44% 70%
gQUIC 6% 10%

For gQUIC it will be sites that return Alt-Svc HTTP Header which starts with quic.

  • Average percentage of resources loaded over HTTP/2 (or gQUIC) versus HTTP/1.1 per site. Trend graph over all available years.
  • Number of HTTP (not HTTPS) sites which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • Number of HTTPS sites using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • Number of HTTPS sites not using HTTP/2 which return upgrade HTTP header containing h2. Once off stat for last crawl.
  • % of sites affected by CDN prioritization issues (H2 and served by CDN) - https://github.com/andydavies/http2-prioritization-issues#cdns--cloud-hosting-services. If not possible then maybe just list sites by CDN and can then manually vlookup from table in Andy's github issue? Once off stat for last crawl.
  • Count of HTTP/2 sites grouped by server HTTP header value but strip version numbers (e.g. Apache and Apache 2.4.28 and Apache 2.4.29 should all report as Apache, but Apache Tomcat should report as Tomcat. Probably need to massive the results to achieve this). Once off stat for last crawl.
  • Count of non-HTTP/2 sites grouped by server HTTP header value but strip version numbers. Once off stat for last crawl.
  • Count of HTTP/2 sites which use HTTP/2 Push. Trend graph over all available years.
  • Average number of HTTP/2 Pushed resources and average bytes. Once off stat for last crawl.
  • Count and number of bytes pushed by asset type (CSS, JS, Images...etc.). Once off stat for last crawl.
  • Count of preload HTTP Headers with nopush attribute set. Once off stat for last crawl.
  • Is it possible to see HTTP/2 Pushed resources which are not used on the page load?
  • Measure number of TCP Connections per site. Average number of domains per site still going down year on year as per HTTP Archive State of the Web report? Trend graph over all available years.
  • Measure average number of TCP Connections per site for HTTP/1.1 sites versus HTTP/2 sites. Once off stat for last crawl.
  • Count of HTTP/2 sites grouped by SETTINGS_MAX_CONCURRENT_STREAMS (including HTTP/2 sites which don't set this value). Note this was added recently as per #22 (comment). Once off stat for last crawl.

๐Ÿ‘‰ AI (@bazzadp): Finalize which metrics you might like to include in an annual "state of HTTP/2" report powered by HTTP Archive. Community contributors have initially sketched out a few ideas to get the ball rolling, but it's up to you, the subject matter experts, to know exactly which metrics we should be looking at. You can use the brainstorming doc to explore ideas.

The metrics should paint a holistic, data-driven picture of the HTTP/2 landscape. The HTTP Archive does have its limitations and blind spots, so if there are metrics out of scope it's still good to identify them now during the brainstorming phase. We can make a note of them in the final report so readers understand why they're not discussed and the HTTP Archive team can make an effort to improve our telemetry for next year's Almanac.

Next steps: Over the next couple of months analysts will write the queries and generate the results, then hand everything off to you to write up your interpretation of the data.

Additional resources:

Website architecture design

I have some questions / remarks :

  • Is there somewhere a document describing the target architecture of the Almanac website (information architecture, URL structure) ?
  • I understood from reading the issues, that we aim to have a static page and every year version should go under a dedicated sub-folder /2019, /2020..etc. IMHO, it would be better to rather have this setup :

Every year we put the content on the root (HP contains static + dynamic insights of the current version), /outline for the current outline, etc... and when we publish a new edition, we archive previous ones on subfolders.

Example :
2019 in the main generic 'website'
2018 content goes on /2018 folder
And when we publish 2020 version, main generic 'website' should get updated with 2020 data
and we archive 2019 on /2019 folder...Etc

The pros of this method :

  • People who search for Almanac insights they find them on the root of the website, simpler. it's trivial for them to find and go to the current version
  • The generic website will gain value over the years, while still allowing the older versions to be accessible if someone explicitly looks for them

What do you think ? @rviscomi

Develop a translation workflow

The web server should accommodate multiple languages. See #29 for context.

To summarize the thread:

  • content will initially be created in English within the src/templates/en directory
  • each chapter will be written in Markdown per #59
  • translations of non-chapter contents will be done manually in a PR and saved to their respective directories organized by language code
  • translations of chapter contents based in Markdown will be done using the gitlocalize tool

Original comment for reference:

I think we should create a src/en/ directory and put all of the English-specific templates there by default, including our current splash page.

So the directory structure might look like this

  • src
    • en
      • templates
        • base.html (English base template)
    • ja
      • templates
        • base.html (Japanese base template)

In src/main.py we should use the en directory by default, unless one is specified in the URL and it's a supported language code.

We also need to start thinking about (but not necessarily implement at this stage) a way to switch languages in the UI.

@HTTPArchive/developers any takers for this issue? Any other implementation ideas?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.