GithubHelp home page GithubHelp logo

httparchive / legacy.httparchive.org Goto Github PK

View Code? Open in Web Editor NEW
327.0 327.0 84.0 4.76 MB

<<THIS REPOSITORY IS DEPRECATED>> The HTTP Archive provides information about website performance such as # of HTTP requests, use of gzip, and amount of JavaScript. This information is recorded over time revealing trends in how the Internet is performing. Built using Open Source software, the code and data are available to everyone allowing researchers large and small to work from a common base.

Home Page: https://legacy.httparchive.org

License: Other

PHP 57.85% JavaScript 37.47% CSS 4.09% Makefile 0.59%

legacy.httparchive.org's Introduction

The HTTP Archive tracks how the Web is built

!! Important: This repository is deprecated. Please see HTTPArchive/httparchive.org for the latest development !!

This repo contains the source code powering the HTTP Archive data collection.

What is the HTTP Archive?

Successful societies and institutions recognize the need to record their history - this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web's digitized content.

In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

legacy.httparchive.org's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

legacy.httparchive.org's Issues

Parsing of doctype not perfect

I hit a problem with the CSV - the doctype was ending with . Looking in the dump from 2014-06-01 it looks like the escaping is also affecting the parsing of the doctype.

Sample for http://www.teletalk.com.bd/

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN\"http://www.w3.org/TR/html4/frameset.dtd">

Not sure, if that's a valid DTD but the backslash looks like it is an attempt to escape the DTD for the frameset but the parser is cutting it at the next quotation mark.

Crawler seems to be taking to long

The crawls for mid-February and the still incomplete mid-March took much longer (9 days) than normal (5 days). In general crawls are taking longer and longer: in September 2013 it was less than 4 days for approximately the same number of URLs (305000)

This analysis is based on the following query:

select 
    label,
    numUrls,
    date_format(from_unixtime(startedDateTime),
            '%Y-%m-%d') as start,
    date_format(from_unixtime(finishedDateTime),
            '%Y-%m-%d') as stop,
    (finishedDateTime - startedDateTime) / 3600 / 24 as days_taken
from
    crawls
order by finishedDateTime desc

Interesting Stats Sugestion: Average Bytes per Page by User Observed Category

Hello,

My suggestion for the "Interesting Stats" page is a chart that tracks "Average Bytes per Page by User Observed Categories."

This chart would be visually and structurally similar to the pie chart named "Average Bytes per Page by Content Type," but the wedge categories would be tied to high level content categories that the average non-technical visitor would relate to. Example categories might include, Page Text Content (Articles, etc), Page Style & Structure (site logo, page layout, etc.), Advertising, Page Image Content (non-advertising or article specific images), etc.
Given the difficulty of reliably identifying advertising images vs content images this may only be reasonable to provide in the Top 100 URLs category (or it may be totally unreasonable to provide at all), but it would still be very interesting to see the results.

In particular, the ratio of data downloaded by a page visitors that is valued or sought by the visitor (i.e. content, page style, etc) versus data downloaded to create value for the page operator (i.e. advertising, etc) may be of interest. This ratio may be increasingly important information as more and more US based ISPs move toward data cap and fee based models and advertisers move toward data heavy advertising like auto-playing audio and video.

Stats about other CDNs.

It would be nice to see how much other CDNs are used to get an idea of how likely a user is to have an asset cached.

Just a list would be cool but maybe you could automatically detect public CDNs by the number of sites that request assets from them.

correlation coefficient issues

Moved from Google Code:

https://code.google.com/p/httparchive/issues/detail?id=192:
Guypo suggests ways to improve the correlation coefficient to isolate other variables:

On 6/25/11 11:18 AM, Guy Podjarny wrote:

The correlation coeff is accurate in determining the variable that correlates the most to load time. The problem is with showing the numbers of the next items without neutralizing that top var. For those, its simply inaccurate, I don't think it's like the median vs average conversation, where they're indeed all right.

I'm not sure there's an easy way to calculate it quickly. Maybe using a stats tool instead of php calculation?
Maybe only calculate the top variable correlating to load time each run, and occasionally do the full analysis using SPSS? Note that SPSS calculates it quite quickly.

Guy Podjarny | CTO, Blaze | 613-800-0413 x202

On 2011-06-25, at 13:46, Steve Souders [email protected] wrote:

Then what's the purpose of the correlation coeff I use? It seems like you're saying the current formula should never be used (when there are multiple variables). This reminds me of the avg vs median vs 90% debate - they're all valid formulas, it just depends on what you're after and clearly stating what function you're using.

Because the next question is: how do you implement this? Right now the correlation coeff charts take 10x longer than any other chart, and my guess is it's based on the order of N, and N is about to increase several orders of magnitude. So there's the practical aspects, too.

-Steve

On 6/25/11 10:37 AM, Guy Podjarny wrote:

I thought of a simpler example:
Let's say you check how reading skills and age correlate to height.
If you looked at the clean numbers, you'll see reading skills correlate quite well to height. But logically, clearly that's only because reading skill correlates to age, which correlates to height. That's why you have to use partial correlation to see correlations above and beyond variables that may explain the correlation.

Cheers,
Guypo

On Sat, Jun 25, 2011 at 12:58 PM, Guy Podjarny [email protected] wrote:

The main issue is that when you calculate the correlation between A & C and B & C, you don't take into account the correlation between A & B.
For example, if these are the correlations:
1. A & B = 0.8
2. A & C = 0.7
3. B & C = 0.9
Then correlation #2 can be explained by #1 and #3, and you can't really say that A & C correlate. 

I feel that's not a good enough explanation, so let me try a real world example.
Let's take these three variables:
W - Load Time
X - Total # Requests
Y - Total # JS Requests
Z - Total # CSS Requests

And that these are the correlations between them:
WX = 0.7
WY = 0.6
WZ = 0.5
XY = 0.8
XZ = 0.3

Logically, total # of requests is the top correlation to load time, followed by # of JS reqs and # CSS reqs.
Also, pages with many requests often have many JS requests too, but often don't have many CSS requests (this is a made up example, just to show stats).

In first pass, you'd say that the # of JS requests is a more significant factor in the load time. However, when you calculate the correlation between # of CSS & JS reqs above and beyond total # of requests, you get these values (this is called "partial correlation", the formula is in the attached doc):
WY.X = 0.09 - Excel Formula: =(0.6 - 0.7*0.8)/(SQRT(1-0.7*0.7)*SQRT(1-0.8*0.8))
WZ.X = 0.42 - Excel Formua: =(0.5 - 0.7*0.3)/(SQRT(1-0.7*0.7)*SQRT(1-0.3*0.3))

So you see that the number of JS requests really doesn't matter much beyond the total number of requests, it only looked that way because pages that had many requests happened to have many JS requests too. On the other hand, the number of CSS requests shows to be quite significant to load time, much more than number of JS requests. 

So once you establish the single variable that correlates the most with the number of requests, you have to neutralize it before you say who is second, then neutralize both to say who is third... etc. 
The attached forumula shows how to neutralize one variable, but to neutralize many, I used SPSS, as it became to complicated. SPSS also gives you an indication of when the correlation become statistically significant, taking into account the total number of samples .

I hope this makes sense, I had to learn this to prepare my mobile study presentation, so I'm not well rehearsed in explaining it... but I'm confident it's accurate. 

Did that help clarify my point?

Cheers,
Guypo


On Sat, Jun 25, 2011 at 2:47 AM, Steve Souders <[email protected]> wrote:

    Hi, Guy.

    Sorry for the late reply - still unburying.

    Yes, I'd like to fix this if it's wrong. I guess I don't understand - there's a formula for calculating the correlation coeff for two variables. That calculation ignores all the other variables. It generates a number. So far it seems like it's correct to say:
        - the correlation coeff of A & D is 0.9
        - the correlation coeff of B & D is 0.8
        - the correlation coeff of C & D is 0.7

    Given that you would say A has the highest correlation, B is 2nd highest, and C is 3rd highest.

    Where does that analysis break?

    -Steve


    On 6/19/11 5:19 PM, Guy Podjarny wrote:
    Hey, 

    First off, congrats on wrapping up Velocity - I heard it was a huge hit!
    I downloaded many of the presentations, and am looking forward to watching some of the videos too. Hopefully you'll get more sleep now ;)

    As you may have seen, as a part of the mobile analysis we did, I reused your HTTP Archive schema (with some minor modifications), and did some statistical analysis. 
    Doing so made me thing that the analysis you currently have around correlation to speed is wrong. If I'm not mistaken, you measure the correlation of each variable to the load time, but you don't do so while neutralizing the other variables. What you should be doing is correlating the effect of each variable above and beyond the others. 

    Doing this for one variable (correlate A and B above and beyond C) is a simple formula. As you add more variables, it becomes more complicated. 
    I did it using SPSS, and I have the SPSS Syntax (sort of a script) that calculates it given a set of data extracted from the HTTP Archive Mobile DB Schema.

    So bottom line:
    1) I think the "correlation to speed" chart you have is misleading, since you can't say what's the 2nd top correlation to speed the way you did (only the top one)
    2) It might be interesting to calculate the correlation to speed of the specific variables above and beyond the others. 

    Let me know if you're interested in getting into the details of this or not, just figured i'll put the offer out there.

    Cheers,
    Guypo

https://code.google.com/p/httparchive/issues/detail?id=239
Guypo reports that the HTTP Archive Mobile correlation coefficient stats seem wrong since "Flash Reqs" has the highest correlation. I spent an hour debugging and the math seems right, so either this is accurate or the formula is flawed.

Here's some output from calculating CC for "Oct 1 2011", "All", "iphone" comparing reqFlash to reqImg for correlation to onLoad and renderStart:

=== onLoad
reqFlash: 0.99561730861636, n=3, sumX=4, sumXX=6, sumY=49375, sumYY=1408318281, sumXY=85674
0.99561730861636 = ((257022) - (197500)) / sqrt( ((18) - (16)) * ((4224954843) - (2437890625)) )
0.99561730861636 = ((257022) - (197500)) / sqrt( ((18) - (16)) * ((4224954843) - (2437890625)) )
0.99561730861636 = (59522) / sqrt( 2 * 1787064218 )
0.99561730861636 = (59522) / sqrt( 3574128436 )
0.99561730861636 = (59522) / 59784.014886924

reqImg: 0.73076506791759, n=974, sumX=31093, sumXX=2392141, sumY=9773060, sumYY=189577950110, sumXY=573515096
0.73076506791759 = ((558603703504) - (303873754580)) / sqrt( ((2329945334) - (966774649)) * ((184648923407140) - (95512701763600)) )
0.73076506791759 = ((558603703504) - (303873754580)) / sqrt( ((2329945334) - (966774649)) * ((184648923407140) - (95512701763600)) )
0.73076506791759 = (254729948924) / sqrt( 1363170685 * 89136221643540 )
0.73076506791759 = (254729948924) / sqrt( 1.2150788431614E+23 )
0.73076506791759 = (254729948924) / 348579810540.05

=== renderStart
reqFlash: 0.97504497486968, n=3, sumX=4, sumXX=6, sumY=8087, sumYY=33501411, sumXY=13506
0.97504497486968 = ((40518) - (32348)) / sqrt( ((18) - (16)) * ((100504233) - (65399569)) )
0.97504497486968 = ((40518) - (32348)) / sqrt( ((18) - (16)) * ((100504233) - (65399569)) )
0.97504497486968 = (8170) / sqrt( 2 * 35104664 )
0.97504497486968 = (8170) / sqrt( 70209328 )
0.97504497486968 = (8170) / 8379.1006677328

reqImg: 0.34036415420895, n=974, sumX=31093, sumXX=2392141, sumY=2677584, sumYY=11729831570, sumXY=112091735
0.34036415420895 = ((109177349890) - (83254119312)) / sqrt( ((2329945334) - (966774649)) * ((11424855949180) - (7169456077056)) )
0.34036415420895 = ((109177349890) - (83254119312)) / sqrt( ((2329945334) - (966774649)) * ((11424855949180) - (7169456077056)) )
0.34036415420895 = (25923230578) / sqrt( 1363170685 * 4255399872124 )
0.34036415420895 = (25923230578) / sqrt( 5.8008363586322E+21 )
0.34036415420895 = (25923230578) / 76163221824.134

Contributing & Private Instances

Hi :)

I'm working on setting up a private instance and there are some pain-points which I know the project is aware off. I wanted to open an issue to start marking sure some of the improvements I want to make are ok or if not which route I should go.

Some Ideas:

  • Cleaning up the mysql calls, Right now the code base seems like it needs an older version of php to run (It's running fine on 5.5.16 locally but throwing a lot of future deprecated warnings)
  • Where should I contribute documentation? The wiki seems empty. I'm ok starting it or do we want to start with a Private-Instance.md file and document it there? (I know there various blog post on setting this up just seems like there should be something here)

Crazy Ideas:

  • Potentially migrating to some sort of micro-framework (I can do this on a separate repo if someone thinks is a good idea but personally I was thinking silex so keeping it php to potentially minimized the amount of work needed)

Thoughts? Also if there's already an effort to do any of what I mention I'm available to help.

In the mean time I'll continue on getting my instance up and running.

diff capability for a website across two runs

moved from Google Code: https://code.google.com/p/httparchive/issues/detail?id=180

There are a lot of diff tools that would be nice - compare the stats across two runs, compare two different websites, etc. But the first one we should do is the simplest one and perhaps most valuable: compare a single website across two runs. For exampmle, compare http://www.wikipedia.org/ on May 16 2011 with Apr 30 2011.

Each piece of information we have should be evaluated for its value in such a comparison:

  • load time, total size, total requests
  • waterfall chart - This could grow huge. Changes in http response headers would be good. I would avoid anything about timing - how long a response took, when it started relative to the beginning of the page - we don't have a large enough sample size to do detailed timing comparisons.
  • Page Speed - lots of cool diff opportunities here, perhaps the most important when it comes to performance.

It would be best if this could built in a way that it could be reused in other forms - browser plugin, standalone web service, etc. Build it based on diffing HAR files as a start.

GTMetrix wrote a HAR Diff but I'm not sure it ever was merged into the HAR viewer - https://code.google.com/p/harviewer/issues/detail?id=52

Store response bodies

It'd be great if we could save the response bodies alongside the meta data, so that they'd be available for download, and possibly for querying later on. That would obsolete slightly overlapping efforts like webdevdata.org, and would enable that group's members (e.g. me :) ) to help out with the HTTPArchive.

We should probably keep the actual response bodies off the database, and just save them to disk, while only keeping the file name in the database.

Since we're talking about a lot of data, it probably makes sense to start out with text based resources only, and see about images later on.

Thoughts?

Add adult column to pages tables

The copy of the HTTPArchive data within bigquery and the raw data dumps don't contain any indication as to whether a URL is an adult site.

Is it possible to add this column either powered by the data from urls.inc or the adult flag from WPT?

Invalid JSON escapes in som .har files

150101_S_FCFN.har.gz has one text/plain (actually binary) resource of size 72449 that contains the string \v, preceded and followed by non-ASCII stuff. One can pretend that it's latin-1 to decode it as a string and feed to a JSON decoder, but a conforming decoder will choke on the invalid escape. A literal backslash should be \\.

This is the full list of files with some similar problem in the 20150101 data set:
150101_M_CEM3.har.gz
150101_M_CP1B.har.gz
150101_M_CJH1.har.gz
150101_M_CNGD.har.gz
150101_M_CJAY.har.gz
150101_N_CXC4.har.gz
150101_N_D0X9.har.gz
150101_N_D92B.har.gz
150101_N_DDK3.har.gz
150101_N_D91W.har.gz
150101_N_D1E7.har.gz
150101_N_D265.har.gz
150101_P_DVYC.har.gz
150101_P_DE6N.har.gz
150101_P_DX2M.har.gz
150101_P_DT61.har.gz
150101_P_DH3X.har.gz
150101_P_DR1P.har.gz
150101_P_DNZS.har.gz
150101_P_DFQH.har.gz
150101_P_DJYP.har.gz
150101_P_DVCD.har.gz
150101_P_DQSA.har.gz
150101_P_DW0X.har.gz
150101_P_DH55.har.gz
150101_P_DFCY.har.gz
150101_Q_EAVE.har.gz
150101_Q_E6BK.har.gz
150101_Q_ED93.har.gz
150101_Q_EF7D.har.gz
150101_Q_EC2K.har.gz
150101_Q_E1FS.har.gz
150101_R_EYCP.har.gz
150101_R_EXFJ.har.gz
150101_R_ENYH.har.gz
150101_R_ES4B.har.gz
150101_R_F35E.har.gz
150101_R_EYX4.har.gz
150101_S_FAF0.har.gz
150101_S_F96V.har.gz
150101_S_FHTH.har.gz
150101_S_FCFN.har.gz

Stats for HTTP/2 vs HTTP 1.1 connections

Many servers and browsers are just beginning to implement support for HTTP/2. Now is the perfect time to begin tracking when web servers begin supporting HTTP/2, to get a good and complete idea of how quickly it's adopted.

add more charts to index.php

We have a lot more charts that could be shown on index.php.

Overall, it'd be nice if the showing of charts on index.php, viewsite.php, and interesting.php (and trends.php) was simpler and not so hardwired. For example, if a new stat is added to interesting.php, it would be nice if it automatically showed up on index.php & viewsite.php and didn't have to be explicitly inserted in those places as well.

Stat Suggestion: Mobile Comparison

It would be great to see stats for the top 1000 for mobile versus desktop: either via media queries (wide screen v narrow screen), user agent (mobile v desktop) and/or mobile site via redirect (www. v m.)

Url

how do I include any url and see the data?

Show fonts alongside HTML, Flash, Images, etc.

It would be great to see the stats on font files (woff, svg, eot, ttf), alongside the rest, as adoption grows. Maybe not as granular as each file; showing their stats as a group would be enough.

Research EXIF to inform whether browsers should honor it without opt-in

See https://www.w3.org/Bugs/Public/show_bug.cgi?id=25508#c14

Suggestion by roc in https://bugzilla.mozilla.org/show_bug.cgi?id=298619#c27

I think someone could get some useful data here. Assuming that most/all sensors are wider than they are tall (even though there's no "natural" orientation for a phone, my Android phone always produces wide images), one could sample photos on the Web and count the photos that 1) have an EXIF orientation annotation that is wrong because the image has been manually rotated 2) have an EXIF orientation annotation and have their size in the page set using , hence would break if we automatically rotated the image

Also interesting is images with orientation EXIF at all, and with duplicate orientation EXIF values.

Is this something that would be possible to check for?

identify adult sites better

WebPagetest has improved their code to identify adult sites. (See catchpoint/WebPageTest#233 )

We should evaluate this new code (based on _adult_site which we're now saving to the "pages" table). Main things to check:

  • Are there any obvious false-positive? (Last time "yahoo.com" was flagged as an adult site because it contained the string "2257" in its HTML.)
  • How many of the sites that I've manually identified as adult content (see $ghAdultContent in utils.inc) are NOT flagged by _adult_site?

Add response size histograms

Reported by [email protected], May 8, 2012

Would be great to see response size histograms for each type of content: HTML, JS, CSS.

Currently we see the number of requests and the total size, which gives us the average. However, something tells me not all of the distributions will be "normal" and it would be interesting to see how that breaks down + drift over time.

May 10, 2012 Delete comment
Project Member #1 stevesoudersorg

I worry about screen real estate and complexity adding 5 more charts.

May 10, 2012 Delete comment
#2 [email protected]

That's true, perhaps we could either put those as a drilldown option on a separate page, or inline on the same page, but with show/hide semantics (default hide)?

Request header size distribution

Please track the size of GET requests.

When designing algorithms for optimizing MAC layer utilization, it is helpful to understand the size of HTTP GET requests (average is useful; CDF is much more useful).

rgb image usage vs cmyk

would love to know the usage rate of cmyk images compared to rgb images.
My company wants to do some custom work related to supporting scraping cmyk images from the web... i have a feeling it's not worth the time but haven't found data to support my feelings.

pagespeed/tree data missing

Wrong response body in .har files

In http://www.archive.org/download/har_httparchive_2014_11_15_B/bodies.zip there's a 141115_B_6QAZ.har.gz in which the response body seems to have been associated with the wrong URL.

The JavaScript resource starting with "!function(e,t){" and containing the string "caretRangeFromPoint" claims to come from https://fonts.googleapis.com/css?family=Montserrat:400,700 but that isn't the case. Load https://app.careerplug.com/user/sign_in and note that the same resource is actually from https://cpats.s3.amazonaws.com/assets/application-1efdbd9dfc7dc160c2d6d998ea3113b8.js

This seems like a bug.

Show sites using mod pagespeed

It would be interesting to track the adoption rate of mod pagespeed as a tool for improving web page performance as a function of time. How many of the top 2000 currently use it? How many of the 2M?

New stats - file requests per page

The average file number of each type per page(like the bytes per page chart) would be a valuable information to perform more realistic testcases for web server benchmarks.

Stat suggestion: HTTP header prevalence

I'd love to see how frequent certain HTTP headers are used by the server. This only refers to the header names, not their values.

Names should be made case-insensitive, so "Last-modified" and "Last-Modified" end up in the same bin.

Result display could be capped, if necessary, e.g. to the top 100.

All recent bodies at archive.org are 404

This is http://httparchive.webpagetest.org/habodies.php?run=20150915:

{
    "status": 200,
    "statusText": "Complete",
    "archives": {
        "0": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_0\/bodies.zip",
        "1": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1\/bodies.zip",
        "10": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_10\/bodies.zip",
        "11": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_11\/bodies.zip",
        "12": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_12\/bodies.zip",
        "13": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_13\/bodies.zip",
        "14": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_14\/bodies.zip",
        "15": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_15\/bodies.zip",
        "16": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_16\/bodies.zip",
        "17": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_17\/bodies.zip",
        "18": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_18\/bodies.zip",
        "19": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_19\/bodies.zip",
        "1A": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1A\/bodies.zip",
        "1B": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1B\/bodies.zip",
        "1C": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1C\/bodies.zip",
        "1D": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1D\/bodies.zip",
        "1E": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1E\/bodies.zip",
        "1F": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1F\/bodies.zip",
        "1G": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1G\/bodies.zip",
        "1H": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1H\/bodies.zip",
        "1J": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1J\/bodies.zip",
        "1K": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_1K\/bodies.zip",
        "2": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_2\/bodies.zip",
        "3": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_3\/bodies.zip",
        "4": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_4\/bodies.zip",
        "5": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_5\/bodies.zip",
        "6": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_6\/bodies.zip",
        "7": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_7\/bodies.zip",
        "8": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_8\/bodies.zip",
        "9": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_9\/bodies.zip",
        "A": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_A\/bodies.zip",
        "B": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_B\/bodies.zip",
        "C": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_C\/bodies.zip",
        "D": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_D\/bodies.zip",
        "E": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_E\/bodies.zip",
        "F": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_F\/bodies.zip",
        "G": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_G\/bodies.zip",
        "H": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_H\/bodies.zip",
        "J": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_J\/bodies.zip",
        "K": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_K\/bodies.zip",
        "M": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_M\/bodies.zip",
        "N": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_N\/bodies.zip",
        "P": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_P\/bodies.zip",
        "Q": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_Q\/bodies.zip",
        "R": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_R\/bodies.zip",
        "S": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_S\/bodies.zip",
        "T": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_T\/bodies.zip",
        "V": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_V\/bodies.zip",
        "W": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_W\/bodies.zip",
        "X": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_X\/bodies.zip",
        "Y": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_Y\/bodies.zip",
        "Z": "http:\/\/www.archive.org\/download\/har_httparchive_2015_09_15_Z\/bodies.zip"
    }
}

None of these bodies.zip are actually there, try e.g. http://www.archive.org/download/har_httparchive_2015_09_15_A/bodies.zip

The error is "The item you have requested has a problem with one or more of the metadata files that describe it, which prevents us from displaying this page."

This has been happening for a long time, the most recent data I have is 20150101, I thought it might be a transient problem but it seems not.

Data dump of HAR files

For my master's thesis, I've written a couple of scripts that run PhantomJS to download domain front pages to generate HAR data using a modified version of their netsniff.js. Afterwards the HAR data is analyzed to answer some specific questions I've posed - mostly regarding internal/external resource/request counts and wether or not URLs match known trackers/ads/etcetera. So far I've downloaded 100k-120k domains in the .se and .dk zones.

Now that a lot of the scripting is done, I figured I could run some tests against other HAR data collections - but to my surprise it seems the HTTP Archive data dumps are in another format, very specific to the local site. Noticed that the javascript HAR viewer is requesting HAR data, but it's downloaded per request from a WebPageTest instance, not the mysql database.

Is there a way to dump HAR files used in HTTP Archive for external analysis?

change device "IE8" to "IE"

Currently I've hardwired two devices: "IE8" and "iphone" in utils.inc (and other places):

$gaDevices = array( "IE8", "iphone" );

But now we're using IE9. So need to change this string to something more generic in all the places in code as well as all the places in DB. I suggest "IE" or "IEdesktop".

parse error when running bootstrap.inc

Tried following instructions here:
http://www.bbinto.me/performance/setup-your-own-http-archive-to-track-and-query-your-site-trends/

But when I got to bootstrap step I got the following error:
$ php bootstrap.inc
PHP Notice: Undefined variable: gChromeDir in /var/www/httparchive/utils.inc on line 31
PHP Notice: Undefined variable: gAndroidDir in /var/www/httparchive/utils.inc on line 32
PHP Parse error: syntax error, unexpected '[' in /var/www/httparchive/crawls.inc on line 179

Collect quirks mode

It would be useful to store which rendering mode was used for the top-level document. There is a doctype field but it is not obvious how to map that to quirkyness and it will not be quite correct.

document.compatMode === 'BackCompat' // quirks mode
document.compatMode === 'CSS1Compat' // standards mode or almost standards mode

API for custom crawls (private instances)

Right now the code is very hardwired around doing a crawl twice each month. For example, the "label" is a date string.

There are some organizations that want to have a private instance of HTTP Archive. They might want to run multiple crawls in a single day. Given the current label naming scheme this won't work.

There are many other issues involved in this. It's a major change, esp. to UI. See https://code.google.com/p/httparchive/issues/detail?id=255

add histograms

Add histograms for:
load time
start render
transfer size & requests - total, html, js, css, image, flash
Speed Index
PageSpeed score
html doc size

of DOM elements

monitor crawls

The June 15 2014 mobile crawl failed because the schema for "requestsmobile" table was not updated. The result was no requests were saved. In this specific situation there were 5000 pages created, but it had "0" for all the expected columns (reqTotal, etc.).

We should have general checks that monitor the progress of the crawl and make sure it's working.

UI fixes for harviewer

Brian Pane made significant improvements to harviewer in bug #194 . But now he's starting at Facebook (!) and won't be able to work on these fixes for awhile. So I'm creating a separate bug with the remaining issues:

Looks good. I pushed to production, for example:
http://httparchive.org/viewsite.php?pageid=370857#waterfall

Here are some issues. Items 5-10 could be broken out into (a) separate bug(s). Perhaps some of them could be tackled in an update to this bug, and the bigger, less important ones broken out into new bugs.

  1. The popup menu that appears when you click the down arrow at the far right of a horizontal bar is not positioned correctly. It shows up in the upper right of the browser's window, instead of under the down arrow.
  2. When you expand a request row and look at the "Headers" tab the rows are not aligned correctly. They're too tall and vertical-align is different. It appears this comes from requestBody.css (eg, .netInfoParamName { vertical-align: top }. (see attached screenshot)
  3. The gray shading at the top of an expanded row is not aligned (see attached screenshot). If I expand a request row, there's a TD that has class="tabViewCol" with a background url "timeline-sprites.png". The tavViewCol TD is inside another TD with class=netInfoCol with the same bg sprite. The problem is the tavViewCol TD is wider than the netInfoCol TD, so the two gray sprite bgs are different widths and don't align.
  4. The colored bars in the legend of the connection state popup are outside of the popup box. (This is the popup you see when you hover over a horizontal time bar.)
  5. Let's get rid of the word "GET" in each row.
  6. As you mention, need to get rid of the STATUS column and use colors.
  7. Please add the Content-Type column. For now, just show one uppercase letter per this key: HTML (H), JavaScript (J), CSS (C), Images (I), Flash (F), all other text/* (XML, JSON, etc.) (T), other (O). I hope to have actual columns w/in a week that are basically these letters (perhaps slightly modified) with a colored background where the color is also keyed off the content-type.
  8. Do we intentionally not have a Response tab when you expand a request row?
  9. I'd like to show the images when you hover just like Net Panel. The image will be fetched when the user hovers. The issue is - the image may no longer exist (eg, we did the run 6 months ago) OR the image has been modified since we did the run so the image shown today is NOT what was fetched previously. Despite these issues, I'd still like to do this.
  10. I notice Firebug Net Panel has a Cache tab when you expand a request row. We could supply Date, Last Modified, Expires, Data Size, Cache-Control, and ETag. I could see this getting VERY tricky given the boundary between the HTTP Archive code and the HarViewer code, but if we could do it it would be nice. All of this info is in the table lower in the page.

First byte and start render time

I might be missing something here, does httparchive keep tracks of first byte and start render time nowadays? I found it in the faq but not on any trend or stats or individual domain page (asides from the filmstrip and webpagetest result itself).

It would be very interesting to have this stat, at least across the top 1000 url, it gives developers a sense on how fast are the web is performing and a good goal to beat.

I also find #18 was working on such feature, was it added and later removed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.