It'd be great if we could save the response bodies alongside the meta data, so that th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Store response bodies about legacy.httparchive.org HOT 20 CLOSED

httparchive commented on May 28, 2024

Store response bodies

from legacy.httparchive.org.

Comments (20)

pmeenan commented on May 28, 2024

Timing is impeccable. The current crawl that is running is capturing response bodies for text resources and storing them with the test results. You can turn it on on a private instance by adding bodies=1 to the WPT settings file (using trunk) or an older WPT instance by passing bodies=1 with the test requests from HA.

The response bodies are included in the HAR files and available in the embedded HAR viewer. An archive of the HAR's with included bodies will also be separately archived and available for download.

I'll update this issue when the archive piece has been validated with info on how to download the archived HAR's with bodies.

from legacy.httparchive.org.

andydavies commented on May 28, 2024

What structure does the webdevdata use to organise the files?

I wonder what the trade off is between storing a bunch of files on disk vs having another table with Resource Id and URL from resources.

from legacy.httparchive.org.

pmeenan commented on May 28, 2024

ok, Interface to query the archived bodies is available. It's JSON-only but should work well enough:

http://httparchive.webpagetest.org/habodies.php?run=

i.e.:

http://httparchive.webpagetest.org/habodies.php?run=20140115

The status will be 200 when all of the HAR's have finished uploading (may be some lag before they can be downloaded from the Internet Archive storage). Each archive should have ~20,000 HAR's in the zip and each one is gzipped in the archive. So far for the 4 groups that have completed on the current run it looks like they are around 4.5GB each.

Also, the bodies are only stored for the median run even though there are 3 runs for each test.

from legacy.httparchive.org.

souders commented on May 28, 2024

Could we instead use the "crawlid" since I'm trying to move away from
"label"s. Eg:
http://httparchive.webpagetest.org/habodies.php?crawlid=199

"Label"s are less preferred because they don't uniquely identify a crawl
(eg, there are crawls for mobile & desktop with the label "20140115").

-Steve

On 1/16/14, 10:32 AM, Patrick Meenan wrote:

ok, Interface to query the archived bodies is available. It's
JSON-only but should work well enough:

http://httparchive.webpagetest.org/habodies.php?run=

i.e.:

http://httparchive.webpagetest.org/habodies.php?run=20140115

The status will be 200 when all of the HAR's have finished uploading
(may be some lag before they can be downloaded from the Internet
Archive storage). Each archive should have ~20,000 HAR's in the zip
and each one is gzipped in the archive. So far for the 4 groups that
have completed on the current run it looks like they are around 4.5GB
each.

Also, the bodies are only stored for the median run even though there
are 3 runs for each test.

—
Reply to this email directly or view it on GitHub
#6 (comment).

from legacy.httparchive.org.

pmeenan commented on May 28, 2024

Not without plumbing a LOT more through or moving the bodies processing into the HA side of things. WPT has no concept of crawl ID's.

Since HA is already pulling the HARs for doing it's own processing we can have it pull the versions with bodies and archive them off independently of WPT's processing (and by "we" I mean "not me" :-D)

from legacy.httparchive.org.

igrigorik commented on May 28, 2024

high five ... awesome! Time to write some map-reduce jobs. :)

from legacy.httparchive.org.

yoavweiss commented on May 28, 2024

Awesome indeed! :)

Regarding "labels vs. IDs", is it possible to add a mobile/desktop identifier to the label, so we could distinguish between their crawls, without full fledged IDs?

from legacy.httparchive.org.

pmeenan commented on May 28, 2024

The individual HAR files should have the browser information as part of the page-level info so you can filter while you're processing (on the current data anyway). All of the data is munged together though until we move the logic to the HA layer so you still need to process it all and pick the ones you're interested in.

from legacy.httparchive.org.

andydavies commented on May 28, 2024

@pmeenan is there a way of just downloading the HARs for the mobile crawl, or will I need to do both and extract?

from legacy.httparchive.org.

pmeenan commented on May 28, 2024

No bodies in the mobile data so the archived HAR's would only have headers for mobile (and right now there is no way to download them separately). Maybe once we decide how to scale the mobile testing we'll be able to grab bodies but the mobitest agent doesn't support it.

from legacy.httparchive.org.

foolip commented on May 28, 2024

I tried to get some data from http://httparchive.webpagetest.org/habodies.php?run=20141015 and followed the links in the JSON to get some bodies.zip, but all they contained were a lot of identical .har files with nothing of value in them. Is something broken?

from legacy.httparchive.org.

foolip commented on May 28, 2024

It looks like the 20140815 run is the latest to have a big bodies.zip, 3.3G.

from legacy.httparchive.org.

pmeenan commented on May 28, 2024

Should have been fixed for 11/1 or 11/15 (cant remember which).
On Nov 27, 2014 4:57 AM, "Philip Jägenstedt" [email protected]
wrote:

It looks like the 20140815 run is the latest to have a big bodies.zip,
3.3G.

—
Reply to this email directly or view it on GitHub
#6 (comment)
.

from legacy.httparchive.org.

h3nce commented on May 28, 2024

Hi @pmeenan,

.zip files are working up to February 2015 (http://httparchive.webpagetest.org/habodies.php?run=20150215). Any chance of a refresh for Nov-Dec 2015?

Thanks,

Tom

from legacy.httparchive.org.

pmeenan commented on May 28, 2024

We are storing them in a public bucket in Google cloud storage (and will also be making it available in big query). You'll need a developer account to rsync the folders but it is free. I'll have some UI with the bucket paths shortly. The 11/15 crawl was the first one and all 4 crawls uploaded correctly and every crawl going forward should also.

On Thu, Dec 3, 2015 at 2:22 PM -0800, "h3nce" [email protected] wrote:

Hi @pmeenan,

.zip files are working up to February 2015 (http://httparchive.webpagetest.org/habodies.php?run=20150215). Any chance of a refresh for Nov-Dec 2015?

Thanks,

Tom

—
Reply to this email directly or view it on GitHub.

from legacy.httparchive.org.

h3nce commented on May 28, 2024

Super, thank you for the update. Yes, I was using the BigQuery UI, that one has response bodies up to Aug 2014 currently. I'll postpone my .zip d/l which is still going from time of posting :)

from legacy.httparchive.org.

pmeenan commented on May 28, 2024

Still working on the real documentation and the UI for showing all of the
directory names but if you want to grab the latest crawl, you'll need to
install gsutil https://cloud.google.com/storage/docs/gsutil_install, set
up credentials and then run:

gsutil -m rsync gs://httparchive/desktop-Nov_15_2015 .

Each HAR is a separate gzip file (~490k of them). The -m is important as it
will run multiple downloads in parallel, otherwise it will take eons. Hope
you gave a good internet connection (or are really patient) though because
it's probably > 100GB.

Let us know if there are issues with credentials. The buckets are all
public so you should be able to sync (and individual files are available
over plain HTTPS) but I haven't tried with an account that doesn't already
have write permissions to the bucket.

Thanks,

-Pat

On Thu, Dec 3, 2015 at 6:10 PM, h3nce [email protected] wrote:

Super, thank you for the update. Yes, I was using the BigQuery UI, that
one has response bodies up to Aug 2014 currently. I'll postpone my .zip d/l
which is still going from time of posting :)

—
Reply to this email directly or view it on GitHub
#6 (comment)
.

from legacy.httparchive.org.

foolip commented on May 28, 2024

What is the difference between gs://httparchive/Dec_1_2015/, gs://httparchive/desktop-Dec_1_2015/ and gs://httparchive/mobile-Dec_1_2015/ ?

Is the first just the union of the desktop and mobile sets?

from legacy.httparchive.org.

foolip commented on May 28, 2024

I haven't tried with an account that doesn't already have write permissions to the bucket.

FWIW, I can confirm that I was able to sync the whole desktop-Dec_1_2015 dataset.

from legacy.httparchive.org.

RByers commented on May 28, 2024

Looks like the simple folders like 'Jan_1_2017' just have summary info (pages.csv.gz). And instead of 'desktop' and 'mobile' the options are now 'chrome' and 'android'. Looks like the latest desktop data is in gs://httparchive/chrome-Jan_1_2017.

from legacy.httparchive.org.

Store response bodies about legacy.httparchive.org HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs