GithubHelp home page GithubHelp logo

Comments (20)

pmeenan avatar pmeenan commented on May 28, 2024

Timing is impeccable. The current crawl that is running is capturing response bodies for text resources and storing them with the test results. You can turn it on on a private instance by adding bodies=1 to the WPT settings file (using trunk) or an older WPT instance by passing bodies=1 with the test requests from HA.

The response bodies are included in the HAR files and available in the embedded HAR viewer. An archive of the HAR's with included bodies will also be separately archived and available for download.

I'll update this issue when the archive piece has been validated with info on how to download the archived HAR's with bodies.

from legacy.httparchive.org.

andydavies avatar andydavies commented on May 28, 2024

What structure does the webdevdata use to organise the files?

I wonder what the trade off is between storing a bunch of files on disk vs having another table with Resource Id and URL from resources.

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 28, 2024

ok, Interface to query the archived bodies is available. It's JSON-only but should work well enough:

http://httparchive.webpagetest.org/habodies.php?run=

i.e.:

http://httparchive.webpagetest.org/habodies.php?run=20140115

The status will be 200 when all of the HAR's have finished uploading (may be some lag before they can be downloaded from the Internet Archive storage). Each archive should have ~20,000 HAR's in the zip and each one is gzipped in the archive. So far for the 4 groups that have completed on the current run it looks like they are around 4.5GB each.

Also, the bodies are only stored for the median run even though there are 3 runs for each test.

from legacy.httparchive.org.

souders avatar souders commented on May 28, 2024

Could we instead use the "crawlid" since I'm trying to move away from
"label"s. Eg:
http://httparchive.webpagetest.org/habodies.php?crawlid=199

"Label"s are less preferred because they don't uniquely identify a crawl
(eg, there are crawls for mobile & desktop with the label "20140115").

-Steve

On 1/16/14, 10:32 AM, Patrick Meenan wrote:

ok, Interface to query the archived bodies is available. It's
JSON-only but should work well enough:

http://httparchive.webpagetest.org/habodies.php?run=

i.e.:

http://httparchive.webpagetest.org/habodies.php?run=20140115

The status will be 200 when all of the HAR's have finished uploading
(may be some lag before they can be downloaded from the Internet
Archive storage). Each archive should have ~20,000 HAR's in the zip
and each one is gzipped in the archive. So far for the 4 groups that
have completed on the current run it looks like they are around 4.5GB
each.

Also, the bodies are only stored for the median run even though there
are 3 runs for each test.


Reply to this email directly or view it on GitHub
#6 (comment).

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 28, 2024

Not without plumbing a LOT more through or moving the bodies processing into the HA side of things. WPT has no concept of crawl ID's.

Since HA is already pulling the HARs for doing it's own processing we can have it pull the versions with bodies and archive them off independently of WPT's processing (and by "we" I mean "not me" :-D)

from legacy.httparchive.org.

igrigorik avatar igrigorik commented on May 28, 2024

high five ... awesome! Time to write some map-reduce jobs. :)

from legacy.httparchive.org.

yoavweiss avatar yoavweiss commented on May 28, 2024

Awesome indeed! :)

Regarding "labels vs. IDs", is it possible to add a mobile/desktop identifier to the label, so we could distinguish between their crawls, without full fledged IDs?

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 28, 2024

The individual HAR files should have the browser information as part of the page-level info so you can filter while you're processing (on the current data anyway). All of the data is munged together though until we move the logic to the HA layer so you still need to process it all and pick the ones you're interested in.

from legacy.httparchive.org.

andydavies avatar andydavies commented on May 28, 2024

@pmeenan is there a way of just downloading the HARs for the mobile crawl, or will I need to do both and extract?

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 28, 2024

No bodies in the mobile data so the archived HAR's would only have headers for mobile (and right now there is no way to download them separately). Maybe once we decide how to scale the mobile testing we'll be able to grab bodies but the mobitest agent doesn't support it.

from legacy.httparchive.org.

foolip avatar foolip commented on May 28, 2024

I tried to get some data from http://httparchive.webpagetest.org/habodies.php?run=20141015 and followed the links in the JSON to get some bodies.zip, but all they contained were a lot of identical .har files with nothing of value in them. Is something broken?

from legacy.httparchive.org.

foolip avatar foolip commented on May 28, 2024

It looks like the 20140815 run is the latest to have a big bodies.zip, 3.3G.

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 28, 2024

Should have been fixed for 11/1 or 11/15 (cant remember which).
On Nov 27, 2014 4:57 AM, "Philip Jägenstedt" [email protected]
wrote:

It looks like the 20140815 run is the latest to have a big bodies.zip,
3.3G.


Reply to this email directly or view it on GitHub
#6 (comment)
.

from legacy.httparchive.org.

h3nce avatar h3nce commented on May 28, 2024

Hi @pmeenan,

.zip files are working up to February 2015 (http://httparchive.webpagetest.org/habodies.php?run=20150215). Any chance of a refresh for Nov-Dec 2015?

Thanks,

Tom

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 28, 2024

We are storing them in a public bucket in Google cloud storage (and will also be making it available in big query). You'll need a developer account to rsync the folders but it is free. I'll have some UI with the bucket paths shortly. The 11/15 crawl was the first one and all 4 crawls uploaded correctly and every crawl going forward should also. 

On Thu, Dec 3, 2015 at 2:22 PM -0800, "h3nce" [email protected] wrote:

Hi @pmeenan,

.zip files are working up to February 2015 (http://httparchive.webpagetest.org/habodies.php?run=20150215). Any chance of a refresh for Nov-Dec 2015?

Thanks,

Tom


Reply to this email directly or view it on GitHub.

from legacy.httparchive.org.

h3nce avatar h3nce commented on May 28, 2024

Super, thank you for the update. Yes, I was using the BigQuery UI, that one has response bodies up to Aug 2014 currently. I'll postpone my .zip d/l which is still going from time of posting :)

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 28, 2024

Still working on the real documentation and the UI for showing all of the
directory names but if you want to grab the latest crawl, you'll need to
install gsutil https://cloud.google.com/storage/docs/gsutil_install, set
up credentials and then run:

gsutil -m rsync gs://httparchive/desktop-Nov_15_2015 .

Each HAR is a separate gzip file (~490k of them). The -m is important as it
will run multiple downloads in parallel, otherwise it will take eons. Hope
you gave a good internet connection (or are really patient) though because
it's probably > 100GB.

Let us know if there are issues with credentials. The buckets are all
public so you should be able to sync (and individual files are available
over plain HTTPS) but I haven't tried with an account that doesn't already
have write permissions to the bucket.

Thanks,

-Pat

On Thu, Dec 3, 2015 at 6:10 PM, h3nce [email protected] wrote:

Super, thank you for the update. Yes, I was using the BigQuery UI, that
one has response bodies up to Aug 2014 currently. I'll postpone my .zip d/l
which is still going from time of posting :)


Reply to this email directly or view it on GitHub
#6 (comment)
.

from legacy.httparchive.org.

foolip avatar foolip commented on May 28, 2024

What is the difference between gs://httparchive/Dec_1_2015/, gs://httparchive/desktop-Dec_1_2015/ and gs://httparchive/mobile-Dec_1_2015/ ?

Is the first just the union of the desktop and mobile sets?

from legacy.httparchive.org.

foolip avatar foolip commented on May 28, 2024

I haven't tried with an account that doesn't already have write permissions to the bucket.

FWIW, I can confirm that I was able to sync the whole desktop-Dec_1_2015 dataset.

from legacy.httparchive.org.

RByers avatar RByers commented on May 28, 2024

Looks like the simple folders like 'Jan_1_2017' just have summary info (pages.csv.gz). And instead of 'desktop' and 'mobile' the options are now 'chrome' and 'android'. Looks like the latest desktop data is in gs://httparchive/chrome-Jan_1_2017.

from legacy.httparchive.org.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.