Comments (20)
Timing is impeccable. The current crawl that is running is capturing response bodies for text resources and storing them with the test results. You can turn it on on a private instance by adding bodies=1 to the WPT settings file (using trunk) or an older WPT instance by passing bodies=1 with the test requests from HA.
The response bodies are included in the HAR files and available in the embedded HAR viewer. An archive of the HAR's with included bodies will also be separately archived and available for download.
I'll update this issue when the archive piece has been validated with info on how to download the archived HAR's with bodies.
from legacy.httparchive.org.
What structure does the webdevdata use to organise the files?
I wonder what the trade off is between storing a bunch of files on disk vs having another table with Resource Id and URL from resources.
from legacy.httparchive.org.
ok, Interface to query the archived bodies is available. It's JSON-only but should work well enough:
http://httparchive.webpagetest.org/habodies.php?run=
i.e.:
http://httparchive.webpagetest.org/habodies.php?run=20140115
The status will be 200 when all of the HAR's have finished uploading (may be some lag before they can be downloaded from the Internet Archive storage). Each archive should have ~20,000 HAR's in the zip and each one is gzipped in the archive. So far for the 4 groups that have completed on the current run it looks like they are around 4.5GB each.
Also, the bodies are only stored for the median run even though there are 3 runs for each test.
from legacy.httparchive.org.
Could we instead use the "crawlid" since I'm trying to move away from
"label"s. Eg:
http://httparchive.webpagetest.org/habodies.php?crawlid=199
"Label"s are less preferred because they don't uniquely identify a crawl
(eg, there are crawls for mobile & desktop with the label "20140115").
-Steve
On 1/16/14, 10:32 AM, Patrick Meenan wrote:
ok, Interface to query the archived bodies is available. It's
JSON-only but should work well enough:http://httparchive.webpagetest.org/habodies.php?run=
i.e.:
http://httparchive.webpagetest.org/habodies.php?run=20140115
The status will be 200 when all of the HAR's have finished uploading
(may be some lag before they can be downloaded from the Internet
Archive storage). Each archive should have ~20,000 HAR's in the zip
and each one is gzipped in the archive. So far for the 4 groups that
have completed on the current run it looks like they are around 4.5GB
each.Also, the bodies are only stored for the median run even though there
are 3 runs for each test.—
Reply to this email directly or view it on GitHub
#6 (comment).
from legacy.httparchive.org.
Not without plumbing a LOT more through or moving the bodies processing into the HA side of things. WPT has no concept of crawl ID's.
Since HA is already pulling the HARs for doing it's own processing we can have it pull the versions with bodies and archive them off independently of WPT's processing (and by "we" I mean "not me" :-D)
from legacy.httparchive.org.
high five ... awesome! Time to write some map-reduce jobs. :)
from legacy.httparchive.org.
Awesome indeed! :)
Regarding "labels vs. IDs", is it possible to add a mobile/desktop identifier to the label, so we could distinguish between their crawls, without full fledged IDs?
from legacy.httparchive.org.
The individual HAR files should have the browser information as part of the page-level info so you can filter while you're processing (on the current data anyway). All of the data is munged together though until we move the logic to the HA layer so you still need to process it all and pick the ones you're interested in.
from legacy.httparchive.org.
@pmeenan is there a way of just downloading the HARs for the mobile crawl, or will I need to do both and extract?
from legacy.httparchive.org.
No bodies in the mobile data so the archived HAR's would only have headers for mobile (and right now there is no way to download them separately). Maybe once we decide how to scale the mobile testing we'll be able to grab bodies but the mobitest agent doesn't support it.
from legacy.httparchive.org.
I tried to get some data from http://httparchive.webpagetest.org/habodies.php?run=20141015 and followed the links in the JSON to get some bodies.zip, but all they contained were a lot of identical .har files with nothing of value in them. Is something broken?
from legacy.httparchive.org.
It looks like the 20140815 run is the latest to have a big bodies.zip, 3.3G.
from legacy.httparchive.org.
Should have been fixed for 11/1 or 11/15 (cant remember which).
On Nov 27, 2014 4:57 AM, "Philip Jägenstedt" [email protected]
wrote:
It looks like the 20140815 run is the latest to have a big bodies.zip,
3.3G.—
Reply to this email directly or view it on GitHub
#6 (comment)
.
from legacy.httparchive.org.
Hi @pmeenan,
.zip files are working up to February 2015 (http://httparchive.webpagetest.org/habodies.php?run=20150215). Any chance of a refresh for Nov-Dec 2015?
Thanks,
Tom
from legacy.httparchive.org.
We are storing them in a public bucket in Google cloud storage (and will also be making it available in big query). You'll need a developer account to rsync the folders but it is free. I'll have some UI with the bucket paths shortly. The 11/15 crawl was the first one and all 4 crawls uploaded correctly and every crawl going forward should also.
On Thu, Dec 3, 2015 at 2:22 PM -0800, "h3nce" [email protected] wrote:
Hi @pmeenan,
.zip files are working up to February 2015 (http://httparchive.webpagetest.org/habodies.php?run=20150215). Any chance of a refresh for Nov-Dec 2015?
Thanks,
Tom
—
Reply to this email directly or view it on GitHub.
from legacy.httparchive.org.
Super, thank you for the update. Yes, I was using the BigQuery UI, that one has response bodies up to Aug 2014 currently. I'll postpone my .zip d/l which is still going from time of posting :)
from legacy.httparchive.org.
Still working on the real documentation and the UI for showing all of the
directory names but if you want to grab the latest crawl, you'll need to
install gsutil https://cloud.google.com/storage/docs/gsutil_install, set
up credentials and then run:
gsutil -m rsync gs://httparchive/desktop-Nov_15_2015 .
Each HAR is a separate gzip file (~490k of them). The -m is important as it
will run multiple downloads in parallel, otherwise it will take eons. Hope
you gave a good internet connection (or are really patient) though because
it's probably > 100GB.
Let us know if there are issues with credentials. The buckets are all
public so you should be able to sync (and individual files are available
over plain HTTPS) but I haven't tried with an account that doesn't already
have write permissions to the bucket.
Thanks,
-Pat
On Thu, Dec 3, 2015 at 6:10 PM, h3nce [email protected] wrote:
Super, thank you for the update. Yes, I was using the BigQuery UI, that
one has response bodies up to Aug 2014 currently. I'll postpone my .zip d/l
which is still going from time of posting :)—
Reply to this email directly or view it on GitHub
#6 (comment)
.
from legacy.httparchive.org.
What is the difference between gs://httparchive/Dec_1_2015/, gs://httparchive/desktop-Dec_1_2015/ and gs://httparchive/mobile-Dec_1_2015/ ?
Is the first just the union of the desktop and mobile sets?
from legacy.httparchive.org.
I haven't tried with an account that doesn't already have write permissions to the bucket.
FWIW, I can confirm that I was able to sync the whole desktop-Dec_1_2015 dataset.
from legacy.httparchive.org.
Looks like the simple folders like 'Jan_1_2017' just have summary info (pages.csv.gz). And instead of 'desktop' and 'mobile' the options are now 'chrome' and 'android'. Looks like the latest desktop data is in gs://httparchive/chrome-Jan_1_2017
.
from legacy.httparchive.org.
Related Issues (20)
- Legacy website explorer limited to July 2018 HOT 2
- Crawlid 558 missing from stats HOT 3
- Legacy Website Reports are Missing Historical Data HOT 3
- Update FAQs HOT 1
- Video summary needs updating HOT 3
- Data from 2018-12-15 contains duplicate tests for some sites HOT 4
- Data from 2019-01-01 contains unknown crawls HOT 13
- Calculation of reqTotal incorrect for many sites for 2019-03-01 data HOT 1
- Legacy website not reachable HOT 4
- stats download for November is empty HOT 5
- A11Y metrics bugs HOT 3
- wpt_bodies meta description and robots gathering is invalid as the selector used is case sensitive HOT 1
- Create documentation file listing the contents of each custom metrics file
- Add a shorter timeout for fetches in custom metrics HOT 13
- Better script element custom metrics
- Create new scripts to detect importScripts() and usage of SW methods inside pwa.js HOT 1
- New event-names and pwa metrics did not use JSON.stringify HOT 7
- Add nativeSource to a11y custom metric
- Improve avif detection
- Improve a11y metric for captioned tables HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from legacy.httparchive.org.