Comments (5)
@zcorpan - Not sure what you want the project to do here.
The HTTPArchive crawl and gathers raw data, it doesn't do any data analysis in particular. It stores response bodies, but only for HTML/text AFAIK, e.g. the image response bodies are not stored.
We can achieve what you're after by downloading the bulk of request data, filter it for image requests, downloading them, and then run analysis on them.
from legacy.httparchive.org.
WebPagetest already does some image analysis, might be possible to extend
that to produce the relevant data.
On 8 Aug 2014 08:36, "Yoav Weiss" [email protected] wrote:
@zcorpan https://github.com/zcorpan - Not sure what you want the
project to do here.
The HTTPArchive crawl and gathers raw data, it doesn't do any data
analysis in particular. It stores response bodies, but only for HTML/text
AFAIK, e.g. the image response bodies are not stored.We can achieve what you're after by downloading the bulk of request data,
filter it for image requests, downloading them, and then run analysis on
them.—
Reply to this email directly or view it on GitHub
#33 (comment)
.
from legacy.httparchive.org.
There are very few (no?) native photos in the websites that we test in the HTTP Archive. The pages are all landing pages for the alexa top 300k sites and actually having full-resolution photos on them would be a bad idea and the vast majority of the images are expected to be post-processed, recompressed, etc (and get flagged if they are not).
I'm not sure what the use case is for exif support in the browser but odds are you're going to have to do a custom crawl or study of some kind to find the photos you are looking for. Actual photo sites like flickr, Google+ Facebook, etc all do a bunch of processing as well for presentation in the browser though you can sometimes get through to the original raw image and I expect those are the ones you are looking for.
from legacy.httparchive.org.
It's true that EXIF images are likely have a much larger presence in long tail Web sites rather than in Alexa's landing pages.
Maybe it's best to try to get data on that using telemetry/use counters.
Regardless, how hard would it be to start storing image response bodies? With that we could get arbitrary stats like that (even if biased towards Alexa sites) by processing the image data itself.
from legacy.httparchive.org.
We store text response bodies but storing full images would not be something we'd be able to do without some serious benefit and justification (it would add about 400GB per crawl to the data we store).
from legacy.httparchive.org.
Related Issues (20)
- Legacy website explorer limited to July 2018 HOT 2
- Crawlid 558 missing from stats HOT 3
- Legacy Website Reports are Missing Historical Data HOT 3
- Update FAQs HOT 1
- Video summary needs updating HOT 3
- Data from 2018-12-15 contains duplicate tests for some sites HOT 4
- Data from 2019-01-01 contains unknown crawls HOT 13
- Calculation of reqTotal incorrect for many sites for 2019-03-01 data HOT 1
- Legacy website not reachable HOT 4
- stats download for November is empty HOT 5
- A11Y metrics bugs HOT 3
- wpt_bodies meta description and robots gathering is invalid as the selector used is case sensitive HOT 1
- Create documentation file listing the contents of each custom metrics file
- Add a shorter timeout for fetches in custom metrics HOT 13
- Better script element custom metrics
- Create new scripts to detect importScripts() and usage of SW methods inside pwa.js HOT 1
- New event-names and pwa metrics did not use JSON.stringify HOT 7
- Add nativeSource to a11y custom metric
- Improve avif detection
- Improve a11y metric for captioned tables HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from legacy.httparchive.org.