GithubHelp home page GithubHelp logo

Comments (5)

yoavweiss avatar yoavweiss commented on May 27, 2024

@zcorpan - Not sure what you want the project to do here.
The HTTPArchive crawl and gathers raw data, it doesn't do any data analysis in particular. It stores response bodies, but only for HTML/text AFAIK, e.g. the image response bodies are not stored.

We can achieve what you're after by downloading the bulk of request data, filter it for image requests, downloading them, and then run analysis on them.

from legacy.httparchive.org.

andydavies avatar andydavies commented on May 27, 2024

WebPagetest already does some image analysis, might be possible to extend
that to produce the relevant data.
On 8 Aug 2014 08:36, "Yoav Weiss" [email protected] wrote:

@zcorpan https://github.com/zcorpan - Not sure what you want the
project to do here.
The HTTPArchive crawl and gathers raw data, it doesn't do any data
analysis in particular. It stores response bodies, but only for HTML/text
AFAIK, e.g. the image response bodies are not stored.

We can achieve what you're after by downloading the bulk of request data,
filter it for image requests, downloading them, and then run analysis on
them.


Reply to this email directly or view it on GitHub
#33 (comment)
.

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 27, 2024

There are very few (no?) native photos in the websites that we test in the HTTP Archive. The pages are all landing pages for the alexa top 300k sites and actually having full-resolution photos on them would be a bad idea and the vast majority of the images are expected to be post-processed, recompressed, etc (and get flagged if they are not).

I'm not sure what the use case is for exif support in the browser but odds are you're going to have to do a custom crawl or study of some kind to find the photos you are looking for. Actual photo sites like flickr, Google+ Facebook, etc all do a bunch of processing as well for presentation in the browser though you can sometimes get through to the original raw image and I expect those are the ones you are looking for.

from legacy.httparchive.org.

yoavweiss avatar yoavweiss commented on May 27, 2024

It's true that EXIF images are likely have a much larger presence in long tail Web sites rather than in Alexa's landing pages.
Maybe it's best to try to get data on that using telemetry/use counters.

Regardless, how hard would it be to start storing image response bodies? With that we could get arbitrary stats like that (even if biased towards Alexa sites) by processing the image data itself.

from legacy.httparchive.org.

pmeenan avatar pmeenan commented on May 27, 2024

We store text response bodies but storing full images would not be something we'd be able to do without some serious benefit and justification (it would add about 400GB per crawl to the data we store).

from legacy.httparchive.org.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.