Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Web image size prediction for efficient focused image crawling (commoncrawl.org)
16 points by boyter on Aug 24, 2015 | hide | past | favorite | 5 comments


Can't use just look at the http response to get the image size in KB, and ignores the images below a certain size, or alternatively just grab the first few bytes of the image, enough to extract the format header (which I believe carries dimension/size info commonly?)


I like the idea, especially as "content-length" and "content-type" should indicate the likely size and quality of the image. Given that, even the simplest things become problematic when we're downloading millions or billions of objects.

If a roundtrip to get a response from the server was 100 ms and we used this technique for the 720 million potential images they found from processing Common Crawl, that's 833 days if done sequentially. If you perform 100 requests per second, that's still 8.3 days just spent retrieving the HTTP headers, let alone to then download promising images.


Yep, that's a viable solution. GIF and PNG include the dimensions in the first few bytes. JPEG is a little more complex, but you don't need to download much of the file to get dimensions.

I've used fastimage with success: https://github.com/sdsykes/fastimage


For nodejs there's https://github.com/netroy/image-size

From testing with 1800+ image URLs, you can almost certainly get the dimensions within the first 64kb.


This is a fascinating idea. It'd be interesting to see a graph mapping the age of a page to the size of the linked images. I would bet that there's a very clear correlation - full page background images as a design trend really blew up a couple of years ago, so prior to that it'll be much more likely that a high resolution image is an anchor with the content of another image ('click here to view' from a thumbnail) or just a 'click to view' link. These days you'd need to drill down in to the CSS.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: