GithubHelp home page GithubHelp logo

icon-crawler's Introduction

icon-crawler

A simple domain icon crawler on-demand.

This app supports various types of icons because most of the time those icons are all different and have different use cases, so it's cool being able to get a particular type of icon.

Currently supported icons:

  • favicon
  • apple-touch
  • svg
  • fluidapp
  • msapp

Live version

http://178.62.216.242 (currently off)

Requests examples

System Dependencies

  • node.js
  • redis
  • nginx (optional)
  • ImageMagick (optional)

How to install and run the app

This app is ready to work out of the box with only node and redis installed, you just need to clone the repo, install the dependencies and you are ready to go, but this type of configuration won't scale the app so well.

git clone https://github.com/ricardofbarros/icon-crawler.git
cd icon-crawler
npm i
npm start

But instead of an "out of the box" installation I used a reverse proxy to help serve the static files and to load balance the node apps, the reverse proxy in question is nginx.

A reverse proxy is fundamental to scale the app, this will be explained why later on the documentation.

If you want to know what configurations I used on nginx you can take a look into nginx/nginx.confg.

Strategies

To find the favicon

There are various fallbacks strategies to catch the icons. I will explain the logic flow to catch them. Gotta catch 'em all!

favicon

Preference order:

  • .png
  • .gif
  • .ico

Logic flow:

  • Try to get all link[rel=icon], this returns as well the shortcut icon elements.
    • If found: Check for extension in the property href, it must be a .png, .gif or a .ico. Return following the preference order.
  • Fallback: Make the following requests in the order they are presented: http://example.com/favicon.ico and http://www.example.com/favicon.ico. If a valid asset is hit return it.

apple-touch

Preference order:

  • Squared icons from the biggest dimension to the smallest dimension. (320x320, 160x160, 60x60).
  • Wide/rectangle a like icons (320x160, 120x60, etc.)

Logic flow:

  • Try to get link[rel=apple-touch-icon-precomposed].
    • If found: Return the href following the preference order.
  • 1st Fallback: Try to get link[rel=apple-touch-icon].
    • If found: Return the href following the preference order.
  • 2nd Fallback: Make the following requests in the order they are presented: http://example.com/apple-touch-icon.png and http://www.example.com/apple-touch-icon.png. If a valid asset is hit return it.

svg

Logic Flow:

  • Try to get all link[rel=icon], this returns as well the shortcut icon elements.
    • If found: Filter for .svg extension. Return if found any.

fluidapp

Logic Flow:

  • Try to get link[rel=fluid-icon].
    • If found: Return it.

msapp

NOTE: The logic flow for this icons is more complex than the rest.

Preference flow for items in browserconfig.xml:

  • square150x150logo
  • square70x70logo
  • TileImage

Logic Flow:

  • Try to get meta[name=msapplication-TileColor]
    • If found: In the last stage of this logic flow we need to fill the .png. Switch the image transparency with the color found.
  • Try to get meta[name=msapplication-square150x150logo].
    • If found:
      • Is TileColor defined?
        • Yes - Pass the url of the image and the color to lib/workers/windowsTileFiller. When the image fill is finished this worker will respond to the request.
        • No - Just return the url of the image.
  • 1st fallback: Try to get meta[name=msapplication-square70x70logo].
    • If found:
      • Is TileColor defined?
        • Yes - Pass the url of the image and the color to lib/workers/windowsTileFiller. When the image fill is finished this worker will respond to the request.
        • No - Just return the url of the image.
  • 2nd fallback: Try to get meta[name=msapplication-TileImage].
    • If found:
      • Is TileColor defined?
        • Yes - Pass the url of the image and the color to lib/workers/windowsTileFiller. When the image fill is finished this worker will respond to the request.
        • No - Just return the url of the image.
  • 3rd fallback: Try to get meta[name=msapplication-config]. (browserconfig.xml)
    • If found: Get the browserconfig.xml and parse it. Look for square150x150logo, square70x70logo, TileImage and TileColor.
      • If found any items in browserconfig.xml: Choose icon according to preference flow for items in browserconfig.xml. Then we check if..
        • Is TileColor defined?
          • Yes - Pass the url of the image and the color to lib/workers/windowsTileFiller. When the image fill is finished this worker will respond to the request.
          • No - Just return the url of the image.
  • 4th fallback: Make the following requests in the order they are presented: http://example.com/browserconfig.xml and http://www.example.com/browserconfig.xml.
    • If a valid asset is hit: Repeat the steps of the 3rd fallback.

To scale

Serving the icons

The first request is used to cache the image on the file system and create a record on redis. Normally the first request takes longer to complete because it needs to download the image, write the image to the file system and create a key in redis and then we can deliver the url to the user. But I don't want the first request to a specific domain to wait!

So for instance when you request to crawl the domain github.com it will parse the HTTP response body and will find the following favicon https://assets-cdn.github.com/favicon.ico, instead of waiting it will deliver the link through a local proxy and behind the curtains it will launch a worker to crawl the rest of the images, then it proceeds to download them, store them and create the cache metadata in redis.

If you want to see the source code of this event, you can take a look into the following files:

  • Local proxy request handler - app/proxyImage.js
  • Icon crawler worker - lib/workers/iconCrawler.js
  • Main request handler of the app - app/getImage.js

Cache files in the file system

The file system should be enough to cache files. Caching files in memory could be a better option if we had the hardware, so for general purposes the file system will suffice.

There is some concerns to scale when you are using the file system to cache files. If you have a lot of files in one directory you will start to cripple the system, so one workaround is to split the md5 filename and make some subdirectories. This is explained in more detail on the Server fault question Storing a million images in the file system.

Serve static files through nginx

Let's get real node.js is nowhere near the performance output of nginx on the department of serving static files, I ran some benchmarks and node was doing a poorly 2-3k reqs/sec using res.sendFile while nginx was doing 45-47k reqs/sec, so nginx was the clear winner to serve the static files that were cached on the file system.

The benchmarks were done using the wrk tool

Using redis to manage warm/hot cache and cold cache

There is a great answer for the topic of warm cache and cold cache in stackexchange.

The implementation of this concept is pretty simple and straightforward. I used zsets and hash sets to accomplish this.

On the hash sets I stored information of where are the images of a specific domain stored in the filesystem. For instance take the following domain github.com to exemplify the data structure:

  • key: icon-crawler:github.com
    • field: 'favicon', value: '/some/where/in/the/fs/favicon.ico'
    • field: 'apple-touch', value: '/some/where/in/the/fs/apple-touch.png'
    • field: 'svg', value: '/some/where/in/the/fs/svg.svg'
    • field: 'fluidapp', value: '/some/where/in/the/fs/fluidapp.png'
    • field: 'msapp', value: '/some/where/in/the/fs/msapp.png'

So that's how I store information of the domains. So when someone request to get the icons of the domain github.com I will check if the key icon-crawler:github.com exists, if it exists I transform those fs paths into url in which the reverse proxy "understands".

Until here, this is basic caching of the "metadata".

So now the implementation of the concept of warm/hot cache and cold cache. For this I used a single zset.

In this set I add a domain to the set (only do it if it doesn't exist) and increase the score of that domain by +1 on each request to crawl to that domain. This is "heating" the cache.

Then I have the following worker lib/workers/zsetDecrementer.js running in every x seconds (this is configurable in through config.js). This worker is basically a cycle to decrement all items on the zset by -1. This is "cooling" the cache.

Then I have the following worker lib/workers/deleteCacheExpired running as well in every x seconds (also configurable). This is workers is in charge of disposing cold cache. In technical terms it will remove all items that are bellow the score 1.

Cool stuff implemented

  • windows tiles background fill - This app will call ImageMagick to fill the background of the .png images with the color specified in the meta tag TileColor
  • Request only a specific type or types by passing the query parameter type. Like this: single type or multiple types

To be implemented

Some stuff wasn't implemented because I didn't have time to do it, but for the record this are the missing features:

  • Icon refresher - This should be a simple worker that will iterate over cached files and see if they are up to date.
  • Delete not used cached files from tmp directory.

Reading material

I didn't know everything about the web standards regarding the icons, so I had to do my research.

icon-crawler's People

Contributors

ricardofbarros avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.