A simple domain icon crawler on-demand.
This app supports various types of icons because most of the time those icons are all different and have different use cases, so it's cool being able to get a particular type of icon.
Currently supported icons:
- favicon
- apple-touch
- svg
- fluidapp
- msapp
http://178.62.216.242 (currently off)
- http://178.62.216.242/get?domain=github.com (Crawl all icons)
- http://178.62.216.242/get?domain=github.com&type=svg (Crawl just the svg icon)
- http://178.62.216.242/get?domain=github.com&type[0]=svg&type[1]=msapp (Crawl the svg and the msapp icons)
- node.js
- redis
- nginx (optional)
- ImageMagick (optional)
This app is ready to work out of the box with only node
and redis
installed, you just need to clone the repo, install the dependencies and you are ready to go, but this type of configuration won't scale the app so well.
git clone https://github.com/ricardofbarros/icon-crawler.git
cd icon-crawler
npm i
npm start
But instead of an "out of the box" installation I used a reverse proxy to help serve the static files and to load balance the node apps, the reverse proxy in question is nginx.
A reverse proxy is fundamental to scale the app, this will be explained why later on the documentation.
If you want to know what configurations I used on nginx you can take a look into nginx/nginx.confg
.
There are various fallbacks strategies to catch the icons. I will explain the logic flow to catch them. Gotta catch 'em all!
Preference order:
.png
.gif
.ico
Logic flow:
- Try to get all
link[rel=icon]
, this returns as well theshortcut icon
elements.- If found: Check for extension in the property
href
, it must be a.png
,.gif
or a.ico
. Return following the preference order.
- If found: Check for extension in the property
- Fallback: Make the following requests in the order they are presented:
http://example.com/favicon.ico
andhttp://www.example.com/favicon.ico
. If a valid asset is hit return it.
Preference order:
- Squared icons from the biggest dimension to the smallest dimension. (320x320, 160x160, 60x60).
- Wide/rectangle a like icons (320x160, 120x60, etc.)
Logic flow:
- Try to get
link[rel=apple-touch-icon-precomposed]
.- If found: Return the href following the preference order.
- 1st Fallback: Try to get
link[rel=apple-touch-icon]
.- If found: Return the href following the preference order.
- 2nd Fallback: Make the following requests in the order they are presented:
http://example.com/apple-touch-icon.png
andhttp://www.example.com/apple-touch-icon.png
. If a valid asset is hit return it.
Logic Flow:
- Try to get all
link[rel=icon]
, this returns as well theshortcut icon
elements.- If found: Filter for
.svg
extension. Return if found any.
- If found: Filter for
Logic Flow:
- Try to get
link[rel=fluid-icon]
.- If found: Return it.
NOTE: The logic flow for this icons is more complex than the rest.
Preference flow for items in browserconfig.xml:
- square150x150logo
- square70x70logo
- TileImage
Logic Flow:
- Try to get
meta[name=msapplication-TileColor]
- If found: In the last stage of this logic flow we need to fill the
.png
. Switch the image transparency with the color found.
- If found: In the last stage of this logic flow we need to fill the
- Try to get
meta[name=msapplication-square150x150logo]
.- If found:
- Is TileColor defined?
- Yes - Pass the url of the image and the color to
lib/workers/windowsTileFiller
. When the image fill is finished this worker will respond to the request. - No - Just return the url of the image.
- Yes - Pass the url of the image and the color to
- Is TileColor defined?
- If found:
- 1st fallback: Try to get
meta[name=msapplication-square70x70logo]
.- If found:
- Is TileColor defined?
- Yes - Pass the url of the image and the color to
lib/workers/windowsTileFiller
. When the image fill is finished this worker will respond to the request. - No - Just return the url of the image.
- Yes - Pass the url of the image and the color to
- Is TileColor defined?
- If found:
- 2nd fallback: Try to get
meta[name=msapplication-TileImage]
.- If found:
- Is TileColor defined?
- Yes - Pass the url of the image and the color to
lib/workers/windowsTileFiller
. When the image fill is finished this worker will respond to the request. - No - Just return the url of the image.
- Yes - Pass the url of the image and the color to
- Is TileColor defined?
- If found:
- 3rd fallback: Try to get
meta[name=msapplication-config]
. (browserconfig.xml)- If found: Get the
browserconfig.xml
and parse it. Look forsquare150x150logo
,square70x70logo
,TileImage
andTileColor
.- If found any items in browserconfig.xml: Choose icon according to preference flow for items in browserconfig.xml. Then we check if..
- Is TileColor defined?
- Yes - Pass the url of the image and the color to
lib/workers/windowsTileFiller
. When the image fill is finished this worker will respond to the request. - No - Just return the url of the image.
- Yes - Pass the url of the image and the color to
- Is TileColor defined?
- If found any items in browserconfig.xml: Choose icon according to preference flow for items in browserconfig.xml. Then we check if..
- If found: Get the
- 4th fallback: Make the following requests in the order they are presented:
http://example.com/browserconfig.xml
andhttp://www.example.com/browserconfig.xml
.- If a valid asset is hit: Repeat the steps of the 3rd fallback.
The first request is used to cache the image on the file system and create a record on redis. Normally the first request takes longer to complete because it needs to download the image, write the image to the file system and create a key in redis and then we can deliver the url to the user. But I don't want the first request to a specific domain to wait!
So for instance when you request to crawl the domain github.com
it will parse the HTTP response body and will find the following favicon https://assets-cdn.github.com/favicon.ico
, instead of waiting it will deliver the link through a local proxy and behind the curtains it will launch a worker to crawl the rest of the images, then it proceeds to download them, store them and create the cache metadata in redis.
If you want to see the source code of this event, you can take a look into the following files:
- Local proxy request handler -
app/proxyImage.js
- Icon crawler worker -
lib/workers/iconCrawler.js
- Main request handler of the app -
app/getImage.js
The file system should be enough to cache files. Caching files in memory could be a better option if we had the hardware, so for general purposes the file system will suffice.
There is some concerns to scale when you are using the file system to cache files. If you have a lot of files in one directory you will start to cripple the system, so one workaround is to split the md5 filename and make some subdirectories. This is explained in more detail on the Server fault question Storing a million images in the file system.
Let's get real node.js is nowhere near the performance output of nginx on the department of serving static files, I ran some benchmarks and node was doing a poorly 2-3k reqs/sec using res.sendFile
while nginx was doing 45-47k reqs/sec, so nginx was the clear winner to serve the static files that were cached on the file system.
The benchmarks were done using the wrk tool
There is a great answer for the topic of warm cache and cold cache
in stackexchange.
The implementation of this concept is pretty simple and straightforward.
I used zsets
and hash sets
to accomplish this.
On the hash sets
I stored information of where are the images of a specific domain stored in the filesystem. For instance take the following domain github.com
to exemplify the data structure:
- key:
icon-crawler:github.com
- field: 'favicon', value: '/some/where/in/the/fs/favicon.ico'
- field: 'apple-touch', value: '/some/where/in/the/fs/apple-touch.png'
- field: 'svg', value: '/some/where/in/the/fs/svg.svg'
- field: 'fluidapp', value: '/some/where/in/the/fs/fluidapp.png'
- field: 'msapp', value: '/some/where/in/the/fs/msapp.png'
So that's how I store information of the domains. So when someone request to get the icons of the domain github.com
I will check if the key icon-crawler:github.com
exists, if it exists I transform those fs paths into url in which the reverse proxy "understands".
Until here, this is basic caching of the "metadata".
So now the implementation of the concept of warm/hot cache and cold cache. For this I used a single zset
.
In this set I add a domain to the set (only do it if it doesn't exist) and increase the score of that domain by +1
on each request to crawl to that domain. This is "heating" the cache.
Then I have the following worker lib/workers/zsetDecrementer.js
running in every x seconds (this is configurable in through config.js
). This worker is basically a cycle to decrement all items on the zset by -1
. This is "cooling" the cache.
Then I have the following worker lib/workers/deleteCacheExpired
running as well in every x seconds (also configurable). This is workers is in charge of disposing cold cache. In technical terms it will remove all items that are bellow the score 1
.
- windows tiles background fill - This app will call
ImageMagick
to fill the background of the.png
images with the color specified in the meta tagTileColor
- Request only a specific type or types by passing the query parameter
type
. Like this: single type or multiple types
Some stuff wasn't implemented because I didn't have time to do it, but for the record this are the missing features:
- Icon refresher - This should be a simple worker that will iterate over cached files and see if they are up to date.
- Delete not used cached files from tmp directory.
I didn't know everything about the web standards regarding the icons, so I had to do my research.