getFavicon

7th November 2009 18:00

I've been using Google's favicon fetching API to illustrate links on the play.tm wire and jasoncartwright.com for a while now, and it's been working fairly well. When given a domain the Google system returns a PNG of the domain's icon. The problem was that Google's system appears to rely on the favicon being at example.com/favicon.ico, which isn't always the case.

Some sites, for whatever reason, choose to place the favicon somewhere different and reference it from a <link> tag in the <head> of their pages. To find the URL of the favicon you therefore have to (expensively) download the entire HTML page, and parse it.

Frustrated by the limitations of the Google API I decided to solve the problem using Google App Engine, by creating getFavicon.

The first problem I came across was when just requesting the /favicon.ico file. Lots of people having poorly configured web servers that return a variety of exotic HTTP response codes, or even a 200 and zero byte body. It's quite annoying to deal with all these exceptions.

To retrieve a favicon specified in the site's HTML I've used the superb BeautifulSoup to do this. This library hugely simplifies the parsing of HTML, but still doesn't completely solve finding the correct URL of the favicon. The <link rel=""> value varies greatly - "shortcut", "icon", "shortcut icon" and "favicon" are all values I've seen.

Of course, all this page & icon downloading and parsing is pretty heavy, so a two-tier cache is used - memcache for a rapid, temporary store and AppEngine's datastore for a slower, persistent store. It works well, and keep the app ticking over quickly. Well over 90% of the icons served come from a cache.

Jason Cartwright RSS Feed

About

Jason is a founder of Potato, a web application development agency in London.

I also co-own Ferrago Ltd, who publish videogames content to around 7m consumers monthly.

Links

Colophon