Hi Everyone! Some of you will remember that I wrote an article here on State of Search about integrating online and offline marketing efforts and as a case study I used the Hoxton Hotel’s £1 sale from January of this year (2011). Although there were a lot of positive things to say about the efforts in general – I still think it is one of the single greatest “free PR” efforts in recent times – there were some technical concerns with the site and a number of people who complained about their experience.
After the most recent installment of the sale yesterday I had a great chat with the men behind the website to get a feel for what they learned from last time, what they improved and what still needs improvement. A very big thank you to Mark McDermott (Founder) and Aidan Kane (Lead Developer) from Codegent for taking the time to speak to me about everything going on “behind the code”.
Lessons from Last Time
The last Hoxton Hotel sale was a great success from an online marketing perspective. We featured the story as an impressive way for a brand that is not really an online brand to truly establish themselves online. We are talking about a brand that has regular press coverage for their sales (without seeding a press release), that has a self-fulfilling link building/link-bait plan every three months, and that most importantly has to invest in extra servers and supports when they push a piece of link bait. Not bad!
However, last go around was certainly not all sunshine and unicorns and we talked extensively about the negative and unanticipated side effects. Some of these side effects are the result of unhappy folks who did not get rewarded with a £1 room, others did not believe that these efforts were real (“no one really gets a room”), and others still complained about how quickly the site fell over.
This year it would seem as though the Hoxton Hotel and their marketing department learned from a few of these errors (covered extensively in this post from this morning) but perhaps the most exciting lessons were those from the team at Codegent and how to cope with the massive traffic spike they saw.
Effectively, the portion of the site that Codegent handles is the normal site, the sale microsite (subdomain), the countdown to the official sale, and providing the links to send the user onto the booking engine/portal. It is safe to say that the gang took their task very seriously after a handful of complaints and a healthy serving of abuse last time around and definitely took care of their end of the bargain.
The codegent portion of the site is effectively the role of getting everyone to the starting line and ensuring that there is no false start. This task may not sound difficult but to do so in an environment that keeps users entertained and engaged and also to keep everyone in order and happy is not an easy task as any starter from a large horse race or marathon field would attest. Let’s take a closer look at how they handled this task in the January sale and the extra measures taken this time around to ensure the site held up to the strain.
Failings in January
During the January sale, the site toppled quite rapidly after the rush of people started coming in to the site. The holding page for the sale was set up in plenty of time, but there were a number of issues caused by circumstances slightly out of the developer’s hands (and admittedly, from a lack of reading a service agreement as closely as possible).
After seeing the sorts of loads the servers were going to need to handle simply after the first newsletter mailout about the sale the gang decided they were going to move away from a basic Apache server.
Codegent moved the site over from an Apache server solution to something slightly newer and lightweight in time for the first sale. The result was a solution combining the smaller servers and virtual hosting of Nginx with Varnish (for reverse-proxy and caching) and was “as optimised a setup as [they] could get from a traditional webserver.”
Despite the planning and efforts that went in to preparing for the server side load issues (with the number of concurrent requests almost all traditional servers would fail). Effectively - the type of loads and requests the site is looking at at an exact moment in time makes a Digg spike look like an absolute joke.
In an effort to tackle these server issues Mark and Aidan made use of Amazon Simple Storage Service (S3) with Amazon Cloudfront on the front-end for the content delivery. The gang also made use of Google App Engine (GAE) to help with the countdown clock.
The problem they overlooked, however, was the fact that although GAE is built to handle API calls, it is not designed to handle more than 500 requests per second. The way in which people interact with the sale – with a countdown and a rush to be the first in the portal – means a massive number of people refreshing in multiple browsers/tabs/windows. As GAE is designed to shut out users (no matter how much money you charge up your account with) if it believes it is under attack they will cut you off and cap you at the maximum free level of requests. There is no doubt that this many requests is basically a denial of service attack – by invitation – so it is not surprising that Google would protect themselves and their clients in this way.
Therfore, the wheels fell off.
Lesson 1: A single point of failure is a BAD idea.
In addition to the issues with GAE, unfortunately Cloudfront changes can also take as long as 20 minutes to be updated. Therefore, when the site came down the necessary changes to the site meant more waiting time for eager (and easily angered) customers.
Lesson 2: Don’t believe the hype! Just because you’re using a “cloud” solution doesn’t mean that Amazon or Google’s servers are any better prepared for this sort of abuse.
The team took a fair amount of grief for this and had a number of people questioning if Codegent had “ever heard of the cloud” before and left many thinking “how embarassing”.
The point that many of these people missed out on, however, was the fact that simply scaling up a Ruby on Rails solution would not have handled the number of requests in question. That is all well and good for a sustained burst of traffic over the course of a day (the type that might come from a successful television campaign, tweets from Stephen Fry, or the “Digg effect”, but this is not the same type of request.
Lesson 3: Do not plan for your most potent day; it is your most potent second that you need to prepare for.
There are very few developers or webmasters who are likely to experience a rush through the gates like this. I suspect the likes of TicketMaster and other distributors may have seen something similar when tickets to a major gig are released, but outside of the ticketing industry there are probably very few sites with remote (or internal) hosting solutions designed to handle this.
Nuts and Bolts of the April Solution
When preparing for the next round of the sale the team at Codegent knew that they would be under a lot of pressure – though from the sound of things their competitive nature and desire to prove people wrong might have been enough. They knew they had some serious lessons that need to be learned and solved quickly.
How to Deal with the Server Issues?
Undoubtedly, the downtime was the biggest concern. The team decided quite quickly that Cloudfront was far too dangerous and took way too long to update if something went wrong. As a result, Cloudfront was the first thing ruled out of the new solution.
In order to test the S3 side of the equation, the guys read up on the terms of service and came to the conclusion that Amazon will basically let you hammer away until your requests start hurting someone else’s quality of service. As a result of the oversight on the January sale, they felt it was better to test out what sort of loads S3 could take – so they spent time testing an array of servers to try and find a thresh hold. After reaching 40,000 requests per second without toppling their service they figured it would probably be up to the task – bearing in mind that they wanted a high number of requests whilst still switching the content beneath.
Getting Rid of a Single Fail-Point – and Not Believing the Hype
They used a single call and 3 different buckets with in Amazon S3 (one in Ireland, one in California and one in Virginia) to handle the gradual increase of traffic.
As for the countdown clocks, rather than relying strictly on GAE, they ran one off of Google App Engine and one off a Ruby solution as a back-up (Heroku). Finally, they also had an Amazon loop on each of the three buckets in case the other two clocks die).
Spreading the Load
In addition to dealing with the number of requests required to cope with the clocks (using fail-over redundancy) they also added another layer. This time around Luke and Aidan decided to enable the clock with a session cookie that also would look at the clock on the PC and see how far off of their central clock was (the clock that determines when the sale goes live). The time on the countdown was therefore running on the cookie and this difference rather than multiple requests.
Opening the Floodgates
As we’ve discussed, the challenge to Codegent was coralling all of these users into the site, keeping them there, and keeping them entertained. Once the sale was one, their aim was to send the users through to the booking engine all within a second, yet to do so without (ideally) shutting down the booking engine’s servers. Once this handover was complete, the ball was in iHotelier’s court to see the users through the booking process.
In addition to keeping the session cookies unique and differing by milliseconds a small element of these requests were staggered. However, and perhaps more importantly, rather than kick a user waiting on the “countdown” page straight into the queue to buy, once the sale went live a page appeared that prompted the user to click to “Book Now”. Requiring the user to click and thus adding a human element, helped space out the flood of users a few additional seconds rather than dropping them all into the booking engine at the identical millisecond.
The last step for the Codegent team was to provide the user with a link (served through a GAE call) to the booking engine and send them on their way. The first of these links send users to the Flash version of the booking engine and, after 8 minutes, the links generated sent users to the HTML version of the booking engine to help share the loads for iHotelier a bit further.
The Results to this Point
Any of you who follow me on Twitter will know that I was a bit frustrated with my overall experience. However, as addressed in the post this morning, my grievances all occurred within the booking engine. There were a number of issues with the booking engine (perhaps most notably when a user got through to the final stage, entered their credit card details, and only then were told that there were no rooms available on that date for that price), but I will save those complaints for another day.
What I find most interesting about these drastic changes that Codegent was able to include just three months down the line. They dealt with a complex problem and executed their end of the bargain exceptionally (their side of the site never broke down and they effectively sent people the link required to give them access to the booking engine).
I want to share the lessons and missed opportunities from a marketing perspective below, but I also wanted to highlight the creative solution these guys came up with. I love hearing about this sort of thing and think that most people will have overlooked the fact that this is not a Digg spike! With this many requests there are bound to be problems, but there are creative solutions within these problems as well.
There is no doubt that there will be complaints about the overall event, and there probably should be, but at least part of the puzzle is coming together.
Read part 1 of Sam’s post series about this topic here: More Lessons from The Hoxton Hotel’s £1 Sale – April 2011Posted in Interviews | Tags: Google apps, server, Technical SEO, web development