Like anyone that has worked in the IT industry for as long as I have, I’ve encountered my fair share of horror stories (or learning opportunities, as we later learned to call them), where technology lets you down or behaves in completely unforeseen ways, with highly stressful consequences.
One of my all-time favourite examples of this (although it didn’t feel like it at the time) involved Santa Claus, of all people, and having told the story in person so many times, I finally found the time to write it down (just in time for Christmas) in all its glorious detail.
Full credit goes to Evan Shortiss and Philip Hayes (who lived through this with us) for helping to tell the story in this format, with Evan devising the article title and Philip providing the technical accuracy to back it up. So here goes…
A Christmas Tale
It was 23 December 2011 and the hardworking staff at our startup company, which was punching above its weight in the mobile applications and cloud space at the time, were on the wind down to a well deserved Christmas break after a very busy and successful year.
‘Twas two nights before Christmas, when all through the house, not an engineer was stirring, not even their mouse…
As their Cloud Operations team had done their due diligence and introduced a platform change freeze the week before (notifying customers accordingly and finalising a support crew that could assist if any issues arose during the holidays, which rarely happened anyway), most of the staff has already started their holidays in the days before, so the office was pretty calm and quiet.
One of our larger profile customers at the time ran the local national airport and were using a custom solution powered by our platform to provide flight information and other services to airport customers. Their solution comprised a publicly available, cross-platform mobile app with a cloud-based, Node.js backend that integrated with some of their own backend systems, parsing data that was delivered via a message queue and storing it in a local cache.
Then, suddenly, the phone rang…
Panic Stations
The customer’s mobile app had just stopped working on the busiest day of the year, leaving thousands of passengers virtually stranded, with no up-to-date flight information. Nothing had changed, though; we were in a change freeze and this same application had been operating flawlessly for months, processing hundreds of thousands of messages without issue. This was quickly verified by the dutiful support crew who confirmed that no changes had been made to the front-end or back-end of the application, and yet the mobile app continued to fail.
In the hours that followed, long into the evening that was, almost every engineer in the company was summoned back from holidays to help get to the root cause of this issue, with stress levels on the increase and patience declining fast. There was also no sign of anyone that looked even remotely like Hans Gruber to blame for the drama that ensued 🙂
Some time later, one of the engineers discovered (via some system logs) that some of the code responsible for parsing messages from the customer’s backend systems was throwing errors. Specifically, the code that mapped airline (ICAO) codes to human readable airline names had a hardcoded table with predefined entries, supporting only the entries in that table and generating errors for anything else it encountered.
Having verified that there was, in fact, decent error handling in place for the code segment in question, this still did not explain why the backend application was still crashing continuously. A few more hours passed and attention turned to the actual data being sent by the customer’s messaging system. It was not the only remaining variable in the entire equation at hand.
Learning Opportunities
And then, suddenly, there it was, staring us in our (now very red) faces. Santa was on his way!
It turns out that, through their inner child and the goodness of their hearts, and in keeping with what had become an annual tradition, the airport IT staff had injected a fake flight (by Santa Airlines) into the arrivals messaging queue. However, the associated airline code (SAN) was not in the table of predefined entries in the backend application, so was generating an error.
However, cruelly, it then transpired that while the error was being caught and handled, there was a small bug in the error handling routine which was preventing the application from processing subsequent messages, ultimately causing it to crash, over and over again.
So, in the end, the addition of a fake flight to bring some festive cheer to airport customers and their families exposed a hidden bug in our application, which “sleighed” our backend!
Lessons Learned
No matter how reliable you believe your data is, or how long your application has been running without issue, edge cases will still happen. A simple unit test would have caught the bug described above and prevented this high profile system from failing at one of the busiest times of the year, causing enormous amounts of stress to everyone involved.