Air Travellers Cop Bad Systems Architecture

After a major systems failure in 2010 of the Navitaire New Skies airline booking and check-in system used by Virgin Australia, another failure yesterday has left thousands of Australian air travellers inconvenienced and angry.

This time the problems were not limited to Virgin, with the crash also affecting Jetstar, Tiger, Rex, and other associated airlines.

It is a fact of life that things go wrong from time to time, but when these things happen repeatedly, one must start asking questions.

In the aftermath, Navitaire blamed a “a blackout in a third party data centre in Sydney” for the outage.

While this may well have been the correct and real reason for the outage – (equally, it may not have been – sometimes it’s much easier to blame someone else if you can) – as a systems engineer/developer with a background in networking, I can’t help but wonder about how Navitaire handles system redundancy.

When designing any production system, particularly a mission-critical system such as an airline booking/check-in system, one must consider redundancy very carefully.

There are a number of different redundancy concepts, such as N+1, or High Availability (HA).

You want to architect your system in such a way, that in the event of a failure of a single component of the system, the system is able to continue to operate fully, or even in a degraded state. It should never collapse completely.

The “blackout in a third party data centre in Sydney” should not have brought the system down, unless the system is badly architected, inadvertently or otherwise.

When developing systems, sometimes corners are cut to save money. If Navitaire hasn’t properly considered systems failover, and it was for cost cutting purposes, given they reached a confidential financial settlement with Virgin after the 2010 failure, they might not have saved as much as they hoped.

Whatever the status of failover systems, it is reasonably clear that they didn’t work yesterday. There should be at least two data centres involved, so that when one goes out, as happened yesterday, the other should take over and allow normal business operations to continue.

Naivtaire have some questions to answer.

This time round, they will have even more airlines asking the question.

Why?