How to Prevent vs. Experience IT Disasters

Education is an ornament in prosperity, and a refuge in adversity.” Aristotle

On May 27th, British Airways suffered adversity in the form of a technology meltdown, reportedly due to a power supply failure. News reports showed throngs of anxious, cranky customers looking at gate agents holding up whiteboards at Heathrow.

A response to my tweet on the subject was “where was the disaster recovery plan?”. Where, indeed.

Technology has never been as durable and fragile. CIOs and their teams have a responsibility to be educated and transparent about the state of technology disaster recovery and/or business continuity plans. When things are going well, this knowledge is a comfort. When the unexpected strikes, the knowledge is a bulwark against a storm of disruption, distraction, negative PR and customer angst.

As with cyber security, its important for boards and executive leadership to have an appropriate level of understanding about the resiliency of mission-critical systems. Here are common misconceptions.

  1. It’s backed up, isn’t it? Sure. Most data and systems are backed up. (See also: “it’s in the cloud, isn’t it?) Many people believe that when data gets lost, it can be fully restored within minutes. Well, maybe. Where, how often, and the steps to recover will vary. Having a backup is like having a copy of a legal document in a safe somewhere. Lose the original, and you need to know who has the key, where the safe is located, and how to get the backup into the hands of the people and systems that need it. Further, if the backup is several versions old, it may be of limited use.
  2. The system is supposed to automatically fail over. Complete redundancy means a full duplicate system, configured and maintained in another location. If you have on-premise systems in your own data centers, this is expensive. Imagine having a fully redundant car. That doesn’t mean just buying a second vehicle and having it in the garage. It means a second identical vehicle that is where you are at all times, fueled up, maintained, insured and with a spare umbrella and ice scraper in the trunk. If you are using “the cloud,” via SAAS or other solutions, your resiliency needs to be explicitly and contractually laid out with the vendor(s), with service level agreements. (Ridesharing services are my personal form of vehicle-related disaster recovery.)
  3. Don’t we have redundant power? Perhaps that’s what British Airways thought. Redundant power is more than dual power feeds to a data center. It is a complex system of dual feeds, wiring, generators, UPS that requires specific engineering and maintenance – and regular testing. You wouldn’t leave an appliance under a dust cover in the garage for years and expect it to start up and work flawlessly.
  4. What about redundant network? Similar to power, redundant networks require sophisticated design, maintenance and testing.

Even if (like one might assume a multi-billion dollar airline might have) all of the above do exist, along with up-to-date (and also redundant) instructions, a disaster recovery plan needs to be tested. As with anything, testing needs to include process, people and technology.

Consider snow blowers; a staple of existence in the Northeast. In the early fall, long before any snow is in the forecast, the snow blower in my garage is dragged out and taken for maintenance at a local vendor. After it is returned, we start it up and run it a few times with the help of the original manual. We think it is worth the time and money to ensure we can get out of the driveway.

We’ve got options, just as CIOs and their companies do. We could spend more and have two snow blowers. We could outsource snow removal to a friendly local plow-person and trust him/her to have reliable equipment. We could go the manual route and have shovels. These are all valid options with varying risks and rewards. The key is: we’ve consciously decided upon and budgeted for the level of resiliency appropriate for us.

The (faint) silver lining to situations like British Airways’ is the tremendous opportunity to learn. What happened? Why? What about the response and incident management warrants improvement? What does it tell the organization about its ability to prevent similar occurrences? As the saying goes, don’t waste a crisis.

Don’t leave your customers and yourself to be buried under reputation-crushing outages due to inattention to the right level of DR planning: make conscious decisions about the level of resiliency and redundancy.

“Each disaster became a stepping stone for growth.” Erin Brockovich, environmental activist

Worth Considering ….

A Wall Street Journal story reported that the average female CEO salary has surpassed the average male CEO story. Not. So. Fast. Success has not been achieved; while the average of 21 female CEOs does in fact exceed 382 male CEOs, but the sample size along with other factors leads to a false conclusion.

Apparently the Windows’ Blue Screen of Death can be useful. According to researchers, many of the WannaCry failed attacks were unsuccessful because Windows crashed and displayed the blue screen.

I feel like sending the nameless, stressed-out, exhausted IT people at British Airways a care package. They have my sympathy. Many of us in IT careers have been there, done that, gotten the cold pizza.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s