It was a normal Monday batch process at a well-respected global bank – until, that is, a critical back-office system failed. At first, IT administrators took it in stride. This wasn’t the only time they’d had to recover lost data. But soon it became clear something more ominous was occurring: the bank’s multi-terabyte database had become corrupted.
The administrators tried to switch to the hot offsite backup. No luck: it had mirrored the corruption. In the IT world, the situation was beginning to spell ‘crisis’. Applications teams and anyone else who could help had to suspend all priorities to focus on the failure. Despite best efforts, the target recovery time – four hours – came and went without a clue as to the problem’s root cause or fix.
It began to look like an episode of ‘House’, with IT managers anxiously brainstorming for more than a day, trying to diagnose the mysterious disorder in their dying patient. They knew a premature move could make matters worse.
To the outside world, the bank showed no sign of its grave condition. Customers continued trading, unaware that this high-profile institution was on the verge of losing millions, being investigated by regulators, and spoiling its good reputation.
Out of view from customers, the IT teams struggled to keep the patient alive. They scrambled to find a clean backup. They found out the corruption had happened two days before the crash; it would take 36 hours to run a check on earlier copies of the data to see if it was clean. They worked on updating the production system, rerunning transaction log files to catch up to the crash point, and processing days of transactions that had since accumulated. Senior managers burned the midnight oil to decide which processes to give priority. By end of day Friday, the bank was uncertain it could open for business on Monday. It might be too risky to go more than five days without accurate settlement reconciliation. The bank alerted regulators. The team plugged away on catch-up processing over the weekend. Fortunately, they completed it in time. By Monday the patient was out of danger and the bank was able to open its doors.
This bank is not alone. Indeed, similar near misses are increasingly common. One global retailer had its point-of-sale transactions freeze for 18 hours during the holiday shopping season. The cause: a storage-network software bug that was never precisely identified. Despite the happy ending at the global bank, its senior managers and IT teams were left troubled. Losses had been modest but had the failure struck at year-end instead – when trading was running at full tilt as investors tidied their portfolios – the outcome could have been disastrous.
It turns out that a tiny conflict between a packaged software bug and the server-management software – something nobody could have foreseen – had caused the potentially monumental disaster. This was a problem not addressed in any standard operations manual. For the bank’s leadership, the unsettling truth was this: Despite the bank’s full compliance with internal policies and external regulations, despite its readiness for loss of a site or failure of a major hardware component, it remained ill prepared for disaster recovery.
The bank is one of many enterprises and public institutions for which a combination of complacency, complexity and strained legacy systems are raising the risk of IT disasters to an alarming level. This is despite the fact that in recent years, disaster recovery and business continuity have gained visibility and significant funding. Improved as practices are, they are no longer enough.
Many large businesses are now so dependent on the flawless operation of their systems that they are dangerously vulnerable to substantial, even irreparable, business damage. The likelihood of disaster is becoming more a matter of when than if.
Too often, organizations let regulators and other stakeholders direct the thrust of business-continuity efforts toward failure of a single IT processing site or component. In the rush to comply, business leaders have lost sight of the other risks they face – the myriad smaller incidents like the one experienced by the global bank. These smaller problems can create big losses when servers shut down or decentralized software fails. A small leak will sink a great ship, to quote Benjamin Franklin.
As a CIO, you can see early warning signs that your organization is ill prepared for disaster recovery by observing any of these conditions:
• The IT organization is more focused on reacting to the most recent interruption than planning for rapid recovery from a major unexpected problem.
• The IT organization plans for and rehearses disaster recovery, but overemphasizes obvious catastrophes like the loss of a building. Most real-world problems are much less spectacular.
• Your key IT staff members are getting closer to retirement. Decades of business and application knowledge will walk out the door as baby boomers retire.
• You’re implementing service-oriented architecture (SOA), helping transform monolithic applications into layered composites built from various packages and custom applications, and increasing the likelihood of software bugs cropping up.
When disaster strikes, it doesn’t matter what caused the problem. What does matter is how quickly and reliably the problem can be resolved to minimize business damage.
Prevention efforts, including standard backup policies and redundant systems, continue to be important, but they’re not enough. To effectively minimize risk, IT leaders must turn their business-continuity efforts toward reliable recovery from the unforeseen.
What’s needed now is a strategic emphasis on rapid and well-rehearsed recovery. Fortunately, some companies are pioneering new approaches to smart IT disaster recovery. Working with these companies, Accenture has identified seven critical points common to the new strategies. Each point requires a shift in mindset for IT leaders, but not a major capital investment. Together, they comprise a useful starting point for more detailed business-continuity strategy and action.
1. Discuss business value and business risk. Just as people in general tend to avoid detailed discussions about death, IT people tend to shy away from asking business users what they would lose if specific IT processes became severely compromised. A smart recovery strategy must uncover such specifics, however, in order to adequately allocate resources.
2. Play more war games. Simulations of recovery scenarios (war games) are rarely pushed far enough or fast enough. At the global bank, technologists weren’t ready to manage the recovery because they hadn’t rehearsed that scenario. The goal of conducting war games, which are relatively low-cost, is rapid resumption of operations with the least impact on customers, revenue, cost and time.
3. Stay in constant “debrief” mode. IT groups should have designated leaders to capture knowledge from failures. They should integrate outside knowledge and third-party perspectives as well. Each time there is a near miss, they should give comprehensive debriefings, dissecting the problems to capture and catalogue key lessons learned.
4. Appoint an IT risk ombudsman. In addition to a chief risk officer, it may be helpful to appoint an IT risk ombudsman, a respected senior manager to whom IT staff can raise concerns without fear of personal exposure. The ombudsman should be a veteran technologist with a deep understanding of the whole IT architecture, and be able to spot problems without agendas or affiliations.
5. Rethink robustness: Robustness means more than the number of backups; it includes labour availability and partner capabilities. For example, a top credit-card processor’s call centres were shut down after a hurricane cut off the staff’s access to clean water, but the company was able to shift call volumes to outsourced centres.
6. “De-average” the data. Customers don’t care about averages when systems are compromised by downtime at the worst possible moment. By using averages as benchmarks, you’re effectively saying, “If it’s not likely to happen, it’s okay to be poorly prepared.”
7. Fix the whole thing, not just elements. Instead of rebuilding application by application, identify how each application maps to the technology platform and related (usually integrated) applications. Fix them in concert for faster results. Smart disaster recovery strategies call for a permanent shift in mindset – away from compliance and complacency, and toward a heightened sense of readiness. As organizations, business processes, applications and infrastructure grow, new failure and recovery scenarios will continue to emerge. Unless an organization designates roles and trains staff for comprehensive recovery scenarios, downtime minutes will turn into hours, and hours into days. Customers are likely to feel the impact and take their business somewhere else. There can be no better reason for CIOs to act.
Craig Sands is an Executive Partner within Accenture’s Systems Integration & Technology practice in Canada. His focus is on helping clients become high performance businesses by aligning and executing IT-driven business strategies.
Andrew Truscott is Security Lead for Accenture in Canada. His is one of the company’s thought leaders on security issues, leading projects across Canada.
SIDEBAR Can you answer these 12 questions?
Disaster-recovery planning is on the boardroom agenda. But in order for CEOs to give directors conclusive answers, they first must talk at length with their CIOs. Here are 12 key questions you should be prepared to answer:
1. Tell me about our response simulation and rehearsal plans and activities. When was the last time we had a full-scale rehearsal of an IT disaster recovery?
2. What did we learn from it, and how do we learn from others’ business-continuity mistakes?
3. How will our recovery plan help the company financially?
4. Have our recovery planning activities made our company more resilient?
5. How can management know how quickly we’re responding in a real emergency?
6. What kind of event-monitoring system do we have to provide early warning so we don’t have to invoke our emergency plans?
7. Who’s accountable for IT disaster recovery?
8. How can we be sure our people are trained to respond effectively?
9. What other resources do we have for recovery other than our own staff?
10. We’re prepared for hardware failure, but what about a large-scale virus or malware attack?
11. What kinds of automated response capabilities do we have to rapidly communicate status and begin response implementation?
12. Do our recovery plans extend to business-support capabilities as well as technology capabilities?
Risk Tolerance and Recovery Speed
Recovery-plan development calls for an accurate accounting of risk types, as well as an understanding of their level of acceptance and potential impact on the business. Four practical factors deserve a mention.
?The speed with which business operations can be recovered, either in-house or with a third party, is directly related to the willingness to allocate resources to a specific recovery strategy.
?One should select the recovery strategy based on business needs, not solely on technical or equipment manufacturers’ capabilities or third-party hot-site vendors’ recommendations.
?When mapping business losses against recovery costs, the point at which the lines intersect may not necessarily represent the most prudent overall recovery strategy. (In other words, the mathematical result is not always the best answer.)
?Supporting the chosen recovery strategy must come with an understanding of which resource will be traded off: time or money.
We'd love to hear your opinion about this or any other story you read in our publication. Click this link to send me a note →
Jim Love, Chief Content Officer, IT World Canada