If you want to really test your disaster recovery plan, you have to get out from behind your desk and step out into the real world. Because in the real world, the backup site lost your tapes, your emergency phone numbers are out of date, and you forgot to order Chinese food for the folks working around the clock at your off-site data center.
“Unless it’s tested, it’s just a document,” says Joyce Repsher, product manager for business continuity services at Electronic Data Systems Corp., an IT outsourcing and services provider in Plano, Texas.
How often should you test? Several experts suggest real-world testing of an organization’s most critical systems at least once a year. In the wake of September 11, and with new regulations holding executives responsible for keeping corporate data secure, organizations are doing more testing than they did ten years ago, says Repsher. An online survey of 224 IT managers, conducted by Computerworld in the US, supports that assertion, indicating that 71 percent had tested their disaster recovery plans in the past year.
Desktop disaster recovery testing involves going through a checklist of who should do what in case of a disaster. Such walk-throughs are a necessary first step and can help you catch changes, such as a new version of an application that will trigger other changes in the plan. They can also identify the most important applications, says Repsher, “before moving to the expense of a more realistic recovery test.”
Companies do desktop tests at different intervals. Fluor Fernald Inc., which is handling the cleanup of a government nuclear site in Fernald, Ohio, does both desktop and physical tests of its disaster response plans every three years “or anytime there’s a significant change in our hardware configuration,” says Jan Arnett, manager of systems and administration at the division of engineering giant Fluor Corp.
Determining which systems need a live test is also critical. Fluor Fernald schedules live tests on only about 25 of its most critical applications and then tests only one server running a representative sample of these applications. “We feel if we can bring one server up, we can bring 10 servers up,” says Arnett, especially since the company uses standard Intel-based servers and networking equipment.
The most common form of live testing is parallel testing, says Todd Pekats, national director of storage alliances at IT services provider CompuCom Systems Inc. in Dallas. Parallel testing recovers a separate set of critical applications at a disaster recovery site without interrupting the flow of regular business. Costly and rarely done, the most realistic test is a full switch of critical systems during working hours to standby equipment, which Pekats says is appropriate only for the most critical applications.
Businesses that are growing or changing quickly should test their disaster recovery plans more often, says Al Decker, executive director of security and privacy services at EDS. He cites one firm that has grown eightfold since 1999, when its disaster plan called for the recovery of critical systems in 24 hours. Today, just mounting the tapes required for those systems would take four to 10 days, he says.
how realistic a test?
Deciding how realistic to make the test “is a balance between the amount of protection you want” and the cost in money, staff time and disruption, says Repsher. As an organization’s disaster recovery program matures, the tests of its recovery plans should become more challenging, adds Dan Bailey, senior manager at risk consulting firm Protiviti Inc. in Dallas. While the more realistic exercises provide more lessons about what needs improvement, he says, an organization just starting out with a rudimentary plan probably can’t handle a very challenging drill.
Never assume that everything will go as planned. That includes anything from having enough food or desks at a recovery site to having up-to-date contact numbers. Communications problems are common, but they’re easily prevented by having every staff member place a test call to everyone on their contact list, says Kevin Chenoweth, a disaster recovery administrator at Vanderbilt University Medical Center in Nashville.
Also, never assume that the data on your backup tapes is current or that your recovery hardware can handle your production databases. Arnett found subtle differences in the drivers and network configuration cards on his replacement servers that forced him to load an older version of his Oracle database software to recover his data.
Chenoweth or his staffers review each test with the affected business units and develop specific plans (with timelines) for fixing problems.
Finally, Chenoweth says, thank everyone for their help, especially if the test kept them away from home. “If you’ve got a good relationship, they’re more likely to be responsive” to the firm’s disaster recovery needs, he says.
testing tip: ditch the script
A disaster drill isn’t much good if everyone knows what’s coming. But too many organizations script disaster tests weeks ahead of time, ship special backup files to an off-site recovery centre, and even make hotel reservations for the recovery staff, says John Jackson, vice president of business resilience and continuity services at IBM Corp. in Chicago.
That eliminates messy but all-too-likely problems such as losing backup tapes in transit or discovering that a convention has booked all the hotel rooms in town. He advises telling the recovery staff, “We just had a disaster…. You can’t take anything out of the building…. You have to rely on the disaster recovery plan and what’s in the off-site recovery center.”
That makes the test more “exciting”, he acknowledges, but it also makes it a lot more useful.
Robert L. Scheier is a writer based in Boylston, Mass. He can be reached at [email protected]. Additional reporting by Mitch Betts.