Learning from disasters

Angry users, upset customers, lost productivity, lost revenues, and long, stressful hours spent trying to fix the problem — these are the horrors associated with a network failure. It’s never a pretty sight.

AT&T Canada Corp. knows this all too well. Last winter, not only did the carrier have to deal with power issues surrounding the ice storm in the Quebec area, but there was an unrelated failure of its national telecommunications network in January that disrupted voice and 1-800 services to many customers across the country. The network failure actually had more impact than the ice storm.

“That was what nailed us — it was an SS7 (Signaling System 7) failure through Stentor,” said Renato Discenza, vice-president, network services for AT&T Canada in Toronto. “Calls weren’t completing and there were about 12 hours of pure agony to try to resolve the problem.”

Discenza said it was a hardware problem in the design, caused by interoperability issues with a mediation device in the Stentor CCS7 (Common Channel Signaling 7) network. It only affected the SS7 links between Stentor and AT&T Canada, as the device in the network was incompatible with the links.

AT&T Canada network staff has since had numerous discussions on what it can learn from this incident.

One realization that has emerged is that carriers can’t insulate themselves. “We’re all interdependent,” he said.

Even though they’re all competitors, the Canadian Carriers Service Forum was launched several months after AT&T’s network failure. The forum includes organizations such as AT&T Canada, Stentor, BCTel, Telus, Metronet — “anybody and everybody who interconnects,” according to Discenza. The goal is to figure out how to work together and help each other out during crises.

“Our networks are all meshed together, so there’s no way to say, ‘My network’s okay, I don’t care what the other guy does,'” Discenza said. There’s an executive committee that has regular meetings, discussing interoperability, emergency restoration, mutual aid and the impact of new technologies on each other.

Interoperability testing is now stressed as a way of preventing future hardware problems. By and large, he said, all the reputable manufacturers — “the Nortels, the Alcatels, the Ciscos of the world” — architecturally will use high-level components with high-level redundancy.

“It’s not so much the box, it’s the way the box interoperates with other parts of the network (that is important),” he explained.

People used to stick with one vendor for their whole network but now they’re using different vendors for different components.

“Now you have to connect with anybody, anywhere, any time,” Discenza said. “So we’re focusing on working with standards groups, really making sure standards are truly standard, that there’s real interoperability.”

In general, AT&T staff was able to learn a lot from this outage — not just in terms of the specific problem, but also what happens when an outage occurs.

“You can do a lot of planning, but there’s a lot of execution that has to happen when an outage occurs in terms of who you talk to, who has the responsibility to fix it, how do you communicate to customers and critical field personnel what has been resolved — there’s a lot of effort you have to put into that,” Discenza said.

Ben Knebel, vice-president of networking solutions for NCR Canada in Mississauga, Ont., agreed that positive changes can be made based on what was learned after a major failure. He described a time when his network was down for 24 hours after a construction crew dug up all of the main trunks and cables for the phone system.

NCR had some phones up and running on an emergency basis, and cellular phones were used so workers could be contacted in critical situations.

When the lines were back up, NCR dealt with the carrier, asking how to avoid a failure such as this in the future. The results of the discussions were positive.

“One of the things they did was put in another set of trunks that were far away enough from the other ones, so if there was a disastrous failure, they could re-route better than they were able to,” Knebel said.

However, most analysts agree it is impossible to completely ward off network failures. There is always something that could unexpectedly disrupt the network, whether it’s a construction crew digging up cable accidentally, a faulty product or a natural disaster.

“Products do fail, no matter how much redundancy you put into the network,” said Mark Fabbi, an analyst at Gartner Group Canada Inc. in Mississauga, Ont.

“The truth is if you want to prevent absolute network outage, you have to have two networks, or three, or perhaps four,” added Thomas Nolle, president of consultancy CIMI Corp. in Voorhees, N.J. “What I’m saying is there is going to be some set of conditions that will destroy as many networks as you have.”

With that in mind, Michael Speyer, analyst with The Yankee Group in Boston, stressed the importance of risk assessment and prioritizing.

“First, make a very clear assessment of the level of risk you’re prepared to expose your business to for network failure, then identify the area of your network that would cause the most disruptions to the business should something at that point of the network fail,” he said. “Then define what level of uptime you want and engineer your network accordingly.”

In the case of AT&T, the network itself is the revenue-bearing asset for the company, so it would have network uptime as a more critical priority, Speyer added. But other companies have to decide on the balance between how much money to spend and the importance of uptime. If a network failure can spell huge disaster for your company, you’ll want to spend more, but there is no point spending more money than necessary if risk is not substantial.

CIMI Corp. advises everybody to do a risk assessment.

“You may not elect to address all of the risks because the financials may not be there, it may not be justified,” Nolle said. “But you really need to do a risk assessment to at least alert management what their exposure is on these things.”

When figuring out priorities and how much to spend on preventing network failures and risk assessment, downtime costs should be calculated.

Nolle gave the example of a retailer that only exists on the Internet, such as Amazon.com. If customers can’t get on to the Amazon site, they’ll just get the merchandise from a competing retailer.

“Those are dollars that are gone forever,” he said. “Losing an hour can be critical — from hundreds of dollars to hundreds of thousands of dollars.”

And it’s not only direct revenue losses that can cost companies money. According to a 1998 Gartner Group study, downtime creates costs in terms of lost productivity, damaged reputation, financial performance and other possible expenses such as litigation, temporary employees, equipment rental and overtime costs.

Michael Thompson, director of computer and network services for the Maryland Institute’s College of Art in Baltimore, Md., said the reputation factor definitely affects him. His college is a visual arts school that does a lot of computer graphics work and the network is integral to many classes. If the network is always down, people complain, and complaints can reach potential students who may decide not to enrol if the college’s reputation is poor.

For those catastrophes that often can’t be foreseen or avoided, a disaster recovery plan is always considered a good idea.

Speyer again emphasized prioritizing, saying the complexity of your disaster recovery plan will have to depend on how critical the network is to your business.

For instance, companies for which network uptime is absolutely critical — such as financial institutions — will go as far as having a disaster recovery site to go to if they lose one site entirely. Buses will be arranged and everyone will go work at the alternate site, called a “hot site,” until the main network is able to be restored, according to NCR’s Knebel.

Last year, AT&T Canada demonstrated a disaster recovery exercise for Canadian customers to give them confidence in its abilities to recover from disaster. The telco simulated a major test centre blowing up. The whole building’s network was restored within 48 hours.

“We have trailers with equipment inside, teams that are trained in how to roll it of the site, put power in, put grounding in and hook it up so it literally will replace the catastrophic site, whether it’s been damaged by fire, flood or whatever,” Discenza said.

While these types of catastrophic network failures do happen occasionally, smaller LAN failures are more common and can still do a good deal of damage.

The Maryland Institute’s Thompson outlined the steps he takes when a network failure occurs.

“Whenever I have a failure, my first thought is to check the physical layer first, making sure the cabling’s all tight, that I’ve got link lights between network interface cards on workstations and the hubs or the switches,” he said.

Then he makes sure he has connectivity between those devices and the routers. After that he starts checking the logical layer, starting at the workstation and working his way up. He does the hardware layer first and then the software layer.

“I go from the workstation to the hub, from the hub to any other switch or hub it might be daisy-chained to, to the router, and eventually to our border router.”

Thompson also suggested regular inspections of the physical components of the network if you have the resources to do it.

“Check your plugs, check your wires, check your wiring panels — run diagnostics on everything now and then to make sure it’s responding correctly,” he said.

There is also a variety of products available to help with network uptime. Thompson said he thinks network sniffer devices, such as those made by Network Associates, can be very useful for diagnosing network problems to catch them before they become disastrous for the network.

“You can set alarm points on the sniffer so that if it sees excessive traffic or traffic that’s not supposed to be there, then it can alert you,” he said.

So whether it’s a company’s LAN or a large carrier network such as AT&T’s, failures do happen. The focus, therefore, must be on minimizing failure as much as possible in a variety of ways. Today’s devices and products are inherently stable, but it’s network design and architecture that need the most attention.