Network availability comes down to people, roadmaps

As the manager of information systems at the McMichael Canadian Art Collection in Kleinburg, Ont., Alexander Meadu calls himself “the engineer among the artists.” But that’s not to say Meadu lacks creativity.

In fact, he can’t afford to be any less imaginative than the artists featured in the McMichael collection. It takes plenty of computing power to keep track of thousands of works of art, and the gallery relies on Meadu to make sure the requisite technology is always available.

“You have to register every piece of art,” he said, describing technology’s role in the collection. “You have to know exactly who has the copyright to it, the sizes and the description, the artist, when it was painted, where is it located in our building, or if it’s on a loan, where is it? Where is it going? What kind of insurance do we need for it?”

Although a relatively small shop of less than 100 employees, the collection operates much like a large company in one respect: when the network goes offline, tempers flare.

“If the network doesn’t work for even a few minutes, people start screaming,” Meadu said. “We cannot really afford to be down.”

Meadu attended the “5-9s Availability” conference held in Toronto on April 24 to learn about building a robust network. Produced by IT World Canada Inc. and sponsored by Compaq Computer Corp., Hewlett-Packard Co. and others, the conference explored ways to attain 99.999 per cent (“five nines;” “5-9s”) network uptime, wherein the network goes down for less than 10 minutes during an entire year.

Analysts, vendor representatives and conference goers presented best – and worst – practices in network caretaking at the event, suggesting that the ultimate high-availability network depends less on whizbang technology and more on people.

“We can have great underlying technology, but if we don’t have the right management…it doesn’t matter,” said Mark Fabbi, a Toronto-based industry researcher with Gartner Inc. and a speaker at the conference.

During his presentation, Fabbi said companies should employ a chief policy officer (CPO), an executive who collects information from the company’s business divisions and advice from the networkers. The CPO would help decide which technology to buy to support the myriad business needs. This executive would also help pen the policies associated with maintaining that technology.

Planning is important, Fabbi said. If network changes are required, test them out in a quarantined environment first; have a “back-out” plan in case the changes don’t act as expected; and don’t be afraid to test that back-out plan.

He also suggested that companies should identify the mission-critical applications and assess the required quality of service (QoS) for each app individually. Every division has its own apps and is unique in its own right, he said. After all, QoS means something different to the accountants than it does to the corporate communications department, and it’s important to recognize that diversity.

Fabbi said it all comes back to the employees – what they do at the company and the way they work. “It’s not a technology issue. If you want to make it happen, it has to happen from a people perspective.”

During his own presentation, Michael Beck, CEO of Intria-HP Corp., an IT outsourcing firm in Toronto, said outages are caused by technology glitches just 20 per cent of the time; 40 per cent of the blame can be heaped on human error.

Wendy Bartlett, whose title is “indestructible scalable computing initiative leader” at Compaq, told the audience “you get what you pay for” when hiring. Invite smart technologists to join the networking fold and you’ll have fewer problems. “Your operations shouldn’t be [staffed by] your lowest-paid people who were flipping burgers a week before,” she said. If talent is difficult to find, consider outsourcing portions of IT before sacrificing network availability.

Write things down, Bartlett said. “People’s IQs go down when things go wrong,” but if they have a guide documenting troubleshooting solutions and step-by-step processes to keep the business running, the staff appear that much brighter.

As for “worst practices,” Fabbi said beware service level agreements (SLAs) that seem too good to be true. Some service providers promise uptime in the 5-9s but can’t deliver. Instead, they might offer a discount when the network fails – and in Fabbi’s opinion, it will fail. “Service providers are making more and more outrageous statements” to win contracts, he said.

Omry Farajun, channel manager with Storage Guardian Inc., a backup-and-restore firm in Toronto, said an SLA is little more than a “game plan” that is only as good as the people and the policies involved. “If [the customer is] not going to implement [policies], what’s the other side of the coin? We’re not an insurance policy.”

Fabbi also warned against throwing money at network problems. Increasing the amount of available bandwidth doesn’t make trouble go away; it simply hides it. As well, not every company should aim for 5-9 availability, because not everyone needs it.

“It’s attainable, but the question is, do you need it?” Fabbi said. “That’s the bigger issue. In parts of your infrastructure it can be attainable, but it may not be what makes your business run. You have to go back to the cost-benefit analysis. Do your risk assessment; calculate the cost of downtime. What’s the likelihood of downtime? Make your cost-benefit justification, which makes it easier to come to some kind of conclusion, what level of availability is required.”

That’s one lesson among many that Meadu from the McMichael Collection planned to take back to Kleinburg. By the end of the presentations he had decided that 5-9s aren’t all that important for his workplace, but planning certainly is.

“Internally, we’re probably not so pressed to get 5-9s. We have 99.9 – three nines – which is perfectly acceptable for us.”

Meadu added that there were some ideas he planned to implement, chiefly around management and planning.

“When you’re dealing with day-to-day operations, it’s hard to stop and think about the recovery plan. We have a recovery plan, but it’s not properly outlined. We need to practice and make a procedure for major failures.”

A little math makes a big difference

Gartner Inc. researcher Mark Fabbi told the audience at the “5-9s Availability” conference that enterprises should focus primarily on keeping mission-critical applications up and running. He provided a formula to calculate the “expected cost of downtime” and to learn just what mission-critical means in your shop.

“Expected cost is a function of the cost failure, the cost to protect against the failure and the probability that a failure event will occur,” read Fabbi’s presentation.

Expected cost = (failure cost – protection cost) x probability of occurrence