Network availability is about people, not metrics

As the manager of information systems at the McMichael Canadian Art Collection in Kleinburg, Ont., Alexander Meadu calls himself “the engineer among the artists.” But that’s not to say Meadu lacks creativity.

In fact, he can’t afford to be any less imaginative than the artists featured in the McMichael collection. It takes plenty of computing power to keep track of thousands of works of art, and the gallery relies on Meadu to make sure the requisite technology is always available.

“You have to register every piece of art,” he said, describing technology’s role in the collection. “You have to know exactly who has the copyright to it, the sizes and the description, the artist, when it was painted, where is it located in our building, or if it’s on a loan, where is it? Where is it going? What kind of insurance do we need for it?”

Although a relatively small shop of under 100 employees, the collection operates much like a large company in one respect: when the network goes offline, tempers flare.

“If the network doesn’t work for even a few minutes, people start screaming,” Meadu said. “We cannot really afford to be down.”

Meadu attended the 5-9s Availability conference held in Toronto yesterday to learn about building a robust network. Produced by IT World Canada Inc. (which owns of ITWorldCanada.com), Compaq Computer Corp., Hewlett-Packard Co. and others, the conference explored ways to attain 99.999 per cent (often dubbed “five nines” or 5-9s) network uptime, wherein the network goes down for less than 10 min. every 12 months.

Analysts, vendor representatives and conference goers presented best – and worst – practices in network caretaking at the event, suggesting that the ultimate high-availability network depends less on whiz-bang technology and more on people.

“We can have great underlying technology, but if we don’t have the right management…it doesn’t matter,” said Mark Fabbi, a Toronto-based industry researcher with Gartner Inc.

Fabbi advocated employing a chief policy officer (CPO), an executive who collects information from the company’s business divisions and advice from the networkers. The CPO would help decide which technology to buy to support the myriad business needs. This executive would also help pen the policies associated with maintaining that technology.

Planning is important, Fabbi said. If network changes are required, test them out in a quarantined environment first; have a “back out” plan in case the changes don’t act as expected; even test that back out plan.

He also suggested companies should identify the mission critical applications and assess the required quality of service (QoS) for each app individually. Every division has its own apps and is unique in its own right, he said. After all, QoS means something different to the accountants than it does to the corporate communications department, and it’s important to recognize that diversity.

Fabbi said it all comes back to the employees – what they do at the company and the way they work. “It’s not a technology issue. If you want to make it happen, it has to happen from a people perspective.”

During his own presentation Michael Beck, Toronto-based CEO of Intria-HP Corp., an IT outsourcing firm, said outages are caused by technology glitches just 20 per cent of the time. The study blamed human error for 40 per cent of network messes.

Wendy Bartlett, whose title is “indestructible, scalable computing initiative leader” at Compaq, told the audience “you get what you pay for” when hiring. Invite smart technologists to join the networking fold and you’ll have fewer problems. “Your operations shouldn’t be your lowest-paid people who were flipping burgers a week before,” she said. If talent is difficult to find, consider outsourcing portions of IT before sacrificing network availability.

Write things down, Bartlett said. “People’s IQs go down when things go wrong,” but if they have a guide documenting troubleshooting solutions and step-by-step processes to keep the business running, the staff appear that much brighter.

As for “worst practices,” Fabbi said beware service level agreements (SLAs) that seem too good to be true. Some service providers promise uptime in the 5-9s, but can’t deliver. Instead, they might offer a discount when the network fails – and in Fabbi’s opinion, it will fail. “Service providers are making more and more outrageous statements” to win contracts, he said.

Hogwash, said one conference attendee. Omry Farajun, channel manager with Storage Guardian Inc., a backup and restore firm in Toronto, said an SLA is little more than a “game plan” that is only as good the people and the policies involved. “If you’re not going to implement those, what’s the other side of the coin? We’re not an insurance policy.”

Fabbi also warned against throwing money at network problems. Increasing the amount of available bandwidth doesn’t make trouble go away, but simply hides it. As well, not every company should aim for 5-9 availability, because not everyone needs it.

“It’s attainable, but the question is, do you need it?” Fabbi said. “That’s the bigger issue. In parts of your infrastructure it can be attainable, but it may not be what makes your business run. You have to go back to the cost-benefit analysis. Do your risk assessment; calculate the cost of downtime. What’s the likelihood of downtime? Make your cost-benefit justification, which makes it easier to come to some kind of conclusion, what level of availability is required.”

That’s one lesson among many that Meadu from the McMichael Collection planned to take back to Kleinburg. By the end of the presentations he had decided that 5-9s aren’t all that important for his workplace, but planning certainly is.

“Internally, we’re probably not so pressed to get 5-9s. We have 99.9 – three nines – which is perfectly acceptable for us.

“There are some ideas I plan to go back and implement, mainly around management and planning,” Meadu continued. “When you’re dealing with day-to-day operations it’s hard to stop and think about the recovery plan. We have a recovery plan, but it’s not properly outlined. We need to practice and make a procedure for major failures.”