IT departments are awash with performance statistics, but numbers don’t always tell the truth about whether customers are getting good online service.
But a mathematical formula might.
At least that’s what Tony Davis, a vice-president and senior consulting fellow at CA Technologies believes.
Davis is one of the perks organizations get for spending millions of dollars on infrastructure software at CA. As a free service he and two other consultants advises them on the best ways to leverage their purchase.
On Tuesday he was in Toronto, which he hits about once a month, and spent some time talking with IT World Canada about what he’s learned in over 20 years in IT.
Based in New York City, Davis came to CA [Nasdaq: CA] two years ago after having held senior IT posts at FedEx, including leading software development and infrastructure teams.
It was at FedEx he came up with the idea of creating a mathematical formula to measure what he calls business service reliability.
A short way of thinking about it is it tries to account for the difference between glorious uptime statistics in the network operating centre and howls of performance complaints the contact centre staff a fielding from customers.
The formula’s a bit complicated, but think about it this way: If the customer of bank goes online to check their balance and then log out, it’s a single transaction. But that transaction has multiple actions – log in, fetch the checking account, check that one entry is right, log out. Each action should have a performance standard to meet – in this case, milliseconds.
The formula tests each link. If there’s a failure to meet the performance standard the entire transaction fails.
At the end of a set period of time – say, a day – compute how many transactions failed the standard and you have a good idea if you have reliable service.
Because of those misleading infrastructure performance screens in the NOC, says Davis, and the wrong people who are watching them.
“What I see consistently is people making a big investment in tools and they put what I would call the standard computer operator in front of it, like in a NOC. And I get that because they’re trying to bring in everything into once central place – and I don’t disagree with that –but I think it’s a problem because most of the people in the NOC look at Java alerts coming in because the SLO was broken and they’re like ‘What do we do with this?’
“I went to one customer and asked to see his NOC, because he was very proud … and when I got there the screens on the wall were filled with orange and yellow and red alerts. And I was sitting there and watching, thinking I was going to see a triage.” But nobody was doing anything.
“So I asked one of the lead operators ‘What are we going to do. Is this a big thing going on?’ and he goes, ‘We don’t know what those are, so we wait for 10 minutes, and then we clear them.’”
Or he recalls being at a company responsible for a sizable dot-com that was getting lots of complaints from customers despite the 24 high availability clusters that generated lots of 99 per cent uptime statistics.
If a couple of servers were giving trouble – say because poor code was running – the server manager would just pull the servers. That, of course, affected thousands of online customers. But the availability of the cluster didn’t change.
Davis says his formula approach – which he admits isn’t entirely scientific — looks at reliability, not availability.
Asked if complaints from customers to a call centre isn’t supposed to be the warning signal IT should heed, Davis to some degree that’s true. But it’s also being reactive. Looking at service reliability is being proactive, he argues.