Strategies for preventing lights out

Michael Smith was 12 years in IT in insurance prior to joining Ernst & Young LLP, four years ago where he is principal, Security and Technology Solutions. His comments are edited for brevity.

IT Focus: There seems to be a shift from business continuity to business resiliency. Just what is that?

Michael Smith: That’s the coming thing. Early last fall there was an [American] interagency white paper put out jointly by the SEC [Securities & Exchange Commission], the OCC (Office of the Comptroller of the Currency) and the Federal Reserve. It made some pretty far-reaching suggestions. They presented this as a draft for comment about some very key and critical participants in the payments and clearing and settlement systems needed to do. It applied to about 15 banks and seven to 10 brokerage houses.

The basic one summary sentence that describes it all and has really large ramifications was along the lines of: regardless of whatever happens, you as a major participant in the system must be able to close your books at the end of the business day.

When you stop to think about it, I have 1,000 people in lower Manhattan, let’s say, who are responsible for closing my bank’s books on this particular settlement system that we are participating in within the Federal Reserve System. If I have to be able to do that on the same day, that means that I have to be able to split that department of 1,000 people in two. I have to make them far enough apart – and the draft interagency white paper suggested 2-300 miles apart – that they’re not affected no matter what happens. And not only that, if I’m going to close the same day, I’m going to have to beef up this operation so that if one chunk of it can’t continue for whatever reason, the other chunk can continue and close my books. That was a very onerous requirement.

The response period is closed now and they’ve had responses back from those participants who were subject to this potential regulation. They are backing off from some of the premises.

Now that is happening in the U.S. and I would expect that OSFI (Office of the Superintendent of Financial Institutions) is watching very closely and carefully. Let’s not forget that although it talks about 15 banks and seven to 10 brokerage houses, some of our banks and our brokerage houses may well be affected by this, not because they are in that tier one level but because they are close to it. And they may choose to do that or they may say ‘look, the writing is on the wall, this is coming to tier two’s as well, so let’s just do it now.’

IT Focus: So if I were to ask each of our major banks ‘are you going to do this?’ what kind of a response do you think I’d get?

Smith: That’s a tough question. I don’t think any of our major banks are in that position right now that they are able to do that. That’s one of the reasons that the three agencies sent it out as a draft paper for comment. There’s some sober reflection going on right now about the cost of doing this business resilience thing.

I think companies with extremely critical processes, particularly where an entire industry depends on that process, in the next five or 10 years those organizations will get to that level of resilience. I just don’t see how they can avoid doing that. It will be a requirement and it will be a cost of doing business in that kind of critical process.

IT Focus: What are the common sources of downtime?

Smith: From a computer perspective, the most common source is power failure or power interruption of some sort. Upwards of 30 per cent of the failures of your average breakdowns or outages in computer systems is power related. If it is that big, why don’t organizations automatically make sure they have a UPS – uninterruptible power supply – in place? It is typically a battery or big set of batteries to continue your systems for a period of time – typically 30 minutes. That tends to take away 90-odd per cent of the major peaks and dips and urges and outages of power. If the power breakage/outage continues for 15 minutes, it gives you another 15 minutes to gracefully bring down your computer systems.

If you’re a big organization, why don’t you have a diesel generator back up, at least to continue past the 30 minutes? I still don’t understand why more organizations don’t spend the money on that. The banks do and the large insurance companies do. I know a couple of pension plan administrators that don’t. And some brokerage houses that don’t and you’d have thought they particularly would with the trading kind of activity.

IT Focus: What are some other sources of downtime?

Smith: One of the problems facing people is as they develop these new applications and they make them available across the Internet, for example, these are now becoming seven day a week, 24-hour applications that need to be available all the time. It makes it a little difficult to do things other than just run the application. You have to maintain the application, do testing, upgrade hardware, change network, run back-ups. Those are all things that we used to put into that overnight window when we brought the online system down at 9 o’clock at night. We’re losing that opportunity.

When we look at high availability, we talk about component failure, system failure and site failure. Site failure really is disaster recovery. Component failure means any single component… It’s any link along the way… We’re not just talking about designing that server sitting in the computer room.

IT Focus: What key message do you have for companies to protect themselves from technology failure and natural disasters?

Smith: There is a process that you go through. Traditionally you break it down, do your analysis upfront. We call it business impact analysis: which of your business processes are most critical? In large complex businesses, that takes a lot of work to figure out. A large insurance corporation takes a while to distil down to what is truly critical. The other thing that operates here is: when is it critical? Some things may be critical right away and other things may be critical only 30 days down the road. What’s critical? When is it critical?

Then if you take your analysis one more step and say: if this process is critical on day three, what do I need to have in place to continue that process on day three?

If you do that analysis for the entire corporation – what’s critical, when and what do they need to do it – you’ve built a profile of requirements for continuing the business on a compromised basis. There are lots of things you’re not going to do, but over time you’re going to build it up again.

You add this all up and day three it says of the 2,000 people you need 300 working somewhere. Then you start strategizing how you’re going to do that. ‘A branch office 20 miles away can accommodate 500 people. I’ll take those 300 people and put them in there and kick out the non-critical ones and tell them to go home. Or, I’ll go to a commercial recovery services vendor, like a Fusepoint, and say I’ll need spots for this number of people and I need these computer systems and start negotiating a deal with them for recovery services that a Fusepoint will guarantee are available at the time of disaster to their customer.’

Then having sorted out your strategy, you start developing plans to make it happen. Then you exercise them to see if they will work and to train the people. They are more [like] training vehicles than tests of capability.

Another fundamental rule of disaster recovery is to make sure that when you take the back-ups that you actually have the back up stored safely and distantly off site. It should be in another building far enough away that you’re comfortable.

A lot of organizations who did disaster recovery well for their mainframe don’t do it well for their distributed system servers because these are sort of like weeds that grew up in the patch beside the plant that you’re trying to grow, the mainframe.

IT Focus: What technology do you see as worthy of note?

Smith: SANs are interesting. Networked storage – EMC, Hitachi and IBM all have a variance on this whereby you have the disk drives at a distance and those disk drives are in effect being constantly updated and keeping a fresh, hot copy of your production data. That’s a very good technology. We like it a lot for those incredibly critical applications that just can’t go down. Emerging technology is electronic vaulting where you do your back ups offsite electronically, maybe across the Internet or maybe across dedicated bandwidths. It’s definitely coming.