NT clustering still bedevils Microsoft, while users wait

When it comes to NT clustering, Microsoft is almost a decade late and many dollars short. Why? While several pundits have suggested technical reasons, it could be more of a cultural issue for the company. Microsoft doesn’t think the same way about servers as you and I.

A cluster is a collection of servers or individual hard disks that can act as a single entity to improve reliability and availability. When one component fails, applications continue working without interruption on the remaining servers or disks.

But while clustering may be a simple concept to explain, it is devilishly difficult to implement. You have to design your operating system to know how to take over for a failed machine while keeping user sessions active. You need to understand how applications, such as databases, Web and other network services, use your server’s memory and disk resources, and how the operating system handles user logons and other network resources.

The ideal cluster has three or more linked servers. Odds are low that the critical components in all these servers would fail at the same moment.

Microsoft still is far behind the curve on what many Unix vendors, including Sun, Hewlett-Packard, Compaq/Tandem and NCR, have been delivering for close to a decade. For example, Sun’s Solaris Cluster can support up to 256 nodes, and other Unix vendors have been delivering clustered systems for many years.

Currently, NT can support only a meagre two-node cluster, although Microsoft has unrealized plans to move beyond that. Two-node systems really don’t do the job and aren’t the true insurance policy that three-node or higher systems offer.

Microsoft employs a large number of people to write and design operating systems. NT has been out for many years, and the promises of true clustering have been touted for almost as long. So why is it taking Microsoft so long to get NT clustering right? I think it comes down to corporate culture.

In its Redmond, Wash., headquarters, one of the nice things that Microsoft has done (taken from Apple, if you really want to give credit) is enable every desktop to become a peer server at a moment’s notice, with just a few mouse clicks. Given that servers come and go on the corporate network, it’s hard for anyone at Microsoft to take them seriously. That fact makes it culturally harder to develop solid clustering because anyone can mount a hard disk from across campus.

Most corporations aren’t eager to use peer servers to run their mission-critical applications. They want dependable, consistent machines that sit on raised floors, with back-up power and back-up tapes spinning nearby. Clustered servers are the next extension of this mainframe mentality — they make important applications, such as your corporate Web site and various databases, almost impervious to downtime.

There is hope for NT and, interestingly, it comes from IBM. IBM is working on not one, but two clustering technologies — one public and another that has been kept mostly under wraps in IBM labs.

IBM introduced the public technology, called Cornhusker, in May. It can handle a cluster of up to eight NT servers. Basically, IBM has written extensions to the operating system to handle the switching among failed systems.

In my role as a consultant, last month I got a chance to see another new IBM technology that turns the concept of clustered servers on its ear: a high-reliability, very high-performance — yet low-cost — network for distributing PCs. Here the switching among computers is accomplished in silicon, inside a series of chips that are placed on a PCI card. The technology I saw was running on NT, although it could easily (at least according to my sources at IBM) be developed for other operating systems, including Linux. Too bad the technology is still behind the closed doors of IBM’s labs because this is exactly the kind of product corporations have been waiting for: inexpensive, highly reliable PCs.

It’s a shame that it has taken NT so long to catch up with the Unix world when it comes to clustering. But maybe IBM’s efforts can help Microsoft get over its cultural blind spot and turn NT into a true mission-critical operating system.