Sun’s memory problems persist

Some users of Sun Microsystems Inc.’s UltraSPARC servers continue to have problems with a defective memory component several months after a senior Sun executive said the company was close to declaring “complete victory” over the nagging issue. But Sun and analysts recently insisted that the company has made significant progress in addressing the problem.

The defect is in an external memory cache on Sun’s UltraSPARC II microprocessors. Under certain conditions, the problem has been triggering system failures and frequent reboots at dozens of customer locations worldwide for more than 18 months.

Sun has acknowledged that it has been grappling with the defect for some time. But in an interview with Computerworld (U.S.) in August, Sun Executive Vice President John Shoemaker said the company was close to fixing the problem with a “mirrored-cache” technology that was due in October.

Sun also said it had “cache-scrubber” patches and various environmental recommendations that should have alleviated the situation for users.

“The kernel scrubber software is shipping, the best practices are in place, and we’ve begun shipping mirrored [memory] where they are needed to achieve satisfactory uptime,” a Sun spokesperson said in an e-mail.

However, some users recently reported that their situation hadn’t changed at all, despite having tried some of Sun’s suggestions. In fact, a major utility in the U.S. is asking Sun to take back three of its midrange servers, collectively valued at more than US$500,000, because of Sun’s continuing inability to resolve the problem.

“The decision was made following the long history of problems, pseudo-fixes and evasions by the Sun representatives,” said a user at the utility who requested anonymity.

The utility company will continue to use Sun servers for Web-based applications, but it has moved the database application that was running on the Sun servers to a Compaq Computer Corp. Unix server.

Norman Morrison, an independent project consultant working at a service provider that hosts Web sites for companies that sell sporting goods, said he’s another unhappy customer. “To date, we have gotten no satisfaction on this problem,” despite continuing server crashes and attempts to fix them, he said.

Less than a month ago, the service provider bought several new Sun servers, one of which has already begun crashing because of memory-related issues, Morrison’s said. Because the service provider uses Sun servers for all its production and development applications, Sun is pretty much locked in as its vendor, he added.

Based on Sun’s information, at this point, “the mirrored cache appears to be the only way they have corrected the problem with 100 per cent certainty,” Morrison said. But he added that Sun told him that that technology won’t be available until the end of the month and that companies must get on a list in order to get the fix.

“What they are probably trying to do is to prioritize who gets it first,” said Bill Moran, an analyst at D.H. Brown Associates Inc. in Port Chester, N.Y. “It sounds to me [as if] Sun does have a fix.”

But beyond mirrored cache, Morrison said other options Sun has offered – such as swapping existing processors for those containing a different vendor’s cache memory and operating system kernel patches – don’t seem to work as well.

“Sun supplied an external cache-refresh kernel patch to reduce the likelihood of this recurring, but this adds [load] to our boxes – and our systems are still crashing regularly,” echoed a user at a large European bank who also requested anonymity.

Similarly, “Sun has recommended various cooling and environmental requirements, all of which we meet,” the user said. But there have been more than 50 memory-related server crashes in the bank’s London offices during the past few months alone, he added.

After a high-level meeting with bank representatives last week, Sun requested further environmental surveys, the user said. “They are giving strict airflow and temperature requirements that exceed those quoted in their product guides,” the manager added.

Not everyone has expressed dissatisfaction, however. One user at an on-line travel services firm claimed that his problems were resolved with an operating system upgrade to Solaris 2.6. “The recent upgrade has eliminated all issues we had with the servers frequently coming down – whether under load or not,” he said.