Sun admits to memory problem

Problems with a memory component that Sun Microsystems Inc. has been quietly trying to fix for the past several months are continuing to plague some large users of Sun’s Ultra Enterprise Unix servers. And Sun has gone to extraordinary lengths to keep its customers quiet about the issue.

The problem involves an external memory cache on Sun’s UltraSPARC II microprocessor module. Under certain conditions, it has been triggering system failures and frequent server reboots at dozens of customer locations.

Sun Executive vice-president John Shoemaker has acknowledged that the company has been grappling with memory-related problems on “a few dozen” of its Ultra Enterprise servers for nearly a year.

Sun customers who have been affected by the problem are unwilling to speak openly about it because Sun has persuaded many of them to sign non-disclosure agreements, said Tom Henkel, an analyst at Gartner Group Inc. in Stamford, Conn.

The non-disclosure agreements were apparently offered with a claim that signing them would bolster Sun’s commitment to resolving the problem quickly, Henkel said. Sun customers began reporting the problem as long as 18 months ago, he said.

Shoemaker acknowledged that it may have been a bad idea for Sun to get its users to sign non-disclosure agreements. But he said the company took that measure only because Sun itself was struggling to pinpoint a reason for the system failures. He added that Sun has stopped requiring such agreements.

The long-standing nature of the problem and Sun’s handling of the issue raise troubling questions about the quality of Sun’s hardware and support, Henkel said.

One high-profile customer that has had very public problems with Sun hardware is eBay Inc. The on-line auctioneer has suffered a series of hardware-related outages over the past year, including one recently. It is unclear whether eBay’s problems are related to the memory issue, however.

Sun insisted that the problem hasn’t caused any data loss for customers. But the frequency of reboots disrupts availability and can cause data loss if applications don’t restart properly, users said.

In the past year, Henkel said, he has talked with at least 50 Sun customers who complained of hardware reliability issues caused by defective memory. Systems affected by the problem appear to be those based on 400MHz UltraSPARC-II CPU modules using either a 4MB or 8MB cache.

“There are a lot of very unhappy campers out there,” Henkel said. “Sun has been experimenting for too long now to find a solution to this problem.”

Meta Group Inc. in Stamford, Conn., also has clients that have experienced the problem. “There was a rash of reliability issues relating to this problem in the March-to-April time frame,” though none since then, said Meta Group analyst Brian Richardson. Eight out of 20 of Meta’s large Sun accounts reported the problem, Richardson said.

According to Shoemaker, the issue has triggered a massive overhaul of Sun’s quality processes and has already directly resulted in about eight major hardware and software changes being incorporated into Sun’s Ultra Enterprise server line.

Sun has also put in place far more rigorous quality and availability testing of its products and is mandating more stringent audits of customer sites, environmental conditions and planned configurations before taking orders on its high-end servers, Shoemaker said.

By year’s end, Sun will release a mirrored memory module that should address this issue once and for all, Shoemaker added. In the past several months, Sun has also been in direct contact with the CIOs at several of the affected companies to explain Sun’s new quality initiative, he said.

“This has been a watershed event for Sun,” Shoemaker said, adding that the company has moved from the back of the class to class leader with respect to quality.

But according to an MIS manager in North Carolina who has experienced the memory problem and who spoke on condition of anonymity, Sun has offered no explanation for the problems. “Sun has not disclosed any information to me about their memory issues – not even a brief description,” the manager said.

In the past three months, all of the manager’s six Sun servers have crashed because of memory-related problems, he said. In each instance, Sun swapped out entire CPU modules but offered no explanation for doing so, he said.

A user at a Midwestern manufacturing company, who also spoke on condition of anonymity, had a similar experience.

“As soon as we reported the issue to Sun, the affected processors were replaced under service contract,” he said. The company was able to resolve the problem by rearranging “our data centre with the express purpose of lowering system temperatures,” he said. “The systems run 10 to 15 degrees Fahrenheit cooler than before, and we haven’t seen a problem since.”

According to Shoemaker, Sun hasn’t been able to narrow the problem to any one specific cause. Sun believes the problems may have been caused by a combination of factors, including defective components from one of Sun’s suppliers, poor packaging of the memory chips on the system boards and environmental factors.