IBM supercomputer puts end to underground nuclear tests

If all goes as planned, the U.S. Department of Energy (DOE) may no longer have to detonate nuclear weapons in underground tests, thanks to a new IBM supercomputer.

Until now, the only way to check if the U.S. stock of nuclear weapons were in working order was to actually detonate one underground. But researchers hope to replace the tests with 3D simulations, which will be made possible by the US$110 million supercomputer dubbed the ASCI White.

“The speed of this is equivalent to every man, woman and child on the face of the earth adding 2,000 numbers in one second. And the memory on it is equivalent to 300 million books,” said David Cooper, the associate director of scientific computing and CIO at the DOE’s Lawrence Livermore National Laboratory in Livermore, Calif.

The supercomputer can run at 12.3 teraflops (trillions of operations per second), is the size of two basketball courts and consists of 512 nodes linked by a new high-speed switch. The department will need 28 moving vans in total to ship the computer. About a quarter of it has already been installed.

The DOE hopes to have both the computer installed and the 3D-simulation software written by this fall.

In the meantime, IBM has also released a commercial version of the ASCI White, a new RS/6000 SP.

“The primary applications are expected to be in the technical area (for the commercial supercomputer) – scientific, engineering. However, they do think there will be a demand for business intelligence and Web servers which require a lot of computations,” said Morton Ginsburg, a research associate at D.H. Brown Associates in Port Chester, N.Y.

“Actually a good percentage of installed SPs are used for business intelligence. About 25 per cent, surprisingly,” Ginsburg said.

The RS/6000 SP supercomputer is designed primarily to perform floating point calculations, and most business intelligence calculations are integer calculations, Ginsburg said.

Floating point calculations, used primarily in scientific and technical research, are decimal operations where numbers can go up to the billions and trillions or go down to the one-billionths and one-trillionths. There are no fixed digits before or after the decimal points – the decimal point can float.

But most enterprise calculations are integer calculations.

“If you get, for instance, an Intel micro with a high enough frequency, with a high enough megahertz, it will have very good integer performance. And so it can be competitive in the commercial area,” Ginsburg said.

But the parallel nature of the SP’s architecture gives it an advantage over other computers when performing business intelligence calculations, said Mike Kerr, the vice-president of products for Web servers at IBM.

“When you want to do complex queries against a very large database, it turns out that that work parallelizes very simply. In other words, you take the complex query that comes in, and you split it up into chunks, or subqueries, that get farmed out to multiple nodes to work on in parallel, and then they come back and get assembled back together in the total query and get responded to. It’s a design that gives you very fast response times to large complex queries.”

back to basics

The new RS/6000 and ASCI White rely on the same basic technology as IBM’s Blue Gene, an experimental supercomputer which the company hopes will be able to perform one quadrillion operations per second (one petaflop).

To build the computer, IBM researchers decided they needed to forget everything they already knew about computers, said Monty Denneau, the lead hardware architect at IBM’s TJ Watson Centre in Yorktown Heights, N.Y.

“You simply couldn’t build this machine by using conventional processors. It couldn’t be done,” he said. “Part of the reasons you can’t – and I’m not disparaging our products and everyone else’s – is we make our chips very large and very complicated.

“So what we did for this machine is to go back to square one completely, and say, ‘Look, we have no history at all.'”

One of the problems with conventional processors are there are too many instructions that slow it down.

Denneau’s team started with a thread unit as their basic unit of computation, which Denneau describes as a computer so simple that a high school student could make it.

Eight thread units share a floating point unit between them, as well as storage.

“By having eight tiny little processors all working at the same time, we keep the floating point busy, and we keep our storage system busy,” he said.

The researchers also decided to do something no one else would – they deliberately slowed down the computer.

“Everyone else tries to get as many instructions per cycle as they can. The price that you pay for that is staggering complexity,” Denneau said.

In order to do away with latency problems during loading, the researchers embedded DRAM into the processor.

To connect processors together, IBM built a system of rings onto each chip at 500MHz, instead of the traditional PC bus. Thirty-six chips are put together on a board, and the boards are stacked on top of one another.

The computer will require over 2 million watts of power, making traditional air-cooling methods infeasible. “It’s like a thousand ovens all running at the same time. You cannot air cool a machine like this. It would be just like a tornado running through the room,” Denneau said.

So IBM decided to water cool the computer instead. Pipes drain through the machine by gravity.

The researchers also encountered another problem – cosmic rays, which are constantly bombarding Earth.

“Our machines have gotten so small that a cosmic ray will actually take a charge out of logic itself. And there’s no way to correct it,” Denneau said.

So IBM decided to replicate processors and have them work in pairs. Two chips perform the same calculation, and if they come up with different answers, then it’s an indication that something has gone wrong and the job is turned back 15 minutes.

“So this machine is very resilient. Literally, you can walk to the middle of it and start hammering on things, cutting up wires, just really damaging it, and it’ll probably just keep on running. It’s the most amazing thing.”