HPC lab faces enterprise-style management issues

Canada’s national laboratory for particle and nuclear physics is running into a problem enterprise IT managers know all too well: how to manage a growing number of users who all want to use a system for their own purposes.

The Tri-University Meson Facility (TRIUMF), which is run by a consortium of universities and is based in Vancouver, recently set up an IBM System Cluster 1350 supercomputer that will be used in connection with an experiment running out of the Conseil European pour la Researche Nucleaire (CERN) in Geneva. The experiment, called ATLAS, will simulate the way protons collide in order to learn more about matter and prove the existence of a particle, the Higgs boson, that scientists hope will explain how the universe was formed.

Mike Vetterli, ATLAS Canada’s computing coordinator and who holds a joint position with TRIUMF and Simon Fraser University, said the system has been running simulations and models but should be ready to start analyzing actual data by next summer. That means experiments that have been done until now by four or five aggregated groups of users could expand to hundreds of thousands of users.

“Everybody’s been working with grids, but it’s been compute grids,” he said. “The big challenge in the last year has been doing grids with data. With CPUs, someone comes in, they use your CPUs and then when they’re done the next person can use them. In this case, if they store data on a disk, you can’t just delete it without their permission.”

TRIUMF will be wrestling with issues such as how much of a disk load it can allocate to users, the persistency of the data and when it can be disposed of, Vetterli said. These are less technical problems than management quandaries, he said.

TRIUMF connects from CERN to GridX1, which consists of eight clusters operating out of a series of post-secondary schools, including the University of Victoria, the Centre for Subatomic Research at the University of Alberta, the WestGrid cluster at the University of British Columbia, and the Research Computing Support Group at the National Research Centre in Ottawa. These facilities are pooling their computer cycles to share data and applications through a technique known as grid computing. The project has been underway for several years.

“Now that we’re getting close to taking data, a lot more people want to do some analysis,” Vetterli said.

IBM played a role from the conceptual design right through physical installation and cooling, according to Chris Patt, manager of eServer strategic initiatives at IBM Canada in Markham, Ont.

“They’re the racing cars of the computer industry – they’re being stressed to their limits,” he said of the infrastructure involved, adding that it parallels the kind of performance expectations emerging in verticals such as digital animation or financial modelling and fraud analysis. “The amount of horsepower we can put together means that businesses are looking more and more into this.”

Pratt noted that the TRIUMF installation was based on blade servers, which are designed to allow much greater efficiency. Vetterli noted that some Canadian supercomputer projects, including WestGrid, have been using blades for years.

“They’re great for dealing with very limited space. Right now we think that with the space that we have we will be okay until 2011, but then we’ll have to find new space somewhere, maybe even a new building,” he said. “The prices (for blades) are not that different. It’s still more, but if you look at the total cost of ownership it’s not as big a difference.”

Vetterli said TRIUMF is also making some use of virtual machines, but mostly for development and testing when software is upgraded.

Related Download
CanadianCIO Census 2016 Mapping Out the Innovation Agenda Sponsor: Cogeco Peer 1
CanadianCIO Census 2016 Mapping Out the Innovation Agenda
The CanadianCIO 2016 census will help you answer those questions and more. Based on detailed survey results from more than 100 senior technology leaders, the new report offers insights on issues ranging from stature and spend to challenges and the opportunities ahead.
Register Now