|The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing / October 6, 2006|
Tailored allocation management on NCSA's Tungsten cluster proves an important part of many researchers' workflows.
Once the allocations have been made and the highest-quality projects have been given set amounts of time, there are two straightforward ways of scheduling users on a supercomputer. One is egalitarian. A queuing system applies a set of rules -- based on the amount of time a particular job is going to take, how many processors are going to be used, and the like -- and puts people in line to wait their turn. The other is totalitarian. The decks are cleared for a big user, and he or she runs on a massive number of processors, perhaps the whole machine, for a long time.
Neither approach is ideal, and neither addresses more nuanced or immediate needs.
Take the case of the MILC collaboration, which studies quantum chromodynamics. In 2004, they received an allocation of four million CPU hours on NCSA's Tungsten cluster. By any reckoning, even one that comprises researchers at nine institutions, that's a massive allocation of time.
To use those resources sensibly and efficiently requires human decisions and policies that are well tuned to the various ways that researchers use the center's systems.
"Sitting down, going to talk to the users, and figuring out what they want. It's the only way to do this" when you have a broad variety of user needs, according to John Towns, who leads NCSA's Persistent Infrastructure Directorate. "'This doesn't work for me' is the last thing you want to hear."
A powerful machine is still important, and Tungsten is certainly that. It has a peak capability of more than 15 trillion calculations per second, making it the largest computer supported by the National Science Foundation and available for open scientific research.
A popular machine is also important, and Tungsten is that, too. In September 2005, about 162 million normalized units of computing time were allocated on Tungsten -- about 20 percent of the total parceled out by NSF across the nation. User requests for Tungsten were almost double that number, far exceeding the number available. This made Tungsten the most requested and the most allocated system in NSF's arsenal in September.
"If you allocate this large and popular a resource in the traditional ways, somebody always suffers. People with large runs wait a long time in the queues or don't get to run at all because the queuing system is set up to handle a large number of smaller jobs. Or the smaller jobs get brushed aside in order to dedicate the machine to large jobs. It's a tough balance to strike," Towns says.
"Tungsten is a resource that satisfies specific needs of the user community. It's a critical part of their research workflow," NCSA Director Thom Dunning says. "That means we have to tailor allocations to suit them. We planned for this sort of approach when we installed Tungsten, and the popularity and productivity among users really showed us that it was the right way to go."
Tailored allocation management means that specific pieces of the machine are dedicated to particular users for given periods of time. These periods are planned in advance so that users know when they're going to get access, how long they're going to have it, and how much computing oomph they're going to have available. These dedicated runs give users the capability they need to complete crucial computations that must be done in a specific timeframe, that require an unusually large number of processors, or that otherwise give the queuing system fits. The dedicated runs still leave room for a traditional capacity-oriented system, with smaller jobs passing through the queue and running unimpeded.
About 40 percent of Tungsten is currently dedicated to tailored allocations.
"Nobody else is doing this this way. But it's the best way to provide access that balances individual productivity and servicing a large number of users. We want to be responsive to individual requests while still ensuring success for a broad range of people and disciplines," Towns says. "When we strike that balance, our users do special things."
What sort of users take advantage of tailored allocation management? And why? Here are some examples.
David Baker, University of Washington
David Baker and his team are in the business of refining protein structures. These structures are traditionally derived using limited experimental data or by starting from first principles and simulating the structure from scratch. This group's technique combines the two to produce much more accurate models.
A tailored account on NCSA's Tungsten cluster gave the team more than just the power they needed.
"We'd never computed on this system when we got our special allocation," he says, and they were still in search of the precise approach that they would use for their structure refinements. A tailored allocation is "very good for methods development. Having dedicated time over days allows you to make such rapid progress. You try different things quickly and get daily feedback. That's really, really helpful as you're trying to get on your feet."
Currently, the team has an in-house server system dedicated to conducting these sorts of protein structure refinements. It serves an entire community of researchers and is overtaxed. NCSA is configuring a portion of its Radium cluster to provide additional capacity to those researchers. It will expand their back-end capacity without any front-end change; researchers will continue to interact with the server as they always have.
The MILC collaboration
Members of the MILC collaboration are drawn to the strongest force in nature -- the force that binds together quarks into the protons and neutrons that form the nuclei of atoms. Their quantum chromodynamics calculations proceed in two steps. Ground state configurations are calculated through Monte Carlo simulations, then the group, along with many other physicists carrying out numerical studies of quantum chromodynamics, use those to simulate and explore a wide variety of other physical attributes of the subatomic world.
The bottleneck is the Monte Carlo calculations. "Each ground state configuration is generated from the preceding one, so we cannot run jobs in parallel or start one job before the previous one ends," explains the University of California at Santa Barbara's Robert Sugar, a member of the collaboration.
"As a result, we are in a poor position to compete for time with many of the users of normal queues who can have several jobs in the queues at once," Sugar says. Without a tailored allocation on Tungsten, there would be a ripple effect throughout the field. "The Department of Energy and the National Science Foundation spend approximately $750 million per year on their experimental programs in high-energy physics. A significant fraction of that is devoted to the study of weak decays of strongly interacting particles, a primary focus of our research."
"Our calculations are needed in order to fully capitalize on the investments being made in the experiments, and our results are needed in a timely fashion in order to keep pace with experiments," he says.
Joel Tohline, Louisiana State University
Every time a team of astrophysicists from Louisiana State University make a run on Tungsten, a star is born -- a pair of them, in fact. Tidal interactions among these stars can cause material to transfer between the stars and distort the stars' gravities, densities, sizes, and distance from one another. In some cases, the stars even merge in a spectacular and violent cosmic event. Work by this team is altering scientists' thinking on the mass ratio at which binary stars return to stability instead of coming to a catastrophic end.
When they asked for one of three tailored allocation on Tungsten, they had just received a referee report from a submission to The Astrophysical Journal. It said that "the conclusions we drew in the paper would be significantly strengthened if we could repeat one of our extended simulations using slightly different initial parameters. We knew from experience that, running on 128 processors without interruptions -- which never happens -- this simulation would require about a month to complete," says Joel Tohline, the professor at Louisiana State that led the team.
A week-long, 512-processor run on Tungsten was set up in short order, and the publication went to print. There are broader implications of working closely with users to supply the sort time and support they need, though.
"In any given year -- or decade, for that matter -- the most interesting problems in computational astrophysics -- substitute physics, biology, etc., as you like -- are often those that push the limits of available resources. To make meaningful progress…we design our simulations each year to take advantage of available computational resources at the national centers such as NCSA," Tohline says.
"Our peers and funding agencies expect to see measurable progress on challenging and timely, relevant problems. If we invest our time performing a simulation that can be completed in a week's time on 32 processors, it is not likely to be addressing one of the most challenging problems that presently confront us. NCSA's commitment to dedicate major resources when they're needed to a single problem is in synch with this overall philosophy. It has contributed demonstrably to my group's ability to make significant contributions to our field."