SPRUCE: A system for supporting event-driven and urgent high-performance computing

Suman Nadella

Argonne National Laboratory


Abstract


High-performance modeling and simulation are playing a driving role in decision making and prediction. For time-critical emergency support applications such as severe weather prediction, flood modeling, and influenza modeling, late results can be useless. A specialized infrastructure is needed to provide computing resources quickly, automatically, and reliably. SPRUCE is a system to support urgent or event-driven computing on both traditional supercomputers and distributed Grids. Scientists are provided with transferable "right-of-way" tokens with varying urgency levels. During an emergency, a token has to be activated at the SPRUCE portal, and jobs can then request urgent access. Local policies dictate the response, which may include providing "next-to-run" status or immediately preempting other jobs. Additional components under development include a periodic testing mechanism of applications in "warm-standby" mode ensuring readiness and an automated "advisor" that helps find the best resource to submit based on deadline, queue status, site policy, and warm-standby history