youtube image
From YouTube: Raythena: A Massively Parallel Data Processing Framework for ATLAS Geant4 Simulation, M. Muskinja

Description

Raythena utilizes the Ray software (a high-performance distributed execution framework) to distribute the highly intensive ATLAS Geant4 simulation workflow across a few hundred HPC nodes. Geant4 simulation is the most computationally expensive step of the ATLAS Monte Carlo simulation chain and represents about 50% of the ATLAS computing budget. Conventionally, it is run on ‘grid’ sites and each simulation campaign takes a few months to simulate the desired quantity of proton-proton collision events. Raythena is a solution for running ATLAS Geant4 simulation efficiencly on HPCs and it could significantly reduce the duration of simulation campaigns in the future. The goal of Raythena is to process as many events as possible with a given CPU-hour allocation on an HPC as fast as possible. An effective mode of operation at NERSC’s Cori was found to be running 100-200 Cori KNL node jobs in the flex queue. Raythena is a central application that orchestrates the workload management across all nodes using the Ray API. On Cori KNL nodes, 132 Geant4 processes were spawned on each compute node, amounting to more than 25,000 Geant4 processes running in parallel. Raythena handles communication with the ATLAS central PanDA database where it retrieves the input events and feeds them to the Geant4 processes. The Raythena framework was found to scale very well up to 100 to 200 nodes on Cori KNL with virtually no delay between the consecutive processed events.