youtube image
From YouTube: 2021-12-14 - Ahmad Maroof Karimi - Characterizing ML I/O Workloads on Leadership Scale HPC Systems

Description

NERSC Data Seminars Series: https://github.com/NERSC/data-seminars

Title:
Characterizing Machine Learning I/O Workloads on Leadership Scale HPC Systems

Speaker:
Ahmad Maroof Karimi
A.I. Methods at Scale Group
National Center for Computational Sciences (NCCS) Division
Oak Ridge National Laboratory

Abstract:
High performance computing (HPC) is no longer solely limited to traditional workloads such as simulation and modeling. With the increase in the popularity of machine learn- ing (ML) and deep learning (DL) technologies, we are observing that an increasing number of HPC users
are incorporating ML methods into their workflow and scientific discovery processes, across a wide spectrum of science domains such as biology, earth science, and physics. This gives rise to a diverse set of I/O patterns than the traditional checkpoint/restart-based HPC I/O behavior. The details of the I/O characteristics of such ML I/O workloads have not been studied extensively for large-scale leadership HPC systems. This paper aims to fill that gap by providing an in-depth analysis to gain an understanding of the I/O behavior of ML I/O workloads using darshan - an I/O characterization tool designed for lightweight tracing and profiling. We study the darshan logs of more than 23,000 HPC ML I/O jobs over a time period of one year running on Summit - the second-fastest supercomputer in the world. This paper provides a systematic I/O characterization of ML I/O jobs running on a leadership scale supercomputer to understand how the I/O behavior differs across science domains and the scale of workloads, and analyze the usage of parallel file system and burst buffer by ML I/O workloads.

Bio:
Ahmad Maroof Karimi works as an HPC Operational Data Scientist in Analytics and A.I. Methods at Scale (AAIMS) Group in National Center for Computational Sciences (NCCS) Division, Oak Ridge National Laboratory. His current research focuses on the characterization of HPC I/O patterns and finding evolving HPC workload trends. He is also working on analyzing HPC facility data to characterize the HPC power consumption and building machine learning based job-aware power prediction models. Before joining ORNL, Ahmad completed his Ph.D. in Computer Science at CWRU, Cleveland, Ohio, in October 2020. His Ph.D. dissertation titled “Data science and machine learning to predict degradation and power of photovoltaic systems: convolutional and spatiotemporal graph neural networks” focused on classifying degradation mechanism and performance prediction of a photovoltaic power plant. He received his M.S. degree from the University of Toledo, Ohio, and B.S. degree from Aligarh Muslim University, India. Ahmad has also worked in the I.T. industry as a software programmer and database designer.

Hosts of Seminar:
Hai Ah Nam, Advanced Technologies Group
Wahid Bhimji, Acting Group Leader, Data & Analytics Group
National Energy Research Scientific Computing Center (NERSC)
Lawrence Berkeley National Laboratory