youtube image
From YouTube: 2022-08-09 - Gene Cooperman - Transparent Checkpointing: a mature technology enabling MANA for MPI

Description

NERSC Data Seminars Series: https://github.com/NERSC/data-seminars

Title:
Transparent Checkpointing: a mature technology enabling MANA for MPI and beyond

Speaker:
Gene Cooperman, Khoury College of Computer Sciences, Northeastern University

Abstract:
Although transparent checking grew up in the 1990s and 2000s as a technology for HPC, it has now grown as a tool that is useful for many newer domains. Today, this is no longer your grandfather's checkpointing software! In this talk, I will review some of the newer checkpointing technologies invented only in the last decade, and how they gate new capabilities that can be adapted in a variety of domains.
This talk includes a tour of the 15-year old DMTCP project, with special emphasis on the latest achievement: MANA for MPI -- a robust package for transparent checkpointing of MPI. But as a prerequisite, one must have an understanding of two advances that brought DMTCP to its present state: (i) a general framework for extensible checkpointing plugins; and (ii) split processes (isolate the software application to be checkpointed from the underlying hardware).
In the remainder of the talk, these two principles are first showcased in MANA. This is then followed by a selection of other domains where transparent checkpointing shows interesting potential. This includes: deep learning (especially for general frameworks), edge computing, lambda functions (serverless computing), spot instances, containers for parallel and distributed computing (Apptainer and Singularity), process migration (migrate the process to the data in joint work with JPL), deep debugging for parallel and distributed computations, a model for checkpointing in Hadoop, and more.

Bio:
Professor Cooperman currently works in high-performance computing. He received his B.S. from the University of Michigan in 1974, and his Ph.D. from Brown University in 1978. He came to Northeastern University in 1986, and has been a full professor there since 1992. His visiting research positions include a 5-year IDEX Chair of Attractivity at the University of Toulouse/CNRS in France, and sabbaticals at Concordia University, at CERN, and in Inria/France. In 2014, he and his student, Xin Dong, used a novel idea to semi-automatically add multi-threading support to the million-line Geant4 code coordinated out of CERN. He is one of the more than 100 co-authors on the foundational Geant4 paper, whose current citation count is 34,000. Prof. Cooperman currently leads the DMTCP project (Distributed Multi-Threaded CheckPointing) for transparent checkpointing. The project began in 2004, and has benefited from a series of PhD theses. Over 150 refereed publications cite DMTCP as having contributed to their research project.

Host of Seminar:
Zhengji Zhao, User Engagement Group
National Energy Research Scientific Computing Center (NERSC)
Lawrence Berkeley National Laboratory