youtube image
From YouTube: 2021-08-10 - Brandon Cook - LDMS data at NERSC: doing something useful with 8 PB of CSV files

Description

NERSC Data Seminars Series: https://github.com/NERSC/data-seminars

Speaker:
Brandon Cook
Application Performance Group
National Energy Research Scientific Computing Center (NERSC)
Lawrence Berkeley National Laboratory

Title:
LDMS data at NERSC: doing something useful with 8 petabytes of CSV files

Abstract:
Analysis of telemetry data from NERSC systems offers the potential for deeper quantitative understanding of the NERSC workload - providing insights for future system design, operation and optimization of the current platform, feedback to developers about workflow performance and diagnostics to uncover issues with workflows. NERSC uses the Lightweight Distributed Metric Service (LDMS) for lightweight collection of a large variety of metrics on NERSC systems; including memory usage, CPU HW counters, power consumption, network and I/O traffic. Across the compute nodes of Cori currently a total of ~400 MB/s worth of data is being collected and stored in CSV file format. With current retention policies there are approximately 8 petabytes of data in CSV format. The size and number of the CSV files along with the desire to integrate with other sources such as Slurm accounting poses several challenges for anyone who wants to work with this data. In this talk, I will walk through this data set: how it is collected, what is in it, where it is located. Then I will discuss a post processing pipeline that transforms, filters, joins this data with Slurm accounting information, and finally stores it 20 - 200 times more efficiently than CSV. Finally I will discuss how the results of the pipeline are used in Iris through the Superfacility API to provide plots directly to users for all non-shared jobs on Cori in O(seconds). Throughout the talk I will highlight how these resources can be accessed and extended.

Bio:
Brandon leads the simulations area of NERSC's application readiness program (NESAP) and works on understanding and analyzing performance on a system and application level, developing future benchmark suites, analyzing future architectures, developing tools to help NERSC users/staff be more productive, engaging users through consulting, acting as NERSC liaison for several NESAP teams, and exploring future programming models. Brandon received his Ph.D. in physics from Vanderbilt University in 2012, where he studied ab initio methods for quantum transport in nanomaterials. Before joining NERSC he was a postdoc at Oak Ridge National Laboratory where he developed and applied electronic structure methods to problems in material science.

Host of Seminar:
Wahid Bhimji
Acting Group Leader, Data & Analytics Group
National Energy Research Scientific Computing Center (NERSC)
Lawrence Berkeley National Laboratory