Large-scale I/O Models for Traditional and Emerging HPC Workloads on Next-Generation HPC Storage Systems
Time: Fri 2022-04-29 15.00
Location: F3, Lindstedtsvägen 26 & 28, Stockholm
Subject area: Computer Science
Doctoral student: Wei Der Chien , Beräkningsvetenskap och beräkningsteknik (CST)
Opponent: Philip Carns, Argonne National Laboratory
Supervisor: Stefano Markidis, SeRC - Swedish e-Science Research Centre, Beräkningsvetenskap och beräkningsteknik (CST); Erwin Laure, SeRC - Swedish e-Science Research Centre, Parallelldatorcentrum, PDC, Beräkningsvetenskap och beräkningsteknik (CST); Artur Podobas, Beräkningsvetenskap och beräkningsteknik (CST)
The ability to create value from large-scale data is now an essential part of research and driving technological development everywhere from everyday technology to life-saving medical applications. In almost all scientific fields that require handling large-scale data, such as weather forecast, physics simulation, and computational biology, supercomputers (HPC systems) have emerged as an essential tool for implementing and solving problems. While the computational speed of supercomputers has grown rapidly, the methods for handling large-scale data I/O (reading and writing data) at a high pace have not evolved as much. POSIX-based Parallel File Systems (PFS) and programming interfaces such as MPI-IO remain the norm of I/O workflow in HPC. At the same time, new applications, such as big data, and Machine Learning (ML) have emerged as a new class of widely deployed HPC applications. While all these applications require the ingestion and output of a large amount of data, they have very different usage patterns, giving a different set of requirements. Apart from that, new I/O technologies on HPC such as fast burst buffers and object stores are increasingly available. It currently lacks a novel method to fully exploit them in HPC applications.
In this thesis, we evaluate modern storage infrastructures, the I/O programming model landscape, and characterize how HPC applications can take advantage of these I/O models to tackle bottlenecks. In particular, we look into object storage, a promising technology that has the potential of replacing existing I/O subsystems for large-scale data storage. Firstly, we mimic the object storage semantic and create an emulator on top of existing parallel file systems to project the performance improvement that can be expected on a real object store for HPC applications. Secondly, we develop a programming model that supports numerical data storage for scientific applications. The set of interfaces captures the need from parallel applications that use domain decomposition. Finally, we evaluate how the interfaces can be used by scientific applications. More specifically, we show for the first time, how our programming interface can be used to leverage Seagate's Motr object-store. Aside from that, we also showcase how this approach can enable the use of modern node-local hierarchical storage architectures.
Aside from advancement on I/O infrastructure, the wide deployment of modern ML workloads introduces unique challenges to HPC and its I/O systems. We first understand the challenges by focusing on a state-of-the-art Deep-Learning (DL) framework called TensorFlow, which is widely used in cloud platforms. We evaluate how data ingestion in TensorFlow differs from traditional HPC applications to understand the challenges. While TensorFlow focuses on DL applications, there are alternative learning methods that pose different sets of challenges. To complement our understanding, we also propose a framework called StreamBrain, which implements a brain-like learning algorithm called the Bayesian Confidence Propagation Neural Network (BCPNN). We find that these alternative methods can potentially impose an even bigger challenge to conventional learning (such as those present in TensorFlow). To explain the I/O behavior of DL training, we perform a series of measurements and profiling on TensorFlow using monitoring tools. However, we find that existing methods are insufficient to derive a fine-grained I/O characteristic on these modern frameworks due to a lack of application-level coupling. To tackle this challenge, we propose a system called tf-Darshan that combines traditional HPC I/O monitoring and an ML workload profiling to enable a fine-grained I/O performance evaluation. Our findings show that the lack of co-design between modern frameworks and the HPC I/O subsystem leads to inefficient I/O (e.g. very small and random reads). They also fail to coordinate I/O requests in an efficient way in a parallel environment. With tf-Darshan, we showcase how knowledge derived from such measurements can be used to explain and improve I/O performance. Some examples include selective data staging to fast storage, and future auto-tuning on I/O parameters.
The methods proposed in this thesis are evaluated on a variety of HPC systems, workstations, and prototype systems with different I/O and compute architectures. Different HPC applications are used to validate the approaches. The experiments show that our approaches can enable a good characterization of I/O performance, and our proposed programming model illustrates how applications can use next-generation storage systems.