Performance Monitoring, Analysis, and Real-Time Introspection on Large-Scale Parallel Systems
Time: Thu 2020-01-09 10.00
Subject area: Computer Science
Doctoral student: Xavier Aguilar , Beräkningsvetenskap och beräkningsteknik (CST)
Opponent: Professor Jesus Labarta, Barcelona Supercomputing Center
Supervisor: Laure Erwin Professor, Skolan för elektroteknik och datavetenskap (EECS); Fürlinger Karl Doctor, Ludwig-Maximilians-Universität München; Lagergren Jens Professor, Skolan för elektroteknik och datavetenskap (EECS)
High-Performance Computing (HPC) has become an important scientific driver. A wide variety of research ranging for example from drug design to climate modelling is nowadays performed in HPC systems. Furthermore, the tremendous computer power of such HPC systems allows scientists to simulate problems that were unimaginable a few years ago. However, the continuous increase in size and complexity of HPC systems is turning the development of efficient parallel software into a difficult task. Therefore, the use of per- formance monitoring and analysis is a must in order to unveil inefficiencies in parallel software. Nevertheless, performance tools also face challenges as a result of the size of HPC systems, for example, coping with huge amounts of performance data generated.
In this thesis, we propose a new model for performance characterisation of MPI applications that tackles the challenge of big performance data sets. Our approach uses Event Flow Graphs to balance the scalability of profiling techniques (generating performance reports with aggregated metrics) with the richness of information of tracing methods (generating files with sequences of time-stamped events). In other words, graphs allow to encode ordered se- quences of events without storing the whole sequence of such events, and therefore, they need much less memory and disk space, and are more scal- able. We demonstrate in this thesis how our Event Flow Graph model can be used as a trace compression method. Furthermore, we propose a method to automatically detect the structure of MPI applications using our Event Flow Graphs. This knowledge can afterwards be used to collect performance data in a smarter way, reducing for example the amount of redundant data collected. Finally, we demonstrate that our graphs can be used beyond trace compression and automatic analysis of performance data. We propose a new methodology to use Event Flow Graphs in the task of visual performance data exploration.
In addition to the Event Flow Graph model, we also explore in this thesis the design and use of performance data introspection frameworks. Future HPC systems will be very dynamic environments providing extreme levels of parallelism, but with energy constraints, considerable resource sharing, and heterogeneous hardware. Thus, the use of real-time performance data to or- chestrate program execution in such a complex and dynamic environment will be a necessity. This thesis presents two different performance data introspec- tion frameworks that we have implemented. These introspection frameworks are easy to use, and provide performance data in real time with very low overhead. We demonstrate, among other things, how our approach can be used to reduce in real time the energy consumed by the system.
The approaches proposed in this thesis have been validated in different HPC systems using multiple scientific kernels as well as real scientific applica- tions. The experiments show that our approaches in performance character- isation and performance data introspection are not intrusive at all, and can be a valuable contribution to help in the performance monitoring of future HPC systems.