Till innehåll på sidan
Till KTH:s startsida Till KTH:s startsida

Performance Monitoring, Analysis, and Real-Time Introspection on Large-Scale Parallel Systems

Tid: To 2019-12-12 kl 11.00 - 12.00

Plats: Room 4423, Lindstedtsvägen 5, KTH, Stockholm

Medverkande: Xavi Aguilar, CST


Exportera till kalender


High-Performance Computing (HPC) has become an important scientific driver. A wide variety of research ranging for example from drug design to climate modelling is nowadays performed in HPC systems. Furthermore, the tremendous computer power of such HPC systems allows scientists to simulate problems that were unimaginable a few years ago. However, the continuous increase in size and complexity of HPC systems is turning the development of efficient parallel software into a difficult task. Therefore, the use of performance monitoring and analysis is a must in order to unveil inefficiencies in parallel software. Nevertheless, performance tools also face challenges as a result of the size of HPC systems, for example, coping with huge amounts of performance data generated.

We propose a new model for performance characterisation of MPI applications that tackles the challenge of big performance data sets. Our approach uses Event Flow Graphs to balance the scalability of profiling techniques (generating performance reports with aggregated metrics) with the richness of information of tracing methods (generating files with sequences of time-stamped events). In other words, graphs allow to encode ordered sequences of events without storing the whole sequence of such events, and therefore, they need much less memory and disk space, and are more scalable. We demonstrate how our Event Flow Graph model can be used as a trace compression method. Furthermore, we propose a method to automatically detect the structure of MPI applications using our Event Flow Graphs. This knowledge can afterwards be used to collect performance data in a smarter way, reducing for example the amount of redundant data collected. Finally, we demonstrate that our graphs can be used beyond trace compression and automatic analysis of performance data. We propose a new methodology to use Event Flow Graphs in the task of visual performance data exploration.

In addition to the Event Flow Graph model, we also explore the design and use of performance data introspection frameworks. Future HPC systems will be very dynamic environments providing extreme levels of parallelism, but with energy constraints, considerable resource sharing, and heterogeneous hardware. Thus, the use of real-time performance data to orchestrate program execution in such a complex and dynamic environment will be a necessity. We present two different performance data introspection frameworks that we have implemented. These introspection frameworks are easy to use, and provide performance data in real time with very low overhead. We demonstrate, among other things, how our approach can be used to reduce in real time the energy consumed by the system.

The approaches proposed in this seminar have been validated in different HPC systems using multiple scientific kernels as well as real scientific applications. The experiments show that our approaches in performance characterisation and performance data introspection are not intrusive at all, and can be a valuable contribution to help in the performance monitoring of future HPC systems.