An Extensible Performance Analysis Tool for Dataflow Applications

The Master Degree project aims to develop an extensible performance analysis tool for dataflow applications developed in ArchiDeS - a component-based message-passing programming system for dataflow applications on many-core systems. There is an existing target application - an executable model of physical layer (L1) uplink data processing in LTE (Long-Term Evolution) base stations (eNodeB). L1 uplink processing is irregular due constantly changing baseband resource utilization, pipelined, and the model exhibits a large amount of concurrency - which together makes it difficult to explain and improve the attained performance levels when run on many-core systems such as Tilera 64-core TILEpro.

The tool will visualize execution and make it possible to find and closely inspect problem spots, as well as present aggregated application execution statistics. A particular focus in the project will be put on the extensibility of the tool, as well as its implementation quality and documentation.

ArchiDeS is a message-passing programming framework for component-based applications. Components form a hierarchy, where leaf "primitive" components contain code that handle messages. Components are interconnected by bindings. Architecture representation in ArchiDeS is first-class, which will also be used by the tool developed in the project. ArchiDeS components interact by asynchronous message-passing. Message handlers are executed until completion, with no preemption. Multiple messages can be processed in parallel by workers running on different cores/processors. Workers have a local work queue, and perform load-balancing by means of work stealing whenenver necessary. ArchiDeS enables "online" (i.e. dynamic) application-specific execution schedulers that can exploit application properties to e.g. prefer components on the execution critical path.

Physical layer (L1) upplink data processing in LTE (Long-Term Evolution) eNodeB"s involves more than a dozen of different processing steps, some of which should be carryied out on dedicated hardware accelerators - separate computing resources limited to a specific function (or a small set thereof). We see the sequence of those steps as a Direct Acyclic Graph (DAG). There are hard real-time constraints on DAG execution, because eNodeB"s must send on time certain acknowledgments back to LTE system user equipment (UE - e.g. mobile phones). Furthermore, processing of individual elements of transmission over the air overlaps in time, i.e. at any given point in time there are several input data items being processed, each at a different degree of completion. This pattern is similar to pipelined processing in all modern CPUs. Finally, there are additional data dependencies between DAG components due to processing of so-called reference symbols and their exploitation for improving the channel quality. At the same time, allocation of air ("baseband") resources to LTE infrastructure users constantly changes. All these issues make parallel execution scheduling a non-trivial task, and, in particular, no solution with the static scheduling is known. In the absence of a static scheduling, ArchiDeS provides a possibility to define "online" (dynamic) application-specific scheduler that can exploit the L1 uplink processing knowledge and outperform the application-"ignorant" scheduler such as generic work-stealing.

There are three principal sources of performance problems in dataflow applications using ArchiDeS or similar programming systems. First, execution scheduling can happend to ignore the critical (execution) path in a DAG. Second is an important special case of the first one: with pipelined parallelism, execution scheduling can dis-balance execution of different DAGs in the pipeline such that some will finish unnecessarily early while some others will finish too late and, thus, violate the real-time constraints. The third type of performance problems can arise when computing resources are heterogeneous: uneven distribution of load on a particular type of limited resource, such as a DFT accelerator, will increase average processing time of DFT tasks, causing underutilization of other types of computing resources and/or failure to meet real-time requirements. The tool devloped in the project attempts to address all three types of performance problems.

The ArchiDeS run-time system is instrumented to collect execution traces. Traces are collected per core and with minimal overhead in order to minimize the measurement impact. A separate process on a dedicated core decodes and merges traces from individual cores and produces a sequential execution history of application run.

The execution history together with the ArchiDeS' application architecture description is used by the performance analysis tool to compute execution statistics, visualize resource utilization over time and progress in dataflow processing of individual input data elements.

Macro-level execution statistics include, but are not limited to amount of potenial parallelism (concurrency) and minimal resource requirements in the application, utilization of computing resources of all types, processing time of individual input elements, waiting time for message processing, and performance of the work-stealing execution scheduler (stealing success rate and priority inversion in message processing).

The tool will produce several types of graphs - we call "views". The first type visualizes resource utilization as a function of time. Tool users will be able to navigate in the time domain and change the time scale ("zoom in/out") under currently select point time, and to choose from several types of representation of resource activity, such as visualizing pipelined processing of different dataflow input elements with different colors, and visualizing execution of different dataflow processing elements (components) with different colors. Synchronized with the time scale of the graph, the tool will be able to visualize system-wide aggregated execution statistics such global message queue length, and resource utilization. Furthermore, tool users will be able to set focus on a particular dataflow DAG and proceed to inspect it using the second type of tool's output.

The second type of tool's output will demonstrate the dataflow processing of individual input elements - i.e. the dataflow DAG annotated with the number of messages sent in the particular DAG instance. The critical path(s) can be visualized. The tool will allow user to hide or show the DAG's sub-components. For every message 'msg' sent in the DAG, users will be able to inspect its scheduling history - the sequence of messages that were executed by the worker until the message 'msg' is finally processed. Note that given a message in a DAG, the tool will be able to locate and visualize the corresponding DAG. This facility provides an ultimate low-level execution tracing facility for ArchiDeS applications, but it can be used as a last resort only due to the large number of messages and DAG instances. Finally, this output format can also used by tool users to tell the tool what constitutes a DAG, i.e. its input and/or output messages, as well as, possibly, to truncate uninportant dataflow (input) dependencies.

The third type of tool's output will show resource utilization and amount of concurrency (i.e. the message queue length) for individual dataflow DAGs as a function of time. Navigation and zooming for this type of output is similar to the type 1's, and the outputs of these two types can be made synchronized: navigating on one type of graph will change the view of the other. Furthermore, navigating on the graph will be reflected in the corresponding DAG view (output type 2): for a selected point in time, the DAG view will highlight DAG components being executed at that time. Using the output type 3, tool users will be able to analyze DAG's execution in the system-wide context (using the output type 1), and compare execution of differnt dataflow DAGs. The latter is in particular useful for comparing execution of dataflow DAGs that eventually violated real-time constraints with those DAGs that completed unnecessarily early: this kind of comparision might suggest a way to improve execution scheduling to reduce resource utilization dis-balance between those two types of dataflow DAGs.

In order to select for closer inspection resources on the output type 1, points in time on outputs type 1 and 3, as well as messages in worker message queues an extensible interface and a first version of a query language will be developed, starting with simple logical predicates in the C syntax (that could be e.g. compiled on-line by a C compiler and dynamically linked to the tool at run-time).

The performance visualization and analysis tool can be evenually extended with visualization and analysis of hardware platform's perfomance counters, and with analysis of permissible exchanges of computation resources between branches in a DAG and/or a set of DAGs.

Mats Brorsson,
Professor
matsbror@kth.se
+46 8 790 41 21

Portfolio

Thesis projects
An Extensible Performance Analysis Tool for Dataflow Applications

Studies

Research

Collaboration

About KTH

Library

An Extensible Performance Analysis Tool for Dataflow Applications

Portfolio

Contact