We are happy to announce that Alireza Farshin successfully defended his licentiate thesis (licentiate is a degree at KTH half-way to a PhD)! We are once again very grateful to Prof. Gerald Q. Maguire Jr. for a fantastic co-advising job. Prof. Babak Falsafi was a superb opponent at the licentiate seminar. Alireza’s thesis is available online:
On March 26, 2019 in Dresden, Alireza Farshin presented our EuroSys 2019 paper on unlocking a performance-enhancing feature that existed in Intel processors for almost a decade. We are making the video of the talk available. The slides are available as well.
CPUs typically have cache memory which increases the speed of access for the most commonly used data, and thus to a large extent masks the long latency of the main memory (DRAM). During this period we have witnessed ever-increasing “core” counts (the basic building blocks of CPUs that can operate independently). For the last nine years, the largest, last-level cache of Intel processors has been split into “slices,” each being physically faster to access by the core to which it is attached relative to other cores. Our work first showed that reading data from the nearest slice is 20% faster on Intel’s Haswell CPUs, and 40% faster on the newer Skylake architecture. We then proceeded to show that a large fraction of these potential gains can be realized when the application carefully places its working set (most commonly used data) in the slices nearest to the cores that will process the data. To showcase the benefits of our approach on real applications, we have built a transparent software layer called CacheDirector that reduced the tail latency (processing time at the 99th percentile) by 21% for packets going through a service chain working at 100 Gbps. Handling traffic at such large speeds is vital for handling increasing network demands.
Our work has a clear benefit of increasing performance “for free” or reducing energy consumption while performing the same amount of work performed even in finely tuned systems. Many applications can benefit from our contribution with relatively small changes. Future important societal applications will also clearly benefit from the higher probability of receiving predictable latency responses.
Our work on unlocking the performance-enhancing last-level cache feature of recent Intel processors is starting to make the news! You can follow the news here: Ericsson, KTH Research, KTH Research (in Swedish), and you can also join the technical discussion on Dejan’s Facebook post. The full EuroSys 2019 paper is available here.
In our upcoming EuroSys 2019 paper, we exploit the characteristics of non-uniform cache architecture (NUCA) in recent Intel processors to introduce a new memory management scheme, i.e., slice-aware memory management. We believe that we are the first to: (i) take a step toward using the current hardware more efficiently in this manner, and (ii) advocate taking advantage of NUCA characteristics in LLC and allowing networking applications to benefit from it. In addition, we propose CacheDirector, a network I/O solution which extends Direct Data I/O (DDIO) and places the packet’s header in the slice of the LLC that is closest to the relevant processing core. The results of our work showed that CacheDirector could reduce the tail latencies in latency-critical Network Function Virtualization (NFV) service chains by 21.5%. Furthermore, our work demonstrated that optimizing the computer systems and taking advantage of nanosecond improvements could have a higher impact on the performance of networking applications.