In our PAM 2021 paper, we study the performance of (smart) Network Interface Cards (NICs) for widely deployed packet classification operations, focusing on four 100-200 GbE NICs from one of the largest NIC vendors worldwide.
We show that the forwarding throughput of the tested NICs sharply degrades when i) the forwarding plane is updated and ii) packets match multiple forwarding tables in the NIC.
Moreover, we uncover that the standard DPDK rule update API realizes slow & non-atomic rule updates using a sequence of rule insertion and deletion operations.
We solve this problem by introducing a direct in-memory rule update mechanism that achieves 80% higher throughput than the standard DPDK rule update API.
This is joint work with Georgios P. Katsikas, Tom Barbette, Marco Chiesa, Dejan Kostic, and Gerald Q. Maguire Jr.
ASPLOS ’21 will feature Alireza’s presentation of our paper titled “PacketMill: Toward Per-Core 100-Gbps Networking”. This is joint work with Alireza Farshin, Tom Barbette, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostić.
The full abstract (with the video and more resources below):
We present PacketMill , a system for optimizing software packet processing, which (i) introduces a new model to effjciently manage packet metadata and (ii) employs code-optimization techniques to better utilize commodity hardware. PacketMill grinds the whole packet processing stack, from the high-level network function confjguration fjle to the low-level userspace network (specifjcally DPDK) drivers, to mitigate ineffjciencies and produce a customized binary for a given network function. Our evaluation results show that PacketMill increases throughput (up to 36.4Gbps – 70%) & reduces latency (up to 101µs – 28%) and enables nontrivial packet processing (e.g., router) at ≈100Gbps , when new packets arrive > 10 × faster than main memory access times, while using only one processing core
PacketMill Webpage: https://packetmill.io/
PacketMill Paper: https://packetmill.io/docs/packetmill-asplos21.pdf
PacketMill source code: https://github.com/aliireza/packetmill
PacketMill Slides with English transcripts: https://people.kth.se/~farshin/documents/packetmill-asplos21-slides.pdf
Effective use of networked resources requires the ability to solve complex large-scale optimization problems fast while accounting for many input variables and performance requirements, such as end-to-end latency. Advancing beyond heuristic approaches, we begin with surveying the current state of applied machine learning to solve complex combinatorial optimization problems over networks. In our IEEE Access article titled “Learning Combinatorial Optimization on Graphs: A Survey with Applications to Networking”, we qualitatively analyse existing learning approaches and applications in the networking domain. Full abstract is as follows:
“Existing approaches to solving combinatorial optimization problems on graphs suffer from the need to engineer each problem algorithmically, with practical problems recurring in many instances. The practical side of theoretical computer science, such as computational complexity, then needs to be addressed. Relevant developments in machine learning research on graphs are surveyed for this purpose. We organize and compare the structures involved with learning to solve combinatorial optimization problems, with a special eye on the telecommunications domain and its continuous development of live and research networks.”
This work was done by Natalia Vesselinova (RISE), Rebecca Steinert (RISE), Daniel Felipe Perez-Ramirez (RISE) and Magnus Boman (KTH).
At USENIX ATC 2020, Alireza presented our paper titled “Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks”. Full materials (video, slides, PDF) are available at the USENIX site. The paper abstract is below. This is joint work with Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostić.
Memory access is the major bottleneck in realizing multi-hundred-gigabit networks with commodity hardware, hence it is essential to make good use of cache memory that is a faster, but smaller memory closer to the processor. Our goal is to study the impact of cache management on the performance of I/O intensive applications. Specifically, this paper looks at one of the bottlenecks in packet processing, i.e., direct cache access (DCA). We systematically studied the current implementation of DCA in Intel ® processors, particularly Data Direct I/O technology (DDIO), which directly transfers data between I/O devices and the processor’s cache. Our empirical study enables system designers/developers to optimize DDIO-enabled systems for I/O intensive applications. We demonstrate that optimizing DDIO could reduce the latency of I/O intensive network functions running at 100Gbps by up to ~30%. Moreover, we show that DDIO causes a 30% increase in tail latencies when processing packets at 200Gbps , hence it is crucial to selectively inject data into the cache or to explicitly bypass it.
Having published an NOMS 2018 paper on reliable distributed control planes, we continued working on this important problem and added an angle of guaranteed performance. Besides filing for a patent application, the work culminated in an IEEE Access article title “Fast Deployment of Reliable Distributed Control Planes with Performance Guarantees“. Full abstract is as follows:
Current trends strongly indicate a transition towards large-scale programmable networks with virtual network functions. In such a setting, deployment of distributed control planes will be vital for guaranteed service availability and performance. Moreover, deployment strategies need to be completed quickly in order to respond flexibly to varying network conditions. We propose an effective optimization approach that automatically decides on the needed number of controllers, their locations, control regions, and traffic routes into a plan which fulfills control flow reliability and routability requirements, including bandwidth and delay bounds. The approach is also fast: the algorithms for bandwidth and delay bounds can reduce the running time at the level of 50x and 500x, respectively, compared to state-of-the-art and direct solvers such as CPLEX. Altogether, our results indicate that computing a deployment plan adhering to predetermined performance requirements over network topologies of various sizes can be produced in seconds and minutes, rather than hours and days. Such fast allocation of resources that guarantees reliable connectivity and service quality is fundamental for elastic and efficient use of network resources.
The work was done at RISE by Shaoteng Liu, Rebecca Steinert, Natalia Vesselinova, and Dejan Kostić.