Increasing the Spatial Correlation of Logical Units of Data to Enable an Ultra-Low Latency Internet (ULTRA)
Three-part public summary (current as of January 2021):
Summary of the context and overall objectives of the project
A number of time-critical societal applications deployed in the cloud infrastructure have to provide high reliability and throughput, along with guaranteed low latency for delivering data. This low latency guarantee is sorely lacking today, with the so-called tail-latency of slowest responses in popular cloud services being several orders of magnitude longer than the median response times. This issue renders the existing cloud infrastructures unsuitable for time-critical societal applications.
The overarching goal of this project is to enforce a large degree of correlation in the data requests (logical units of data), both temporally (across time) and spatially (as server work units require correlation to achieve high cache hit rates). The result is that the logical units of data will be processed at almost the maximum processing speed of the cloud servers. By doing so, we will achieve an ultra-low latency Internet. This project will produce knowledge necessary to produce the tools that will be the key to dramatically reducing the latency of key societal services; these include cloud services used by a large number of users on a daily basis.
Main results achieved so far
In our CacheDirector paper at EuroSys 2019, we show how we unlocked a performance-enhancing feature that existed in Intel® processors for almost a decade. Our results show that placing incoming packets on the so-called slice of the fast cache memory that is closest to the CPU core that will process the packet could reduce the tail latencies in latency-critical Network Function Virtualization (NFV) service chains by 21.5%. In our USENIX ATC 2020 paper, we demonstrate that optimizing DDIO (a form of direct cache access) could reduce the latency of I/O intensive network functions running at 100Gbps by up to ~30%.
In our CoNEXT 2019 paper, we demonstrate the importance of load-balancing within a single machine (potentially with hundreds of CPU cores). Our technique called RSS++ incurs up to 14x lower 95th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. Next, we examine the problem faced by large service providers that use load balancers to dispatch millions of incoming connections per second towards thousands of servers. There are two basic yet critical requirements for a load balancer: uniform load distribution of the incoming connections across the servers and per-connection-consistency (PCC), i.e., the ability to map packets belonging to the same connection to the same server even in the presence of changes in the number of active servers and load balancers. Yet, meeting both these requirements at the same time has been an elusive goal. In our NSDI 2020 paper, we present Cheetah, a load balancer that supports uniform load distribution and PCC while being scalable, memory efficient, resilient to clogging attacks, and fast at processing packets.
In our work on Geo-distributed load balancing, we observe that the increasing density of globally distributed datacenters reduces the network latency between neighboring datacenters and allows replicated services deployed across neighboring locations to share workload when necessary, without violating strict Service Level Objectives (SLOs). In our SoCC 2018 paper, we present Kurma, our fast and accurate load balancer for geo-distributed storage systems and demonstrate Kurma’s ability to effectively share load among datacenters while reducing SLO violations by up to a factor of 3 in high load settings or reducing the cost of running the service by up to 17%.
In our work on the third task of the project we have so far countered a commonly held belief that traffic engineering and routing changes are infrequent. Based on our measurements over a number of years of traffic between data centers in one of the largest cloud provider’s networks, we found that it is common for flows to change paths at ten-second intervals or even faster. These frequent path and, consequently, latency variations can negatively impact the performance of cloud applications, specifically, latency- sensitive and geo-distributed applications.
We have also performed additional work on low-latency distributed file systems for the next generation of durable storage. In joint work with researchers spread all over our planet with Waleed Reda as the main author we built the Assise distributed file system that colocates computation with persistent memory modules (PMM) storage. Our evaluation shows that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts.
Progress beyond the state of the art
In our work on Eurosys 2019 paper, we design and implemented CacheDirector as a network I/O solution that implements slice-aware memory management by carefully mapping the first 64 Byte of a packet (containing the packet’s header) to the slice that is closest to the associated processing core. In the follow-on work that was published in our USENIX ATC 2020 paper, we extensively study the characteristics of DDIO in different scenarios and identify its shortcomings and demonstrate the necessity and benefits of bypassing cache while receiving packets at 200 Gbps.
In our work on intra-server load balancing called RSS++ [CoNEXT 2019], we proposed a new load-balancing technique that dynamically (several times per second) modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. In our follow-on work on Cheetah [NSDI 2020] we design a data center load balancer (LB) that guarantees per-connection consistency for any realizable load balancing mechanisms. Cheetah carefully encodes the connection-to-servers mappings into the packet headers so as to guarantee levels of resilience that are no worse (and in some cases even stronger) than existing stateless and stateful LBs, respectively.
In the context of our Geo-distributed load balancing system called Kurma [SOCC 2018], we decouple request completion time into (i) service time, (ii) base network propagation delay, and (iii) residual WAN latency caused by network congestion. At run-time , Kurma tracks changes in each component independently and thus, accurately estimates the rate of SLO violations among geo-distributed datacenters from the rate of incoming requests.
Our recent CCR 2020 journal article reports on our multi-year measurements and the analysis focused on observing path changes and latency variations between different Amazon AWS regions. In the process of collecting these measurements, we devised a path change detector that we validated using both ad hoc experiments and feedback from cloud networking experts. In addition, we developed a technique for decoupling propagation delays from congestion.
Our work on Assise [OSDI 2020] is a blueprint for the distributed file systems using the next generation of durable storage. We designed CC-NVM , the first persistent and available distributed cache coherence layer. CC-NVM provides locality for data and metadata access, replicates for availability, and provides linearizability and prefix crash consistency for all file system input/output.
The cloud computing infrastructure that logically centralizes data storage and computation for many different actors is a prime example of a key societal system. A number of time-critical applications deployed in the cloud infrastructure have to provide high reliability and throughput, along with guaranteed low latency for delivering data. This low latency guarantee is sorely lacking today, with the so-called tail-latency of slowest responses in popular cloud services being several orders of magnitude longer than the median response times. Unfortunately, simply using a network with ample bandwidth does not guarantee low latency because of problems with congestion at the intra-and inter-data center levels and server overloads. All of these problems currently render the existing cloud infrastructures unsuitable for time-critical societal applications. The reasons for unpredictable delays across the Internet and within the cloud infrastructure are numerous, but some of the key culprits are: 1) slow memory subsystems limit server effectiveness, and 2) excess buffering in the Internet further limits correlation of data requests. The aim of this project is to dramatically change the way data flows across the Internet, such that it is more highly correlated when it is to be processed at the servers. The result is that the logical units of data will be processed at almost the maximum processing speed of the cloud servers. By doing so, we will achieve an ultra-low latency Internet. This project will produce the tools and knowledge that will be key to dramatically reducing the latency of key societal services; these include cloud services used by a large number of users on a daily basis.