Load Balancing in Stream Processing Systems
Speaker: Muhammad Anis Uddin Nasir
Anis () is a third PhD student at KTH Royal Institute of Technology, working under the Marie Curie Initial Training Network Project called iSocial. During the PhD, he finished two internships at: Yahoo Labs Barcelona and Aalto University (supervisor: Aristides Gionis). His research interests are related to designing algorithms for streaming processing, graph processing and large scale distributed systems.
Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: having more workers implies fewer keys per worker, so it becomes harder to “average out” the cost of hot keys with cold keys.
To achieve load balancing for distributed stream processing systems, we propose two different techniques. First, we introduce PARTIAL KEY GROUPING (PKG), a new stream partitioning scheme that adapts the classical “power of two choices” to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. Second, we extend PKG, for the highly skewed input streams, and propose a novel load balancing technique that uses a heavy hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d ≥ 2 choices to ensure a balanced load, where d is tuned automatically to minimize the memory and computation cost of operator replication. Both the proposed techniques operate in online fashion and provide significant improvements compared to the state of the art solutions.