Towards Efficient and Robust Decentralized Learning
Time: Thu 2026-03-05 15.00
Location: D3, Lindstedtsvägen 5, Stockholm
Language: English
Subject area: Electrical Engineering
Doctoral student: Zesen Wang , Reglerteknik
Opponent: Research Scientist Hao-Jun Michael Shi, Meta Plattforms, Menlo Park, San Fransisco, CA, USA
Supervisor: Professor Mikael Johansson, Reglerteknik
QC 20260205
Abstract
The widening gap between GPU compute capability and inter-node networkbandwidth presents a fundamental challenge for distributed deep learning. Whiletraditional ”All-Reduce” methods require every GPU to sync globally — slowing downthe entire system to the speed of the slowest worker — decentralized training allowsGPUs to communicate only with a few neighbors. Despite its potential, decentralized training is rarely adopted in practice because its performance gains are hard to predict, its impact on model accuracy is poorly understood, and it is complex to implement.
This thesis investigates Decentralized Training as a robust and efficient alternative to global synchronization. By restricting communication to a sparse graph of neighbors, decentralized algorithms reduce bandwidth usage and alleviate the single point of straggler inherent in global collective communications. Despite these theoretical advantages, adoption has been hindered by three key challenges: ambiguity regarding efficiency gains, uncertainty about generalization performance, and implementation barriers.
To address the efficiency ambiguity, we propose a comprehensive runtime model that characterizes the capability of decentralized algorithms. We derive an analytical bound that characterizes hardware–model regimes under which decentralized training can outperform the All-Reduce method by a margin, validating this model on GPU clusters. This analysis highlights the relevance of decentralized schemes as the "outer loop" synchronization mechanism in bandwidth-constrained environments.
Second, we tackle the generalization uncertainty by analyzing the role of consensus error. We initially propose AccumAdam, an engineering stabilization mechanism designed to mitigate momentum drift caused by decentralization and stabilize convergence. We then pivot to a novel perspective with DSGD-AC (Adaptive Consensus), demonstrating that consensus error --- often viewed as harmful noise --- can act as an implicit regularization mechanism related to curvature. We show that by controlling rather than eliminating this error, decentralized training can favor smooth minima and improve generalization compared to centralized baselines.
Finally, to lower the implementation barrier, we present Decent-DP, a lightweight, modular software library that integrates seamlessly with existing PyTorch workflows. Decent-DP enables transparent experimentation with various topologies and the AWC communication-computation pattern. Collectively, this work bridges the gap between systems-level optimization and learning-theoretic robustness, establishing decentralized learning as a potential component for resilient distributed training systems.