Machine Learning with Decentralized Data and Differential Privacy
New Methods for Training, Inference and Sampling
Time: Wed 2025-06-11 10.00
Location: D3, Lindstedtsvägen 5, Stockholm
Video link: https://kth-se.zoom.us/j/69506042503
Language: English
Subject area: Computer Science
Doctoral student: Dominik Fay , Reglerteknik
Opponent: Senior researcher Aurélien Bellet, Inria, Montpellier, France
Supervisor: Professor Mikael Johansson, Reglerteknik; Professor Tobias J. Oechtering, Teknisk informationsvetenskap; Assistant Professor Jens Sjölund, Uppsala University
QC 20250519
Abstract
Scale has been an essential driver of progress in recent machine learning research. Data sets and computing resources have grown rapidly, complemented by models and algorithms capable of leveraging these resources. However, in many important applications, there are two limits to such data collection. First, data is often locked in silos, and cannot be shared. This is common in the medical domain, where patient data is controlled by different clinics. Second, machine learning models are prone to memorization. Therefore, when dealing with sensitive data, it is often desirable to have formal privacy guarantees to ensure that no sensitive information can be reconstructed from the trained model.
The topic of this thesis is the design of machine learning algorithms that adhere to these two restrictions: to operate on decentralized data and to satisfy formal privacy guarantees. We study two broad categories of machine learning algorithms for decentralized data: federated learning and ensembling of local models. Federated learning is a form of machine learning in which multiple clients collaborate during training via the coordination of a central server. In ensembling of local models, each client first trains a local model on its own data, and then collaborates with other clients during inference. As a formal privacy guarantee, we consider differential privacy, which is based on introducing artificial noise to ensure membership privacy. Differential privacy is typically applied to federated learning by adding noise to the model updates sent to the server, and to ensembling of local models by adding noise to the predictions of the local models.
Our research addresses the following core areas in the context of privacy-preserving machine learning with decentralized data: First, we examine the implications of data dimensionality on privacy for ensembling of medical image segmentation models. We extend the classification algorithm Private Aggregation of Teacher Ensembles (PATE) to high-dimensional labels, and demonstrate that dimensionality reduction can improve the privacy-utility trade-off. Second, we consider the impact of hyperparameter selection on privacy. Here, we propose a novel adaptive technique for hyperparameter selection in differentially private gradient descent; as well as an adaptive technique for federated learning with non-smooth loss functions. Third, we investigate sampling-based solutions to scale differentially private machine learning to datasets with a large number of data points. We study the privacy-enhancing properties of importance sampling and find that it can outperform uniform sub-sampling not only in terms of sample efficiency but also in terms of privacy. Fourth, we study the problem of systematic label shift in ensembling of local models. We propose a novel method based on label clustering to enable flexible collaboration at inference time.
The techniques developed in this thesis improve the scalability and locality of machine learning while ensuring robust privacy protection. This constitutes progress on the goal of a safe application of machine learning to large and diverse data sets for medical image analysis and similar domains.