Skip to main content
To KTH's start page

Machine Learning with Decentralized Data and Differential Privacy

New Methods for Training, Inference and Sampling

Time: Wed 2025-06-11 10.00

Location: D3, Lindstedtsvägen 5, Stockholm

Video link: https://kth-se.zoom.us/j/69506042503

Language: English

Subject area: Computer Science

Doctoral student: Dominik Fay , Reglerteknik

Opponent: Senior researcher Aurélien Bellet, Inria, Montpellier, France

Supervisor: Professor Mikael Johansson, Reglerteknik; Professor Tobias J. Oechtering, Teknisk informationsvetenskap; Assistant Professor Jens Sjölund, Uppsala University

Export to calendar

QC 20250519

Abstract

Scale has been an essential driver of progress in recent machine learning research. Data sets and computing resources have grown rapidly, complemented by models and algorithms capable of leveraging these resources. However, in many important applications, there are two limits to such data collection. First, data is often locked in silos, and cannot be shared. This is common in the medical domain, where patient data is controlled by different clinics. Second, machine learning models are prone to memorization. Therefore, when dealing with sensitive data, it is often desirable to have formal privacy guarantees to ensure that no sensitive information can be reconstructed from the trained model.

The topic of this thesis is the design of machine learning algorithms that adhere to these two restrictions: to operate on decentralized data and to satisfy formal privacy guarantees. We study two broad categories of machine learning algorithms for decentralized data: federated learning and ensembling of local models. Federated learning is a form of machine learning in which multiple clients collaborate during training via the coordination of a central server. In ensembling of local models, each client first trains a local model on its own data, and then collaborates with other clients during inference. As a formal privacy guarantee, we consider differential privacy, which is based on introducing artificial noise to ensure membership privacy. Differential privacy is typically applied to federated learning by adding noise to the model updates sent to the server, and to ensembling of local models by adding noise to the predictions of the local models.

Our research addresses the following core areas in the context of privacy-preserving machine learning with decentralized data: First, we examine the implications of data dimensionality on privacy for ensembling of medical image segmentation models. We extend the classification algorithm Private Aggregation of Teacher Ensembles (PATE) to high-dimensional labels, and demonstrate that dimensionality reduction can improve the privacy-utility trade-off. Second, we consider the impact of hyperparameter selection on privacy. Here, we propose a novel adaptive technique for hyperparameter selection in differentially private gradient descent; as well as an adaptive technique for federated learning with non-smooth loss functions. Third, we investigate sampling-based solutions to scale differentially private machine learning to datasets with a large number of data points. We study the privacy-enhancing properties of importance sampling and find that it can outperform uniform sub-sampling not only in terms of sample efficiency but also in terms of privacy. Fourth, we study the problem of systematic label shift in ensembling of local models. We propose a novel method based on label clustering to enable flexible collaboration at inference time.

The techniques developed in this thesis improve the scalability and locality of machine learning while ensuring robust privacy protection. This constitutes progress on the goal of a safe application of machine learning to large and diverse data sets for medical image analysis and similar domains.

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-363514