An Information-Theoretic Approach to Bandits and Reinforcement Learning
Time: Mon 2025-12-15 13.15
Location: F3, Flodis, Lindstedtvägen 26
Video link: https://kth-se.zoom.us/j/61173029059
Language: English
Doctoral student: Amaury Gouverneur , Teknisk informationsvetenskap
Opponent: Research Assistant Professor Gergely Neu, Universitat Pompeu Fabra, Barcelona, Spain
Supervisor: Professor Mikael Skoglund, Teknisk informationsvetenskap; Professor Tobias J. Oechtering, Teknisk informationsvetenskap
QC 20251124
Abstract
In this thesis, we develop and extend information-theoretic methods for analyzing reinforcement learning, focussing particularly on bandit problems. Our aim is to understand how learning algorithms manage the fundamental trade-off between exploration and exploitation, and how information-theoretic quantities such as entropy and mutual information can be used to characterize and bound their performance. Building on the framework of Russo and Van Roy, which relates Bayesian regret of an algorithm to its learning efficiency, measured by the information ratio, we derive new results that broaden the scope of this approach to more complex and structured learning settings.
The first part of the thesis revisits the analysis of linear bandits, a model in which rewards depend linearly on the selected action via an unknown parameter. While existing information-theoretic analyses provide general regret bounds, they all assume finite action sets. We close this gap by introducing refined analyses and show that near-optimal regret bounds can be obtained even for infinite or continuous action spaces.
The second part of the thesis extends the information analysis framework to more complex bandit models. In the contextual bandit setting, we build on the lifted information ratio and extend its application to more general reward classes. Additionally, we provide alternative proofs based on classical information-theoretic tools. This new derivation reveals simplifies the analysis and enables sharper guarantees for structured problems. We then study the Thompson Sampling algorithm for logistic bandits, a widely used model for binary rewards. We derive the first regret bounds that avoid exponential dependence on the logistic slope parameter and are independent of the number of actions, resolving a long standing open question in the field.
The third part of the thesis takes a step beyond bandit models and investigates how information-theoretic principles can be extended to more general reinforcement learning problems. We study the theoretical performance limits and learning guarantees of model-based reinforcement learning, establishing a framework for analyzing its Bayesian regret. In parallel, we derive PAC-Bayesian bounds for offline decision-making, including improved guarantees for offline bandits that are parameter-free and achieve near-optimal rates.
Together, the contributions of this thesis provide a coherent extension of information-theoretic analysis to a wide range of learning settings; from linear to nonlinear, from discrete to continuous, and from bandits to reinforcement learning.