# Reinforcement Learning and Optimal Adaptive Control for Structured Dynamical Systems

**Tid: **
Må 2023-10-23 kl 14.00

**Plats: **
Kollegiesalen, Brinellvägen 8, Stockholm

**Språk: **
Engelska

**Ämnesområde: **
Elektro- och systemteknik

**Respondent: **
Damianos Tranos
, Reglerteknik, Statistical Learning & Control Group

**Opponent: **
Associate Professor Pontus Giselsson, Lund University

**Handledare: **
Professor Alexandre Proutiere, Reglerteknik

QC 20231003

## Abstract

In this thesis, we study the related problems of reinforcement learning and optimal adaptive control, specialized to specific classes of stochastic and structured dynamical systems. By stochastic, we mean systems that are unknown to the decision maker and evolve according to some probabilistic law. By structured, we mean that they are restricted in some known way, e.g., they belong to a specific model class or must obey a set of known constraints. The objective in both problems is the design of an optimal algorithm, i.e., one that maximizes a certain performance metric. Because of the stochasticity, the algorithm faces an exploration-exploitation dilemma, where it must balance collecting information from the system and leveraging existing information to choose the best action or input. This trade-off is best captured by the notion of regret, defined as the difference between the performance of the algorithm and an oracle which has full knowledge of the system. In the first part of the thesis, we investigate systems that can be modeled as Markov Decision Processes (MDPs) and derive general asymptotic and problem-specific regret lower bounds for ergodic and deterministic MDPs. We make these bounds explicit for MDPs that: i) are ergodic and unstructured, ii) have Lipschitz transitions and rewards, and iii) are deterministic and satisfy a decoupling property. Furthermore, we propose Directed Exploration Leaning (DEL), an algorithm that is valid for any ergodic MDP with any structure and whose regret upper bound matches the associated regret lower bounds, thus being truly optimal. For this algorithm, we present theoretical regret guarantees as well as a numerical demonstration that verifies its ability to exploit the underlying structure. In the second part, we study systems with uncertain linear dynamics and which are subject to additive disturbances as well as state and input constraints. We develop Self-Tuning Tube-based Model Predictive Control (STTMPC), an adaptive and robust model predictive control algorithm which leverages the least-squares estimator as well as polytopic tubes to guarantee robust constraint satisfaction, along with recursive feasibility, and input-to-state stability. The algorithm also ensures the persistence of excitation without compromising the system's asymptotic performance and with no increase in computational complexity. We also provide guarantees on the expected regret of STT-MPC, in the form of an upper bound whose rate explicitly depends on the chosen rate of excitation. The performance of the algorithm is also demonstrated via a numerical example.