Gain Estimation using Multi-armed bandit policies

Examiner Cristian Rojas

Time: Tue 2021-04-27 15.00 - 16.00

Location: Zoom https://kth-se.zoom.us/j/63527165624

Respondent: Chia-Hsuan Chou , Division of Decision and Control Systems

Opponent: Tianze Li

Supervisor: Matias Müller

This thesis investigates a new method to estimate the system norm using reinforcement
learning. Given an unknown system, we aim to estimate its H∞-norm with a model-free
approach, which involves solving a sequential inputdesign problem. This problem is modeled
as a multi-armed bandit, which provides us a way to study optimal decision making under
uncertainty.
In the multi-armed bandit framework, there are two different types of policies: index and
Bayesian policies. The main goal of this thesis is to compare the performance of these two
class of policies. We take Thompson Sampling representing Bayesian policies and five
different UCB-type algorithms in the class of index policies. We compare these algorithms in
two different setups depending on the class of input signals allowed to be applied to
thesystem, denoted as single-frequency and power-spreading strategies. The input design
method provides an asymptotically optimal way to collect input-output measurements of the
system without having to rely on a model, providing asymptotically the best possible
information to estimate the H∞-norm of the system.
Simulation results show that algorithms with Bayesian policies are able to estimate the H∞-
norm accurately in both single-frequency and power-spreading strategies, while index
policies compare very well for the power-spreading case but they are outperformed by the
other class of input signals.

To the calendar