EL2805 Reinforcement Learning 7.5 credits

Förstärkande inlärning

Reinforcement Learning (RL) addresses the problem of controlling a dynamical system so as to maximize a notion of reward cumulated over time. At each time (or round), the agent selects an action, and as a result, the system state evolves. The agent observes the new state and collects a reward associated with the state transition, before deciding on the next action. Unlike classical control tasks where typically the system dynamics are completely predictable, RL is concerned with systems whose dynamics have to be learnt or with systems interacting with an uncertain environment. As time evolves, the agent gathers more data, and may improve her knowledge about the system dynamics to make better informed decisions. RL has found numerous applications, ranging from robotics, control, online services and game playing, and has received an increasing attention. Very recently, RL has solved problems in situations approaching real-world complexity, e.g., in learning human-level control for playing video and board games. These situations are however rather specific, and we are still far from systems able to learn in a wide variety of scenarios like humans do.

The course provides an in-depth treatment of the modern theoretical tools used to devise and analyse RL algorithms. It includes an introduction to RL and to its classical algorithms such as Q-learning, and SARSA, but further presents the rationale behind the design of more recent algorithms, such as those striking optimal trade-off between exploration and exploitation. The course also covers algorithms used in recent RL success stories, i.e., deep RL algorithms.

  • Education cycle

    Second cycle
  • Main field of study

    Electrical Engineering
  • Grading scale

    A, B, C, D, E, FX, F

Course offerings

Autumn 19 for programme students

Autumn 18 Doktorand for single courses students

  • Periods

    Autumn 18 P2 (7.5 credits)

  • Application code

    10165

  • Start date

    27/08/2018

  • End date

    14/01/2019

  • Language of instruction

    English

  • Campus

    KTH Campus

  • Tutoring time

    Daytime

  • Form of study

    Normal

  • Number of places *

    Max. 1

    *) If there are more applicants than number of places selection will be made.

  • Course responsible

    Alexandre Proutiere <alepro@kth.se>

  • Teacher

    Alexandre Proutiere <alepro@kth.se>

  • Target group

    For doctoral students at KTH

Intended learning outcomes

The course provides an in-depth treatment of the modern theoretical tools used to devise and analyse RL algorithms. It includes an introduction to RL and to its classical algorithms such as Q-learning, and SARSA, but further presents the rationale behind the design of more recent algorithms, such as those striking optimal trade-off between exploration and exploitation. The course also covers algorithms used in recent RL success stories, i.e., deep RL algorithms. After the course, you should be able to:

  • Model the problem of controlling discrete-time stochastic systems with known dynamics, and formulate this problem as a Markov Decision Process (MDP)
  • Solve MDPs using Bellman optimality principle and Dynamical Programming
  • Derive solutions of MDPs using the value and policy iteration methods
  • Control stochastic systems with unknown dynamics using Q-learning or SARSA algorithms
  • Distinguish between on-policy and off-policy RL problems
  • Develop and implement RL algorithms with function approximation (e.g. deep RL algorithms – in which the Q function is approximated by the output of a neural network)
  • Understand solutions to solve bandit optimization problems
  • Design RL algorithms striking a better exploration-exploitation trade-off than Q-learning based algorithms

Course main content

Markov chains, Markov Decision Process (MDP), Dynamic Programming and value / policy iteration methods, Multi-Armed Bandit problems, RL algorithms (Q-learning, Q-learning with function approximation, UCRL).

Disposition

Lectures, exercises, computer laboratory, homework

Eligibility

For single course students: 120 credits and documented proficiency in English B or equivalent.

Literature

Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley.  

Bertsekas, Dynamic Programming and Optimal Control, vol. 1, Athena Scientific.

Bubeck and Cesa-Bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Now publisher, Foundations and trends in machine learning, 2012

Sutton and Barto, Introduction to Reinforcement Learning, MIT Press, Cambridge, MA, USA, 1st edition, 1998

Szepesvari. Algorithms for Reinforcement Learning, Synthesis Lectures on Articial Intelligence and Machine Learning, Morgan & Claypool Publishers, 2010

Examination

  • HEM1 - Homework 1, 1.0, grading scale: P, F
  • HEM2 - Homework 2, 1.0, grading scale: P, F
  • LAB1 - Lab 1, 1.0, grading scale: P, F
  • LAB2 - Lab 2, 1.0, grading scale: P, F
  • TEN1 - Exam, 3.5, grading scale: P, F

HEM1 - Homework 1, 1.0, grade scale: P, F

LAB1 - Lab 1, 1.5, grade scale: P, F

LAB2 - Lab 2, 1.5, grade scale: P, F

TEN1 - Exam, 3.5, grade scale: A, B, C, D, E, FX, F

Requirements for final grade

H1: Homework, 1, grade scale: P/F

LAB1: Computer lab 1, 1, grade scale: P/F

LAB2: Computer lab 2, 1, grade scale: P/F

TEN1: Written exam, grade scale: A, B, C, D, E, FX, F

Offered by

EECS/Intelligent Systems

Contact

Alexandre Proutiere (alepro@kth.se)

Examiner

Alexandre Proutiere <alepro@kth.se>

Supplementary information

https://www.kth.se/student/kurser/kurs/EL2805?l=en

Version

Course syllabus valid from: Spring 2019.
Examination information valid from: Spring 2019.