EL2805 Reinforcement Learning 7.5 credits

Förstärkande inlärning

Reinforcement Learning (RL) addresses the problem of controlling a dynamical system so as to maximize a notion of reward cumulated over time. At each time (or round), the agent selects an action, and as a result, the system state evolves. The agent observes the new state and collects a reward associated with the state transition, before deciding on the next action. Unlike classical control tasks where typically the system dynamics are completely predictable, RL is concerned with systems whose dynamics have to be learnt or with systems interacting with an uncertain environment. As time evolves, the agent gathers more data, and may improve her knowledge about the system dynamics to make better informed decisions. RL has found numerous applications, ranging from robotics, control, online services and game playing, and has received an increasing attention. Very recently, RL has solved problems in situations approaching real-world complexity, e.g., in learning human-level control for playing video and board games. These situations are however rather specific, and we are still far from systems able to learn in a wide variety of scenarios like humans do.

The course provides an in-depth treatment of the modern theoretical tools used to devise and analyse RL algorithms. It includes an introduction to RL and to its classical algorithms such as Q-learning, and SARSA, but further presents the rationale behind the design of more recent algorithms, such as those striking optimal trade-off between exploration and exploitation. The course also covers algorithms used in recent RL success stories, i.e., deep RL algorithms.

Show course information based on the chosen semester and course offering:

Offering and execution

No offering selected

Select the semester and course offering above to get information from the correct course syllabus and course offering.

Course information

Content and learning outcomes

Course contents *

Markov chains, Markov Decision Process (MDP), Dynamic Programming and value / policy iteration methods, Multi-Armed Bandit problems, RL algorithms (Q-learning, Q-learning with function approximation, UCRL).

Intended learning outcomes *

The course provides an in-depth treatment of the modern theoretical tools used to devise and analyse RL algorithms. It includes an introduction to RL and to its classical algorithms such as Q-learning, and SARSA, but further presents the rationale behind the design of more recent algorithms, such as those striking optimal trade-off between exploration and exploitation. The course also covers algorithms used in recent RL success stories, i.e., deep RL algorithms. After the course, you should be able to:

  • Model the problem of controlling discrete-time stochastic systems with known dynamics, and formulate this problem as a Markov Decision Process (MDP)
  • Solve MDPs using Bellman optimality principle and Dynamical Programming
  • Derive solutions of MDPs using the value and policy iteration methods
  • Control stochastic systems with unknown dynamics using Q-learning or SARSA algorithms
  • Distinguish between on-policy and off-policy RL problems
  • Develop and implement RL algorithms with function approximation (e.g. deep RL algorithms – in which the Q function is approximated by the output of a neural network)
  • Understand solutions to solve bandit optimization problems
  • Design RL algorithms striking a better exploration-exploitation trade-off than Q-learning based algorithms

Course Disposition

Lectures, exercises, computer laboratory, homework

Literature and preparations

Specific prerequisites *

For single course students: 120 credits and documented proficiency in English B or equivalent.

Recommended prerequisites

No information inserted

Equipment

No information inserted

Literature

Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley.  

Bertsekas, Dynamic Programming and Optimal Control, vol. 1, Athena Scientific.

Bubeck and Cesa-Bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Now publisher, Foundations and trends in machine learning, 2012

Sutton and Barto, Introduction to Reinforcement Learning, MIT Press, Cambridge, MA, USA, 1st edition, 1998

Szepesvari. Algorithms for Reinforcement Learning, Synthesis Lectures on Articial Intelligence and Machine Learning, Morgan & Claypool Publishers, 2010

Examination and completion

Grading scale *

A, B, C, D, E, FX, F

Examination *

  • HEM1 - Homework 1, 1.0 credits, Grading scale: P, F
  • HEM2 - Homework 2, 1.0 credits, Grading scale: P, F
  • LAB1 - Lab 1, 1.0 credits, Grading scale: P, F
  • LAB2 - Lab 2, 1.0 credits, Grading scale: P, F
  • TEN1 - Exam, 3.5 credits, Grading scale: P, F

Based on recommendation from KTH’s coordinator for disabilities, the examiner will decide how to adapt an examination for students with documented disability.

The examiner may apply another examination format when re-examining individual students.

HEM1 - Homework 1, 1.0, grade scale: P, F

LAB1 - Lab 1, 1.5, grade scale: P, F

LAB2 - Lab 2, 1.5, grade scale: P, F

TEN1 - Exam, 3.5, grade scale: A, B, C, D, E, FX, F

Other requirements for final grade *

H1: Homework, 1, grade scale: P/F

LAB1: Computer lab 1, 1, grade scale: P/F

LAB2: Computer lab 2, 1, grade scale: P/F

TEN1: Written exam, grade scale: A, B, C, D, E, FX, F

Opportunity to complete the requirements via supplementary examination

No information inserted

Opportunity to raise an approved grade via renewed examination

No information inserted

Examiner

Alexandre Proutiere

Further information

Course web

Further information about the course can be found on the Course web at the link below. Information on the Course web will later be moved to this site.

Course web EL2805

Offered by

EECS/Intelligent Systems

Main field of study *

Electrical Engineering

Education cycle *

Second cycle

Add-on studies

No information inserted

Contact

Alexandre Proutiere (alepro@kth.se)

Ethical approach *

  • All members of a group are responsible for the group's work.
  • In any assessment, every student shall honestly disclose any help received and sources used.
  • In an oral assessment, every student shall be able to present and answer questions about the entire assignment and solution.

Supplementary information

https://www.kth.se/student/kurser/kurs/EL2805?l=en.

In this course, the EECS code of honor applies, see:
http://www.kth.se/en/eecs/utbildning/hederskodex.