Reinforcement Learning (RL) addresses the problem of controlling a dynamical system so as to maximize a notion of reward cumulated over time. At each time (or round), the agent selects an action, and as a result, the system state evolves. The agent observes the new state and collects a reward associated with the state transition, before deciding on the next action. Unlike classical control tasks where typically the system dynamics are completely predictable, RL is concerned with systems whose dynamics have to be learnt or with systems interacting with an uncertain environment. As time evolves, the agent gathers more data, and may improve her knowledge about the system dynamics to make better informed decisions. RL has found numerous applications, ranging from robotics, control, online services and game playing, and has received an increasing attention. Very recently, RL has solved problems in situations approaching real-world complexity, e.g., in learning human-level control for playing video and board games. These situations are however rather specific, and we are still far from systems able to learn in a wide variety of scenarios like humans do.
The course provides an in-depth treatment of the modern theoretical tools used to devise and analyse RL algorithms. It includes an introduction to RL and to its classical algorithms such as Q-learning, and SARSA, but further presents the rationale behind the design of more recent algorithms, such as those striking optimal trade-off between exploration and exploitation. The course also covers algorithms used in recent RL success stories, i.e., deep RL algorithms.
Choose semester and course offering
Choose semester and course offering to see current information and more about the course, such as course syllabus, study period, and application information.
Content and learning outcomes
The course gives an in-depth treatment of the modern theoretical tools that are used to design and analyse reinforcement learning algorithms (RL algorithms). It contains an introduction to RL and to its classical algorithms like Q-learning and SARSA, and present furthermore a justification behind the design of the latest algorithms, such as the striking optimal trade-off between exploration and exploitation. The course also covers algorithms that are used in the latest success histories for RL, e.g., deep RL algorithms.
Markov chains, Markov decision process (MDP), dynamic programming and value- and policy iterations, design of approximate controllers for MDP, stochastic linear quadratic control, the Multi-Armed Bandit problem, RL algorithms (Q-learning, Q-learning with function approximation).
Intended learning outcomes
After passing the course, the student should be able to
- carefully formulate stochastic control problems as Markov decision-making process problems (MDP), classify equivalent problems and evaluate their traceability
- state the principle about optimality in finite time and infinite time horizon for MDP and solve MDP by means of dynamic programming
- derive solutions to MDP by using value- and policy iterations
- solve control problems for systems whose dynamics must be learnt with Q learning and SARSA algorithms
- explain the difference between on-policy and off-policy algorithms
- develop and implement RL algorithms with function approximation (for example deep RL algorithms where the Q function is approximated by the output of a neural network)
- solve bandit optimisation problems.
Literature and preparations
For non-program students: 120 higher education credits and documented knowledge in English B or an equivalent discipline.
Examination and completion
If the course is discontinued, students may request to be examined during the following two academic years.
- HEM1 - Homework 1, 1.0 credits, grading scale: P, F
- HEM2 - Homework 2, 1.0 credits, grading scale: P, F
- LAB1 - Lab 1, 1.0 credits, grading scale: P, F
- LAB2 - Lab 2, 1.0 credits, grading scale: P, F
- TENA - Written exam, 3.5 credits, grading scale: A, B, C, D, E, FX, F
Based on recommendation from KTH’s coordinator for disabilities, the examiner will decide how to adapt an examination for students with documented disability.
The examiner may apply another examination format when re-examining individual students.
Opportunity to complete the requirements via supplementary examination
Opportunity to raise an approved grade via renewed examination
- All members of a group are responsible for the group's work.
- In any assessment, every student shall honestly disclose any help received and sources used.
- In an oral assessment, every student shall be able to present and answer questions about the entire assignment and solution.
Further information about the course can be found on the Course web at the link below. Information on the Course web will later be moved to this site.Course web EL2805
Main field of study
In this course, the EECS code of honor applies, see: