Student Seminar in Mathematics: Reinforcement Learning Autumn 2021

Martin Schweizer
Zhouyi Tan
Time and Location
Thu. 14-16, HG E 21
First meeting date


The aim of this seminar is to give an introduction to some of the mathematical ideas behind reinforcement learning. This includes stochastic optimisation and convergence analysis. The emphasis is on mathematical theory, not on developing and testing algorithms.


The underlying textbook mostly works with stochastic control problems for discrete-time Markov chains with a finite state space. But for a proper understanding, students should be familiar with measure-theoretic probability theory as well as stochastic processes in discrete time, and in particular with the construction of Markov chains on the canonical path space via the Ionescu-Tulcea theorem.

Rules and conditions

To obtain the credit points, everyone must give a 90-minute talk and attend the seminar regularly. During the first meeting, the talks will be assigned and the decision about the participants will be finalised.

The coordinator will hold a weekly office hour to answer questions from the students.

Office hour

Tu., 11-12, HG G 47.1


We mainly discuss materials from the book Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, Second Edition by Abhijit Gosavi. All references below are to that book.

The last two talks will be based on a journal article.

List of Talks

# Title Date Speaker Material
1 Dynamic programming for finite-state Markov chains Oct. 7 Viola Bosselmann Sect. 11.2-11.3.1, pp. 351-367.
2 Discounted reward: setup, policy iteration and value iteration Oct. 14 Charles Käslin Sect. 11.3.2-11.3.3 + 6.5.4-6.5.5, pp. 367-371 + 159-166.
3 Average reward: setup, policy iteration Oct. 21 Yan-Xing Lan Sect. 11.4.1-11.4.2 + 6.4.1-6.4.2, pp. 371-380 + 150-153.
4 Average reward: value iteration Oct. 28 Kevin Zhang Sect. 11.4.3 + 6.4.3, pp. 380-389 + 154-159.
5 Modified policy iteration, basics of semi-Markov setup Nov. 4 Songyan Hou Sect. 6.8-6.10 + 6.7, pp. 184-192 + 169-184.
6 Basic ideas for reinforcement learning, Q-factors, etc. Nov. 11 Emmanuel Bauer Sect. 6.3-, pp. 203-221.
7 Policy iteration for Q-factors, SARSA, CAP-I Nov. 18 Tim Gyger Sect., pp. 221-234.
8 Asynchronous stochastic optimization, convergence results Nov. 25 Adrien Perroud Sect. 11.6.1 + 9.11.2-9.11.3, pp. 390-396 + 310-318
9 Convergence results for reinforcement learning Dec. 2 Siqi He Sect. 11.8.1-, pp. 400-411
10 More convergence results for reinforcement learning Dec. 9 Markus Krimmel Sect., pp. 411-424
11 In-depth results in Talk #8 Dec. 16 Jérémy Weymann Borkar/Meyn2000
12 In-depth results in Talk #8 Dec. 23 Krunoslav Lehman Pavasovic Borkar/Meyn2000