The aim of this seminar is to give an introduction to some of the mathematical ideas behind reinforcement learning. This includes stochastic optimisation and convergence analysis. The emphasis is on mathematical theory, not on developing and testing algorithms.
The underlying textbook mostly works with stochastic control problems for discrete-time Markov chains with a finite state space. But for a proper understanding, students should be familiar with measure-theoretic probability theory as well as stochastic processes in discrete time, and in particular with the construction of Markov chains on the canonical path space via the Ionescu-Tulcea theorem.
To obtain the credit points, everyone must give a 90-minute talk and attend the seminar regularly. During the first meeting, the talks will be assigned and the decision about the participants will be finalised.
The coordinator will hold a weekly office hour to answer questions from the students.
We mainly discuss materials from the book Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, Second Edition by Abhijit Gosavi. All references below are to that book.
The last two talks will be based on a journal article.
# | Title | Date | Speaker | Material |
---|---|---|---|---|
1 | Dynamic programming for finite-state Markov chains | Oct. 7 | Viola Bosselmann | Sect. 11.2-11.3.1, pp. 351-367. |
2 | Discounted reward: setup, policy iteration and value iteration | Oct. 14 | Charles Käslin | Sect. 11.3.2-11.3.3 + 6.5.4-6.5.5, pp. 367-371 + 159-166. |
3 | Average reward: setup, policy iteration | Oct. 21 | Yan-Xing Lan | Sect. 11.4.1-11.4.2 + 6.4.1-6.4.2, pp. 371-380 + 150-153. |
4 | Average reward: value iteration | Oct. 28 | Kevin Zhang | Sect. 11.4.3 + 6.4.3, pp. 380-389 + 154-159. |
5 | Modified policy iteration, basics of semi-Markov setup | Nov. 4 | Songyan Hou | Sect. 6.8-6.10 + 6.7, pp. 184-192 + 169-184. |
6 | Basic ideas for reinforcement learning, Q-factors, etc. | Nov. 11 | Emmanuel Bauer | Sect. 6.3-6.4.1.1, pp. 203-221. |
7 | Policy iteration for Q-factors, SARSA, CAP-I | Nov. 18 | Tim Gyger | Sect. 6.4.1.2-6.4.3, pp. 221-234. |
8 | Asynchronous stochastic optimization, convergence results | Nov. 25 | Adrien Perroud | Sect. 11.6.1 + 9.11.2-9.11.3, pp. 390-396 + 310-318 |
9 | Convergence results for reinforcement learning | Dec. 2 | Siqi He | Sect. 11.8.1- 11.8.1.1, pp. 400-411 |
10 | More convergence results for reinforcement learning | Dec. 9 | Markus Krimmel | Sect. 11.8.1.2-11.8.4, pp. 411-424 |
11 | In-depth results in Talk #8 | Dec. 16 | Jérémy Weymann | Borkar/Meyn2000 |
12 | In-depth results in Talk #8 | Dec. 23 | Krunoslav Lehman Pavasovic | Borkar/Meyn2000 |