Student Seminar in Mathematics: Reinforcement Learning Autumn 2021

The aim of this seminar is to give an introduction to some of the mathematical ideas behind reinforcement learning. This includes stochastic optimisation and convergence analysis. The emphasis is on mathematical theory, not on developing and testing algorithms.

Prerequisites

The underlying textbook mostly works with stochastic control problems for discrete-time Markov chains with a finite state space. But for a proper understanding, students should be familiar with measure-theoretic probability theory as well as stochastic processes in discrete time, and in particular with the construction of Markov chains on the canonical path space via the Ionescu-Tulcea theorem.

Rules and conditions

To obtain the credit points, everyone must give a 90-minute talk and attend the seminar regularly. During the first meeting, the talks will be assigned and the decision about the participants will be finalised.

The coordinator will hold a weekly office hour to answer questions from the students.

Office hour

Materials

We mainly discuss materials from the book Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, Second Edition by Abhijit Gosavi. All references below are to that book.

List of Talks

#	Title	Date	Speaker	Material
1	Dynamic programming for finite-state Markov chains	Oct. 7	Viola Bosselmann	Sect. 11.2-11.3.1, pp. 351-367.
2	Discounted reward: setup, policy iteration and value iteration	Oct. 14	Charles Käslin	Sect. 11.3.2-11.3.3 + 6.5.4-6.5.5, pp. 367-371 + 159-166.
3	Average reward: setup, policy iteration	Oct. 21	Yan-Xing Lan	Sect. 11.4.1-11.4.2 + 6.4.1-6.4.2, pp. 371-380 + 150-153.
4	Average reward: value iteration	Oct. 28	Kevin Zhang	Sect. 11.4.3 + 6.4.3, pp. 380-389 + 154-159.
5	Modified policy iteration, basics of semi-Markov setup	Nov. 4	Songyan Hou	Sect. 6.8-6.10 + 6.7, pp. 184-192 + 169-184.
6	Basic ideas for reinforcement learning, Q-factors, etc.	Nov. 11	Emmanuel Bauer	Sect. 6.3-6.4.1.1, pp. 203-221.
7	Policy iteration for Q-factors, SARSA, CAP-I	Nov. 18	Tim Gyger	Sect. 6.4.1.2-6.4.3, pp. 221-234.
8	Asynchronous stochastic optimization, convergence results	Nov. 25	Adrien Perroud	Sect. 11.6.1 + 9.11.2-9.11.3, pp. 390-396 + 310-318
9	Convergence results for reinforcement learning	Dec. 2	Siqi He	Sect. 11.8.1- 11.8.1.1, pp. 400-411
10	More convergence results for reinforcement learning	Dec. 9	Markus Krimmel	Sect. 11.8.1.2-11.8.4, pp. 411-424
11	In-depth results in Talk #8	Dec. 16	Jérémy Weymann	Borkar/Meyn2000
12	In-depth results in Talk #8	Dec. 23	Krunoslav Lehman Pavasovic	Borkar/Meyn2000