- Lecturer
- Martin Schweizer
- Coordinator
- Zhouyi Tan
- Time and Location
- Thu. 14-16, HG E 21
- First meeting date
- 23.09.2021

The aim of this seminar is to give an introduction to some of the mathematical ideas behind reinforcement learning. This includes stochastic optimisation and convergence analysis. The emphasis is on mathematical theory, not on developing and testing algorithms.

The underlying textbook mostly works with stochastic control problems for discrete-time Markov chains with a finite state space. But for a proper understanding, students should be familiar with measure-theoretic probability theory as well as stochastic processes in discrete time, and in particular with the construction of Markov chains on the canonical path space via the Ionescu-Tulcea theorem.

To obtain the credit points, everyone must give a 90-minute talk and attend the seminar regularly. During the first meeting, the talks will be assigned and the decision about the participants will be finalised.

The coordinator will hold a weekly office hour to answer questions from the students.

We mainly discuss materials from the book *Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, Second Edition* by Abhijit Gosavi. All references below are to that book.

The last two talks will be based on a journal article.

# | Title | Date | Speaker | Material |
---|---|---|---|---|

1 | Dynamic programming for finite-state Markov chains | Oct. 7 | Viola Bosselmann | Sect. 11.2-11.3.1, pp. 351-367. |

2 | Discounted reward: setup, policy iteration and value iteration | Oct. 14 | Charles Käslin | Sect. 11.3.2-11.3.3 + 6.5.4-6.5.5, pp. 367-371 + 159-166. |

3 | Average reward: setup, policy iteration | Oct. 21 | Yan-Xing Lan | Sect. 11.4.1-11.4.2 + 6.4.1-6.4.2, pp. 371-380 + 150-153. |

4 | Average reward: value iteration | Oct. 28 | Kevin Zhang | Sect. 11.4.3 + 6.4.3, pp. 380-389 + 154-159. |

5 | Modified policy iteration, basics of semi-Markov setup | Nov. 4 | Songyan Hou | Sect. 6.8-6.10 + 6.7, pp. 184-192 + 169-184. |

6 | Basic ideas for reinforcement learning, Q-factors, etc. | Nov. 11 | Emmanuel Bauer | Sect. 6.3-6.4.1.1, pp. 203-221. |

7 | Policy iteration for Q-factors, SARSA, CAP-I | Nov. 18 | Tim Gyger | Sect. 6.4.1.2-6.4.3, pp. 221-234. |

8 | Asynchronous stochastic optimization, convergence results | Nov. 25 | Adrien Perroud | Sect. 11.6.1 + 9.11.2-9.11.3, pp. 390-396 + 310-318 |

9 | Convergence results for reinforcement learning | Dec. 2 | Siqi He | Sect. 11.8.1- 11.8.1.1, pp. 400-411 |

10 | More convergence results for reinforcement learning | Dec. 9 | Markus Krimmel | Sect. 11.8.1.2-11.8.4, pp. 411-424 |

11 | In-depth results in Talk #8 | Dec. 16 | Jérémy Weymann | Borkar/Meyn2000 |

12 | In-depth results in Talk #8 | Dec. 23 | Krunoslav Lehman Pavasovic | Borkar/Meyn2000 |

- A. Gosavi, Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , Second Edition, Springer 2015.
- V.S. Borkar, S.P. Meyn, The ODE method for convergence of stochastic approximation and reinforcement learning, SIAM J. Control Optim. 2000.
- V. Dixon, I. Halperin, P. Bilokon, Machine Learning in Finance, Springer 2020. (This book is indicated only for extra background information and will not be used in the seminar.)