RL agents essentially do backward induction
This post assumes the reader has some familiarity with Markov Decision Processes and Game theory.
Policy iteration and Value Iteration are two of the most basic algorithms in tabular Reinforcement Learning. Modern game-playing AI agents like AlphaStar and AlphaGo are based on approximations to these algorithms. The purpose of this post is to give an intuition of what these algorithms are doing. I will restrict to deterministic MDPs for simplicity.
What I’ll do in this post, is explain that we can think of value iteration as actually doing a simpler thing, namely backward induction. In a next post I’ll do the same for Policy Iteration.
Background for Backward induction : Single-player extensive form games as MDPs
Backward induction is the basic solution algorithm of extensive form games. If you want to find an strategy (also called a policy) in an arbitrary extensive form game that is optimal in all subgames, assuming that your opponent also plays such a strategy, then backward induction is the algorithm you should run.
However in this post, I’ll restrict to single-player games, so we can forget about the opponent. If we do this, then we’ll see that an extensive form game is a special case of an MDP. Here is an example:
Each node is a decision to be made, either L or R, and in the final node, the utility is obtained. We can view such an extensive form game (EF Game) also as an MDP (S, A, T, R) :