MDP Planning ALgorithms
We would like to solve the MDP Solving problem efficiently. Before stating the algorithm, we shall first go over a bunch of theoretical results.
Banach Space
A complete, normed vector space is called as a Banach Space. That is, both the vector space and norm are well defined. Additionally, every Cauchy sequence must have a limit in that vector space.
A Cauchy sequence is a convergent sequence. Example, rationals are not Banach because an irrational number can be written as a sequence of rational numbers ($\sqrt{2}$)
Contraction Mapping
A mapping $T:X\to X$ is called a contraction mapping with contraction factor $L$ if $\forall u,v\in X$,
\[\vert\vert Tv-Tu\vert\vert \leq L\vert\vert v-u\vert\vert\]$x^*$ is a fixed point of $T$ if $Tx^*=x^*$.
Banachs Fixed-Point Theorem
This theorem states that for a given contraction mapping and a contraction factor $L\in[0,1)$:
- There exists a unique fixed point
- For $x\in X, m\geq 0; \vert\vert T^mx-x^*\vert\vert \leq L^m\vert\vert x-x^*\vert\vert$
Bellman Optimality Operator
$B^* : \mathcal{R}^n\to\mathcal{R}^n$ for an MDP $(S,A,T,R,\gamma)$ is given as follows;
That is, we have used notation to redefine the domain of value function to be $\mathcal{R}^n$, and $B^*$ maps value functions.
Theorem. $(\mathcal{R}^n, \vert\vert\cdot\vert\vert_{\infty})$ is a Banach space, and $B^*$ is a contraction mapping in this space with contraction factor $\gamma$.
Bellmans Optimality Equations
The fixed point can be obtained by solving the equation:
There are $n$ variables and $n$ unknowns, but the equations are non-linear in nature. These equations are called as the Bellman’s Optimality Equations for the given MDP. Algorithms for solving these equations will be discussed below.
We shall also prove later that $V^*$ is an optimal value function for a given MDP.
Value Iteration
This is a very simple algorithm. We apply the optimality operator iteratively until the difference is negligible. We know this works from the Banach’s Fixed point operator.
Also, notice that $V^*, Q^*, \pi^*$ have a cyclic relationship.
- $V^*\to Q^*$: definition of $Q^*$
- $Q^*\to \pi^*$: $\text{arg}\max_aQ^*(s,a)$
- $\pi^*\to V^*$: Solve bellman equations for $\pi^*$
Linear Programming
This is a technique where a Linear objective function of variables under given linear constraints is maximized. We can frame the Bellman’s optimality equations as a linear programming problem.
From the Bellman optimality equations, we can say the following. Notice that there are $nk$ linear constraints and $V^*$ staisfies all these constraints.
From this, we can say that $V \succeq V^*$, and can thus linearise the result to be $\sum_s V(s)\geq \sum_s V^*(s)$.
Therefore, the objective function to be maximized is given by
This LP problem has $n$ variables and $nk$ constraints. Do note that a dual is possible which solves for $\pi^*$ with $nk$ variables and $n$ constraints.