# MDP Planning ALgorithms

We would like to solve the MDP Solving problem efficiently. Before stating the algorithm, we shall first go over a bunch of theoretical results.

## Banach Space

A *complete, normed vector space* is called as a Banach Space. That is, both the vector space and norm are well defined. Additionally, every Cauchy sequence must have a limit in that vector space.

A *Cauchy sequence* is a convergent sequence. Example, rationals are not Banach because an irrational number can be written as a sequence of rational numbers ($\sqrt{2}$)

### Contraction Mapping

A mapping $T:X\to X$ is called a contraction mapping with *contraction factor* $L$ if $\forall u,v\in X$,

$x^*$ is a **fixed point** of $T$ if $Tx^*=x^*$.

### Banachs Fixed-Point Theorem

This theorem states that for a given contraction mapping and a contraction factor $L\in[0,1)$:

- There exists a
*unique*fixed point - For $x\in X, m\geq 0; \vert\vert T^mx-x^*\vert\vert \leq L^m\vert\vert x-x^*\vert\vert$

## Bellman Optimality Operator

$B^* : \mathcal{R}^n\to\mathcal{R}^n$ for an MDP $(S,A,T,R,\gamma)$ is given as follows;

That is, we have used notation to redefine the domain of value function to be $\mathcal{R}^n$, and $B^*$ maps value functions.

**Theorem.** $(\mathcal{R}^n, \vert\vert\cdot\vert\vert_{\infty})$ is a Banach space, and $B^*$ is a contraction mapping in this space with contraction factor $\gamma$.

## Bellmans Optimality Equations

The fixed point can be obtained by solving the equation:

There are $n$ variables and $n$ unknowns, but the equations are non-linear in nature. These equations are called as the Bellman’s Optimality Equations for the given MDP. Algorithms for solving these equations will be discussed below.

We shall also prove later that $V^*$ is an optimal value function for a given MDP.

# Value Iteration

This is a very simple algorithm. We apply the optimality operator iteratively until the difference is negligible. We know this works from the Banach’s Fixed point operator.

Also, notice that $V^*, Q^*, \pi^*$ have a cyclic relationship.

- $V^*\to Q^*$: definition of $Q^*$
- $Q^*\to \pi^*$: $\text{arg}\max_aQ^*(s,a)$
- $\pi^*\to V^*$: Solve bellman equations for $\pi^*$

# Linear Programming

This is a technique where a *Linear objective function* of variables under *given linear constraints* is maximized. We can frame the Bellman’s optimality equations as a linear programming problem.

From the Bellman optimality equations, we can say the following. Notice that there are $nk$ linear constraints and $V^*$ staisfies all these constraints.

From this, we can say that $V \succeq V^*$, and can thus linearise the result to be $\sum_s V(s)\geq \sum_s V^*(s)$.

Therefore, the *objective function* to be maximized is given by

This LP problem has $n$ variables and $nk$ constraints. Do note that a dual is possible which solves for $\pi^*$ with $nk$ variables and $n$ constraints.