MDP Planning ALgorithms

We would like to solve the MDP Solving problem efficiently. Before stating the algorithm, we shall first go over a bunch of theoretical results.

 

Banach SpacePermalink

A complete, normed vector space is called as a Banach Space. That is, both the vector space and norm are well defined. Additionally, every Cauchy sequence must have a limit in that vector space.

A Cauchy sequence is a convergent sequence. Example, rationals are not Banach because an irrational number can be written as a sequence of rational numbers (2)

Contraction MappingPermalink

A mapping T:XX is called a contraction mapping with contraction factor L if u,vX,

||TvTu||L||vu||

x is a fixed point of T if Tx=x.

Banachs Fixed-Point TheoremPermalink

This theorem states that for a given contraction mapping and a contraction factor L[0,1):

  1. There exists a unique fixed point
  2. For xX,m0;||Tmxx||Lm||xx||

 

Bellman Optimality OperatorPermalink

B:RnRn for an MDP (S,A,T,R,γ) is given as follows;

(B(F))(s)=maxasT(s,a,s)[R(s,a,s)+γF(s)]

That is, we have used notation to redefine the domain of value function to be Rn, and B maps value functions.

Theorem. (Rn,||||) is a Banach space, and B is a contraction mapping in this space with contraction factor γ.

||F||=max(|f1|,|fn|)

Bellmans Optimality EquationsPermalink

The fixed point can be obtained by solving the equation:

V(s)=maxasT(s,a,s)[R(s,a,s)+γV(s)]

There are n variables and n unknowns, but the equations are non-linear in nature. These equations are called as the Bellman’s Optimality Equations for the given MDP. Algorithms for solving these equations will be discussed below.

We shall also prove later that V is an optimal value function for a given MDP.

 

Value IterationPermalink

This is a very simple algorithm. We apply the optimality operator iteratively until the difference is negligible. We know this works from the Banach’s Fixed point operator.

Also, notice that V,Q,π have a cyclic relationship.

  1. VQ: definition of Q
  2. Qπ: argmaxaQ(s,a)
  3. πV: Solve bellman equations for π

 

Linear ProgrammingPermalink

This is a technique where a Linear objective function of variables under given linear constraints is maximized. We can frame the Bellman’s optimality equations as a linear programming problem.

From the Bellman optimality equations, we can say the following. Notice that there are nk linear constraints and V staisfies all these constraints.

V(s)maxasT(s,a,s)[R(s,a,s)+γV(s)]
B preserves

From this, we can say that VV, and can thus linearise the result to be sV(s)sV(s).

Therefore, the objective function to be maximized is given by

(sV(s))

This LP problem has n variables and nk constraints. Do note that a dual is possible which solves for π with nk variables and n constraints.