Prediction Problem

This problem is usually the first step in solving the Control problem. Let the learning algorithm be $L$. At each step $t$, let the estimated value of the states by the agent be $\hat{V}^t$. Given a policy $\pi$ that the agent follows, with value function $V^\pi$, we would like $L$ such that:

\[\lim_{t\to\infty} \hat{V}^t = V^\pi\]

A few assumptions have to be taken before we can begin to tackle this problem.

Irreducibility

Given an MDP $M$ and policy $\pi$; if there is a directed path between every $s$ and $s’$ with non-zero policy, then The MDP is said to be irreducible under the policy.

If $M$ is irreducible for all $\pi\in\Pi$, then $M$ is irreducible.

Intuition: We do not want the learning algorithm’s policy to be stuck in some subset of states

Aperiodicity

whaaatt

Ergodicity

An MDP which is both Irreducible and Aperiodic is said to be Ergodic in nature. An Ergodic MDP induces a steady state distribution $\mu^\pi: S\to(0,1)$ for every policy $\pi$.

That is, let the probability of current state being $s$ after $t$ timesteps be $p(s,t)$. This value must also depend on the initial state. However, ergocidity states that as $t\to\infty$, the distribution tends to $\mu^\pi$, which is independent od the initial state.

Model-Based Approach

A model is an estimate of the MDP by the agent. Let the estimated MDP be $(S, A, \tilde{T}, \tilde{R}, \gamma)$, we wish that this converges to the true MDP.

As like with bandits, we would like to visit every state-action pair infinitely many times.

The algorithm we shall propose is the following:

Take uniform random actions until every state-action pair has been taken atleast once, the model is said to be valid now
If model is valid, take a random action with probability $\epsilon$ and calculated optimal with probability $1-\epsilon$
Update the values of (S, A, T, R) to be their empirical means after every iteration

We shall state without a proof that for convergence to optimal behaviour, the algorithm only requires irreducibility.

An algorithm is said to be “model-based” if if uses $\theta(n^2k)$ memory. Similarly, we call a model to be model-free if it is using only $\theta(nk)$ memory.

Monte-Carlo’s Method

Given an episodic task, we observe the agent taking uniform random actions for $e$ episodes. This method is used to estimate the value of each state given the history for each episode.

We define the following:

$G(s,i,j)$ : Discounted reward of $s$ from its $j$’th visit in episode $i$
$1(s,i,j)$ : 1 if $s$ has occured atleast $j$ times in episode $i$

First Visit Monte-Carlo

\[\hat{V}(s) = \frac{\sum G(s,i,1)}{\sum 1(s,i,1)}\]

Every Visit Monte Carlo

\[\hat{V}(s) = \frac{\sum_i\sum_j G(s,i,j)}{\sum_i\sum_j 1(s,i,j)}\]

Second Visit Monte-Carlo

\[\hat{V}(s) = \frac{\sum G(s,i,2)}{\sum 1(s,i,2)}\]

Last Visit Monte-Carlo

\[\hat{V}(s) = \frac{\sum G(s,i,times(s,i))}{\sum 1(s,i,times(s,i))}\]

$\hat{V}(s)$ for the first three given above converges to the ideal value, BUT THE LAST ONE DOESN’T!