Reinforcement Learning

Looking at the MDP from the agent’s point of view, it is not guaranteed to know the transition probabilities and the rewards associated with an action.

That is, the entire MDP (S, A, T, R, $\gamma$) is provided to the agent in the planning problem. In a learning setting, the agent knows only (S, A, $\gamma$) and sometimes R. It has to make inferences on T from experience.

Let a t-length history be defined as follows:

\[h^t = (s^0, a^0, r^0 \ldots s^t)\]

Learning Algorithm

A Learning Algorithm L is a mapping from the set of all histories to set of all probability distributions over arms. We would like to construct L such that:

\[\lim_{T\to\infty}\frac{1}{T}\left( \sum_{t=0}^{T-1} \mathcal{P} \[ a^t\sim L(h^t) is optimal for s^t \] \right)\]

The above problem is also known as the Control Problem.