Markov Decision Problem
A Markov Decision problem $M$ is given by the tuple $(S, A, T, R, \gamma)$. Each of these elements are defined below.
 $S$ is the set of states. $\vert S\vert = n$

$A$ is the set of actions. $\vert A\vert = k$

$T$ is the Transition function. $T(s, a, s’)$ gives the probability of reaching $s’$ from $s$ by taking the action $a$. Note that $T(s,a,\cdot)$ is a probability distribution over $S$.

$R$ is the Reward function. $R(s,a,s’)$ gives the reward for reaching $s’$ from $s$ via action $a$. We assume $R(s,a,s’)\in [ R_{max}, R_{max} ]$.
 $\gamma$ is the Discount factor.
Let $s^t$ denote the state at the $t^{th}$ timestep. Similarly, $a^t$ is the action taken at this timestep and $r^t$ is the reward for this action.
Trajectory is the history of the agent.
A policy $\pi$ is a function which the agent follows to give an action at any state. For simplicity, let the agent look only at the current state for its policy. That is, $\pi: S\to A$. This policy is:
 Markovian  Looks only at the current state and not at history
 Deterministic  A single action is given, not probability distribution
 Stationary  The output doesn’t change with time
$\Pi$ is the set of all policies. In this case, $\vert \Pi\vert = k^n$. We would like a policy which maximizes the overall reward obtained.
Value of a State
The value of a state under a policy $\pi$, $V^{\pi}(s)$ tries to tell us how “good” the state is for accumulating a large reward.
Large $\gamma$ means a large “lookahead”. The agent looks further ahead into the future to decide upon the value of the state. Do note that $\gamma\in [0,1)$
MDP Planning Problem
It can be mathematically proven that every such MDP has an optimal Markovian, deterministic, stationary policy $\pi^*$ such that
That is, the value of every state is larger in the optimal policy. This policy is NOT unique! The MDP Planning Problem is centered around finding this optimal policy $\pi^*$.
Alternative Formulations
Reward Function
The reward function can be defined differently in literature. For example, it could be stochastic instead of deterministic. It might be just dependant on $(s,a)$ instead of $(s,a,s’)$.
Some authors minimize cost instead of maximizing reward. The core idealogy remains the same, however.
A multiarmed bandit is an MDP with a single state and each of the actions being associated with a uniform distribution as a reward!
Episodic Tasks
Our discussion used continuing tasks, where trajectories are infinitely long. Episodic tasks have a terminal/sink state from which outgoing transitions are absent.
There should be nonzero probability of reaching the sink state from every nonterminal state, meaning that trajectories would surely be finite.
Value function
The definition used by us is called as Infinite Discounted Reward. There are other choices as well:
 Total Reward  Can only be used on episodic tasks (Obv)
 Finite Horizon Reward  Set a horizon $T$ and sum up rewards till here. Optimal policies for this need not be stationary.
 Average Reward  Exactly what the name suggests
Policy Evaluation
The process of calculating the value of each state given the MDP is called as policy evaluation. Bellman’s Equations are used for this procedure.
These equations are used to find the values of every state for a given MDP. For every state $s\in S$;
There are $n$ such linear equations, and $n$ unknowns.
 unique solution guaranteed if $\gamma < 1$
 for episodic, if $\gamma=1$, solution guaranteed if value of terminal state fixed as $0$.
Action Value Functions
The naive method of finding $\pi^*$ for a given MDP would be to use Bellman’s equations for each policy and get the minima. This would take $\text{poly}(n,k)\cdot k^n$ time.
We define the Action Value function $Q^{\pi}(s,a)$ as the first step for obtaining a more efficient solution.
$Q^{\pi}(s,a)$ is the expected long term reward for starting at state $s$, taking action $a$ at $t=0$ and then following $\pi$ for $t>0$.
Its value is given by the equation:
We know that all optimal policies have same optimal value function, and by extension, all optimal policies will have the same action value function as well.