Multi-Armed Bandits
Explore Exploit Dilemma
This dilemma can be stated as the question:
Do I use the experience gained so far to pick the most optimal move, or do I keep exploring and obtain more data?
Do note that it is not guaranteed for the move made by the agent (using limited data) to be optimal globally. Such a dilemma is observable in clinical trials, online advertising and packet routing in communication networks.
Formal Definitions and Notations
Stochastic Multi-armed bandit
Has $n$ arms, each of which is a Bernoulli Distribution. The $i^{th}$ arm has mean reward $p_i$, and the largest mean is $p_M$.
It is Stochastic because the distribution of each arm is known.
Algorithm
Operates on the multi-armed bandit, by deciding which arm is to be pulled; to get a reward for that action.
That is, at any time $t$, given the history ($h^t$) it chooses an arm to sample ($a^t$) to obtain a reward $r^t$.
The maximum number of pulls allowed is called as the total sampling budget or horizon ($T$).
A deterministic algorithm maps the set of all histories to the set of all arms. Similarly, a randomized algorithm maps the set of all histories to the set of all probability distributions over the arms.
Using the above relation, we can see that $(2n)^T$ possible histories are possible for a deterministic algorithm acting on a stochastic multi-armed bandit, with horizon $T$.
Epsilon Greedy Algorithms
$\epsilon$ is a parameter $\in [0,1]$ which controls the amount of exploration. It is used in different ways, some explained below.
- $\epsilon$G1
- For $t \leq \epsilon T$, sample an arm uniformly at random
- At $t = \lfloor \epsilon T \rfloor$ determine the action with the highest empirical mean, and choose this action everytime
- $\epsilon$G2
- For $t \leq \epsilon T$, sample an arm uniformly at random
- For $t > \epsilon T$, sample the arm with the highest empirical mean
- That is, the value of the mean is updated in this case after $\epsilon T$ steps, wheras it wasn’t done previously
- $\epsilon$G3
- With probability $\epsilon$ sample an arm uniformly at random, and with $1-\epsilon$ sample the arm with highest empirical mean
Evaluating Algorithms
This is visualized by plotting a graph between the expected reward and the time stamp of the algorithm. To gauge the performance, three horizontal lines are plotted; $p_M$, $p_{min}$, $p_{avg}$. We expect the algorithm to start out near the average and reach the max with increasing time steps.
The expected cumulative regret of the algorithm is defined as follows:
That is, the difference between the maximum possible reward and what you end up getting. Ideally, we would like the value of $(R_T/T)$ to tend to $0$ as $T$ tends to $\infty$.
That is, we would like the algorithm’s regret to be sublinear.