Regret Optimization

Let’s analyze the performance of $\epsilon$G1 and $\epsilon$G2. The regret calculations are shown below:

$$\begin{eqnarray} R_T &=& Tp_M - \sum^{T-1}_{t=0}\mathbb{E}(r^t) \\ &=& Tp_M - \sum^{\epsilon T-1}_{t=0}\mathbb{E}(r^t) - \sum^{T-1}_{t=\epsilon T}\mathbb{E}(r^t) \\ &\geq& Tp_M - (\epsilon T)p_{avg} - T(1-\epsilon)p_M \\ &\geq& \epsilon T(p_M - p_{avg}) \\ &\in& \Omega (T) \\ \end{eqnarray}$$

That is, the regret is not sublinear, even in the best case scenario where the algorithm performs perfectly after exploration.

Now let’s do the same regret analysis of $\epsilon$G3.

$$\begin{eqnarray} R_T &=& Tp_M - \sum^{T-1}_{t=0}\mathbb{E}(r^t) \\ &\geq& Tp_M - \sum^{T-1}_{t=0}(\epsilon p_{avg} + (1-\epsilon)p_M)\\ &\geq& \epsilon T(p_M - p_{avg}) \\ &\in& \Omega(T) \\ \end{eqnarray}$$

That is, all three epsilon greedy algorithms discussed earlier are not sub-linear in nature.

Acheiving Sublinear Regret

There are two general heuristics which should be met for a sub-linear algorithm.

  1. Every arm in the multi-armed bandit must be pulled infinite number of times as $T\rightarrow \infty$. (Infinite Exploration)

  2. Let $exploit(T)$ be the number of pulls that are exploitative in nature. Then, for sublinear regret we need the following;

$$\lim_{T\to\infty}\frac{\mathbb{E}(exploit(T))}{T} = 1$$

That is, nearly all of the pulls must be of exploitative behaviour. (Greedy Limit)

Now, let $\bar{\mathcal{I}}$ be a set of all bandit instances with reward means strictly less than 1. Then;

An algorithm L acheives sub-linear regret on all instances of $I \in \bar{\mathcal{I}}$ iff the algorithm satisfies both the above mentioned conditions.

These conditions are called as GLIE in short, which stands for “Greedy Limit Infinite Exploration”.

Modifying epsilon greedy strategies

$\epsilon$G1/2 can be modified slightly to make it “GLIE compliant”, instead of exploring for $\epsilon T$ pulls, we explore for $\sqrt{T}$ pulls.

  • C1 satisfied since each arm pulled $\sqrt{T}/n$ times on average
  • C2 satisfied as $exploit(T)$ would be $T-\sqrt{T}$

Similarly, $\epsilon$G3 can be fixed by making epsilon a function of $t$, as $1/(t+1)$. It can be seen pretty easily that the conditions are satisfied, using the below equation.

$$\sum^{T-1}_{t=0}\frac{1}{n(t+1)} = \Theta\left(\frac{\log T}{n}\right)$$

Lai and Robbins Lower Bound

This result establishes that the lower bound on the regret attainable for a sub-polynomial algorithm is logarithmic in $T$.

It has been stated more formally below; note the little-o notation.

If $L$ be an algorithm such that for every bandit $I\in\bar{\mathcal{I}}$ and for every $\alpha>0$, as $T\rightarrow\infty$

$$R_T(L,I) = o(T^\alpha)$$

Then, for every bandit instance $I\in\bar{\mathcal{I}}$ as $T\rightarrow\infty$

$$ \frac{R_T(L,I)}{ln(T)} = \sum_{a:p_a(I)\neq p_M(I)}\frac{p_M(I) - p_a(I)}{KL(p_a(I), p_M(I))} $$

Where, $KL(x,y) = xln(x/y)+(1-x)ln((1-x)/(1-y))$

(Notice that the RHS of second equation is constant for a given bandit)

Sublinear Algorithms

UCB

At every time $t$ and arm $a$, define $\text{ucb}^t_a$ as follows:

$$\text{ucb}^t_a = \hat{p}^t_a + \sqrt{\frac{2ln(t)}{u^t_a}}$$

Where $\hat{p}^t_a$ is the empirical mean of that arm, and $u^t_a$ is the number of times that arm has been pulled. (Pull all the arms once before calculating)

The algorithm samples the arm with the highest ucb. This acheives a regret of $O(\log(T))$, the optimal dependance on $T$.

KL UCB

Although UCB is optimal order-wise, the constant is still different. KL-UCB fixes this by changing the definition of UCB slightly.

$$\text{kl-ucb}^t_a = \max\{ q\in[\hat{p}^t_a,1]\text{ where } u^t_a KL(\hat{p}^t_a,q)\leq ln(t)+cln(ln(t)) \}$$
$$\text{where } c\geq 3$$

Notice that $KL(\hat{p}^t_a,q)$ monotonically increases with $q$, easy to find value by binary search! This algorithm asymptotically matches the Lai and Robbins’ Lower Bound as well.

Thompson Sampling

This algorithm uses Beta Distribution, and it’s parameters are given below. Note that this distribution is always going to give values between 0 and 1.

\[Beta(\alpha, \beta) \rightarrow \mu = \frac{\alpha}{\alpha+\beta}, \sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\]

At time $t$, let arm $a$ have $s^t_a$ successes and $f_a^t$ failures. Then, $Beta(s^t_a+1, f_a^t+1)$ represents a belief about the true mean of that arm.

For every arm $a$, draw a sample $x^t_a \sim Beta(s^t_a+1, f_a^t+1)$ and chose the arm which gave the maximal $x^t_a$ (and update the distribution).

This acheives optimal regret, and is excellent in practice. Usually, it performs slightly better than KL-UCB as well.

All these algorithms are examples of optimism in face of uncertainity principle.