Here is the question from Ross’ book that I posted last week

Question 1 We have two coins, a red one and a green one. When flipped, one lands heads with probability {P_1} and the other with probability {P_2}. Assume that {P_1>P_2}. We do not know which coin is the {P_1} coin. We initially attach probability {p} to the red coin being the {P_1} coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor {\beta}. Find the optimal policy.

This is a dynamic programming problem where the state is the belief that the red coin is {P_1}. Every period we choose a coin to toss, get a reward and updated our state given the outcome. Before I give my solution let me explain why we can’t immediately invoke uncle Gittins.

In the classical bandit problem there are {n} arms and each arm {i} provides a reward from an unknown distribution {\theta_i\in\Delta([0,1])}. Bandit problems are used to model tradeoffs between exploitation and exploration: Every period we either exploit an arm about whose distribution we already have a good idea or explore another arm. The {\theta_i} are randomized independently according to distributions {\mu_i\in \Delta(\Delta([0,1]))}, and what we are interested in is the expected discounted reward. The optimization problem has a remarkable solution: choose in every period the arm with the largest Gittins index. Then update your belief about that arm using Bayes’ rule. The Gittins index is a function which attaches a number {G(\mu)} (the index) to every belief {\mu} about an arm. What is important is that the index of an arm {i} depends only on {\mu_i} — our current belief about the distribution of the arm — not on our beliefs about the distribution of the other arms.

The independence assumption means that we only learn about the distribution of the arm we are using. This assumption is not satisfied in the red coin green coin problem: If we toss the red coin and get heads then the probability that the green coin is {P_1} decreases. Googling `multi-armed bandit’ with `dependent arms’ I got some papers which I haven’t looked at carefully but my superficial impression is that they would not help here.

Here is my solution. Call the problem I started with `the difficult problem’ and consider a variant which I call `the easy problem’. Let {r=p/(p+\sqrt{p(1-p)}} so that {r^2/(1-r)^2=p/1-p}. In the easy problem there are again two coins but this time the red coin is {P_1} with probability {r} and {P_2} with probability {1-r} and, independently, the green coin is {P_1} with probability {(1-r)} and {P_2} with probability {r}. The easy problem is easy because it is a bandit problem. We have to keep track of beliefs {p_r} and {p_g} about the red coin and the green coin ({p_r} is the probability that the red coin is {P_1}), starting with {p_r={r}} and {p_g=(1-r)}, and when we toss the red coin we update {p_r} but keep {p_g} fixed. It is easy to see that the Gittins index of an arm is a monotone function of the belief that the arm is {P_1} so the optimal strategy is to play red when {p_r\ge p_g} and green when {p_g\ge p_r}. In particular, the optimal action in the first period is red when {p\ge 1/2} and green when {p\le 1/2}.

Now here comes the trick. Consider a general strategy {g} that assigns to every finite sequence of past actions and outcomes an action (red or green). Denote by {V_d(g)} and {V_e(g)} the rewards that {g} gives in the difficult and easy problems respectively. I claim that

\displaystyle \begin{array}{rcl} &V_e(g)=r(1-r) \cdot P_1/(1-\beta)+ \\ &r(1-r) \cdot P_2/(1-\beta) + (r^2+(1-r)^2) V_d(g).\end{array}

Why is that ? in the easy problem there is a probability {r(1-r)} that both coins are {P_1}. If this happens then every {g} gives payoff {P_1/(1-\beta)}. There is a probability {r(1-r)} that both coins are {P_2}. If this happens then every {g} gives payoff {P_2/(1-\beta)}. And there is a probability {r^2+(1-r)^2} that the coins are different, and, because of the choice of {r}, conditionally on this event the probability of {G} being {P_1} is {p}. Therefore, in this case {g} gives whatever {g} gives in the difficult problem.

So, the payoff in the easy problem is a linear function of the payoff in the difficult problem. Therefore the optimal strategy in the difficult problem is the same as the optimal strategy in the easy problem. In particular, we just proved that, for every {p}, the optimal action in the first period is red when {p\ge 1/2} and green with {p\le 1/2}. Now back to the dynamic programming formulation, from standard arguments it follows that the optimal strategy is to keep doing it forever, i.e., at every period to toss the coin that is more likely to be the {P_1} coin given the current information.

See why I said my solution is tricky and specific ? it relies on the fact that there are only two arms (the fact that the arms are coins is not important). Here is a problem whose solution I don’t know:

Question 2 Let {0 \le P_1 < P_2 < ... < P_n \le 1}. We are given {n} coins, one of each parameter, all {n!} possibilities equally likely. Each period we have to toss a coin and we get payoff {1} for Heads. What is the optimal strategy ?