Here is the bonus question from the final exam in my dynamic optimization class of last semester. It is based on problem 8 Chapter II in Ross’ book *Introduction to stochastic dynamic programming*. It appears there as `guess the optimal policy’ without asking for proof. The question seems very natural, but I couldn’t find any information about it (nor apparently the students). I have a solution but it is tricky and too specific to this problem. I will describe my solution next week but perhaps somebody can show me a better solution or a reference ?

We have two coins, a red one and a green one. When flipped, one lands heads with probability *P1* and the other with probability *P2*. We do not know which coin is the *P1* coin. We initially attach probability *p* to the red coin being the *P1* coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor β. Describe the optimal policy, including a proof of optimality.

## 6 comments

October 28, 2014 at 11:20 pm

Jyotirmoy BhattacharyaIsn’t this a multi-armed bandit problem to which we can apply Gittins’s Index Theorem? [Gittins et al, ‘Multi Armed Bandit Allocation Indices’, Wiley, 2011]

October 29, 2014 at 1:37 pm

DanThis is not a multi-armed bandit problem because there is the two arms have the same state

October 30, 2014 at 1:01 am

AnonymousI do not see it that way. With an arm for each coin, the state of the arm is the outcome of the next toss of that coin. If q_i is the probability of heads for coin i, then because of independence, from any starting state the next state is H with probability q_i and tails with probability 1-q_i.

October 30, 2014 at 1:02 am

JyotirmoyThe comment at 1:01 a.m. is from me. Couldn’t log in.

October 31, 2014 at 8:02 am

AnonymousNo it’s not it’s from me

November 8, 2014 at 10:33 am

Red coin, green coin | The Leisure of the Theory Class[…] Here is the question I posted last week […]