You are currently browsing the tag archive for the ‘multiarmed bandit’ tag.

This post is only gonna make sense if you read my last-week post on Gittins Index

So, why does Gittins’ theorem hold for arms and not for projects ?

First, an example. Consider the following two projects A and B. The first period you work on A you have to choose an action Up or Down. Future rewards from working on project A depend on your first action. The (deterministic) rewards sequence is

A:  8, 0, 0, 0, … if you played Up; and     5, 5, 5, 5, … if you played Down.

Project B gives the deterministic rewards sequence

B:  7, 0, 0,…

If your discount factor is high (meaning you are sufficiently patient) then your best strategy is to start with project B, get 7 and then switch to project A, play Down and get 5 every period.

Suppose now that there is an additional project, project C which gives the deterministic payoff sequence

C:  6,6,6,….

Your optimal strategy now is to start with A, play Up, and then switch to C.

This is a violation of Independence of Irrelevant Alternatives: when only A and B are available you first choose B. When C becomes available you suddenly choose  A. The example shows that Gittins’ index theorem does not hold for projects.

What goes wrong in the proof ? We can still define the “fair charges” and the “prevailing charges” as in the multi-armed proof. For example, the fair charge for project A is 8: If the casino charges a smaller amount you will pay once, play Up, and go home, and the casino will lose money.

The point where the proof breaks down is step 5. It is not true that Gittins’ strategy maximizes the payments to the casino. The key difference is that in the multi-project case the sequence of prevailing charges of each bandit depends on your strategy. In contrast, in the multi-arm case, your strategy only determines which charge you will play when, but the charges themselves are independent of your strategy. Since the discounts sequence 1,\beta,\beta^2,\dots is decreasing, the total discounted payment in the multi-armed problem is maximized if you pay the charges in decreasing order, as you do under the Gittins Strategy.

After presenting Richard Weber’s remarkable proof of Gittins’ index theorem in my dynamic optimization class, I claimed that the best way to make sure that you understand a proof is to identify where the assumptions of the theorem are used. Here is the proof again, slightly modified from Weber’s paper, followed by the question I gave in class.

First, an arm or a bandit process is given by a countable state space S, a transition function q(s'|s) and a payoff function r:S\rightarrow [0,1]. The interpretation is that at every period, when the arm is at state s, playing it gives a reward r(s) and the arm’s state changes according to q(\cdot|s).

In the multi-armed bandit problem, at every period you choose an arm to play. The states of the arms you didn’t choose remain fixed. Your goal is to maximize expected total discounted rewards. Gittins’ theorem says that for each arm there exists a function \gamma:S\rightarrow [0,1] called the Gittins Index (GI from now on) such that, in a multi armed problem, the optimal strategy is to play at each period the arm whose current state has the largest GI. In fancy words, the theorem establishes that the choice which arm to play at each period satisfies Independent of Irrelevance Alternatives: Suppose there are three arms A,B,C whose current states are a,b,c. If you were going to start by playing A if only A and B were available, then you should not start with B when A,B,C are available.

The proof proceeds in several steps:

  1. Define the Gittins Index at state s to be the amount \gamma such that, if the casino charges \gamma every time you play the arm, then both playing and not playing are optimal actions at the state s. We need to prove that there exists a unique such \gamma. This is not completely obvious, but can be shown by appealing to standard dynamic programming arguments.
  2. Assume that you enter a casino with a single arm at some state s with GI \gamma. Assume also that the casino charges \gamma every time you play the arm. At every period, you can play, or quit playing, or take a break. From step 1, it follows that regardless of your strategy, the casino will always get a nonnegative net expected net payoff, and if you play optimally then the net expected payoff to the casino (and therefore also to you) is zero. For this reason this \gamma (the GI of the initial state) is called the fair charge. Here, playing optimally means that you either not play at all or start playing and continue to play every period until the arm reaches a state with GI strictly smaller then \gamma, in which case you must quit. It is important that as long as the arm is at a state with GI strictly greater than \gamma you continue playing. If you need to take a restroom break you must wait until the arm reaches a state with GI \le \gamma.
  3. Continuing with a single arm, assume now that the casino announces a new policy that at every period, if the arm reaches a state with GI that is strictly smaller than the GI of all previous states, then the charge for playing the arm drops to the new GI. We call these new (random) charges the prevailing charges. Again, the casino will always get a nonnegative net expected payoff, and if you play optimally then the net expected payoff is zero. Here, playing optimally means that you either not play at all or start playing and continue to play foreover. You can quit or take a bathroom break only at periods in which the prevailing charge equals the GI of the current state.
  4. Consider now the multi-arms problem, and assume again that in order to play an arm you have to pay its current prevailing charge as defined in step 3. Then again, regardless of how you play, the Casino will get a nonnegative net payoff (since by step 3 this is the case for every arm separately), and you can still get an expected net payoff 0 if you play optimally. Playing optimally means that you either not play or start playing. If you start playing you can quit, take a break, or switch to another arm only in periods in which the prevailing charge of the arm you are currently playing equals the GI of its current state.
  5. Forget for a moment about your profits and assume that what you care about is maximizing payments to the casino (I don’t mean net payoff, I mean just the charges that the casino receives from your playing). Since the sequence of prevailing charges of every arm is decreasing, and since the discount factor makes the casino like higher payments early, the Gittins strategy — the one in which you play at each period the arm with highest current GI, which by definition of the prevailing charge is also the arm with highest current prevailing charge — is the one that maximizes the Casino’s payments. In fact, this would be the case even if you knew the realization of the charges sequence in advance.
  6. The Gittins strategy is one of the optimal strategies from step 4. Therefore, its net expected payoff is 0.
  7. Therefore, for every strategy \sigma,
    Rewards from \sigma<= Charges from \sigma<= Charges from Gittins strategy
    (First inequality is step 4 and second is step 5)
    And Charges from Gittins strategy = Rewards from Gittins Strategy
    (step 6)
  8. Therefore, Gittins strategy gives the optimal possible total rewards.

That’s it.

Now, here is the question. Suppose that instead of arms we would have dynamic optimization problems, each given by a state space, an action space, a transition function, and a payoff function. Let’s call them projects. The difference between a project and an arm is that when you decide to work on a project you also decide which action to take, and the current reward and next state depend on the current state and on your action. Now read again the proof with projects in mind. Every time I said “play arm i”, what I meant is work on project i and choose the optimal action. We can still define an “index”, as in the first step: the unique charge \gamma such that, if you need to pay \gamma every period you work on the project (using one of the actions) then both not working and working with some action are optimal. The conclusion is not true for the projects problem though. At which step does the argument break down ?

Here is the question from Ross’ book that I posted last week

Question 1 We have two coins, a red one and a green one. When flipped, one lands heads with probability {P_1} and the other with probability {P_2}. Assume that {P_1>P_2}. We do not know which coin is the {P_1} coin. We initially attach probability {p} to the red coin being the {P_1} coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor {\beta}. Find the optimal policy.

This is a dynamic programming problem where the state is the belief that the red coin is {P_1}. Every period we choose a coin to toss, get a reward and updated our state given the outcome. Before I give my solution let me explain why we can’t immediately invoke uncle Gittins.

In the classical bandit problem there are {n} arms and each arm {i} provides a reward from an unknown distribution {\theta_i\in\Delta([0,1])}. Bandit problems are used to model tradeoffs between exploitation and exploration: Every period we either exploit an arm about whose distribution we already have a good idea or explore another arm. The {\theta_i} are randomized independently according to distributions {\mu_i\in \Delta(\Delta([0,1]))}, and what we are interested in is the expected discounted reward. The optimization problem has a remarkable solution: choose in every period the arm with the largest Gittins index. Then update your belief about that arm using Bayes’ rule. The Gittins index is a function which attaches a number {G(\mu)} (the index) to every belief {\mu} about an arm. What is important is that the index of an arm {i} depends only on {\mu_i} — our current belief about the distribution of the arm — not on our beliefs about the distribution of the other arms.

The independence assumption means that we only learn about the distribution of the arm we are using. This assumption is not satisfied in the red coin green coin problem: If we toss the red coin and get heads then the probability that the green coin is {P_1} decreases. Googling `multi-armed bandit’ with `dependent arms’ I got some papers which I haven’t looked at carefully but my superficial impression is that they would not help here.

Here is my solution. Call the problem I started with `the difficult problem’ and consider a variant which I call `the easy problem’. Let {r=p/(p+\sqrt{p(1-p)}} so that {r^2/(1-r)^2=p/1-p}. In the easy problem there are again two coins but this time the red coin is {P_1} with probability {r} and {P_2} with probability {1-r} and, independently, the green coin is {P_1} with probability {(1-r)} and {P_2} with probability {r}. The easy problem is easy because it is a bandit problem. We have to keep track of beliefs {p_r} and {p_g} about the red coin and the green coin ({p_r} is the probability that the red coin is {P_1}), starting with {p_r={r}} and {p_g=(1-r)}, and when we toss the red coin we update {p_r} but keep {p_g} fixed. It is easy to see that the Gittins index of an arm is a monotone function of the belief that the arm is {P_1} so the optimal strategy is to play red when {p_r\ge p_g} and green when {p_g\ge p_r}. In particular, the optimal action in the first period is red when {p\ge 1/2} and green when {p\le 1/2}.

Now here comes the trick. Consider a general strategy {g} that assigns to every finite sequence of past actions and outcomes an action (red or green). Denote by {V_d(g)} and {V_e(g)} the rewards that {g} gives in the difficult and easy problems respectively. I claim that

\displaystyle \begin{array}{rcl} &V_e(g)=r(1-r) \cdot P_1/(1-\beta)+ \\ &r(1-r) \cdot P_2/(1-\beta) + (r^2+(1-r)^2) V_d(g).\end{array}

Why is that ? in the easy problem there is a probability {r(1-r)} that both coins are {P_1}. If this happens then every {g} gives payoff {P_1/(1-\beta)}. There is a probability {r(1-r)} that both coins are {P_2}. If this happens then every {g} gives payoff {P_2/(1-\beta)}. And there is a probability {r^2+(1-r)^2} that the coins are different, and, because of the choice of {r}, conditionally on this event the probability of {G} being {P_1} is {p}. Therefore, in this case {g} gives whatever {g} gives in the difficult problem.

So, the payoff in the easy problem is a linear function of the payoff in the difficult problem. Therefore the optimal strategy in the difficult problem is the same as the optimal strategy in the easy problem. In particular, we just proved that, for every {p}, the optimal action in the first period is red when {p\ge 1/2} and green with {p\le 1/2}. Now back to the dynamic programming formulation, from standard arguments it follows that the optimal strategy is to keep doing it forever, i.e., at every period to toss the coin that is more likely to be the {P_1} coin given the current information.

See why I said my solution is tricky and specific ? it relies on the fact that there are only two arms (the fact that the arms are coins is not important). Here is a problem whose solution I don’t know:

Question 2 Let {0 \le P_1 < P_2 < ... < P_n \le 1}. We are given {n} coins, one of each parameter, all {n!} possibilities equally likely. Each period we have to toss a coin and we get payoff {1} for Heads. What is the optimal strategy ?

Kellogg faculty blogroll