You are currently browsing Eran’s articles.

Time for some game theory about the recent development in Israeli politics. (You can read more about it here)

In the aftermath of the last election, no side could form a governing coalition. So the options were either a unity government, composed of the two major parties (Bibi Netanyahu’s Likud and Benny Gantz’ Kahol-Lavan) or yet another, fourth in a row, election. But with a year of election campaigns sowing hatred throughout the country, the ongoing coronavirus lockdown and an impending recession, nobody wants another election. So the two factions have to bargain on the terms of the unity government, with the possibility of another election being what game theorists call a disagreement point: this is what happens if negotiations break down.

Game theory teaches us that the disagreement point has a significant impact on the outcome, even when the parties reach an agreement, and that the better your position in the disagreement point, the better the bargaining outcome will be for you.

The disagreement point in the negotiation following the last election is impacted by two factors that did not exist in the previous rounds:

- The coronavirus came as a gift from the gods for Bibi. Frightened citizens usually stick with incumbent leaders in times of crisis, and in Bibi’s case, this tendency is magnified by the fact that he has been around for so many years. Even before the virus popped up, the greatest difficulty of the opposition was to convince the public that he is not the only person who is qualified for the prime-minister job. So an election will be advantageous to Bibi. Moreover, since the election will be postponed until after the lockdown, Bibi will likely stay in power for a long period even under the disagreement point.
- While he could not form a government, this time Gantz did manage to form a functioning majority coalition in the parliament. He was on the verge of electing a parliamentary speaker from his party, which would allow him to implement a law — the so-called “Bibi law” that forbids Bibi from being appointed for a prime-minister due to the fact that he was indicted for bribery.

So, in terms of the disagreement point, Bibi has something good going for him and something bad going for him. So he made a threat: If you continue with your current plan to take over the parliament, I will abandon the negotiations and we will jump straight ahead to the disagreement point. Gantz, believing the threat, gave up on the plan. Instead, he promised to give the speakership to somebody from Bibi’s party and to join a government led by Bibi and, and he has split his own party for that purpose, bringing with him only half of it to the new coalition under Bibi.

So now Gantz made the disagreement point much better for Bibi and much worse from himself:

— If there is an election again, Gantz will be viewed as a fool who tore down his party for nothing, and he will likely not be elected.

— Gantz has made public that he too thinks Bibi should be a prime minister. This statement substantiates Bibi’s image as the only qualified person for the job. It also diminishes the moral argument against choosing an indicted person to prime-minister, which was the justification for Bibi’s law and a central theme in the campaign against him.

— Gantz currently took the speakership to himself. Joining Bibi’s government will force him to resign from the speakership, and he has already agreed to give it to Bibi’s party. After Bibi has the speakership, even before the government is formed, it is not possible to pass the Bibi law.

The alternative was to take over the parliament, uncover information about the government’s lack of preparation and incompetent handling of the corona crisis, and meanwhile negotiate secretly over the unity government. But without publicly supporting Bibi as the prime-minster until the deal is reached, all the while keeping the option of passing Bibi’s law if the negotiation fails. It’s hard to predict a bargaining outcome, but I think it is possible that under this plan, even if Bibi would have continued to be the prime minister for some time, we would see a meaningful division of power and a date for the termination of Bibi’s reign. Now there is no such hope.

Why did Gantz give up all his cards? I think the reason is that, because he believed in Bibi’s threat to abandon the negotiations, Gantz thought he had to choose between another election and a unity government. Indeed, Gantz conducted an election poll before he made this move, which suggests he was thinking in these terms.

Bibi’s threat, however, was empty, or, to use another game-theoretic jargon, not credible. The fact that the disagreement point would have become worse for Bibi precisely implies that, whatever he says now, he would have been willing to negotiate in order to escape it. The only difference is that the negotiation would be conducted in an environment that is less convenient to Bibi.

It is also not at all clear that the probability of reaching a deal has increased following Ganz’s move. In fact, an exultant Bibi might insist on terms that even Gantz will not be able to accept.

Many of Gantz’ voters are furious because they preferred another election over a Bibi government. But Gant’s voters who were willing to stomach a Bibi government should also be furious because Gantz created a bargaining environment in which he is desperate to achieve it.

Israel is holding an election Tuesday. It’s a multi-party system in Israel, based on proportional representation, and the main battle is between two blocs of parties. We call them the left-wing bloc and the right-wing bloc, and civics textbooks will tell you that this division is about policy: the Arab-Israeli conflict, economics and the role of religion. But let’s face it, nobody cares about public policy anymore. It’s all about Bibi. Parties that intend to form a coalition under Bibi belong to the right-wing bloc and parties that intend to rid the country of him belong to the left wing bloc.

In addition to the battle between the blocs, there is also a contest between the parties in each bloc, especially between one big party and several satellites. Bibi, in particular, is expected to use the final days of the campaign to siphon votes from satellites to his own Likud party, as he did in previous elections. Indeed, in recent days, Bibi and his social media fans said that the number of seats of the Likud party compared to the major opponent (called BlueWhite), and not the size of the blocs, will determine whether Bibi will form the government again. BlueWhite makes the same argument to left-wing voters.

But if Bibi is serious about cannibalizing his bloc, he could use a more fatal argument by appealing to the electoral threshold, that is the minimum share of total votes that a party has to achieve to be entitled to seats in the parliament. The current threshold 3.25%, which translates to roughly five seats in the 120-seat parliament. This sounds like a low threshold, but there are eleven parties in the current parliament, some of them have split into two, there are new parties that are serious contenders, and Likud and BlueWhite together are expected to win about half of the total votes. So many small parties are close to the threshold in the opinion polls. Since the two blocs are almost tied (with a small advantage to the right-wing bloc), the electoral threshold might end up playing a significant role in determining the outcome if many votes are cast to a party that doesn’t cross the threshold.

Enter game theory

It is difficult to do a game-theoretic analysis of anything election because we don’t have a good explanation for what drives voters to stand in line in the voting booth rather than go to the beach. Let me bypass this issue and just assume that you gain some utility from voting to your favorite party. So assume for simplicity that right-wing voters have two options, Likud or New Right (NR), which is one of the new parties in the right-wing bloc, whose leaders Bibi detests. According to the polls, 4% of the voters are New Right supporters. These guys will get utility 2 if they vote New Right and utility 1 if they vote Likud. And here is the twist: you only get this utility if the party you vote to passes the electoral threshold. If it doesn’t, you feel like your vote has been wasted and you get utility 0.

Whether or not an NR supporter will vote her preference depends on what she thinks the other supporter will do. In fact, NR supporters are involved in a population stag hunt game (sometimes called investment game). There are two Nash equilibria in the game: either everyone votes NR or everyone votes Likud. The first equilibrium requires players to take the risky action of voting NR, and if they all go along, they all get high utility 2. I have played variants of the investment game in class several times. Typically, players go for the safe choice, which in this case is Likud. If you do this, you guarantee yourself a utility 1 regardless of what the other players do.

How could the potential voters of NR coordinate on the risky equilibrium which gives a higher utility? I think the public opinion polls have a big role here. In the recent polls, NR has been consistently hovering above the threshold. The polls make it commonly known that there are sufficiently many potential NR voters, and also that they plan to go along with voting NR.

Suppose however that on Monday Bibi hints that his internal polls show that NR does not cross the threshold. What would our NR supporter do? She will likely know that there are no such polls since most Bibi supporters know that he is a lier. But perhaps other players will believe Bibi and switch to voting Likud, which will drive NR below the threshold. Or perhaps other players will worry that some other players will believe him. You see where this is going: to play the risky equilibrium, you need confidence that the other players are playing it too. Bibi can shake this confidence even if he can’t change the fact that NR has enough supporters to cross the threshold if they all collaborate.

By the way, I believe that established parties that are already represented in the parliament are less vulnerable to such an attack, since the confidence of their supporters in the risky equilibrium is stronger, as they played it once already.

Go ahead Bibi, make my day.

This post is only gonna make sense if you read my last-week post on Gittins Index

So, why does Gittins’ theorem hold for arms and not for projects ?

First, an example. Consider the following two projects *A* and* B*. The first period you work on *A* you have to choose an action Up or Down. Future rewards from working on project *A* depend on your first action. The (deterministic) rewards sequence is

*A*: 8, 0, 0, 0, … if you played Up; and 5, 5, 5, 5, … if you played Down.

Project *B* gives the deterministic rewards sequence

*B*: 7, 0, 0,…

If your discount factor is high (meaning you are sufficiently patient) then your best strategy is to start with project *B*, get 7 and then switch to project *A*, play Down and get 5 every period.

Suppose now that there is an additional project, project *C* which gives the deterministic payoff sequence

*C*: 6,6,6,….

Your optimal strategy now is to start with *A*, play Up, and then switch to *C*.

This is a violation of Independence of Irrelevant Alternatives: when only *A* and *B* are available you first choose B. When *C* becomes available you suddenly choose A. The example shows that Gittins’ index theorem does not hold for projects.

What goes wrong in the proof ? We can still define the “fair charges” and the “prevailing charges” as in the multi-armed proof. For example, the fair charge for project *A* is 8: If the casino charges a smaller amount you will pay once, play Up, and go home, and the casino will lose money.

The point where the proof breaks down is step 5. It is not true that Gittins’ strategy maximizes the payments to the casino. The key difference is that in the multi-project case the sequence of prevailing charges of each bandit depends on your strategy. In contrast, in the multi-arm case, your strategy only determines which charge you will play when, but the charges themselves are independent of your strategy. Since the discounts sequence is decreasing, the total discounted payment in the multi-armed problem is maximized if you pay the charges in decreasing order, as you do under the Gittins Strategy.

After presenting Richard Weber’s remarkable proof of Gittins’ index theorem in my dynamic optimization class, I claimed that the best way to make sure that you understand a proof is to identify where the assumptions of the theorem are used. Here is the proof again, slightly modified from Weber’s paper, followed by the question I gave in class.

First, an *arm* or a *bandit process* is given by a countable state space , a transition function and a payoff function . The interpretation is that at every period, when the arm is at state , playing it gives a reward and the arm’s state changes according to .

In the *multi-armed bandit problem*, at every period you choose an arm to play. The states of the arms you didn’t choose remain fixed. Your goal is to maximize expected total discounted rewards. Gittins’ theorem says that for each arm there exists a function called the *Gittins Index* (GI from now on) such that, in a multi armed problem, the optimal strategy is to play at each period the arm whose current state has the largest GI. In fancy words, the theorem establishes that the choice which arm to play at each period satisfies *Independent of Irrelevance Alternatives*: Suppose there are three arms whose current states are . If you were going to start by playing if only and were available, then you should not start with when are available.

The proof proceeds in several steps:

- Define the Gittins Index at state to be the amount such that, if the casino charges every time you play the arm, then both playing and not playing are optimal actions at the state . We need to prove that there exists a unique such . This is not completely obvious, but can be shown by appealing to standard dynamic programming arguments.
- Assume that you enter a casino with a single arm at some state with GI . Assume also that the casino charges every time you play the arm. At every period, you can play, or quit playing, or take a break. From step 1, it follows that regardless of your strategy, the casino will always get a nonnegative net expected net payoff, and if you play optimally then the net expected payoff to the casino (and therefore also to you) is zero. For this reason, this (the GI of the initial state) is called the
*fair charge*. Here, playing optimally means that you either not play at all or start playing and continue to play every period until the arm reaches a state with GI strictly smaller than , in which case you must quit. It is important that as long as the arm is at a state with GI strictly greater than you continue playing. If you need to take a restroom break you must wait until the arm reaches a state with GI . - Continuing with a single arm, assume now that the casino announces a new policy that at every period, if the arm reaches a state with GI that is strictly smaller than the GI of all previous states, then the charge for playing the arm drops to the new GI. We call these new (random) charges the
*prevailing charges*. Again, the casino will always get a nonnegative net expected payoff, and if you play optimally then the net expected payoff is zero. Here, playing optimally means that you either not play at all or start playing and continue to play forever. You can quit or take a bathroom break only at periods in which the prevailing charge equals the GI of the current state. - Consider now the multi-arms problem, and assume again that in order to play an arm you have to pay its current prevailing charge as defined in step 3. Then again, regardless of how you play, the Casino will get a nonnegative net payoff (since by step 3 this is the case for every arm separately), and you can still get an expected net payoff if you play optimally. Playing optimally means that you either not play or start playing. If you start playing you can quit, take a break, or switch to another arm only in periods in which the prevailing charge of the arm you are currently playing equals the GI of its current state.
- Forget for a moment about your profits and assume that what you care about is maximizing payments to the casino (I don’t mean net payoff, I mean just the charges that the casino receives from your playing). Since the sequence of prevailing charges of every arm is decreasing, and since the discount factor makes the casino like higher payments early, the Gittins strategy — the one in which you play at each period the arm with highest current GI, which by definition of the prevailing charge is also the arm with highest current prevailing charge — is the one that maximizes the Casino’s payments. In fact, this would be the case even if you knew the realization of the charges sequence in advance.
- The Gittins strategy is one of the optimal strategies from step 4. Therefore, its net expected payoff is .
- Therefore, for every strategy ,

Rewards from Charges from Charges from Gittins strategy

(First inequality is step 4 and second is step 5)

And Charges from Gittins strategy = Rewards from Gittins Strategy

(step 6) - Therefore, the Gittins strategy gives the optimal possible total rewards.

That’s it.

Now, here is the question. Suppose that instead of arms we would have dynamic optimization problems, each given by a state space, an action space, a transition function, and a payoff function. Let’s call them projects. The difference between a project and an arm is that when you decide to work on a project you also decide which action to take, and the current reward and next state depend on the current state and on your action. Now read again the proof with projects in mind. Every time I said “play arm ”, what I meant is work on project and choose the optimal action. We can still define an “index”, as in the first step: the unique charge such that, if you need to pay every period you work on the project (using one of the actions) then both not working and working with some action is optimal. The conclusion is not true for the projects problem though. At which step does the argument break down?

It is not often that Terry Tao gets into politics in his blog, but, as political observers like to say, normal rules don’t apply this year. Tao writes that many of Trump’s supporters secretly believe that he is not even remotedly qualified for the presidency, but they continue to entertain this possibility because their fellow citizens and the media and politicians seem to be doing so. He suggests that more people should come out and reveal their secret beliefs.

I generally agree with Tao’ sentiment and argument, but I have a quibble. Tao describe the current situation as mutual knowledge without common knowledge. This, I think, is wrong. To get politics out of the way, let me explain my position using a similar situation which Tao also mentions: The Emperor’s new clothes. I have already come across people casting the Emperor’s story in terms of mutual knowledge without common knowledge, and I think it is also wrong. The way I understand the story, before the kid shouts, each of the Emperor’s subjects sees that the Emperor is naked, but after observing everybody else’s reaction, each subject updates her own initial belief and deduces that she was probably wrong. The subjects now don’t think that the Emperor is naked. Rather, each subjects thinks that her own eyes deceived her.

But when game theorists and logicians say that an assertion is mutual knowledge (or mutual belief) we mean that each of us, after taking into account our own information including what we deduce about other people’s information, think the assertion is true. In my reading of the Emperor’s new cloths story this is not the case.

For an assertion to be common knowledge, we need in addition that everybody knows that everybody knows that the assertion is true, and that everybody knows that everybody knows that everybody knows that the assertion is true, and onwards to infinity. A good example of a situation with mutual knowledge and no common knowledge is the blue-eyed islanders puzzle (using the story as it appears Terrence’ blog and a big spoiler ahead if you are not familiar with the puzzle): Before the foreigner makes an announcement, it is mutual knowledge that there are at least 99 blue-eyed islanders, but this fact is not common knowledge: If Alice and Bob are both blue-eyed then Alice, not knowing the color of her own eyes, thinks that Bob might observe only 98 blue-eyed islanders. In fact it is not even common knowledge that there are at least 98 blue-eyed Islanders, because Alice thinks that Bob might think that Craig might only observe 97 blue-eyed Islanders. By similar reasoning, before the foreigner’s announcement, it is not even common knowledge that there is at least one blue-eyed islander. Once the foreigner announces it, this fact becomes common knowledge.

No mutual knowledge and no common knowledge are two situations that can have different behavioral implications. Suppose that we offer each of the subjects the following private voting game: Is the emperor wearing clothes ? You have to answer yes or no. If you answer correctly you get a free ice cream sandwich, otherwise you get nothing. According to my reading of the story they will all give the wrong answer, and get nothing. On the other hand, suppose you offer a similar game to the islanders — even before the foreigner arrives — Do you think that there is at least one blue-eyed islander ? they will answer correctly.

There is an alternative reading of the Emperor’s story, according to which it is indeed a story about mutual knowledge without common knowledge: Even after observing the crowd’s reaction, each subject still knows that the Emperor is naked, but she keeps her mouth shut because she suspects that her fellow subjects don’t realize it and she doesn’t want to make a fool of herself. This reading strikes me as less psychologically interesting, but, more importantly, if that’s how you understand the story then there is nothing to worry about. All the subjects will vote correctly anyway and get the ice cream even without the little kid making it a common knowledge. And Trump will not be elected president even if people continue to keep their mouth shut.

When explaining the meaning of a confidence interval don’t say “the probability that the parameter is in the interval is 0.95” because probability is a precious concept and this statement does not match the meaning of this term. Instead, say “We are 95% confident that the parameter is in the interval”. Admittedly, I don’t know what people will make of the word “confident”. But I also don’t know what they will make of the word “probability”

If you live under the impression that in order to publish an empirical paper you must include the sentence “this holds with p-value *x”* for some number *x<0.05* in your paper, here is a surprising bit of news for you: The editors of Basic and Applied Social Psychology have banned p-value from their journal, along with confidence intervals. In fact, according to the editorial, the state of the art of statistics “remains uncertain” so statistical inference is no longer welcome in their journal.

When I came across this editorial I was dumbfounded by the arrogance of the editors, who seem to know about statistics as much as I know about social psychology. But I haven’t heard about this journal until yesterday, and if I did I am pretty sure I wouldn’t believe anything they publish, p-value or no p-value. So I don’t have the right to complain here.

Here are somebodies who have the right to complain: The American Statistical Association. Concerned with the misuse, mistrust and misunderstanding of the p-value, ASA has recently issued a policy statement on p- values and statistical significance, intended for researchers who are not statisticians.

How do you explain p-value to practitioners who don’t care about things like Neyman-Pearson Lemma, independence and UMP tests ? First, you use language that obscures conceptual difficulties: “the probability that a statistical summary of the data would be equal to or more extreme than its observed value’’ — without saying what “more extreme’’ means. Second, you use warnings and slogans about what p-value doesn’t mean or can’t do, like “p-value does not measure the size of an effect or the importance of a result.’’

Among these slogans my favorite is

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

What’s cute about this statement is that it assumes that everybody understands what “there is 5% chance that the studied hypothesis is true” and that the notion of P-value is the one that is difficult to understand. In fact, the opposite is true.

Probability is conceptually tricky. It’s meaning is somewhat clear in a situation of a repeated experiment: I more or less understand what it means that a coin has 50% chance to land on Heads. (Yes. Only more or less). But without going full subjective I have no idea what is the meaning of the probability that a given hypothesis (boys who eat pickles in kindergarten have higher SAT score than girls who play firefighters) is true. On the other hand, The meaning of the corresponding P-value relies only on the conceptually simpler notion of probabilities in a repeated experiment.

Why therefore do the committee members (rightly !) assume that people are comfortable with the difficult concept of probability that an hypothesis is true and are uncomfortable with the easy concept of p-value ? I think the reason is that unlike the word “p-value”, the word “probability” is a word that we use in everyday life, so most people feel they know what it means. Since they have never thought about it formally, they are not aware that they actually don’t.

So here is a modest proposal for preventing the misuse and misunderstanding of statistical inference: Instead of saying “this hypothesis holds with p-value 0.03” say “We are 97% confident that this hypothesis holds”. We all know what “confident” means right ?

I am not the right person to write about Lloyd Shapley. I think I only saw him once, in the first stony brook conference I attended. He reminded me of Doc Brown from Back to The Future, but I am not really sure why. Here are links to posts in The Economist and NYT following his death.

Shapley got the Nobel in 2012 and according to Robert Aumann deserved to get it right with Nash. Shapley himself however was not completely on board: “I consider myself a mathematician and the award is for economics. I never, never in my life took a course in economics.” If you are wondering what he means by “a mathematician” read the following quote, from the last paragraph of his stable matching paper with David Gale

The argument is carried out not in mathematical symbols but in ordinary English; there are no obscure or technical terms. Knowledge of calculus is not presupposed. In fact, one hardly needs to know how to count. Yet any mathematician will immediately recognize the argument as mathematical…

What, then, to raise the old question once more, is mathematics? The answer, it appears, is that any argument which is carried out with sufficient precision is mathematical

In the paper Gale and Shapley considered a problem of matching (or assignment as they called it) of applicants to colleges, where each applicant has his own preference over colleges and each college has its preference over applicants. Moreover, each college has a quota. Here is the definition of stability, taken from the original paper

Definition: An assignment of applicants to colleges will be called unstable if there are two applicants and who are assigned to colleges and , respectively, although prefers to and prefers to .

According to the Gale-Shapley algorithm, applicants apply to colleges sequentially following their preferences. A college with quota maintains a `waiting list’ of size with the top applicants that has applied to it so far, and rejects all other applicants. When an applicant is rejected from a college he applies to his next favorite college. Gale and Shapley proved that the algorithm terminates with a stable assignment.

One reason that the paper was so successful is that the Gale Shapley method is actually used in practice. (A famous example is the national resident program that assigns budding physicians to hospitals). From theoretical perspective my favorite follow-up is a paper of Dubins and Freedman “Machiavelli and the Gale-Shapley Algorithm” (1981): Suppose that some applicant, Machiavelli, decides to `cheat’ and apply to colleges in different order than his true ranking. Can Machiavelli improves his position in the assignment produced by the algorithm ? Dubins and Freedman prove that the answer to this question is no.

Shapley’s contribution to game theory is too vast to mention in a single post. Since I mainly want to say something about his mathematics let me mention Shapley-Folkman-Starr Lemma, a kind of discrete analogue of Lyapunov’s theorem on the range of non-atomic vector measures, and KKMS Lemma which I still don’t understand its meaning but it has something to do with fixed points and Yaron and I have used it in our paper about rental harmony.

I am going to talk in more details about stochasic games, introduced by Shapley in 1953, since this area has been flourishing recently with some really big developments. A (two-player, zero-sum) stochastic game is given by a finite set of states, finite set of actions for the players, a period payoff function , a distribution over for every state and actions , and a discount factor . At every period the system is at some state , players choose actions simultaneously and independently. Then the column player pays to the row player. The game then moves to a new state in the next period, randomized according to . Players evaluate their infinite stream of payoofs via the discount factor . The model is a generalization of the single player dynamic programming model which was studied by Blackwell and Bellman. Shapley proved that every zero-sum stochastic game admits a value, by imitating the familiar single player argument, which have been the joy and pride of macroeconomists ever since Lucas asset pricing model (think Bellman Equation and the contraction operators). Fink later proved using similar ideas that non-zero sum discounted stochastic games admit perfect markov equilibria.

A major question, following a similar question in the single player setup, is the limit behavior of the value and the optimal strategies when players become more patient (i.e., goes to ). Mertens and Neyman have proved that the limit exists, and moreover that for every there strategies which are -optimal for sufficiently large discount factor. Whether a similar result holds for Nash equilibrium in -player stochastic games is probably the most important open question in game theory. Another important question is whether the limit of the value exists for zero-sum games in which the state is not observed by both players. Bruno Zilloto has recently answered this question by providing a counter-example. I should probably warn that you need to know how to count and also some calculus to follow up this literature. Bruno Zilloto will give the Shapley Lecture in Games2016 in Maastricht. Congrats, Bruno ! and thanks to Shapley for leaving us with some much stuff to play with !

One of the highlight of last year’s Stony Brook conference was John Milnor’s talk about John Nash. The video is now available online. John Milnor, also an Abel Prize laureate, is familiar to many economists from his book “Topology from the differential approach”. He has known John Nash from the beginning of their careers in the common room in the math school at Princeton University. Watch him talk with clarity and humor about young Nash’s ambitions, creativity, courage, and contributions.

Here is a the handwritten letter which Nash wrote to the NSA in 1955 (pdf), fifteen years before Cook formalized the P/NP problem. In the letter Nash conjectures that for most encryption mechanisms, recovering the key from the cipher requires exponential amount of time. And here is what Nash had to say about proving this conjecture:

Here is the question from Ross’ book that I posted last week

Question 1We have two coins, a red one and a green one. When flipped, one lands heads with probability and the other with probability . Assume that . We do not know which coin is the coin. We initially attach probability to the red coin being the coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor . Find the optimal policy.

This is a dynamic programming problem where the state is the belief that the red coin is . Every period we choose a coin to toss, get a reward and updated our state given the outcome. Before I give my solution let me explain why we can’t immediately invoke uncle Gittins.

In the classical bandit problem there are arms and each arm provides a reward from an unknown distribution . Bandit problems are used to model tradeoffs between exploitation and exploration: Every period we either exploit an arm about whose distribution we already have a good idea or explore another arm. The are randomized independently according to distributions , and what we are interested in is the expected discounted reward. The optimization problem has a remarkable solution: choose in every period the arm with the largest Gittins index. Then update your belief about that arm using Bayes’ rule. The Gittins index is a function which attaches a number (the index) to every belief about an arm. What is important is that the index of an arm depends only on — our current belief about the distribution of the arm — not on our beliefs about the distribution of the other arms.

The independence assumption means that we only learn about the distribution of the arm we are using. This assumption is not satisfied in the red coin green coin problem: If we toss the red coin and get heads then the probability that the green coin is decreases. Googling `multi-armed bandit’ with `dependent arms’ I got some papers which I haven’t looked at carefully but my superficial impression is that they would not help here.

Here is my solution. Call the problem I started with `the difficult problem’ and consider a variant which I call `the easy problem’. Let so that . In the easy problem there are again two coins but this time the red coin is with probability and with probability and, *independently*, the green coin is with probability and with probability . The easy problem is easy because it is a bandit problem. We have to keep track of beliefs and about the red coin and the green coin ( is the probability that the red coin is ), starting with and , and when we toss the red coin we update but keep fixed. It is easy to see that the Gittins index of an arm is a monotone function of the belief that the arm is so the optimal strategy is to play red when and green when . In particular, the optimal action in the first period is red when and green when .

Now here comes the trick. Consider a general strategy that assigns to every finite sequence of past actions and outcomes an action (red or green). Denote by and the rewards that gives in the difficult and easy problems respectively. I claim that

Why is that ? in the easy problem there is a probability that both coins are . If this happens then every gives payoff . There is a probability that both coins are . If this happens then every gives payoff . And there is a probability that the coins are different, and, because of the choice of , conditionally on this event the probability of being is . Therefore, in this case gives whatever gives in the difficult problem.

So, the payoff in the easy problem is a linear function of the payoff in the difficult problem. Therefore the optimal strategy in the difficult problem is the same as the optimal strategy in the easy problem. In particular, we just proved that, for every , the optimal action in the first period is red when and green with . Now back to the dynamic programming formulation, from standard arguments it follows that the optimal strategy is to keep doing it forever, i.e., at every period to toss the coin that is more likely to be the coin given the current information.

See why I said my solution is tricky and specific ? it relies on the fact that there are only two arms (the fact that the arms are coins is not important). Here is a problem whose solution I don’t know:

Question 2Let . We are given coins, one of each parameter, all possibilities equally likely. Each period we have to toss a coin and we get payoff for Heads. What is the optimal strategy ?

## Recent Comments