You are currently browsing Eran’s articles.

This post is only gonna make sense if you read my last-week post on Gittins Index

So, why does Gittins’ theorem hold for arms and not for projects ?

First, an example. Consider the following two projects *A* and* B*. The first period you work on *A* you have to choose an action Up or Down. Future rewards from working on project *A* depend on your first action. The (deterministic) rewards sequence is

*A*: 8, 0, 0, 0, … if you played Up; and 5, 5, 5, 5, … if you played Down.

Project *B* gives the deterministic rewards sequence

*B*: 7, 0, 0,…

If your discount factor is high (meaning you are sufficiently patient) then your best strategy is to start with project *B*, get 7 and then switch to project *A*, play Down and get 5 every period.

Suppose now that there is an additional project, project *C* which gives the deterministic payoff sequence

*C*: 6,6,6,….

Your optimal strategy now is to start with *A*, play Up, and then switch to *C*.

This is a violation of Independence of Irrelevant Alternatives: when only *A* and *B* are available you first choose B. When *C* becomes available you suddenly choose A. The example shows that Gittins’ index theorem does not hold for projects.

What goes wrong in the proof ? We can still define the “fair charges” and the “prevailing charges” as in the multi-armed proof. For example, the fair charge for project *A* is 8: If the casino charges a smaller amount you will pay once, play Up, and go home, and the casino will lose money.

The point where the proof breaks down is step 5. It is not true that Gittins’ strategy maximizes the payments to the casino. The key difference is that in the multi-project case the sequence of prevailing charges of each bandit depends on your strategy. In contrast, in the multi-arm case, your strategy only determines which charge you will play when, but the charges themselves are independent of your strategy. Since the discounts sequence is decreasing, the total discounted payment in the multi-armed problem is maximized if you pay the charges in decreasing order, as you do under the Gittins Strategy.

After presenting Richard Weber’s remarkable proof of Gittins’ index theorem in my dynamic optimization class, I claimed that the best way to make sure that you understand a proof is to identify where the assumptions of the theorem are used. Here is the proof again, slightly modified from Weber’s paper, followed by the question I gave in class.

First, an *arm* or a *bandit process* is given by a countable state space , a transition function and a payoff function . The interpretation is that at every period, when the arm is at state , playing it gives a reward and the arm’s state changes according to .

In the *multi-armed bandit problem*, at every period you choose an arm to play. The states of the arms you didn’t choose remain fixed. Your goal is to maximize expected total discounted rewards. Gittins’ theorem says that for each arm there exists a function called the *Gittins Index* (GI from now on) such that, in a multi armed problem, the optimal strategy is to play at each period the arm whose current state has the largest GI. In fancy words, the theorem establishes that the choice which arm to play at each period satisfies *Independent of Irrelevance Alternatives*: Suppose there are three arms whose current states are . If you were going to start by playing if only and were available, then you should not start with when are available.

The proof proceeds in several steps:

- Define the Gittins Index at state to be the amount such that, if the casino charges every time you play the arm, then both playing and not playing are optimal actions at the state . We need to prove that there exists a unique such . This is not completely obvious, but can be shown by appealing to standard dynamic programming arguments.
- Assume that you enter a casino with a single arm at some state with GI . Assume also that the casino charges every time you play the arm. At every period, you can play, or quit playing, or take a break. From step 1, it follows that regardless of your strategy, the casino will always get a nonnegative net expected net payoff, and if you play optimally then the net expected payoff to the casino (and therefore also to you) is zero. For this reason this (the GI of the initial state) is called the
*fair charge*. Here, playing optimally means that you either not play at all or start playing and continue to play every period until the arm reaches a state with GI strictly smaller then , in which case you must quit. It is important that as long as the arm is at a state with GI strictly greater than you continue playing. If you need to take a restroom break you must wait until the arm reaches a state with GI . - Continuing with a single arm, assume now that the casino announces a new policy that at every period, if the arm reaches a state with GI that is strictly smaller than the GI of all previous states, then the charge for playing the arm drops to the new GI. We call these new (random) charges the
*prevailing charges*. Again, the casino will always get a nonnegative net expected payoff, and if you play optimally then the net expected payoff is zero. Here, playing optimally means that you either not play at all or start playing and continue to play foreover. You can quit or take a bathroom break only at periods in which the prevailing charge equals the GI of the current state. - Consider now the multi-arms problem, and assume again that in order to play an arm you have to pay its current prevailing charge as defined in step 3. Then again, regardless of how you play, the Casino will get a nonnegative net payoff (since by step 3 this is the case for every arm separately), and you can still get an expected net payoff if you play optimally. Playing optimally means that you either not play or start playing. If you start playing you can quit, take a break, or switch to another arm only in periods in which the prevailing charge of the arm you are currently playing equals the GI of its current state.
- Forget for a moment about your profits and assume that what you care about is maximizing payments to the casino (I don’t mean net payoff, I mean just the charges that the casino receives from your playing). Since the sequence of prevailing charges of every arm is decreasing, and since the discount factor makes the casino like higher payments early, the Gittins strategy — the one in which you play at each period the arm with highest current GI, which by definition of the prevailing charge is also the arm with highest current prevailing charge — is the one that maximizes the Casino’s payments. In fact, this would be the case even if you knew the realization of the charges sequence in advance.
- The Gittins strategy is one of the optimal strategies from step 4. Therefore, its net expected payoff is .
- Therefore, for every strategy ,

Rewards from Charges from Charges from Gittins strategy

(First inequality is step 4 and second is step 5)

And Charges from Gittins strategy = Rewards from Gittins Strategy

(step 6) - Therefore, Gittins strategy gives the optimal possible total rewards.

That’s it.

Now, here is the question. Suppose that instead of arms we would have dynamic optimization problems, each given by a state space, an action space, a transition function, and a payoff function. Let’s call them projects. The difference between a project and an arm is that when you decide to work on a project you also decide which action to take, and the current reward and next state depend on the current state and on your action. Now read again the proof with projects in mind. Every time I said “play arm ”, what I meant is work on project and choose the optimal action. We can still define an “index”, as in the first step: the unique charge such that, if you need to pay every period you work on the project (using one of the actions) then both not working and working with some action are optimal. The conclusion is not true for the projects problem though. At which step does the argument break down ?

It is not often that Terry Tao gets into politics in his blog, but, as political observers like to say, normal rules don’t apply this year. Tao writes that many of Trump’s supporters secretly believe that he is not even remotedly qualified for the presidency, but they continue to entertain this possibility because their fellow citizens and the media and politicians seem to be doing so. He suggests that more people should come out and reveal their secret beliefs.

I generally agree with Tao’ sentiment and argument, but I have a quibble. Tao describe the current situation as mutual knowledge without common knowledge. This, I think, is wrong. To get politics out of the way, let me explain my position using a similar situation which Tao also mentions: The Emperor’s new clothes. I have already come across people casting the Emperor’s story in terms of mutual knowledge without common knowledge, and I think it is also wrong. The way I understand the story, before the kid shouts, each of the Emperor’s subjects sees that the Emperor is naked, but after observing everybody else’s reaction, each subject updates her own initial belief and deduces that she was probably wrong. The subjects now don’t think that the Emperor is naked. Rather, each subjects thinks that her own eyes deceived her.

But when game theorists and logicians say that an assertion is mutual knowledge (or mutual belief) we mean that each of us, after taking into account our own information including what we deduce about other people’s information, think the assertion is true. In my reading of the Emperor’s new cloths story this is not the case.

For an assertion to be common knowledge, we need in addition that everybody knows that everybody knows that the assertion is true, and that everybody knows that everybody knows that everybody knows that the assertion is true, and onwards to infinity. A good example of a situation with mutual knowledge and no common knowledge is the blue-eyed islanders puzzle (using the story as it appears Terrence’ blog and a big spoiler ahead if you are not familiar with the puzzle): Before the foreigner makes an announcement, it is mutual knowledge that there are at least 99 blue-eyed islanders, but this fact is not common knowledge: If Alice and Bob are both blue-eyed then Alice, not knowing the color of her own eyes, thinks that Bob might observe only 98 blue-eyed islanders. In fact it is not even common knowledge that there are at least 98 blue-eyed Islanders, because Alice thinks that Bob might think that Craig might only observe 97 blue-eyed Islanders. By similar reasoning, before the foreigner’s announcement, it is not even common knowledge that there is at least one blue-eyed islander. Once the foreigner announces it, this fact becomes common knowledge.

No mutual knowledge and no common knowledge are two situations that can have different behavioral implications. Suppose that we offer each of the subjects the following private voting game: Is the emperor wearing clothes ? You have to answer yes or no. If you answer correctly you get a free ice cream sandwich, otherwise you get nothing. According to my reading of the story they will all give the wrong answer, and get nothing. On the other hand, suppose you offer a similar game to the islanders — even before the foreigner arrives — Do you think that there is at least one blue-eyed islander ? they will answer correctly.

There is an alternative reading of the Emperor’s story, according to which it is indeed a story about mutual knowledge without common knowledge: Even after observing the crowd’s reaction, each subject still knows that the Emperor is naked, but she keeps her mouth shut because she suspects that her fellow subjects don’t realize it and she doesn’t want to make a fool of herself. This reading strikes me as less psychologically interesting, but, more importantly, if that’s how you understand the story then there is nothing to worry about. All the subjects will vote correctly anyway and get the ice cream even without the little kid making it a common knowledge. And Trump will not be elected president even if people continue to keep their mouth shut.

When explaining the meaning of a confidence interval don’t say “the probability that the parameter is in the interval is 0.95” because probability is a precious concept and this statement does not match the meaning of this term. Instead, say “We are 95% confident that the parameter is in the interval”. Admittedly, I don’t know what people will make of the word “confident”. But I also don’t know what they will make of the word “probability”

If you live under the impression that in order to publish an empirical paper you must include the sentence “this holds with p-value *x”* for some number *x<0.05* in your paper, here is a surprising bit of news for you: The editors of Basic and Applied Social Psychology have banned p-value from their journal, along with confidence intervals. In fact, according to the editorial, the state of the art of statistics “remains uncertain” so statistical inference is no longer welcome in their journal.

When I came across this editorial I was dumbfounded by the arrogance of the editors, who seem to know about statistics as much as I know about social psychology. But I haven’t heard about this journal until yesterday, and if I did I am pretty sure I wouldn’t believe anything they publish, p-value or no p-value. So I don’t have the right to complain here.

Here are somebodies who have the right to complain: The American Statistical Association. Concerned with the misuse, mistrust and misunderstanding of the p-value, ASA has recently issued a policy statement on p- values and statistical significance, intended for researchers who are not statisticians.

How do you explain p-value to practitioners who don’t care about things like Neyman-Pearson Lemma, independence and UMP tests ? First, you use language that obscures conceptual difficulties: “the probability that a statistical summary of the data would be equal to or more extreme than its observed value’’ — without saying what “more extreme’’ means. Second, you use warnings and slogans about what p-value doesn’t mean or can’t do, like “p-value does not measure the size of an effect or the importance of a result.’’

Among these slogans my favorite is

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

What’s cute about this statement is that it assumes that everybody understands what “there is 5% chance that the studied hypothesis is true” and that the notion of P-value is the one that is difficult to understand. In fact, the opposite is true.

Probability is conceptually tricky. It’s meaning is somewhat clear in a situation of a repeated experiment: I more or less understand what it means that a coin has 50% chance to land on Heads. (Yes. Only more or less). But without going full subjective I have no idea what is the meaning of the probability that a given hypothesis (boys who eat pickles in kindergarten have higher SAT score than girls who play firefighters) is true. On the other hand, The meaning of the corresponding P-value relies only on the conceptually simpler notion of probabilities in a repeated experiment.

Why therefore do the committee members (rightly !) assume that people are comfortable with the difficult concept of probability that an hypothesis is true and are uncomfortable with the easy concept of p-value ? I think the reason is that unlike the word “p-value”, the word “probability” is a word that we use in everyday life, so most people feel they know what it means. Since they have never thought about it formally, they are not aware that they actually don’t.

So here is a modest proposal for preventing the misuse and misunderstanding of statistical inference: Instead of saying “this hypothesis holds with p-value 0.03” say “We are 97% confident that this hypothesis holds”. We all know what “confident” means right ?

I am not the right person to write about Lloyd Shapley. I think I only saw him once, in the first stony brook conference I attended. He reminded me of Doc Brown from Back to The Future, but I am not really sure why. Here are links to posts in The Economist and NYT following his death.

Shapley got the Nobel in 2012 and according to Robert Aumann deserved to get it right with Nash. Shapley himself however was not completely on board: “I consider myself a mathematician and the award is for economics. I never, never in my life took a course in economics.” If you are wondering what he means by “a mathematician” read the following quote, from the last paragraph of his stable matching paper with David Gale

The argument is carried out not in mathematical symbols but in ordinary English; there are no obscure or technical terms. Knowledge of calculus is not presupposed. In fact, one hardly needs to know how to count. Yet any mathematician will immediately recognize the argument as mathematical…

What, then, to raise the old question once more, is mathematics? The answer, it appears, is that any argument which is carried out with sufficient precision is mathematical

In the paper Gale and Shapley considered a problem of matching (or assignment as they called it) of applicants to colleges, where each applicant has his own preference over colleges and each college has its preference over applicants. Moreover, each college has a quota. Here is the definition of stability, taken from the original paper

Definition: An assignment of applicants to colleges will be called unstable if there are two applicants and who are assigned to colleges and , respectively, although prefers to and prefers to .

According to the Gale-Shapley algorithm, applicants apply to colleges sequentially following their preferences. A college with quota maintains a `waiting list’ of size with the top applicants that has applied to it so far, and rejects all other applicants. When an applicant is rejected from a college he applies to his next favorite college. Gale and Shapley proved that the algorithm terminates with a stable assignment.

One reason that the paper was so successful is that the Gale Shapley method is actually used in practice. (A famous example is the national resident program that assigns budding physicians to hospitals). From theoretical perspective my favorite follow-up is a paper of Dubins and Freedman “Machiavelli and the Gale-Shapley Algorithm” (1981): Suppose that some applicant, Machiavelli, decides to `cheat’ and apply to colleges in different order than his true ranking. Can Machiavelli improves his position in the assignment produced by the algorithm ? Dubins and Freedman prove that the answer to this question is no.

Shapley’s contribution to game theory is too vast to mention in a single post. Since I mainly want to say something about his mathematics let me mention Shapley-Folkman-Starr Lemma, a kind of discrete analogue of Lyapunov’s theorem on the range of non-atomic vector measures, and KKMS Lemma which I still don’t understand its meaning but it has something to do with fixed points and Yaron and I have used it in our paper about rental harmony.

I am going to talk in more details about stochasic games, introduced by Shapley in 1953, since this area has been flourishing recently with some really big developments. A (two-player, zero-sum) stochastic game is given by a finite set of states, finite set of actions for the players, a period payoff function , a distribution over for every state and actions , and a discount factor . At every period the system is at some state , players choose actions simultaneously and independently. Then the column player pays to the row player. The game then moves to a new state in the next period, randomized according to . Players evaluate their infinite stream of payoofs via the discount factor . The model is a generalization of the single player dynamic programming model which was studied by Blackwell and Bellman. Shapley proved that every zero-sum stochastic game admits a value, by imitating the familiar single player argument, which have been the joy and pride of macroeconomists ever since Lucas asset pricing model (think Bellman Equation and the contraction operators). Fink later proved using similar ideas that non-zero sum discounted stochastic games admit perfect markov equilibria.

A major question, following a similar question in the single player setup, is the limit behavior of the value and the optimal strategies when players become more patient (i.e., goes to ). Mertens and Neyman have proved that the limit exists, and moreover that for every there strategies which are -optimal for sufficiently large discount factor. Whether a similar result holds for Nash equilibrium in -player stochastic games is probably the most important open question in game theory. Another important question is whether the limit of the value exists for zero-sum games in which the state is not observed by both players. Bruno Zilloto has recently answered this question by providing a counter-example. I should probably warn that you need to know how to count and also some calculus to follow up this literature. Bruno Zilloto will give the Shapley Lecture in Games2016 in Maastricht. Congrats, Bruno ! and thanks to Shapley for leaving us with some much stuff to play with !

One of the highlight of last year’s Stony Brook conference was John Milnor’s talk about John Nash. The video is now available online. John Milnor, also an Abel Prize laureate, is familiar to many economists from his book “Topology from the differential approach”. He has known John Nash from the beginning of their careers in the common room in the math school at Princeton University. Watch him talk with clarity and humor about young Nash’s ambitions, creativity, courage, and contributions.

Here is a the handwritten letter which Nash wrote to the NSA in 1955 (pdf), fifteen years before Cook formalized the P/NP problem. In the letter Nash conjectures that for most encryption mechanisms, recovering the key from the cipher requires exponential amount of time. And here is what Nash had to say about proving this conjecture:

Here is the question from Ross’ book that I posted last week

Question 1We have two coins, a red one and a green one. When flipped, one lands heads with probability and the other with probability . Assume that . We do not know which coin is the coin. We initially attach probability to the red coin being the coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor . Find the optimal policy.

This is a dynamic programming problem where the state is the belief that the red coin is . Every period we choose a coin to toss, get a reward and updated our state given the outcome. Before I give my solution let me explain why we can’t immediately invoke uncle Gittins.

In the classical bandit problem there are arms and each arm provides a reward from an unknown distribution . Bandit problems are used to model tradeoffs between exploitation and exploration: Every period we either exploit an arm about whose distribution we already have a good idea or explore another arm. The are randomized independently according to distributions , and what we are interested in is the expected discounted reward. The optimization problem has a remarkable solution: choose in every period the arm with the largest Gittins index. Then update your belief about that arm using Bayes’ rule. The Gittins index is a function which attaches a number (the index) to every belief about an arm. What is important is that the index of an arm depends only on — our current belief about the distribution of the arm — not on our beliefs about the distribution of the other arms.

The independence assumption means that we only learn about the distribution of the arm we are using. This assumption is not satisfied in the red coin green coin problem: If we toss the red coin and get heads then the probability that the green coin is decreases. Googling `multi-armed bandit’ with `dependent arms’ I got some papers which I haven’t looked at carefully but my superficial impression is that they would not help here.

Here is my solution. Call the problem I started with `the difficult problem’ and consider a variant which I call `the easy problem’. Let so that . In the easy problem there are again two coins but this time the red coin is with probability and with probability and, *independently*, the green coin is with probability and with probability . The easy problem is easy because it is a bandit problem. We have to keep track of beliefs and about the red coin and the green coin ( is the probability that the red coin is ), starting with and , and when we toss the red coin we update but keep fixed. It is easy to see that the Gittins index of an arm is a monotone function of the belief that the arm is so the optimal strategy is to play red when and green when . In particular, the optimal action in the first period is red when and green when .

Now here comes the trick. Consider a general strategy that assigns to every finite sequence of past actions and outcomes an action (red or green). Denote by and the rewards that gives in the difficult and easy problems respectively. I claim that

Why is that ? in the easy problem there is a probability that both coins are . If this happens then every gives payoff . There is a probability that both coins are . If this happens then every gives payoff . And there is a probability that the coins are different, and, because of the choice of , conditionally on this event the probability of being is . Therefore, in this case gives whatever gives in the difficult problem.

So, the payoff in the easy problem is a linear function of the payoff in the difficult problem. Therefore the optimal strategy in the difficult problem is the same as the optimal strategy in the easy problem. In particular, we just proved that, for every , the optimal action in the first period is red when and green with . Now back to the dynamic programming formulation, from standard arguments it follows that the optimal strategy is to keep doing it forever, i.e., at every period to toss the coin that is more likely to be the coin given the current information.

See why I said my solution is tricky and specific ? it relies on the fact that there are only two arms (the fact that the arms are coins is not important). Here is a problem whose solution I don’t know:

Question 2Let . We are given coins, one of each parameter, all possibilities equally likely. Each period we have to toss a coin and we get payoff for Heads. What is the optimal strategy ?

Here is the bonus question from the final exam in my dynamic optimization class of last semester. It is based on problem 8 Chapter II in Ross’ book *Introduction to stochastic dynamic programming*. It appears there as `guess the optimal policy’ without asking for proof. The question seems very natural, but I couldn’t find any information about it (nor apparently the students). I have a solution but it is tricky and too specific to this problem. I will describe my solution next week but perhaps somebody can show me a better solution or a reference ?

We have two coins, a red one and a green one. When flipped, one lands heads with probability *P1* and the other with probability *P2*. We do not know which coin is the *P1* coin. We initially attach probability *p* to the red coin being the *P1* coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor β. Describe the optimal policy, including a proof of optimality.

Nicolas Copernicus’s de Revolutionibus, in which he advocated his theory that Earth revolved ar ound the sun, was first printed just before Copernicus’ death in 1543. It therefore fell on one Andreas Osiander to write the introduction. Here is a passage from the introduction:

[An astronomer] will adopt whatever suppositions enable [celestial] motions to be computed correctly from the principles of geometry for the future as well as for the past…. These hypotheses need not be true nor even probable. On the contrary, if they provide a calculus consistent with the observations, that alone is enough.

In other words, the purpose of the astronomer’s study is to capture the observed phenomena — to provide an analytic framework by which we can explain and predict what we see when we look at the sky. It turns out that it is more convenient to capture the phenomena by assuming that Earth revolved around the sun than by assuming, as the Greek astronomers did, a geo-centric epicyclical planet motion. Therefore let’s calculate the right time for Easter by making this assumption. As astronomers, we shouldn’t care whether this is actually true.

Whether or not Copernicus would have endorsed this approach is disputable. What is certain is that his book was at least initially accepted by the Catholic Church whose astronomers have used Copernicus’ model to develop the Gregorian Calendar. (Notice I said the word model btw, which is probably anachronistic but, I think, appropriately captures Osiander’s view). The person who caused the scandal was Galileo Galilei, who famously declared that if earth behaves as if it moves around the sun then, well, it moves around the sun. Yet it moves. It’s not a model, it’s reality. Physicists’ subject matter is the nature, not models about nature.

What about economists ? Econ theorists at least don’t usually claim that the components of their modeling of economic agents (think utilities, beliefs, discount factors, ambiguity aversions) correspond to any real elements of the physical world or of the cognitive process that the agent performs. When we say that Adam’s utility from apple is log(c) we don’t mean that Adam knows anything about logs. We mean — wait for it — that he behaves *as if* this is his utility, or, as Osiander would have put it, this utility provides a calculus consistent with the observations, and that alone is enough.

The contrast between theoretical economists’ `as if’ approach and physicists’ `and yet it moves’ approach is not as sharp as I would like it to be. First, from the physics side, modern interpretations of quantum physics view it, and by extension the entire physics enterprise, as nothing more than a computational tool to produce predictions. On the other hand, from the economics side, while I think it is still customary to pay lip service to the `as if’ orthodoxy at least in decision theory classes, I don’t often hear it in seminars. And when neuro-economists claim to localize the decision making process in the brain they seem to view the components of the model as more than just mathematical constructions.

Yep, I am advertising another paper. Stay tuned :)

## Recent Comments