You are currently browsing Eran’s articles.
It is not often that Terry Tao gets into politics in his blog, but, as political observers like to say, normal rules don’t apply this year. Tao writes that many of Trump’s supporters secretly believe that he is not even remotedly qualified for the presidency, but they continue to entertain this possibility because their fellow citizens and the media and politicians seem to be doing so. He suggests that more people should come out and reveal their secret beliefs.
I generally agree with Tao’ sentiment and argument, but I have a quibble. Tao describe the current situation as mutual knowledge without common knowledge. This, I think, is wrong. To get politics out of the way, let me explain my position using a similar situation which Tao also mentions: The Emperor’s new clothes. I have already come across people casting the Emperor’s story in terms of mutual knowledge without common knowledge, and I think it is also wrong. The way I understand the story, before the kid shouts, each of the Emperor’s subjects sees that the Emperor is naked, but after observing everybody else’s reaction, each subject updates her own initial belief and deduces that she was probably wrong. The subjects now don’t think that the Emperor is naked. Rather, each subjects thinks that her own eyes deceived her.
But when game theorists and logicians say that an assertion is mutual knowledge (or mutual belief) we mean that each of us, after taking into account our own information including what we deduce about other people’s information, think the assertion is true. In my reading of the Emperor’s new cloths story this is not the case.
For an assertion to be common knowledge, we need in addition that everybody knows that everybody knows that the assertion is true, and that everybody knows that everybody knows that everybody knows that the assertion is true, and onwards to infinity. A good example of a situation with mutual knowledge and no common knowledge is the blue-eyed islanders puzzle (using the story as it appears Terrence’ blog and a big spoiler ahead if you are not familiar with the puzzle): Before the foreigner makes an announcement, it is mutual knowledge that there are at least 99 blue-eyed islanders, but this fact is not common knowledge: If Alice and Bob are both blue-eyed then Alice, not knowing the color of her own eyes, thinks that Bob might observe only 98 blue-eyed islanders. In fact it is not even common knowledge that there are at least 98 blue-eyed Islanders, because Alice thinks that Bob might think that Craig might only observe 97 blue-eyed Islanders. By similar reasoning, before the foreigner’s announcement, it is not even common knowledge that there is at least one blue-eyed islander. Once the foreigner announces it, this fact becomes common knowledge.
No mutual knowledge and no common knowledge are two situations that can have different behavioral implications. Suppose that we offer each of the subjects the following private voting game: Is the emperor wearing clothes ? You have to answer yes or no. If you answer correctly you get a free ice cream sandwich, otherwise you get nothing. According to my reading of the story they will all give the wrong answer, and get nothing. On the other hand, suppose you offer a similar game to the islanders — even before the foreigner arrives — Do you think that there is at least one blue-eyed islander ? they will answer correctly.
There is an alternative reading of the Emperor’s story, according to which it is indeed a story about mutual knowledge without common knowledge: Even after observing the crowd’s reaction, each subject still knows that the Emperor is naked, but she keeps her mouth shut because she suspects that her fellow subjects don’t realize it and she doesn’t want to make a fool of herself. This reading strikes me as less psychologically interesting, but, more importantly, if that’s how you understand the story then there is nothing to worry about. All the subjects will vote correctly anyway and get the ice cream even without the little kid making it a common knowledge. And Trump will not be elected president even if people continue to keep their mouth shut.
When explaining the meaning of a confidence interval don’t say “the probability that the parameter is in the interval is 0.95” because probability is a precious concept and this statement does not match the meaning of this term. Instead, say “We are 95% confident that the parameter is in the interval”. Admittedly, I don’t know what people will make of the word “confident”. But I also don’t know what they will make of the word “probability”
If you live under the impression that in order to publish an empirical paper you must include the sentence “this holds with p-value x” for some number x<0.05 in your paper, here is a surprising bit of news for you: The editors of Basic and Applied Social Psychology have banned p-value from their journal, along with confidence intervals. In fact, according to the editorial, the state of the art of statistics “remains uncertain” so statistical inference is no longer welcome in their journal.
When I came across this editorial I was dumbfounded by the arrogance of the editors, who seem to know about statistics as much as I know about social psychology. But I haven’t heard about this journal until yesterday, and if I did I am pretty sure I wouldn’t believe anything they publish, p-value or no p-value. So I don’t have the right to complain here.
Here are somebodies who have the right to complain: The American Statistical Association. Concerned with the misuse, mistrust and misunderstanding of the p-value, ASA has recently issued a policy statement on p- values and statistical significance, intended for researchers who are not statisticians.
How do you explain p-value to practitioners who don’t care about things like Neyman-Pearson Lemma, independence and UMP tests ? First, you use language that obscures conceptual difficulties: “the probability that a statistical summary of the data would be equal to or more extreme than its observed value’’ — without saying what “more extreme’’ means. Second, you use warnings and slogans about what p-value doesn’t mean or can’t do, like “p-value does not measure the size of an effect or the importance of a result.’’
Among these slogans my favorite is
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone
What’s cute about this statement is that it assumes that everybody understands what “there is 5% chance that the studied hypothesis is true” and that the notion of P-value is the one that is difficult to understand. In fact, the opposite is true.
Probability is conceptually tricky. It’s meaning is somewhat clear in a situation of a repeated experiment: I more or less understand what it means that a coin has 50% chance to land on Heads. (Yes. Only more or less). But without going full subjective I have no idea what is the meaning of the probability that a given hypothesis (boys who eat pickles in kindergarten have higher SAT score than girls who play firefighters) is true. On the other hand, The meaning of the corresponding P-value relies only on the conceptually simpler notion of probabilities in a repeated experiment.
Why therefore do the committee members (rightly !) assume that people are comfortable with the difficult concept of probability that an hypothesis is true and are uncomfortable with the easy concept of p-value ? I think the reason is that unlike the word “p-value”, the word “probability” is a word that we use in everyday life, so most people feel they know what it means. Since they have never thought about it formally, they are not aware that they actually don’t.
So here is a modest proposal for preventing the misuse and misunderstanding of statistical inference: Instead of saying “this hypothesis holds with p-value 0.03” say “We are 97% confident that this hypothesis holds”. We all know what “confident” means right ?
I am not the right person to write about Lloyd Shapley. I think I only saw him once, in the first stony brook conference I attended. He reminded me of Doc Brown from Back to The Future, but I am not really sure why. Here are links to posts in The Economist and NYT following his death.
Shapley got the Nobel in 2012 and according to Robert Aumann deserved to get it right with Nash. Shapley himself however was not completely on board: “I consider myself a mathematician and the award is for economics. I never, never in my life took a course in economics.” If you are wondering what he means by “a mathematician” read the following quote, from the last paragraph of his stable matching paper with David Gale
The argument is carried out not in mathematical symbols but in ordinary English; there are no obscure or technical terms. Knowledge of calculus is not presupposed. In fact, one hardly needs to know how to count. Yet any mathematician will immediately recognize the argument as mathematical…
What, then, to raise the old question once more, is mathematics? The answer, it appears, is that any argument which is carried out with sufficient precision is mathematical
In the paper Gale and Shapley considered a problem of matching (or assignment as they called it) of applicants to colleges, where each applicant has his own preference over colleges and each college has its preference over applicants. Moreover, each college has a quota. Here is the definition of stability, taken from the original paper
Definition: An assignment of applicants to colleges will be called unstable if there are two applicants and who are assigned to colleges and , respectively, although prefers to and prefers to .
According to the Gale-Shapley algorithm, applicants apply to colleges sequentially following their preferences. A college with quota maintains a `waiting list’ of size with the top applicants that has applied to it so far, and rejects all other applicants. When an applicant is rejected from a college he applies to his next favorite college. Gale and Shapley proved that the algorithm terminates with a stable assignment.
One reason that the paper was so successful is that the Gale Shapley method is actually used in practice. (A famous example is the national resident program that assigns budding physicians to hospitals). From theoretical perspective my favorite follow-up is a paper of Dubins and Freedman “Machiavelli and the Gale-Shapley Algorithm” (1981): Suppose that some applicant, Machiavelli, decides to `cheat’ and apply to colleges in different order than his true ranking. Can Machiavelli improves his position in the assignment produced by the algorithm ? Dubins and Freedman prove that the answer to this question is no.
Shapley’s contribution to game theory is too vast to mention in a single post. Since I mainly want to say something about his mathematics let me mention Shapley-Folkman-Starr Lemma, a kind of discrete analogue of Lyapunov’s theorem on the range of non-atomic vector measures, and KKMS Lemma which I still don’t understand its meaning but it has something to do with fixed points and Yaron and I have used it in our paper about rental harmony.
I am going to talk in more details about stochasic games, introduced by Shapley in 1953, since this area has been flourishing recently with some really big developments. A (two-player, zero-sum) stochastic game is given by a finite set of states, finite set of actions for the players, a period payoff function , a distribution over for every state and actions , and a discount factor . At every period the system is at some state , players choose actions simultaneously and independently. Then the column player pays to the row player. The game then moves to a new state in the next period, randomized according to . Players evaluate their infinite stream of payoofs via the discount factor . The model is a generalization of the single player dynamic programming model which was studied by Blackwell and Bellman. Shapley proved that every zero-sum stochastic game admits a value, by imitating the familiar single player argument, which have been the joy and pride of macroeconomists ever since Lucas asset pricing model (think Bellman Equation and the contraction operators). Fink later proved using similar ideas that non-zero sum discounted stochastic games admit perfect markov equilibria.
A major question, following a similar question in the single player setup, is the limit behavior of the value and the optimal strategies when players become more patient (i.e., goes to ). Mertens and Neyman have proved that the limit exists, and moreover that for every there strategies which are -optimal for sufficiently large discount factor. Whether a similar result holds for Nash equilibrium in -player stochastic games is probably the most important open question in game theory. Another important question is whether the limit of the value exists for zero-sum games in which the state is not observed by both players. Bruno Zilloto has recently answered this question by providing a counter-example. I should probably warn that you need to know how to count and also some calculus to follow up this literature. Bruno Zilloto will give the Shapley Lecture in Games2016 in Maastricht. Congrats, Bruno ! and thanks to Shapley for leaving us with some much stuff to play with !
One of the highlight of last year’s Stony Brook conference was John Milnor’s talk about John Nash. The video is now available online. John Milnor, also an Abel Prize laureate, is familiar to many economists from his book “Topology from the differential approach”. He has known John Nash from the beginning of their careers in the common room in the math school at Princeton University. Watch him talk with clarity and humor about young Nash’s ambitions, creativity, courage, and contributions.
Here is a the handwritten letter which Nash wrote to the NSA in 1955 (pdf), fifteen years before Cook formalized the P/NP problem. In the letter Nash conjectures that for most encryption mechanisms, recovering the key from the cipher requires exponential amount of time. And here is what Nash had to say about proving this conjecture:
Here is the question from Ross’ book that I posted last week
Question 1 We have two coins, a red one and a green one. When flipped, one lands heads with probability and the other with probability . Assume that . We do not know which coin is the coin. We initially attach probability to the red coin being the coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor . Find the optimal policy.
This is a dynamic programming problem where the state is the belief that the red coin is . Every period we choose a coin to toss, get a reward and updated our state given the outcome. Before I give my solution let me explain why we can’t immediately invoke uncle Gittins.
In the classical bandit problem there are arms and each arm provides a reward from an unknown distribution . Bandit problems are used to model tradeoffs between exploitation and exploration: Every period we either exploit an arm about whose distribution we already have a good idea or explore another arm. The are randomized independently according to distributions , and what we are interested in is the expected discounted reward. The optimization problem has a remarkable solution: choose in every period the arm with the largest Gittins index. Then update your belief about that arm using Bayes’ rule. The Gittins index is a function which attaches a number (the index) to every belief about an arm. What is important is that the index of an arm depends only on — our current belief about the distribution of the arm — not on our beliefs about the distribution of the other arms.
The independence assumption means that we only learn about the distribution of the arm we are using. This assumption is not satisfied in the red coin green coin problem: If we toss the red coin and get heads then the probability that the green coin is decreases. Googling `multi-armed bandit’ with `dependent arms’ I got some papers which I haven’t looked at carefully but my superficial impression is that they would not help here.
Here is my solution. Call the problem I started with `the difficult problem’ and consider a variant which I call `the easy problem’. Let so that . In the easy problem there are again two coins but this time the red coin is with probability and with probability and, independently, the green coin is with probability and with probability . The easy problem is easy because it is a bandit problem. We have to keep track of beliefs and about the red coin and the green coin ( is the probability that the red coin is ), starting with and , and when we toss the red coin we update but keep fixed. It is easy to see that the Gittins index of an arm is a monotone function of the belief that the arm is so the optimal strategy is to play red when and green when . In particular, the optimal action in the first period is red when and green when .
Now here comes the trick. Consider a general strategy that assigns to every finite sequence of past actions and outcomes an action (red or green). Denote by and the rewards that gives in the difficult and easy problems respectively. I claim that
Why is that ? in the easy problem there is a probability that both coins are . If this happens then every gives payoff . There is a probability that both coins are . If this happens then every gives payoff . And there is a probability that the coins are different, and, because of the choice of , conditionally on this event the probability of being is . Therefore, in this case gives whatever gives in the difficult problem.
So, the payoff in the easy problem is a linear function of the payoff in the difficult problem. Therefore the optimal strategy in the difficult problem is the same as the optimal strategy in the easy problem. In particular, we just proved that, for every , the optimal action in the first period is red when and green with . Now back to the dynamic programming formulation, from standard arguments it follows that the optimal strategy is to keep doing it forever, i.e., at every period to toss the coin that is more likely to be the coin given the current information.
See why I said my solution is tricky and specific ? it relies on the fact that there are only two arms (the fact that the arms are coins is not important). Here is a problem whose solution I don’t know:
Question 2 Let . We are given coins, one of each parameter, all possibilities equally likely. Each period we have to toss a coin and we get payoff for Heads. What is the optimal strategy ?
Here is the bonus question from the final exam in my dynamic optimization class of last semester. It is based on problem 8 Chapter II in Ross’ book Introduction to stochastic dynamic programming. It appears there as `guess the optimal policy’ without asking for proof. The question seems very natural, but I couldn’t find any information about it (nor apparently the students). I have a solution but it is tricky and too specific to this problem. I will describe my solution next week but perhaps somebody can show me a better solution or a reference ?
We have two coins, a red one and a green one. When flipped, one lands heads with probability P1 and the other with probability P2. We do not know which coin is the P1 coin. We initially attach probability p to the red coin being the P1 coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor β. Describe the optimal policy, including a proof of optimality.
Nicolas Copernicus’s de Revolutionibus, in which he advocated his theory that Earth revolved ar ound the sun, was first printed just before Copernicus’ death in 1543. It therefore fell on one Andreas Osiander to write the introduction. Here is a passage from the introduction:
[An astronomer] will adopt whatever suppositions enable [celestial] motions to be computed correctly from the principles of geometry for the future as well as for the past…. These hypotheses need not be true nor even probable. On the contrary, if they provide a calculus consistent with the observations, that alone is enough.
In other words, the purpose of the astronomer’s study is to capture the observed phenomena — to provide an analytic framework by which we can explain and predict what we see when we look at the sky. It turns out that it is more convenient to capture the phenomena by assuming that Earth revolved around the sun than by assuming, as the Greek astronomers did, a geo-centric epicyclical planet motion. Therefore let’s calculate the right time for Easter by making this assumption. As astronomers, we shouldn’t care whether this is actually true.
Whether or not Copernicus would have endorsed this approach is disputable. What is certain is that his book was at least initially accepted by the Catholic Church whose astronomers have used Copernicus’ model to develop the Gregorian Calendar. (Notice I said the word model btw, which is probably anachronistic but, I think, appropriately captures Osiander’s view). The person who caused the scandal was Galileo Galilei, who famously declared that if earth behaves as if it moves around the sun then, well, it moves around the sun. Yet it moves. It’s not a model, it’s reality. Physicists’ subject matter is the nature, not models about nature.
What about economists ? Econ theorists at least don’t usually claim that the components of their modeling of economic agents (think utilities, beliefs, discount factors, ambiguity aversions) correspond to any real elements of the physical world or of the cognitive process that the agent performs. When we say that Adam’s utility from apple is log(c) we don’t mean that Adam knows anything about logs. We mean — wait for it — that he behaves as if this is his utility, or, as Osiander would have put it, this utility provides a calculus consistent with the observations, and that alone is enough.
The contrast between theoretical economists’ `as if’ approach and physicists’ `and yet it moves’ approach is not as sharp as I would like it to be. First, from the physics side, modern interpretations of quantum physics view it, and by extension the entire physics enterprise, as nothing more than a computational tool to produce predictions. On the other hand, from the economics side, while I think it is still customary to pay lip service to the `as if’ orthodoxy at least in decision theory classes, I don’t often hear it in seminars. And when neuro-economists claim to localize the decision making process in the brain they seem to view the components of the model as more than just mathematical constructions.
Yep, I am advertising another paper. Stay tuned :)
This post describes the main theorem in my new paper with Nabil. Scroll down for open questions following this theorem. The theorem asserts that a Bayesian agent in a stationary environment will learn to make predictions as if he knew the data generating process, so that the as time goes by structural uncertainty dissipates. The standard example is when the sequence of outcomes is i.i.d. with an unknown parameter. As times goes by the agent learns the parameter.
The formulation of `learning to make predictions’ goes through merging, which traces back to Blackwell and Dubins. I will not give Blackwell and Dubin’s definition in this post but a weaker definition, suggested by Kalai and Lehrer.
A Bayesian agent observes an infinite sequence of outcomes from a finite set . Let represent the agent’s belief about the future outcomes. Suppose that before observing every day’s outcome the agent makes a probabilistic prediction about it. I denote by the element in which represents the agent’s prediction about the outcome of day just after he observed the outcomes of previous days. In the following definition it is instructive to think about as the true data generating process, i.e., the process that generates the sequence of outcomes, which may be different from the agent’s belief.
Definition 1 (Kalai and Lehrer) Let . Then merges with if for -almost every realization it holds that
Assume now that the agent’s belief is stationary, and let be its ergodic decomposition. Recall that in this decomposition ranges over ergodic beliefs and represents structural uncertainty. Does the agent learn to make predictions ? Using the definition of merging we can ask, does merges with ? The answer, perhaps surprisingly, is no. I gave an example in my previous post.
Let me now move to a weaker definition of merging, that was first suggested by Lehrer and Smorodinsky. This definition requires the agent to make correct predictions in almost every period.
Definition 2 Let . Then weakly merges with if -almost every realization it holds that
for a set of periods of density .
The definition of weak merging is natural: patient agents whose belief weakly merges with the true data generating process will make almost optimal decisions. Kalai, Lehrer and Smorodinsky discuss these notions of mergings and also their relationship with Dawid’s idea of calibration.
I am now in a position to state the theorem I have been talking about for two months:
Theorem 3 Let be stationary, and let be its ergodic decomposition. Then weakly merges with for -almost every .
In words: An agent who has some structural uncertainty about the data generating process will learn to make predictions in most periods as if he knew the data generating process.
Finally, here are the promised open questions. They deal with the two qualification in the theorem. The first question is about the “-almost every ” in the theorem. As Larry Wasserman mentioned this is unsatisfactory in some senses. So,
Question 1 Does there exists a stationary (equivalently a belief over ergodic beliefs) such that weakly merges with for every ergodic distribution ?
The second question is about strengthening weak merging to merging. We already know that this cannot be done for arbitrary belief over ergodic processes, but what if is concentrated on some natural family of processes, for example hidden markov processes with a bounded number of hidden states ? Here is the simplest setup for which I don’t know the answer.
Question 2 The outcome of the stock market at every day is either U or D (up or down). An agent believes that this outcome is a stochastic function of an unobserved (hidden) state of the economy which can be either G or B (good or bad): When the hidden state is B the outcome is U with probability (and D with probability ), and when the state is G the outcome is U with probability . The hidden state changes according to a markov process with transition probability , . The parameter is and the agent has some prior over the parameter. Does the agent’s belief about outcomes merge with the truth for -almost every ?.
One more word about organ selling before I return to my comfort zone and talk about Brownian motion in Lie groups. Selling living human organs is repugnant, in part because the sellers cause damage to their bodies out of desperation. But what about allowing your relatives to sell what’s left of you when you’re gone ? I think this should be uncontroversial. And there are side advantages too, in addition to increasing the number of transplantations. For example, it will encourage you to quit smoking.
Over to you, Walter.