You are currently browsing Eran’s articles.

When explaining the meaning of a confidence interval don’t say “the probability that the parameter is in the interval is 0.95” because probability is a precious concept and this statement does not match the meaning of this term. Instead, say “We are 95% confident that the parameter is in the interval”.  Admittedly, I don’t know what people will make of the word “confident”. But I also don’t know what they will make of the word “probability”

If you live under the impression that in order to publish an empirical paper you must include the sentence “this holds with p-value x” for some number x<0.05 in your paper, here is a surprising bit of news for you: The editors of Basic and Applied Social Psychology have banned p-value from their journal, along with confidence intervals. In fact, according to the editorial, the state of the art of statistics “remains uncertain” so statistical inference is no longer welcome in their journal.

When I came across this editorial I was dumbfounded by the arrogance of the editors, who seem to know about statistics as much as I know about social psychology. But I haven’t heard about this journal until yesterday, and if I did I am pretty sure I wouldn’t believe anything they publish, p-value or no p-value. So I don’t have the right to complain here.

Here are somebodies who have the right to complain: The American Statistical Association. Concerned with the misuse, mistrust and misunderstanding of the p-value, ASA has recently issued a policy statement on p- values and statistical significance, intended for researchers who are not statisticians.

How do you explain p-value to practitioners who don’t care about things like Neyman-Pearson Lemma, independence and UMP tests ? First, you use language that obscures conceptual difficulties: “the probability that a statistical summary of the data would be equal to or more extreme than its observed value’’ — without saying what “more extreme’’ means. Second, you use warnings and slogans about what p-value doesn’t mean or can’t do, like “p-value does not measure the size of an effect or the importance of a result.’’

Among these slogans my favorite is

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

What’s cute about this statement is that it assumes that everybody understands what “there is 5% chance that the studied hypothesis is true” and that the notion of P-value is the one that is difficult to understand. In fact, the opposite is true.

Probability is conceptually tricky. It’s meaning is somewhat clear in a situation of a repeated experiment: I more or less understand what it means that a coin has 50% chance to land on Heads. (Yes. Only more or less). But without going full subjective I have no idea what is the meaning of the probability that a given hypothesis (boys who eat pickles in kindergarten have higher SAT score than girls who play firefighters) is true. On the other hand, The meaning of the corresponding P-value relies only on the conceptually simpler notion of probabilities in a repeated experiment.

Why therefore do the committee members (rightly !) assume that people are comfortable with the difficult concept of probability that an hypothesis is true and are uncomfortable with the easy concept of p-value ? I think the reason is that unlike the word “p-value”, the word “probability” is a word that we use in everyday life, so most people feel they know what it means. Since they have never thought about it formally, they are not aware that they actually don’t.

So here is a modest proposal for preventing the misuse and misunderstanding of statistical inference: Instead of saying “this hypothesis holds with p-value 0.03” say “We are 97% confident that this hypothesis holds”. We all  know what “confident” means right ?

I am not the right person to write about Lloyd Shapley. I think I only saw him once, in the first stony brook conference I attended. He reminded me of Doc Brown from Back to The Future, but I am not really sure why. Here are links to posts in The Economist and NYT following his death.

 

Shapley got the Nobel in 2012 and according to Robert Aumann deserved to get it right with Nash. Shapley himself however was not completely on board: “I consider myself a mathematician and the award is for economics. I never, never in my life took a course in economics.” If you are wondering what he means by “a mathematician” read the following quote, from the last paragraph of his stable matching paper with David Gale

The argument is carried out not in mathematical symbols but in ordinary English; there are no obscure or technical terms. Knowledge of calculus is not presupposed. In fact, one hardly needs to know how to count. Yet any mathematician will immediately recognize the argument as mathematical…

What, then, to raise the old question once more, is mathematics? The answer, it appears, is that any argument which is carried out with sufficient precision is mathematical

 

In the paper Gale and Shapley considered a problem of matching (or assignment as they called it) of applicants to colleges, where each applicant has his own preference over colleges and each college has its preference over applicants. Moreover, each college has a quota. Here is the definition of stability, taken from the original paper

Definition: An assignment of applicants to colleges will be called unstable if there are two applicants {\alpha} and {\beta} who are assigned to colleges {A} and {B}, respectively, although {\beta} prefers {A} to {B} and {A} prefers {\beta} to {\alpha}.
According to the Gale-Shapley algorithm, applicants apply to colleges sequentially following their preferences. A college with quota {q} maintains a `waiting list’ of size {q} with the top {q} applicants that has applied to it so far, and rejects all other applicants. When an applicant is rejected from a college he applies to his next favorite college. Gale and Shapley proved that the algorithm terminates with a stable assignment.

One reason that the paper was so successful is that the Gale Shapley method is actually used in practice. (A famous example is the national resident program that assigns budding physicians to hospitals). From theoretical perspective my favorite follow-up  is a paper of Dubins and Freedman “Machiavelli and the Gale-Shapley Algorithm” (1981): Suppose that some applicant, Machiavelli, decides to `cheat’ and apply to colleges in different order than his true ranking. Can Machiavelli improves his position in the assignment produced by the algorithm ? Dubins and Freedman prove that the answer to this question is no.

Shapley’s contribution to game theory is too vast to mention in a single post. Since I mainly want to say something about his mathematics let me mention Shapley-Folkman-Starr Lemma, a kind of discrete analogue of Lyapunov’s theorem on the range of non-atomic vector measures, and KKMS Lemma which I still don’t understand its meaning but it has something to do with fixed points and Yaron and I have used it in our paper about rental harmony.

I am going to talk in more details about stochasic games, introduced by Shapley in 1953, since this area has been flourishing recently with some really big developments. A (two-player, zero-sum) stochastic game is given by a finite set {Z} of states, finite set of actions {A,B} for the players, a period payoff function {r:Z\times A\times B\rightarrow [0,1]}, a distribution {q(\cdot|z,a,b)} over {Z} for every state {z} and actions {a,b}, and a discount factor {0<\beta<1}. At every period the system is at some state {z\in Z}, players choose  actions {a,b} simultaneously and independently. Then the column player pays {r(z,a,b)} to the row player. The game then moves to a new state in the next period, randomized according to {q(\cdot|z,a,b)}. Players evaluate their infinite stream of payoofs via the discount factor {\beta}. The model is a generalization of the single player dynamic programming model which was studied by Blackwell and Bellman. Shapley proved that every zero-sum stochastic game admits a value, by imitating the familiar single player argument, which have been the joy and pride of macroeconomists ever since Lucas asset pricing model (think Bellman Equation and the contraction operators). Fink later proved using similar ideas that non-zero sum discounted stochastic games admit perfect markov equilibria.

A major question, following a similar question in the single player setup, is the limit behavior of the value and the optimal strategies when players become more patient (i.e., {\beta} goes to {1}). Mertens and Neyman have proved that the limit exists, and moreover that for every {\epsilon>0} there strategies which are {\epsilon}-optimal for sufficiently large discount factor. Whether a similar result holds for Nash equilibrium in {N}-player stochastic games is probably the most important open question in game theory. Another important question is whether the limit of the value exists for zero-sum games in which the state is not observed by both players. Bruno Zilloto has recently answered this question by providing a counter-example. I should probably warn that you need to know how to count and also some calculus to follow up this literature. Bruno Zilloto will give the Shapley Lecture in Games2016 in Maastricht. Congrats, Bruno ! and thanks to Shapley for leaving us with some much stuff to play with !

One of the highlight of last year’s Stony Brook conference was John Milnor’s talk about John Nash. The video is now available online. John Milnor, also an Abel Prize laureate, is familiar to many economists from his book “Topology from the differential approach”. He has known John Nash from the beginning of their careers in the common room in the math school at Princeton University. Watch him talk with clarity and humor about young Nash’s ambitions, creativity, courage, and contributions.
Here is a the handwritten letter which Nash wrote to the NSA in 1955 (pdf), fifteen years before Cook formalized the P/NP problem. In the letter Nash conjectures that for most encryption mechanisms, recovering the key from the cipher requires exponential amount of time. And here is what Nash had to say about proving this conjecture: 

  
   

 
 

Here is the question from Ross’ book that I posted last week

Question 1 We have two coins, a red one and a green one. When flipped, one lands heads with probability {P_1} and the other with probability {P_2}. Assume that {P_1>P_2}. We do not know which coin is the {P_1} coin. We initially attach probability {p} to the red coin being the {P_1} coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor {\beta}. Find the optimal policy.

This is a dynamic programming problem where the state is the belief that the red coin is {P_1}. Every period we choose a coin to toss, get a reward and updated our state given the outcome. Before I give my solution let me explain why we can’t immediately invoke uncle Gittins.

In the classical bandit problem there are {n} arms and each arm {i} provides a reward from an unknown distribution {\theta_i\in\Delta([0,1])}. Bandit problems are used to model tradeoffs between exploitation and exploration: Every period we either exploit an arm about whose distribution we already have a good idea or explore another arm. The {\theta_i} are randomized independently according to distributions {\mu_i\in \Delta(\Delta([0,1]))}, and what we are interested in is the expected discounted reward. The optimization problem has a remarkable solution: choose in every period the arm with the largest Gittins index. Then update your belief about that arm using Bayes’ rule. The Gittins index is a function which attaches a number {G(\mu)} (the index) to every belief {\mu} about an arm. What is important is that the index of an arm {i} depends only on {\mu_i} — our current belief about the distribution of the arm — not on our beliefs about the distribution of the other arms.

The independence assumption means that we only learn about the distribution of the arm we are using. This assumption is not satisfied in the red coin green coin problem: If we toss the red coin and get heads then the probability that the green coin is {P_1} decreases. Googling `multi-armed bandit’ with `dependent arms’ I got some papers which I haven’t looked at carefully but my superficial impression is that they would not help here.

Here is my solution. Call the problem I started with `the difficult problem’ and consider a variant which I call `the easy problem’. Let {r=p/(p+\sqrt{p(1-p)}} so that {r^2/(1-r)^2=p/1-p}. In the easy problem there are again two coins but this time the red coin is {P_1} with probability {r} and {P_2} with probability {1-r} and, independently, the green coin is {P_1} with probability {(1-r)} and {P_2} with probability {r}. The easy problem is easy because it is a bandit problem. We have to keep track of beliefs {p_r} and {p_g} about the red coin and the green coin ({p_r} is the probability that the red coin is {P_1}), starting with {p_r={r}} and {p_g=(1-r)}, and when we toss the red coin we update {p_r} but keep {p_g} fixed. It is easy to see that the Gittins index of an arm is a monotone function of the belief that the arm is {P_1} so the optimal strategy is to play red when {p_r\ge p_g} and green when {p_g\ge p_r}. In particular, the optimal action in the first period is red when {p\ge 1/2} and green when {p\le 1/2}.

Now here comes the trick. Consider a general strategy {g} that assigns to every finite sequence of past actions and outcomes an action (red or green). Denote by {V_d(g)} and {V_e(g)} the rewards that {g} gives in the difficult and easy problems respectively. I claim that

\displaystyle \begin{array}{rcl} &V_e(g)=r(1-r) \cdot P_1/(1-\beta)+ \\ &r(1-r) \cdot P_2/(1-\beta) + (r^2+(1-r)^2) V_d(g).\end{array}

Why is that ? in the easy problem there is a probability {r(1-r)} that both coins are {P_1}. If this happens then every {g} gives payoff {P_1/(1-\beta)}. There is a probability {r(1-r)} that both coins are {P_2}. If this happens then every {g} gives payoff {P_2/(1-\beta)}. And there is a probability {r^2+(1-r)^2} that the coins are different, and, because of the choice of {r}, conditionally on this event the probability of {G} being {P_1} is {p}. Therefore, in this case {g} gives whatever {g} gives in the difficult problem.

So, the payoff in the easy problem is a linear function of the payoff in the difficult problem. Therefore the optimal strategy in the difficult problem is the same as the optimal strategy in the easy problem. In particular, we just proved that, for every {p}, the optimal action in the first period is red when {p\ge 1/2} and green with {p\le 1/2}. Now back to the dynamic programming formulation, from standard arguments it follows that the optimal strategy is to keep doing it forever, i.e., at every period to toss the coin that is more likely to be the {P_1} coin given the current information.

See why I said my solution is tricky and specific ? it relies on the fact that there are only two arms (the fact that the arms are coins is not important). Here is a problem whose solution I don’t know:

Question 2 Let {0 \le P_1 < P_2 < ... < P_n \le 1}. We are given {n} coins, one of each parameter, all {n!} possibilities equally likely. Each period we have to toss a coin and we get payoff {1} for Heads. What is the optimal strategy ?

Here is the bonus question from the final exam in my dynamic optimization class of last semester. It is based on problem 8 Chapter II in Ross’ book Introduction to stochastic dynamic programming. It appears there as `guess the optimal policy’ without asking for proof. The question seems very natural, but I couldn’t find any information about it (nor apparently the students). I have a solution but it is tricky and too specific to this problem. I will describe my solution next week but perhaps somebody can show me a better solution or a reference ?

 

We have two coins, a red one and a green one. When flipped, one lands heads with probability P1 and the other with probability P2. We do not know which coin is the P1 coin. We initially attach probability p to the red coin being the P1 coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor β. Describe the optimal policy, including a proof of optimality.

Nicolas Copernicus’s de Revolutionibus, in which he advocated his theory that Earth revolved ar ound the sun, was first printed just before Copernicus’ death in 1543. It therefore fell on one Andreas Osiander to write the introduction. Here is a passage from the introduction:

[An astronomer] will adopt whatever suppositions enable [celestial] motions to be computed correctly from the principles of geometry for the future as well as for the past…. These hypotheses need not be true nor even probable. On the contrary, if they provide a calculus consistent with the observations, that alone is enough.

In other words, the purpose of the astronomer’s study is to capture the observed phenomena — to provide an analytic framework by which we can explain and predict what we see when we look at the sky. It turns out that it is more convenient to capture the phenomena by assuming that Earth revolved around the sun than by assuming, as the Greek astronomers did, a geo-centric epicyclical planet motion. Therefore let’s calculate the right time for Easter by making this assumption. As astronomers, we shouldn’t care whether this is actually true.

Whether or not Copernicus would have endorsed this approach is disputable. What is certain is that his book was at least initially accepted by the Catholic Church whose astronomers have used Copernicus’ model to develop the Gregorian Calendar. (Notice I said the word model btw, which is probably anachronistic but, I think, appropriately captures Osiander’s view). The person who caused the scandal was Galileo Galilei, who famously declared that if earth behaves as if it moves around the sun then, well, it moves around the sun. Yet it moves. It’s not a model, it’s reality. Physicists’ subject matter is the nature, not models about nature.

What about economists ? Econ theorists at least don’t usually claim that the components of their modeling of economic agents (think utilities, beliefs, discount factors, ambiguity aversions) correspond to any real elements of the physical world or of the cognitive process that the agent performs. When we say that Adam’s utility from apple is log(c) we don’t mean that Adam knows anything about logs. We mean — wait for it — that he behaves as if this is his utility, or, as Osiander would have put it, this utility provides a calculus consistent with the observations, and that alone is enough.

The contrast between theoretical economists’ `as if’ approach and physicists’ `and yet it moves’ approach is not as sharp as I would like it to be. First, from the physics side, modern interpretations of quantum physics view it, and by extension the entire physics enterprise, as nothing more than a computational tool to produce predictions. On the other hand, from the economics side, while I think it is still customary to pay lip service to the `as if’ orthodoxy at least in decision theory classes, I don’t often hear it in seminars. And when neuro-economists claim to localize the decision making process in the brain they seem to view the components of the model as more than just mathematical constructions.

Yep, I am advertising another paper. Stay tuned :)

This post describes the main theorem in my new paper with Nabil. Scroll down for open questions following this theorem. The theorem asserts that a Bayesian agent in a stationary environment will learn to make predictions as if he knew the data generating process, so that the as time goes by structural uncertainty dissipates. The standard example is when the sequence of outcomes is i.i.d. with an unknown parameter. As times goes by the agent learns the parameter.

The formulation of `learning to make predictions’ goes through merging, which traces back to Blackwell and Dubins. I will not give Blackwell and Dubin’s definition in this post but a weaker definition, suggested by Kalai and Lehrer.

A Bayesian agent observes an infinite sequence of outcomes from a finite set {A}. Let {\mu\in\Delta(A^\mathbb{N})} represent the agent’s belief about the future outcomes. Suppose that before observing every day’s outcome the agent makes a probabilistic prediction about it. I denote by {\mu(\cdot|a_0,\dots,a_{n-1})} the element in {\Delta(A)} which represents the agent’s prediction about the outcome of day {n} just after he observed the outcomes {a_0,\dots,a_{n-1}} of previous days. In the following definition it is instructive to think about {\tilde\mu} as the true data generating process, i.e., the process that generates the sequence of outcomes, which may be different from the agent’s belief.

Definition 1 (Kalai and Lehrer) Let {\mu,\tilde\mu\in\Delta(A^\mathbb{N})}. Then {\mu} merges with {\tilde\mu} if for {\tilde\mu}-almost every realization {(a_0,\dots,a_{n-1},\dots)} it holds that

\displaystyle \lim_{n\rightarrow\infty}\|\mu(\cdot|a_0,\dots,a_{n-1})-\tilde\mu(\cdot|a_0,\dots,a_{n-1})\|=0.

Assume now that the agent’s belief {\mu} is stationary, and let {\mu=\int \theta~\lambda(\mathrm{d}\theta)} be its ergodic decomposition. Recall that in this decomposition {\theta} ranges over ergodic beliefs and {\lambda} represents structural uncertainty. Does the agent learn to make predictions ? Using the definition of merging we can ask, does {\mu} merges with {\theta} ? The answer, perhaps surprisingly, is no. I gave an example in my previous post.

Let me now move to a weaker definition of merging, that was first suggested by Lehrer and Smorodinsky. This definition requires the agent to make correct predictions in almost every period.

Definition 2 Let {\mu,\tilde\mu\in\Delta(A^\mathbb{N})}. Then {\mu} weakly merges with {\tilde\mu} if {\tilde\mu}-almost every realization {(a_0,\dots,a_{n-1},\dots)} it holds that

\displaystyle \lim_{n\rightarrow\infty,n\in T}\|\mu(\cdot|a_0,\dots,a_{n-1})-\tilde\mu(\cdot|a_0,\dots,a_{n-1})\|=0

for a set {T\subseteq \mathbb{N}} of periods of density {1}.

The definition of weak merging is natural: patient agents whose belief weakly merges with the true data generating process will make almost optimal decisions. Kalai, Lehrer and Smorodinsky discuss these notions of mergings and also their relationship with Dawid’s idea of calibration.

I am now in a position to state the theorem I have been talking about for two months:

Theorem 3 Let {\mu\in\Delta(A^\mathbb{N})} be stationary, and let {\mu=\int \theta~\lambda(\mathrm{d}\theta)} be its ergodic decomposition. Then {\mu} weakly merges with {\theta} for {\lambda}-almost every {\theta}.

In words: An agent who has some structural uncertainty about the data generating process will learn to make predictions in most periods as if he knew the data generating process.

Finally, here are the promised open questions. They deal with the two qualification in the theorem. The first question is about the “{\lambda}-almost every {\theta}” in the theorem. As Larry Wasserman mentioned this is unsatisfactory in some senses. So,

Question 1 Does there exists a stationary {\mu} (equivalently a belief {\lambda} over ergodic beliefs) such that {\mu} weakly merges with {\theta} for every ergodic distribution {\theta} ?

The second question is about strengthening weak merging to merging. We already know that this cannot be done for arbitrary belief {\lambda} over ergodic processes, but what if {\lambda} is concentrated on some natural family of processes, for example hidden markov processes with a bounded number of hidden states ? Here is the simplest setup for which I don’t know the answer.

Question 2 The outcome of the stock market at every day is either U or D (up or down). An agent believes that this outcome is a stochastic function of an unobserved (hidden) state of the economy which can be either G or B (good or bad): When the hidden state is B the outcome is U with probability {q_B} (and D with probability {1-q_B}), and when the state is G the outcome is U with probability {q_G}. The hidden state changes according to a markov process with transition probability {\rho(B|B)=1-\rho(G|B)=p_B}, {\rho(B|G)=1-\rho(G|G)=p_G}. The parameter is {(p_B,p_G,q_B,q_G)} and the agent has some prior {\lambda} over the parameter. Does the agent’s belief about outcomes merge with the truth for {\lambda}-almost every {(p_B,p_G,q_B,q_G)} ?.

One more word about organ selling before I return to my comfort zone and talk about Brownian motion in Lie groups. Selling living human organs is repugnant, in part because the sellers cause damage to their bodies out of desperation. But what about allowing your relatives to sell what’s left of you when you’re gone ? I think this should be uncontroversial. And there are side advantages too, in addition to increasing the number of transplantations. For example, it will encourage you to quit smoking.

Over to you, Walter.

Something funny happened when I started watching Al Roth’s lecture and looked at the paper: I realized that what I always assumed is the meaning of `repugnant transactions’ is not exactly the phenomena that Roth talks about. What I thought `repugnant transaction’ means is a situation of `two rights makes a wrong’: it’s totally awesome that Xanders is willing to donate his extra kidney to Zordiac, and it’s really nice of Zordiac to donate money to Xanders, but these two nobles acts done together in exchange for each other is imoral and should be outlawed. Roth however defines `repugnant transaction’ more broadly as any transaction that some people want to engage in and others don’t think they should.

 

Consider the opening example of his paper: laws against selling horse meat in restaurants. Here what is repugnant is not the exchange but the good itself. It’s not two rights makes wrong. It’s just wrong. We outlaw the exchange simply because of constitutional reasons or because it’s impossible to enforce a ban on eating — people will simply order take away and perform the crime of eating at their homes.

 

So let me offer a distinction between `repugnant exchanges’, where the good itself is fine but buying and selling it is repugnant and `repugnant good/services’ where the good or service are what is repugnant, even if for whatever reason what we actually outlaw is the transaction. Most of the examples that Roth gives fall into the `repugnant good/service’ category rather than `repugnant exchange’. Such is the case of buying and selling recreational drugs, endangered species, imported cultural property.

 

Are there any examples for `repugnant exchanges’ in addition to selling human organs ? Well, there is `renting’ organs, as in surrogate motherhood. Anything else ? An interesting example is lending money with interest, which used to be repugnant in the West (we got over it already): The very idea of lending money was never considered repugnant. What was repugnant is doing it for payment in terms of interest.

 

Finally, there is prostitution, which is illegal in the US. Repugnant service or repugnant exchange ? depends on your reasoning. Anti-prostitution laws have an unlikely coalition of supporters. There are the religious moralists, for whom the service (extra-marriage sexual intercourse) is what makes the transaction repugnant. They go for prostitution just because that’s what they can outlaw in the US. (They go further in Iran.) But there are also feminists and liberals who view prostitution as exploitation, as I view selling human organs. They find the exchange repugnant even if they have no problem with the service itself.

 

Note that the example of prostitution shows the difficulty in the distinction I make between `repugnant good/service’ and `repugnant exchange’: It relies on unobservable reasoning. Just by knowing the laws and customs of a society you don’t know to which category a forbidden transaction belongs. Moreover, since different people may have different reasoning, the category is sometimes not uniquely defined. But I still think it’s a useful distinction.

Follow

Get every new post delivered to your Inbox.

Join 216 other followers