You are currently browsing the category archive for the ‘Probability’ category.

When discussing the allocation of indivisible objects, I point to randomization as a method. To emphasize it is not merely a theoretical nicety, but is used to allocate objects that are of high value I give the example of suicide lotteries. I first came across them in Marcus Clarke’s `For The Term of His Natural Life’. It counts as the first Australian novel. Its hero, an Englishman, Rufus Dawes is transported to Australia for a crime he did not commit. In the course of his tribulations, Dawes is shipped to a penal settlement on Norfolk Island, 800 miles east of Australia; even a prison needs a further prison. Robert Hughes, in his heart rending and eloquent account of the founding of Australia, called the `Fatal Shore’, describes Norfolk island in these words:

`….a place of breathtaking barbarity……. On Norfolk Island an Irishman named William Riley received 100 lashes for ”Singing a Song” (no doubt a rebel one) and 50 for asking a warder for a chew of tobacco. Deranged by cruelty and misery, some men would opt for a lifetime at the bottom of the carceral heap by blinding themselves; thus, they reasoned, they would be left alone.’

It is in this portion of his book, that Hughes recalls an eyewitness account of a suicide lottery of the type mentioned in Clarke’s novel. Here is Clarke’s succinct description of it:

The scheme of escape hit upon by the convict intellect was simply this. Three men being together, lots were drawn to determine whom should be murdered. The drawer of the longest straw was the “lucky” man. He was killed. The drawer of the next longest straw was the murderer. He was hanged. The unlucky one was the witness. He had, of course, an excellent chance of being hung also, but his doom was not so certain, and he therefore looked upon himself as unfortunate.

Clarke and Hughes deviate slightly upon the precise incentives that would drive participation in the scheme. As many of the convicts on Norfolk island were Irish, the scheme was concocted as a way to to circumvent the Catholic prohibition on suicide. Hughes suggests that, after the murder, witness and culprit would be shipped back to the mainland for trial. Conditions there were better, so for both there was brief respite and a greater opportunity for escape.

Its an arresting story, that one is loath to give up. But, one is compelled to ask, is it true? If yes, was it common? Tim Causer of King’s College London went back to look at the records and says the answers are `maybe’ and `no’. Here is his summing up:

`Capital offences committed with apparent suicidal intent are an important part of Norfolk Island’s history, but they need to be understood more fully. It should be recognised just how rare they were, that ‘suicide lotteries’ are embellishments upon actual cases of state-assisted suicide and repeating the myth only reinforces the sensationalised interpretation of Norfolk Island’s history.’

You can find the full account here.

Two articles in the March 3rd, 2015 issue of the NY Times. One by the columnist Nocera marvels at Buffet’s success and concludes that it must be due to Buffet’s genius. The second, in the business section of the same, summarizes Buffet’s annual letter that attempts to explain his success. As usual, neither considers the possibility that luck may have a role in Buffet’s success. Buffet may indeed be a Vulcan, but based on the data alone one cannot reject the possibility that luck may explain Buffet’s record. I won’t repeat the argument but will point to this paper by my colleagues (Foster & Stine) that does so.

Finally, if too late. It was announced October 3rd. For more on Blackwell see this post or the special GEB issue.




This post describes the main theorem in my new paper with Nabil. Scroll down for open questions following this theorem. The theorem asserts that a Bayesian agent in a stationary environment will learn to make predictions as if he knew the data generating process, so that the as time goes by structural uncertainty dissipates. The standard example is when the sequence of outcomes is i.i.d. with an unknown parameter. As times goes by the agent learns the parameter.

The formulation of `learning to make predictions’ goes through merging, which traces back to Blackwell and Dubins. I will not give Blackwell and Dubin’s definition in this post but a weaker definition, suggested by Kalai and Lehrer.

A Bayesian agent observes an infinite sequence of outcomes from a finite set {A}. Let {\mu\in\Delta(A^\mathbb{N})} represent the agent’s belief about the future outcomes. Suppose that before observing every day’s outcome the agent makes a probabilistic prediction about it. I denote by {\mu(\cdot|a_0,\dots,a_{n-1})} the element in {\Delta(A)} which represents the agent’s prediction about the outcome of day {n} just after he observed the outcomes {a_0,\dots,a_{n-1}} of previous days. In the following definition it is instructive to think about {\tilde\mu} as the true data generating process, i.e., the process that generates the sequence of outcomes, which may be different from the agent’s belief.

Definition 1 (Kalai and Lehrer) Let {\mu,\tilde\mu\in\Delta(A^\mathbb{N})}. Then {\mu} merges with {\tilde\mu} if for {\tilde\mu}-almost every realization {(a_0,\dots,a_{n-1},\dots)} it holds that

\displaystyle \lim_{n\rightarrow\infty}\|\mu(\cdot|a_0,\dots,a_{n-1})-\tilde\mu(\cdot|a_0,\dots,a_{n-1})\|=0.

Assume now that the agent’s belief {\mu} is stationary, and let {\mu=\int \theta~\lambda(\mathrm{d}\theta)} be its ergodic decomposition. Recall that in this decomposition {\theta} ranges over ergodic beliefs and {\lambda} represents structural uncertainty. Does the agent learn to make predictions ? Using the definition of merging we can ask, does {\mu} merges with {\theta} ? The answer, perhaps surprisingly, is no. I gave an example in my previous post.

Let me now move to a weaker definition of merging, that was first suggested by Lehrer and Smorodinsky. This definition requires the agent to make correct predictions in almost every period.

Definition 2 Let {\mu,\tilde\mu\in\Delta(A^\mathbb{N})}. Then {\mu} weakly merges with {\tilde\mu} if {\tilde\mu}-almost every realization {(a_0,\dots,a_{n-1},\dots)} it holds that

\displaystyle \lim_{n\rightarrow\infty,n\in T}\|\mu(\cdot|a_0,\dots,a_{n-1})-\tilde\mu(\cdot|a_0,\dots,a_{n-1})\|=0

for a set {T\subseteq \mathbb{N}} of periods of density {1}.

The definition of weak merging is natural: patient agents whose belief weakly merges with the true data generating process will make almost optimal decisions. Kalai, Lehrer and Smorodinsky discuss these notions of mergings and also their relationship with Dawid’s idea of calibration.

I am now in a position to state the theorem I have been talking about for two months:

Theorem 3 Let {\mu\in\Delta(A^\mathbb{N})} be stationary, and let {\mu=\int \theta~\lambda(\mathrm{d}\theta)} be its ergodic decomposition. Then {\mu} weakly merges with {\theta} for {\lambda}-almost every {\theta}.

In words: An agent who has some structural uncertainty about the data generating process will learn to make predictions in most periods as if he knew the data generating process.

Finally, here are the promised open questions. They deal with the two qualification in the theorem. The first question is about the “{\lambda}-almost every {\theta}” in the theorem. As Larry Wasserman mentioned this is unsatisfactory in some senses. So,

Question 1 Does there exists a stationary {\mu} (equivalently a belief {\lambda} over ergodic beliefs) such that {\mu} weakly merges with {\theta} for every ergodic distribution {\theta} ?

The second question is about strengthening weak merging to merging. We already know that this cannot be done for arbitrary belief {\lambda} over ergodic processes, but what if {\lambda} is concentrated on some natural family of processes, for example hidden markov processes with a bounded number of hidden states ? Here is the simplest setup for which I don’t know the answer.

Question 2 The outcome of the stock market at every day is either U or D (up or down). An agent believes that this outcome is a stochastic function of an unobserved (hidden) state of the economy which can be either G or B (good or bad): When the hidden state is B the outcome is U with probability {q_B} (and D with probability {1-q_B}), and when the state is G the outcome is U with probability {q_G}. The hidden state changes according to a markov process with transition probability {\rho(B|B)=1-\rho(G|B)=p_B}, {\rho(B|G)=1-\rho(G|G)=p_G}. The parameter is {(p_B,p_G,q_B,q_G)} and the agent has some prior {\lambda} over the parameter. Does the agent’s belief about outcomes merge with the truth for {\lambda}-almost every {(p_B,p_G,q_B,q_G)} ?.

A Bayesian agent is observing a sequence of outcomes in {\{S,F\}}. The agent does not know the outcomes in advance, so he forms some belief {\mu} over sequences of outcomes. Suppose that the agent believes that the number {d} of successes in {k} consecutive outcomes is distributed uniformly in {\{0,1,\dots k\}} and that all configuration with {d} successes are equally likely:

\displaystyle \mu\left(a_0,a_1,\dots,a_{k-1} \right)=\frac{1}{(k+1)\cdot {\binom{k}{d}}}

for every {a_0,a_1,\dots,a_{k-1}\in \{S,F\}} where {d=\#\{0\le i<k|a_i=S\}}.

You have seen this belief {\mu} already though maybe not in this form. It is a belief of an agent who tosses an i.i.d. coin and has some uncertainty over the parameter of the coin, given by a uniform distribution over {[0,1]}.

In this post I am gonna make a fuss about the fact that as time goes by the agent learns the parameter of the coin. The word `learning’ has several legitimate formalizations and today I am talking about the oldest and probably the most important one — consistency of posterior beliefs. My focus is somewhat different from that of textbooks because 1) As in the first paragraph, my starting point is the belief {\mu} about outcome sequences, before there are any parameters and 2) I emphasize some aspects of consistency which are unsatisfactory in the sense that they don’t really capture our intuition about learning. Of course this is all part of the grand marketing campaign for my paper with Nabil, which uses a different notion of learning, so this discussion of consistency is a bit of a sidetrack. But I have already came across some VIP who i suspect was unaware of the distinction between different formulations of learning, and it wasn’t easy to treat his cocky blabbering in a respectful way. So it’s better to start with the basics.

Let {A} be a finite set of outcomes. Let {\mu\in\Delta(A^\mathbb{N})} be a belief over the set {A^\mathbb{N}} of infinite sequences of outcomes, also called realizations. A decomposition of {\mu} is given by a set {\Theta} of parameters, a belief {\lambda} over {\Theta}, and, for every {\theta\in\Theta} a belief {\mu_\theta} over {A^\mathbb{N}} such that {\mu=\int \mu_\theta~\lambda(\mathrm{d}\theta)}. The integral in the definition means that the agent can think about the process as a two stages randomization: First a parameter {\theta} is drawn according to {\lambda} and then a realization {\omega} is drawn according to {\mu_\theta}. Thus, a decomposition captures a certain way in which a Bayesian agent arranges his belief. Of course every belief {\mu} admits many decompositions. The extreme decompositions are:

  • The Trivial Decomposition. Take {\Theta=\{\bar\theta\}} and {\mu_{\bar\theta}=\mu}.
  • Dirac’s Decomposition. Take {\Theta=A^\mathbb{N}} and {\lambda=\mu}. A “parameter” in this case is a measure {\delta_\omega} that assigns probability 1 to the realization {\omega}.

Not all decompositions are equally exciting. We are looking for decompositions in which the parameter {\theta} captures some `fundamental property’ of the process. The two extreme cases mentioned above are usually unsatisfactory in this sense. In Dirac’s decomposition, there are as many parameters as there are realizations; parameters simply copy realizations. In the trivial decomposition, there is a single parameter and thus cannot discriminate between different interesting properties. For stationary process, there is a natural decomposition in which the parameters distinguish between fundamental properties of the process. This is the ergodic decomposition, according to which the parameters are the ergodic beliefs. Recall that in this decomposition, a parameter captures the empirical distribution of blocks of outcomes in the infinite realization.

So what about learning ? While observing a process, a Bayesian agent will update his belief about the parameter. We denote by {\lambda_n\left(a_0,\dots,a_{n-1}\right)\in\Delta(\Theta)} the posterior belief about the parameter {\theta} at the beginning of period {n} after observing the outcome sequence {a_0,\dots,a_{n-1}}. The notion of learning I want to talk about in this post is that this belief converges to a belief that is concentrated on the true parameter {\theta}. The example you should have in mind is the coin toss example I started with: while observing the outcomes of the coin the agent becomes more and more certain about the true parameter of the coin, which means his posterior belief becomes concentrated around a belief that gives probability {1} to the true parameter.

Definition 1 A decomposition of {\mu} is consistent if for {\lambda}-almost every {\theta} it holds that

\displaystyle \lambda_n\left(a_0,\dots,a_{n-1}\right)\xrightarrow[n\rightarrow\infty]{w}\delta_\theta

for {\mu_\theta}-almost every realization {\omega=(a_0,a_1,\dots)}.

In this definition, {\delta_\theta} is Dirac atomic measure on {\theta} and the convergence is weak convergence of probability measures. No big deal if you don’t know what it means since it is exactly what you expect.

So, we have a notion of learning, and a seminal theorem of L.J. Doob (more on that below) implies that the ergodic decomposition of every stationary process is consistent. While this is not something that you will read in most textbooks (more on that below too), it is still well known. Why do Nabil and I dig further into the issue of learnability of the ergodic decomposition ? Two reasons. First, one has to write papers. Second, there is something unsatisfactory about the concept of consistency as a formalization of learning. To see why, consider the belief {\mu} that outcomes are i.i.d. with probability {1/2} for success. This belief is ergodic, so from the perspective of the ergodic decomposition the agent `knows the process’ and there is nothing else to learn. But let’s look at Dirac’s decomposition instead of the ergodic decomposition. Then the parameter space equals the space of all realizations. Suppose the true parameter (=realization) is {\omega^\ast=(a_0,a_1,\dots)}, then after observing the first {n} outcomes of the process the agent’s posterior belief about the parameter is concentrated on all {\omega} that agrees with {\omega^\ast} on the first {n} coordinates. These posterior beliefs converge to {\delta_{\omega^\ast}}, so that Dirac decomposition is also consistent ! We may say that we learn the parameter, but “learning the parameter” in this environment is just recording the past. The agent does not gain any new insight about the future of the process from learning the parameter.

In my next post I will talk about other notions of learning, originating in a seminal paper of Blackwell and Dubins, which capture the idea that an agent who learns a parameter can make predictions as if he new the parameter. Let me also say here that this post and the following ones are much influenced by a paper of Jackson, Kalai, Smorodinsky. I will talk more on that paper in another post.

For the rest of this post I am going to make some comments about Bayesian consistency which, though again standard, I don’t usually see them in textbooks. Especially I don’t know of a reference for the version of Doob’s Theorem which I give below, so if any reader can give me such a reference it will be helpful.

First, you may wonder whether every decomposition is consistent. The answer is no. For a trivial example, take a situation where {\mu_\theta} are the same for every {\theta}. More generally troubles arrise when the realization does not pin down the parameter. Formally, let us say that function {f:\Omega\rightarrow \Theta} pins down or identifies the parameter if

\displaystyle \mu_\theta\left(\{\omega: f(\omega)=\theta\}\right)=1

for {\lambda}-almost every {\theta}. If such {f} exists then the decomposition is identifiable.

We have the following

Theorem 2 (Doob’s Theorem) A decomposition is identifiable if and only if it is consistent.

The `if’ part follows immediately from the definitions. The `only if’ part is deep, but not difficult: it follows immediately from the martingale convergence theorem. Indeed, Doob’s Theorem is usually cited as the first application of martingales theory.

Statisticians rarely work with this abstract formulation of decomposition which I use. For this reason, the theorem is usually formulated only for the case that {\Theta=\Delta(A)} and {\mu_\theta} is i.i.d. {\theta}. In this case the fact that the decomposition is identifiable follows from the strong law of large numbers. Doob’s Theorem then implies the standard consistency of Bayesian estimator of the parameter {\theta}.

Four agents are observing infinite streams of outcomes in {\{S,F\}}. None of them knows the future outcomes and as good Bayesianists they represent their beliefs about unknowns as probability distributions:

  • Agent 1 believes that outcomes are i.i.d. with probability {1/2} of success.
  • Agent 2 believes that outcomes are i.i.d. with probability {\theta} of success. She does not know {\theta}; She believes that {\theta} is either {2/3} or {1/3}, and attaches probability {1/2} to each possibility.
  • Agent 3 believes that outcomes follow a markov process: every day’s outcome equals yesterday’s outcome with probability {3/4}.
  • Agent 4 believes that outcomes follow a markov process: every day’s outcome equals yesterday’s outcome with probability {\theta}. She does not know {\theta}; Her belief about {\theta} is the uniform distribution over {[0,1]}.

I denote by {\mu_1,\dots,\mu_4\in\Delta\left(\{S,F\}^\mathbb{N}\right)} the agents’ beliefs about future outcomes.

We have an intuition that Agents 2 and 4 are in a different situations from Agents 1 and 3, in the sense that are uncertain about some fundamental properties of the stochastic process they are facing. I will say that they have `structural uncertainty’. The purpose of this post is to formalize this intuition. More explicitly, I am looking for a property of a belief {\mu} over {\Omega} that will distinguish between beliefs that reflect some structural uncertainty and beliefs that don’t. This property is ergodicity.



Definition 1 Let {\zeta_0,\zeta_1,\dots} be a stationary process with values in some finite set {A} of outcomes. The process is ergodic if for every block {\bar a=(a_0,\dots,a_{k})} of outcomes it holds that

\displaystyle \begin{array}{rcl} \lim_{n\rightarrow\infty}&\frac{\#\left\{0\le t < n|\zeta_t=a_0,\dots,\zeta_{t+k}=a_{k}\right\}}{n}=\\&\mathop{\mathbb P}(\zeta_0=a_0,\dots,\zeta_{k}=a_{k})~\text{a.s}.\end{array}

A belief {\mu\in \Delta(A^\mathbb{N})} is ergodic if it is the distribution of an ergodic process

Before I explain the definition let me write the ergodicity condition for the special case of the block {\bar a=(a)} for some {a\in A} (this is a block of size 1):

\displaystyle \lim_{n\rightarrow\infty}\frac{\#\left\{0\le t < n|\zeta_t=a\right\}}{n}=\mathop{\mathbb P}(\zeta_0=a)~\text{a.s}.\ \ \ \ \ (1)


In the right side of (1) we have the (subjective) probability that on day {0} we will see the outcome {a}. Because of stationarity this is also the belief that we will see the outcome {a} on every other day. In the left side of (1) we have no probabilities at all. What is written there is the frequency of appearances of the outcome {a} in the realized sequence. This frequency is objective and has nothing to do with our beliefs. Therefore, the probabilities that a Bayesian agent with ergodic belief attaches to observing some outcome is a number that can be measured from the process: just observe it long enough and check the frequency in which this outcome appears. In a way, for ergodic processes the frequentist and subjective interpretations of probability coincide, but there are legitimate caveats to this statement, which I am not gonna delve into because my subject matter is not the meaning of probability. For my purpose it’s enough that ergodicity captures the intuition we have about the four agents I started with: Agents 1 and 3 both give probability {1/2} to success in each day. This means that if they are sold a lottery ticket that gives a prize if there is a success at day, say, 172, they both price this lottery ticket the same way. However, Agent 1 is certain that in the long run the frequency of success will be {1/2}. Agent 2 is certain that it will be either {2/3} or {1/3}. In fancy words, {\mu_1} is ergodic and {\mu_2} is not.

So, ergodic processes capture our intuition of `processes without structural uncertainty’. What about situations with uncertainty ? What mathematical creature captures this uncertainty ? Agent 2’s uncertainty seems to be captured by some probability distribution over two ergodic processes — the process “i.i.d. {2/3}” and the process “i.i.d. {1/3}”. Agent 2 is uncertain which of these processes he is facing. Agent 4’s uncertainty is captured by some probability distribution over a continuum of markov (ergodic) processes. This is a general phenomena:

Theorem 2 (The ergodic decomposition theorem) Let {\mathcal{E}} be the set of ergodic distributions over {A^\mathbb{N}}. Then for every stationary belief {\mu\in\Delta(A^\mathbb{N})} there exists a unique distribution {\lambda} over {\mathcal{E}} such that {\mu=\int \theta~\lambda(\text{d}\theta)}.

The probability distribution {\lambda} captures uncertainty about the structure of the process. In the case that {\mu} is an ergodic processes {\lambda} is degenerated and there is no structural uncertainty.

Two words of caution: First, my definition of ergodic processes is not the one you will see in textbooks. The equivalence to the textbook definition is an immediate consequence of the so called ergodic theorem, which is a generalization of the law of large numbers for ergodic processes. Second, my use of the word `uncertainty’ is not universally accepted. The term traces back at least to Frank Knight, who made the distinction between risk or “measurable uncertainty” and what is now called “Knightian uncertainty” which cannot be measured. Since Knight wrote in English and not in Mathematish I don’t know what he meant, but modern decision theorists, mesmerized by the Ellsberg Paradox, usually interpret risk as a Bayesian situation and Knightian uncertainty, or “ambiguity”, as a situation which falls outside the Bayesian paradigm. So if I understand correctly they will view the situations of these four agents mentioned above as situations of risk only without uncertainty. The way in which I use “structural uncertainty” was used in several theory papers. See this paper of Jonathan and Nabil. And this and the paper which I am advertising in these posts, about disappearance of uncertainty over time. (I am sure there are more.)

To be continued…

The abstract of a 2005 paper by Itti and Baldi begins with these words:

The concept of surprise is central to sensory processing, adaptation, learning, and attention. Yet, no widely-accepted mathematical theory currently exists to quantitatively characterize surprise elicited by a stimulus or event, for observers that range from single neurons to complex natural or engineered systems. We describe a formal Bayesian definition of surprise that is the only consistent formulation under minimal axiomatic assumptions.

They propose that surprise be measured by the Kullback-Liebler divergence between the prior and the posterior. As with many good ideas, Itti and Baldi are not the first to propose this. C. L. Martin and G. Meeden did so in 1984 in an unpublished paper entitled: `The distance between the prior and the posterior distributions as a measure of surprise.’ Itti and Baldi go further and provide experimental support that this notion of surprise comports with human notions of surprise. Recently, Ely, Frankel and Kamenica in Economics, have also considered the issue of surprise, focusing instead on how best to release information so as to maximize interest.

Surprise now being defined, one might go on to define novelty, interestingness, beauty and humor. Indeed, Jurgen Schmidhuber has done just that (and more). A paper on the optimal design of jokes cannot be far behind. Odd as this may seem, it is a part of a venerable tradition. Kant defined humor as the sudden transformation of a strained expectation into nothing. Birkhoff himself wrote an entire treatise on Aesthetic Measure (see the review by Garabedian). But, I digress.

Returning to the subject of surprise, the Kulback-Liebler divergence is not the first measure of surprise or even the most wide spread. I think that prize goes to the venerable p-value. Orthodox Bayesians, those who tremble in the sight of measure zero events, look in horror upon the p-value because it does not require one to articulate a model of the alternative. Even they would own, I think, to the convenience of having to avoid listing all alternative models and carefully evaluating them. Indeed I. J. Good  writing in 1981 notes the following:

The evolutionary value of surprise is that it causes us to check our assumptions. Hence if an experiment gives rise to a surprising result given some null hypothesis H it might cause us to wonder whether H is true even in the absence of a vague alternative to H.

Good, by the way, described himself as a cross between Bayesian and Frequentist, called a Doogian. One can tell from this label that he had an irrepressible sense of humor. Born Isadore Guldak Joseph of a Polish family in London, he changed his name to Ian Jack Good, close enough one supposes. At Bletchley park he and Turing came up with the scheme that eventually broke the German Navy’s enigma code. This led to the Good-Turing estimator. Imagine a sequence so symbols chosen from a finite alphabet. How would one estimate the probability of observing a letter from the alphabet that has not yet appeared in the sequence thus far? But, I digress.

Warren Weaver was, I think, the first to propose a measure of surpirse. Weaver is most well known as a popularizer of Science. Some may recall him as the Weaver on the slim volume by Shannon and Weaver on the Mathematical Theory of Communication. Well before that, Weaver played an important role at the Rockefeller foundation, where he used their resources to provide fellowships to many promising scholars and jump start molecular biology. The following is from page 238 of my edition Jonas’ book `The Circuit Riders’:

Given the unreliability of such sources, the conscientious philanthropoid has no choice but to become a circuit rider. To do it right, a circuit rider must be more than a scientifically literate ‘tape recorder on legs.’ In order to win the confidence of their informants, circuit riders for Weaver’s Division of Natural Sciences were called upon the offer a high level of ‘intellectual companionship – without becoming ‘too chummy’ with people whose work they had, ultimately, to judge.

But, I digress.

To define Weaver’s notion, suppose a discrete random variable X that takes values in the set \{1, 2, \ldots, m\}. Let p_i be the probability that X = i. The surprise index of outcome k is \frac{\sum_i i p^2_i}{p_k}. Good himself jumped into the fray with some generalizations of Weaver’s index. Here is one \frac{[\sum_iip_i^t]^{1/t}}{p_k}. Others involve the use of logs, leading to measures that are related to notions of entropy as well probability scoring rules. Good also proposed axioms that a good measure to satisfy, but I cannot recall if anyone followed up to derive axiomatic characterizations.

G. L. S. Shackle, who would count as one of the earliest decision theorists, also got into the act. Shackle departed from subjective probability and proposed to order degrees of beliefs by their potential degrees of surprise. Shackle also proposed, I think, that an action be judged interesting by its best possible payoff and its potential for surprise. Shackle, has already passed beyond the ken of men. One can get a sense of his style and vigor from the following response to an invitation to write a piece on Rational Expectations:

Rational expectations’ remains for me a sort of monster living in a cave. I have never ventured into the cave to see what he is like, but I am always uneasily aware that he may come out and eat me. If you will allow me to stir the cauldron of mixed metaphors with a real flourish, I shall suggest that ‘rational expectations’ is neo-classical theory clutching at the last straw. Observable circumstances offer us suggestions as to what may be the sequel of this act or that one. How can we know what invisible circumstances may take effect in time-to come, of which no hint can now be gained? I take it that ‘rational expectations’ assumes that we can work out what will happen as a consequence of this or that course of action. I should rather say that at most we can hope to set bounds to what can happen, at best and at worst, within a stated length of time from ‘the present’, and can invent an endless diversity of possibilities lying between them. I fear that for your purpose I am a broken reed.

There is a debate underway about how effective Israel’s iron dome system is in protecting populated areas from missile attacks. On the pro side it is argued that somewhere between 85% to 90% of incoming missiles are destroyed. The con side argues that the proportion is much smaller, 40% or less. A large part of the difference comes from how one defines `destroy’. Perhaps a better term would be intercept. It is possible that about 90% of incoming missiles are intercepted. However, a missile once intercepted may not have its warhead disabled, making at least one of the fragments that falls to ground (in a populated area) dangerous.

While nailing down the actual numbers may be interesting, it strikes me as irrelevant. Suppose that any incoming missile has a 90% chance of being intercepted and destroyed (which is the claim of the builder of the iron dome technology). If the attacker launches N missiles and iron dome is deployed, the probability (assuming independence) not a single one making it through is (0.9)^N. Thus, the probability of at least one missile making it through the `dome’ is 1 – (0.9)^N. If N is large, this is large. For example, for N = 10, the probability that at least one missile makes its way through is at least 60% (thanks to anonymous below for correction). Thus, as long as the attacker has access to large quantities of missiles, it can be sure to get missiles through the dome.

In Harsanyi games with incomplete information, also known as Bayesian games, each player has a type. The type of the player describes all that he knows and believes about the situation he faces: who are the players, what are his and their available actions, what are his and their utility functions, and what are the beliefs of the other players about the situation.

Since the player’s type describes his knowledge and beliefs, a player always knows his own type. But a player need not know the other players’ types. Indeed, a chess player knows his own abilities, but he may not the level of his opponent: he may ascribe probability 1/3 to the event that his opponent is familiar with the Benko gambit, and probability 2/3 to the event that the opponent is not familiar with this opening.

In a Bayesian game,  a chance move selects a vector of types, one type for each player, according to a known probability distribution p at the outset of the game. Each player learns his own type, but he does not know the types chosen for the other players. He does have a belief about the other players’ types, which is the conditional distribution of p given his own type.

The Bayesian game is an auxiliary construction. In reality there is no chance move that selects the player’s types: the knowledge and beliefs each player is equipped with determine his or her type. Bayesian games are  merely a way to model the incomplete information each player has on the other players’ types. Thus, the true situation the players face is the situation after the vector of types was selected, which is called the interim stage. The situation before the vector of types is chosen, which is called ex ante stage, is the mathematical way that Harsanyi found to model the game.

Consider now the following Bayesian game, that depends on a real number a (which is in the unit interval; below, all additions and subtractions are modulo 1). There are two players; the type space of each player is the unit interval [0,1]. The types of the players are correlated: if  player 1 has type x, then he believes that player 2’s type is either x of x+a (each with probability 1/2); if  player 2 has type x, then he believes that player 1’s type is either x of x-a (each with probability 1/2). This belief structure can be described by a common prior distribution: the types of the two players are chosen according to the uniform distribution over the following set T (this is a variation of an example of Ehud Lehrer and Dov Samet):

The type space of the Bayesian game

If player 1’s type is x, then he believes that player 2 may be of type x or x+a. It follows that player 2 may believes that player 1’s type is x-a, x, or x+a. So player 2 may believe that player 1 believes that player 2’s type is x-a, x, x+a or x+2a. When the situation is a game, to decide how to play, player 1 needs to take all types of player 2 (and of himself) of the form {x+na, n is an integer}. This set of finite if a is a rational number, and countable if a is an irrational number. Denote by Zx the set of all pairs of type {(x+na,x+na), n is an integer} union with {(x+na,x+(n+1)a), n is an integer}. The set Zx is called the minimal belief subspace of player 1. In the interim stage, after his type was selected and told to him, player 1 knows that the type vector is in Zx, that only type vectors in Zx appear in the belief hierarchy of player 2, and therefore he can think about the situation as if the Bayesian game is restricted to Zx: a type vector in Zx was chosen according to the conditional distribution over Zx. To determine how to play, player 1 should find an equilibrium in the game restricted to Z.

The uniform distribution of the set T that appears in the figure above induces a probability distribution over Zx. When Zx is finite (= a is a rational number), this is the uniform distribution over a finite set. Alas, when Zx is countable (a is irrational) there is no uniform distribution over Zx. In particular, the interim stage is not well defined! Thus, even though the interim stage is the actual situation the players face, and even though they can describe their beliefs using a Harsanyi game with a larger type space, the situation they face cannot be described as a Harsanyi game if they take into account only the types that are possible according to their information.

It is interesting to note that one can find a Bayesian equilibrium in the game restricted to Zx, for every x. However, when one tries to “glue” these equilibria together, one might find out that the resulting pair of strategies over [0,1] is not measurable, and in particular an equilibrium in the Harsanyi game (over T) need not exist. This finding was first noted by Bob Simon.

Since indeed the true situation the players face is the interim stage, and the ex ante stage is merely an auxiliary construction, how come the ex ante stage does not define a proper game in the interim stage? If this is the case, is the auxiliary construction of Harsanyi game over T the correct one?

Among game theoretic concepts, mixed strategy is arguably the most difficult to digest. We don’t see people tossing coins all the time, and it’s difficult to justify rational decision as based on Lady Fortuna’s unpredictable caprices. The case of Nash Equilibrium is especially disturbing — if you are indifferent between a couple of strategies then why bother randomizing between them according to the mixture prescribed by the equilibrium. Just pick one of these strategies arbitrary and get it over with.

I know of two types of answers that game theory gives for this conundrum. One, which may be called `interpretation of mixed strategies’ is arguing that the mixed strategies in Nash equilibrium do not reflect an actual randomization performed by the players: Epistemic game theory interprets mixed strategies as opponent’s beliefs about a player’s (non-randomized) strategy; Harsanyi’s Purification approach embeds a normal form game in a larger game with incomplete information and pure equilibrium. The other type of answers is identifying classes of games which admit pure equilibrium, such as games with strategic complementarity and potential games.

In my new paper with Yaron(pdf) we suggest another class of games which admit pure {\epsilon}-equilibrium, which means that no player gains more than {\epsilon} from deviating. These are games in which a player’s payoff does not change much if one of her opponents changes his strategy:

Math and open problems below the fold…

Read the rest of this entry »