You are currently browsing the category archive for the ‘Decision theory’ category.

This post describes the main theorem in my new paper with Nabil. Scroll down for open questions following this theorem. The theorem asserts that a Bayesian agent in a stationary environment will learn to make predictions as if he knew the data generating process, so that the as time goes by structural uncertainty dissipates. The standard example is when the sequence of outcomes is i.i.d. with an unknown parameter. As times goes by the agent learns the parameter.

The formulation of `learning to make predictions’ goes through merging, which traces back to Blackwell and Dubins. I will not give Blackwell and Dubin’s definition in this post but a weaker definition, suggested by Kalai and Lehrer.

A Bayesian agent observes an infinite sequence of outcomes from a finite set {A}. Let {\mu\in\Delta(A^\mathbb{N})} represent the agent’s belief about the future outcomes. Suppose that before observing every day’s outcome the agent makes a probabilistic prediction about it. I denote by {\mu(\cdot|a_0,\dots,a_{n-1})} the element in {\Delta(A)} which represents the agent’s prediction about the outcome of day {n} just after he observed the outcomes {a_0,\dots,a_{n-1}} of previous days. In the following definition it is instructive to think about {\tilde\mu} as the true data generating process, i.e., the process that generates the sequence of outcomes, which may be different from the agent’s belief.

Definition 1 (Kalai and Lehrer) Let {\mu,\tilde\mu\in\Delta(A^\mathbb{N})}. Then {\mu} merges with {\tilde\mu} if for {\tilde\mu}-almost every realization {(a_0,\dots,a_{n-1},\dots)} it holds that

\displaystyle \lim_{n\rightarrow\infty}\|\mu(\cdot|a_0,\dots,a_{n-1})-\tilde\mu(\cdot|a_0,\dots,a_{n-1})\|=0.

Assume now that the agent’s belief {\mu} is stationary, and let {\mu=\int \theta~\lambda(\mathrm{d}\theta)} be its ergodic decomposition. Recall that in this decomposition {\theta} ranges over ergodic beliefs and {\lambda} represents structural uncertainty. Does the agent learn to make predictions ? Using the definition of merging we can ask, does {\mu} merges with {\theta} ? The answer, perhaps surprisingly, is no. I gave an example in my previous post.

Let me now move to a weaker definition of merging, that was first suggested by Lehrer and Smorodinsky. This definition requires the agent to make correct predictions in almost every period.

Definition 2 Let {\mu,\tilde\mu\in\Delta(A^\mathbb{N})}. Then {\mu} weakly merges with {\tilde\mu} if {\tilde\mu}-almost every realization {(a_0,\dots,a_{n-1},\dots)} it holds that

\displaystyle \lim_{n\rightarrow\infty,n\in T}\|\mu(\cdot|a_0,\dots,a_{n-1})-\tilde\mu(\cdot|a_0,\dots,a_{n-1})\|=0

for a set {T\subseteq \mathbb{N}} of periods of density {1}.

The definition of weak merging is natural: patient agents whose belief weakly merges with the true data generating process will make almost optimal decisions. Kalai, Lehrer and Smorodinsky discuss these notions of mergings and also their relationship with Dawid’s idea of calibration.

I am now in a position to state the theorem I have been talking about for two months:

Theorem 3 Let {\mu\in\Delta(A^\mathbb{N})} be stationary, and let {\mu=\int \theta~\lambda(\mathrm{d}\theta)} be its ergodic decomposition. Then {\mu} weakly merges with {\theta} for {\lambda}-almost every {\theta}.

In words: An agent who has some structural uncertainty about the data generating process will learn to make predictions in most periods as if he knew the data generating process.

Finally, here are the promised open questions. They deal with the two qualification in the theorem. The first question is about the “{\lambda}-almost every {\theta}” in the theorem. As Larry Wasserman mentioned this is unsatisfactory in some senses. So,

Question 1 Does there exists a stationary {\mu} (equivalently a belief {\lambda} over ergodic beliefs) such that {\mu} weakly merges with {\theta} for every ergodic distribution {\theta} ?

The second question is about strengthening weak merging to merging. We already know that this cannot be done for arbitrary belief {\lambda} over ergodic processes, but what if {\lambda} is concentrated on some natural family of processes, for example hidden markov processes with a bounded number of hidden states ? Here is the simplest setup for which I don’t know the answer.

Question 2 The outcome of the stock market at every day is either U or D (up or down). An agent believes that this outcome is a stochastic function of an unobserved (hidden) state of the economy which can be either G or B (good or bad): When the hidden state is B the outcome is U with probability {q_B} (and D with probability {1-q_B}), and when the state is G the outcome is U with probability {q_G}. The hidden state changes according to a markov process with transition probability {\rho(B|B)=1-\rho(G|B)=p_B}, {\rho(B|G)=1-\rho(G|G)=p_G}. The parameter is {(p_B,p_G,q_B,q_G)} and the agent has some prior {\lambda} over the parameter. Does the agent’s belief about outcomes merge with the truth for {\lambda}-almost every {(p_B,p_G,q_B,q_G)} ?.

Four agents are observing infinite streams of outcomes in {\{S,F\}}. None of them knows the future outcomes and as good Bayesianists they represent their beliefs about unknowns as probability distributions:

  • Agent 1 believes that outcomes are i.i.d. with probability {1/2} of success.
  • Agent 2 believes that outcomes are i.i.d. with probability {\theta} of success. She does not know {\theta}; She believes that {\theta} is either {2/3} or {1/3}, and attaches probability {1/2} to each possibility.
  • Agent 3 believes that outcomes follow a markov process: every day’s outcome equals yesterday’s outcome with probability {3/4}.
  • Agent 4 believes that outcomes follow a markov process: every day’s outcome equals yesterday’s outcome with probability {\theta}. She does not know {\theta}; Her belief about {\theta} is the uniform distribution over {[0,1]}.

I denote by {\mu_1,\dots,\mu_4\in\Delta\left(\{S,F\}^\mathbb{N}\right)} the agents’ beliefs about future outcomes.

We have an intuition that Agents 2 and 4 are in a different situations from Agents 1 and 3, in the sense that are uncertain about some fundamental properties of the stochastic process they are facing. I will say that they have `structural uncertainty’. The purpose of this post is to formalize this intuition. More explicitly, I am looking for a property of a belief {\mu} over {\Omega} that will distinguish between beliefs that reflect some structural uncertainty and beliefs that don’t. This property is ergodicity.

 

 

Definition 1 Let {\zeta_0,\zeta_1,\dots} be a stationary process with values in some finite set {A} of outcomes. The process is ergodic if for every block {\bar a=(a_0,\dots,a_{k})} of outcomes it holds that

\displaystyle \begin{array}{rcl} \lim_{n\rightarrow\infty}&\frac{\#\left\{0\le t < n|\zeta_t=a_0,\dots,\zeta_{t+k}=a_{k}\right\}}{n}=\\&\mathop{\mathbb P}(\zeta_0=a_0,\dots,\zeta_{k}=a_{k})~\text{a.s}.\end{array}

A belief {\mu\in \Delta(A^\mathbb{N})} is ergodic if it is the distribution of an ergodic process

Before I explain the definition let me write the ergodicity condition for the special case of the block {\bar a=(a)} for some {a\in A} (this is a block of size 1):

\displaystyle \lim_{n\rightarrow\infty}\frac{\#\left\{0\le t < n|\zeta_t=a\right\}}{n}=\mathop{\mathbb P}(\zeta_0=a)~\text{a.s}.\ \ \ \ \ (1)

 

In the right side of (1) we have the (subjective) probability that on day {0} we will see the outcome {a}. Because of stationarity this is also the belief that we will see the outcome {a} on every other day. In the left side of (1) we have no probabilities at all. What is written there is the frequency of appearances of the outcome {a} in the realized sequence. This frequency is objective and has nothing to do with our beliefs. Therefore, the probabilities that a Bayesian agent with ergodic belief attaches to observing some outcome is a number that can be measured from the process: just observe it long enough and check the frequency in which this outcome appears. In a way, for ergodic processes the frequentist and subjective interpretations of probability coincide, but there are legitimate caveats to this statement, which I am not gonna delve into because my subject matter is not the meaning of probability. For my purpose it’s enough that ergodicity captures the intuition we have about the four agents I started with: Agents 1 and 3 both give probability {1/2} to success in each day. This means that if they are sold a lottery ticket that gives a prize if there is a success at day, say, 172, they both price this lottery ticket the same way. However, Agent 1 is certain that in the long run the frequency of success will be {1/2}. Agent 2 is certain that it will be either {2/3} or {1/3}. In fancy words, {\mu_1} is ergodic and {\mu_2} is not.

So, ergodic processes capture our intuition of `processes without structural uncertainty’. What about situations with uncertainty ? What mathematical creature captures this uncertainty ? Agent 2’s uncertainty seems to be captured by some probability distribution over two ergodic processes — the process “i.i.d. {2/3}” and the process “i.i.d. {1/3}”. Agent 2 is uncertain which of these processes he is facing. Agent 4’s uncertainty is captured by some probability distribution over a continuum of markov (ergodic) processes. This is a general phenomena:

Theorem 2 (The ergodic decomposition theorem) Let {\mathcal{E}} be the set of ergodic distributions over {A^\mathbb{N}}. Then for every stationary belief {\mu\in\Delta(A^\mathbb{N})} there exists a unique distribution {\lambda} over {\mathcal{E}} such that {\mu=\int \theta~\lambda(\text{d}\theta)}.

The probability distribution {\lambda} captures uncertainty about the structure of the process. In the case that {\mu} is an ergodic processes {\lambda} is degenerated and there is no structural uncertainty.

Two words of caution: First, my definition of ergodic processes is not the one you will see in textbooks. The equivalence to the textbook definition is an immediate consequence of the so called ergodic theorem, which is a generalization of the law of large numbers for ergodic processes. Second, my use of the word `uncertainty’ is not universally accepted. The term traces back at least to Frank Knight, who made the distinction between risk or “measurable uncertainty” and what is now called “Knightian uncertainty” which cannot be measured. Since Knight wrote in English and not in Mathematish I don’t know what he meant, but modern decision theorists, mesmerized by the Ellsberg Paradox, usually interpret risk as a Bayesian situation and Knightian uncertainty, or “ambiguity”, as a situation which falls outside the Bayesian paradigm. So if I understand correctly they will view the situations of these four agents mentioned above as situations of risk only without uncertainty. The way in which I use “structural uncertainty” was used in several theory papers. See this paper of Jonathan and Nabil. And this and the paper which I am advertising in these posts, about disappearance of uncertainty over time. (I am sure there are more.)

To be continued…

The abstract of a 2005 paper by Itti and Baldi begins with these words:

The concept of surprise is central to sensory processing, adaptation, learning, and attention. Yet, no widely-accepted mathematical theory currently exists to quantitatively characterize surprise elicited by a stimulus or event, for observers that range from single neurons to complex natural or engineered systems. We describe a formal Bayesian definition of surprise that is the only consistent formulation under minimal axiomatic assumptions.

They propose that surprise be measured by the Kullback-Liebler divergence between the prior and the posterior. As with many good ideas, Itti and Baldi are not the first to propose this. C. L. Martin and G. Meeden did so in 1984 in an unpublished paper entitled: `The distance between the prior and the posterior distributions as a measure of surprise.’ Itti and Baldi go further and provide experimental support that this notion of surprise comports with human notions of surprise. Recently, Ely, Frankel and Kamenica in Economics, have also considered the issue of surprise, focusing instead on how best to release information so as to maximize interest.

Surprise now being defined, one might go on to define novelty, interestingness, beauty and humor. Indeed, Jurgen Schmidhuber has done just that (and more). A paper on the optimal design of jokes cannot be far behind. Odd as this may seem, it is a part of a venerable tradition. Kant defined humor as the sudden transformation of a strained expectation into nothing. Birkhoff himself wrote an entire treatise on Aesthetic Measure (see the review by Garabedian). But, I digress.

Returning to the subject of surprise, the Kulback-Liebler divergence is not the first measure of surprise or even the most wide spread. I think that prize goes to the venerable p-value. Orthodox Bayesians, those who tremble in the sight of measure zero events, look in horror upon the p-value because it does not require one to articulate a model of the alternative. Even they would own, I think, to the convenience of having to avoid listing all alternative models and carefully evaluating them. Indeed I. J. Good  writing in 1981 notes the following:

The evolutionary value of surprise is that it causes us to check our assumptions. Hence if an experiment gives rise to a surprising result given some null hypothesis H it might cause us to wonder whether H is true even in the absence of a vague alternative to H.

Good, by the way, described himself as a cross between Bayesian and Frequentist, called a Doogian. One can tell from this label that he had an irrepressible sense of humor. Born Isadore Guldak Joseph of a Polish family in London, he changed his name to Ian Jack Good, close enough one supposes. At Bletchley park he and Turing came up with the scheme that eventually broke the German Navy’s enigma code. This led to the Good-Turing estimator. Imagine a sequence so symbols chosen from a finite alphabet. How would one estimate the probability of observing a letter from the alphabet that has not yet appeared in the sequence thus far? But, I digress.

Warren Weaver was, I think, the first to propose a measure of surpirse. Weaver is most well known as a popularizer of Science. Some may recall him as the Weaver on the slim volume by Shannon and Weaver on the Mathematical Theory of Communication. Well before that, Weaver played an important role at the Rockefeller foundation, where he used their resources to provide fellowships to many promising scholars and jump start molecular biology. The following is from page 238 of my edition Jonas’ book `The Circuit Riders’:

Given the unreliability of such sources, the conscientious philanthropoid has no choice but to become a circuit rider. To do it right, a circuit rider must be more than a scientifically literate ‘tape recorder on legs.’ In order to win the confidence of their informants, circuit riders for Weaver’s Division of Natural Sciences were called upon the offer a high level of ‘intellectual companionship – without becoming ‘too chummy’ with people whose work they had, ultimately, to judge.

But, I digress.

To define Weaver’s notion, suppose a discrete random variable X that takes values in the set \{1, 2, \ldots, m\}. Let p_i be the probability that X = i. The surprise index of outcome k is \frac{\sum_i i p^2_i}{p_k}. Good himself jumped into the fray with some generalizations of Weaver’s index. Here is one \frac{[\sum_iip_i^t]^{1/t}}{p_k}. Others involve the use of logs, leading to measures that are related to notions of entropy as well probability scoring rules. Good also proposed axioms that a good measure to satisfy, but I cannot recall if anyone followed up to derive axiomatic characterizations.

G. L. S. Shackle, who would count as one of the earliest decision theorists, also got into the act. Shackle departed from subjective probability and proposed to order degrees of beliefs by their potential degrees of surprise. Shackle also proposed, I think, that an action be judged interesting by its best possible payoff and its potential for surprise. Shackle, has already passed beyond the ken of men. One can get a sense of his style and vigor from the following response to an invitation to write a piece on Rational Expectations:

Rational expectations’ remains for me a sort of monster living in a cave. I have never ventured into the cave to see what he is like, but I am always uneasily aware that he may come out and eat me. If you will allow me to stir the cauldron of mixed metaphors with a real flourish, I shall suggest that ‘rational expectations’ is neo-classical theory clutching at the last straw. Observable circumstances offer us suggestions as to what may be the sequel of this act or that one. How can we know what invisible circumstances may take effect in time-to come, of which no hint can now be gained? I take it that ‘rational expectations’ assumes that we can work out what will happen as a consequence of this or that course of action. I should rather say that at most we can hope to set bounds to what can happen, at best and at worst, within a stated length of time from ‘the present’, and can invent an endless diversity of possibilities lying between them. I fear that for your purpose I am a broken reed.

What does it mean to describe a probability distribution over, say, {\{0,1\}^\mathbb{N}} ? I am interested in this question because of the expert testing literature (pdf), in which an expert is supposed to provide a client with the distribution of some stochastic process, but the question is also relevant for Bayesiansts, at least those of us who think that the Bayesian paradigm captures (or should capture) the reasoning process of rational agents, in which case it makes little sense to reason about something you cannot describe.

So, what does it mean to describe your belief/theory about a stochastic process {X_1,X_2,\dots} with values in {\{0,1\}}?  Here are some examples of descriptions:

{X_n} are I.I.D with {P(X_n=0)=3/4}.

{X_0=X_1=1} and {\mathop{\mathbb P}(X_n=0|X_0,\dots,X_{n-1})=1/eX_{n-1}+0.4X_{n-2}}

{X_n=Y_{2n}\cdot Y_{2n-1}} where {Y_0,Y_1,\dots} are I.I.D {(1/2,1/2)}.

Everyone agrees that I have just described processes, and that the first and last are different descriptions of the same process. Also everyone probably agrees that there are only countably many processes I can describe, since every description is a sentence in English/Math and there are only countably many such sentences.

Read the rest of this entry »

Kellogg faculty blogroll