You are currently browsing the tag archive for the ‘uncertainty’ tag.

In the lasts posts I talked about a Bayesian agent in a stationary environment. The flagship example was tossing a coin with uncertainty about the parameter. As time goes by, he learns the parameter. I hinted about the distinction between `learning the parameter’, and `learning to make predictions about the future as if you knew the parameter’. The former seems to imply the latter almost by definition, but this is not so.

Because of its simplicity, the i.i.d. example is in fact somewhat misleading for my purposes in this post. If you toss a coin then your belief about the parameter of the coin determines your belief about the outcome tomorrow: if at some point your belief about the parameter is given by some {\mu\in [0,1]} then your prediction about the outcome tomorrow will be the expectation of {\mu}. But in a more general stationary environment, your prediction about the outcome tomorrow depends on your current belief about the parameter and also on what you have seen in the past. For example, if the process is Markov with an unknown transition matrix then to make a probabilistic prediction about the outcome tomorrow you first form a belief about the transition matrix and then uses it to predict the outcome tomorrow given the outcome today. The hidden markov case is even more complicated, and it gives rise to the distinction between the two notions of learning.

The formulation of the idea of `learning to make predictions’ goes through merging. The definition traces back at least to Blackwell and Dubins. It was popularized in game theory by the Ehuds, who used Blackwell and Dubins’ theorem to prove that rational players will end up playing approximate Nash Equilibrium. In this post I will not explicitly define merging. My goal is to give an example for the `weird’ things that can happen when one moves from the i.i.d. case to an arbitrary stationary environment. Even if you didn’t follow my previous posts, I hope the following example will be intriguing for its own sake.

Every day there is a probability {1/2} for eruption of war (W). If no war erupts then the outcome is either bad economy (B) or good economy (G) and is a function of the number of peaceful days since the last war. The function from the number of peaceful days to outcome is an unknown parameter of the process. Thus, a parameter is a function {\theta:\{1,2,\dots\}\rightarrow\{\text{B},\text{G}\}}. I am going to compare the predictions about the future made by two agents: Roxana, who knows {\theta} and Ursula, who faces some uncertainty about {\theta} represented by a uniform belief over the set of all parameters. Both Roxana and Ursula don’t know the future outcomes and since both of them are rational decision makeres, they both use Bayes’ rule to form beliefs about the unknown future given what they have seen in the past.

Consider first Roxana. In the terminology I introduced in previous posts, she faces no structural uncertainty. After a period of {k} consecutive peaceful days Roxana believes that with probability {1/2} the outcome tomorrow will be W and with probability {1/2} the outcome tomorrow will be {\theta(k)}.

Now consider Ursula. While she does not initially know {\theta}, as times goes by she learns it. What do I mean here by learning ? Well, suppose Ursula starts observing the outcomes and she sees G,B,W,B,G,…. From this information Ursula she deduces that {\theta(1)=\text{B}}, so that if a peaceful day follows a war then it has a bad economy. Next time a war pops up Ursula will know to make a prediction about the outcome tomorrow which is as accurate as Roxana’s prediction. Similarly Ursula can deduce that {\theta(2)=\text{G}}. This way Ursula gradually deduces the values of {\theta(k)} while she observes the process. However, and this is the punch line, for every {k\in\{1,2,3,\dots\}} there will be a time when Ursula observes {k} consecutive peaceful day for the first time and at this day her prediction about the next outcome will be {1/2} for war, {1/4} for good economy and {1/4} for bad economy. Thus there will always be infinitely many occasions in which Ursula’s prediction differ from Roxana.

So, Ursula does learn the parameter in the sense that she gradually deduce more and more values of {\theta(k)}. However, because at every point in time she may require a different value of {\theta(k)} — This is the difference between the stationary environment and the i.i.d. environment ! — there may happen infinitely many times in which she has not yet been able to deduce the value of the parameter which she needs in order to make a prediction about the outcome tomorrow.

You may notice that Ursula does succeed in making predictions most of the times. In fact, the situations when she fails become more and more rare, after observing longer and longer blocks of peaceful days. Indeed, Nabil and I formalize this idea and show that this is the case in every stationary environment with structural uncertainty: the observer makes predictions approximately as if he knew the parameter in almost every day. For that, we use a weak notion of merging which was suggested by Lehrer and Smorodinsky. If you are interested then this is a good time to look at our paper.

Finally, the example given above is our adaptation to an example that appeared first in a paper by Boris Yakovlevich Ryabko. Ryabko’s paper is part of a relatively large literature about non-Bayesian predictions in stationary environment. I will explain the relationship between that literature and our paper in another post.

A Bayesian agent is observing a sequence of outcomes in {\{S,F\}}. The agent does not know the outcomes in advance, so he forms some belief {\mu} over sequences of outcomes. Suppose that the agent believes that the number {d} of successes in {k} consecutive outcomes is distributed uniformly in {\{0,1,\dots k\}} and that all configuration with {d} successes are equally likely:

\displaystyle \mu\left(a_0,a_1,\dots,a_{k-1} \right)=\frac{1}{(k+1)\cdot {\binom{k}{d}}}

for every {a_0,a_1,\dots,a_{k-1}\in \{S,F\}} where {d=\#\{0\le i<k|a_i=S\}}.

You have seen this belief {\mu} already though maybe not in this form. It is a belief of an agent who tosses an i.i.d. coin and has some uncertainty over the parameter of the coin, given by a uniform distribution over {[0,1]}.

In this post I am gonna make a fuss about the fact that as time goes by the agent learns the parameter of the coin. The word `learning’ has several legitimate formalizations and today I am talking about the oldest and probably the most important one — consistency of posterior beliefs. My focus is somewhat different from that of textbooks because 1) As in the first paragraph, my starting point is the belief {\mu} about outcome sequences, before there are any parameters and 2) I emphasize some aspects of consistency which are unsatisfactory in the sense that they don’t really capture our intuition about learning. Of course this is all part of the grand marketing campaign for my paper with Nabil, which uses a different notion of learning, so this discussion of consistency is a bit of a sidetrack. But I have already came across some VIP who i suspect was unaware of the distinction between different formulations of learning, and it wasn’t easy to treat his cocky blabbering in a respectful way. So it’s better to start with the basics.

Let {A} be a finite set of outcomes. Let {\mu\in\Delta(A^\mathbb{N})} be a belief over the set {A^\mathbb{N}} of infinite sequences of outcomes, also called realizations. A decomposition of {\mu} is given by a set {\Theta} of parameters, a belief {\lambda} over {\Theta}, and, for every {\theta\in\Theta} a belief {\mu_\theta} over {A^\mathbb{N}} such that {\mu=\int \mu_\theta~\lambda(\mathrm{d}\theta)}. The integral in the definition means that the agent can think about the process as a two stages randomization: First a parameter {\theta} is drawn according to {\lambda} and then a realization {\omega} is drawn according to {\mu_\theta}. Thus, a decomposition captures a certain way in which a Bayesian agent arranges his belief. Of course every belief {\mu} admits many decompositions. The extreme decompositions are:

  • The Trivial Decomposition. Take {\Theta=\{\bar\theta\}} and {\mu_{\bar\theta}=\mu}.
  • Dirac’s Decomposition. Take {\Theta=A^\mathbb{N}} and {\lambda=\mu}. A “parameter” in this case is a measure {\delta_\omega} that assigns probability 1 to the realization {\omega}.

Not all decompositions are equally exciting. We are looking for decompositions in which the parameter {\theta} captures some `fundamental property’ of the process. The two extreme cases mentioned above are usually unsatisfactory in this sense. In Dirac’s decomposition, there are as many parameters as there are realizations; parameters simply copy realizations. In the trivial decomposition, there is a single parameter and thus cannot discriminate between different interesting properties. For stationary process, there is a natural decomposition in which the parameters distinguish between fundamental properties of the process. This is the ergodic decomposition, according to which the parameters are the ergodic beliefs. Recall that in this decomposition, a parameter captures the empirical distribution of blocks of outcomes in the infinite realization.

So what about learning ? While observing a process, a Bayesian agent will update his belief about the parameter. We denote by {\lambda_n\left(a_0,\dots,a_{n-1}\right)\in\Delta(\Theta)} the posterior belief about the parameter {\theta} at the beginning of period {n} after observing the outcome sequence {a_0,\dots,a_{n-1}}. The notion of learning I want to talk about in this post is that this belief converges to a belief that is concentrated on the true parameter {\theta}. The example you should have in mind is the coin toss example I started with: while observing the outcomes of the coin the agent becomes more and more certain about the true parameter of the coin, which means his posterior belief becomes concentrated around a belief that gives probability {1} to the true parameter.

Definition 1 A decomposition of {\mu} is consistent if for {\lambda}-almost every {\theta} it holds that

\displaystyle \lambda_n\left(a_0,\dots,a_{n-1}\right)\xrightarrow[n\rightarrow\infty]{w}\delta_\theta

for {\mu_\theta}-almost every realization {\omega=(a_0,a_1,\dots)}.

In this definition, {\delta_\theta} is Dirac atomic measure on {\theta} and the convergence is weak convergence of probability measures. No big deal if you don’t know what it means since it is exactly what you expect.

So, we have a notion of learning, and a seminal theorem of L.J. Doob (more on that below) implies that the ergodic decomposition of every stationary process is consistent. While this is not something that you will read in most textbooks (more on that below too), it is still well known. Why do Nabil and I dig further into the issue of learnability of the ergodic decomposition ? Two reasons. First, one has to write papers. Second, there is something unsatisfactory about the concept of consistency as a formalization of learning. To see why, consider the belief {\mu} that outcomes are i.i.d. with probability {1/2} for success. This belief is ergodic, so from the perspective of the ergodic decomposition the agent `knows the process’ and there is nothing else to learn. But let’s look at Dirac’s decomposition instead of the ergodic decomposition. Then the parameter space equals the space of all realizations. Suppose the true parameter (=realization) is {\omega^\ast=(a_0,a_1,\dots)}, then after observing the first {n} outcomes of the process the agent’s posterior belief about the parameter is concentrated on all {\omega} that agrees with {\omega^\ast} on the first {n} coordinates. These posterior beliefs converge to {\delta_{\omega^\ast}}, so that Dirac decomposition is also consistent ! We may say that we learn the parameter, but “learning the parameter” in this environment is just recording the past. The agent does not gain any new insight about the future of the process from learning the parameter.

In my next post I will talk about other notions of learning, originating in a seminal paper of Blackwell and Dubins, which capture the idea that an agent who learns a parameter can make predictions as if he new the parameter. Let me also say here that this post and the following ones are much influenced by a paper of Jackson, Kalai, Smorodinsky. I will talk more on that paper in another post.

For the rest of this post I am going to make some comments about Bayesian consistency which, though again standard, I don’t usually see them in textbooks. Especially I don’t know of a reference for the version of Doob’s Theorem which I give below, so if any reader can give me such a reference it will be helpful.

First, you may wonder whether every decomposition is consistent. The answer is no. For a trivial example, take a situation where {\mu_\theta} are the same for every {\theta}. More generally troubles arrise when the realization does not pin down the parameter. Formally, let us say that function {f:\Omega\rightarrow \Theta} pins down or identifies the parameter if

\displaystyle \mu_\theta\left(\{\omega: f(\omega)=\theta\}\right)=1

for {\lambda}-almost every {\theta}. If such {f} exists then the decomposition is identifiable.

We have the following

Theorem 2 (Doob’s Theorem) A decomposition is identifiable if and only if it is consistent.

The `if’ part follows immediately from the definitions. The `only if’ part is deep, but not difficult: it follows immediately from the martingale convergence theorem. Indeed, Doob’s Theorem is usually cited as the first application of martingales theory.

Statisticians rarely work with this abstract formulation of decomposition which I use. For this reason, the theorem is usually formulated only for the case that {\Theta=\Delta(A)} and {\mu_\theta} is i.i.d. {\theta}. In this case the fact that the decomposition is identifiable follows from the strong law of large numbers. Doob’s Theorem then implies the standard consistency of Bayesian estimator of the parameter {\theta}.

Four agents are observing infinite streams of outcomes in {\{S,F\}}. None of them knows the future outcomes and as good Bayesianists they represent their beliefs about unknowns as probability distributions:

  • Agent 1 believes that outcomes are i.i.d. with probability {1/2} of success.
  • Agent 2 believes that outcomes are i.i.d. with probability {\theta} of success. She does not know {\theta}; She believes that {\theta} is either {2/3} or {1/3}, and attaches probability {1/2} to each possibility.
  • Agent 3 believes that outcomes follow a markov process: every day’s outcome equals yesterday’s outcome with probability {3/4}.
  • Agent 4 believes that outcomes follow a markov process: every day’s outcome equals yesterday’s outcome with probability {\theta}. She does not know {\theta}; Her belief about {\theta} is the uniform distribution over {[0,1]}.

I denote by {\mu_1,\dots,\mu_4\in\Delta\left(\{S,F\}^\mathbb{N}\right)} the agents’ beliefs about future outcomes.

We have an intuition that Agents 2 and 4 are in a different situations from Agents 1 and 3, in the sense that are uncertain about some fundamental properties of the stochastic process they are facing. I will say that they have `structural uncertainty’. The purpose of this post is to formalize this intuition. More explicitly, I am looking for a property of a belief {\mu} over {\Omega} that will distinguish between beliefs that reflect some structural uncertainty and beliefs that don’t. This property is ergodicity.

 

 

Definition 1 Let {\zeta_0,\zeta_1,\dots} be a stationary process with values in some finite set {A} of outcomes. The process is ergodic if for every block {\bar a=(a_0,\dots,a_{k})} of outcomes it holds that

\displaystyle \begin{array}{rcl} \lim_{n\rightarrow\infty}&\frac{\#\left\{0\le t < n|\zeta_t=a_0,\dots,\zeta_{t+k}=a_{k}\right\}}{n}=\\&\mathop{\mathbb P}(\zeta_0=a_0,\dots,\zeta_{k}=a_{k})~\text{a.s}.\end{array}

A belief {\mu\in \Delta(A^\mathbb{N})} is ergodic if it is the distribution of an ergodic process

Before I explain the definition let me write the ergodicity condition for the special case of the block {\bar a=(a)} for some {a\in A} (this is a block of size 1):

\displaystyle \lim_{n\rightarrow\infty}\frac{\#\left\{0\le t < n|\zeta_t=a\right\}}{n}=\mathop{\mathbb P}(\zeta_0=a)~\text{a.s}.\ \ \ \ \ (1)

 

In the right side of (1) we have the (subjective) probability that on day {0} we will see the outcome {a}. Because of stationarity this is also the belief that we will see the outcome {a} on every other day. In the left side of (1) we have no probabilities at all. What is written there is the frequency of appearances of the outcome {a} in the realized sequence. This frequency is objective and has nothing to do with our beliefs. Therefore, the probabilities that a Bayesian agent with ergodic belief attaches to observing some outcome is a number that can be measured from the process: just observe it long enough and check the frequency in which this outcome appears. In a way, for ergodic processes the frequentist and subjective interpretations of probability coincide, but there are legitimate caveats to this statement, which I am not gonna delve into because my subject matter is not the meaning of probability. For my purpose it’s enough that ergodicity captures the intuition we have about the four agents I started with: Agents 1 and 3 both give probability {1/2} to success in each day. This means that if they are sold a lottery ticket that gives a prize if there is a success at day, say, 172, they both price this lottery ticket the same way. However, Agent 1 is certain that in the long run the frequency of success will be {1/2}. Agent 2 is certain that it will be either {2/3} or {1/3}. In fancy words, {\mu_1} is ergodic and {\mu_2} is not.

So, ergodic processes capture our intuition of `processes without structural uncertainty’. What about situations with uncertainty ? What mathematical creature captures this uncertainty ? Agent 2’s uncertainty seems to be captured by some probability distribution over two ergodic processes — the process “i.i.d. {2/3}” and the process “i.i.d. {1/3}”. Agent 2 is uncertain which of these processes he is facing. Agent 4’s uncertainty is captured by some probability distribution over a continuum of markov (ergodic) processes. This is a general phenomena:

Theorem 2 (The ergodic decomposition theorem) Let {\mathcal{E}} be the set of ergodic distributions over {A^\mathbb{N}}. Then for every stationary belief {\mu\in\Delta(A^\mathbb{N})} there exists a unique distribution {\lambda} over {\mathcal{E}} such that {\mu=\int \theta~\lambda(\text{d}\theta)}.

The probability distribution {\lambda} captures uncertainty about the structure of the process. In the case that {\mu} is an ergodic processes {\lambda} is degenerated and there is no structural uncertainty.

Two words of caution: First, my definition of ergodic processes is not the one you will see in textbooks. The equivalence to the textbook definition is an immediate consequence of the so called ergodic theorem, which is a generalization of the law of large numbers for ergodic processes. Second, my use of the word `uncertainty’ is not universally accepted. The term traces back at least to Frank Knight, who made the distinction between risk or “measurable uncertainty” and what is now called “Knightian uncertainty” which cannot be measured. Since Knight wrote in English and not in Mathematish I don’t know what he meant, but modern decision theorists, mesmerized by the Ellsberg Paradox, usually interpret risk as a Bayesian situation and Knightian uncertainty, or “ambiguity”, as a situation which falls outside the Bayesian paradigm. So if I understand correctly they will view the situations of these four agents mentioned above as situations of risk only without uncertainty. The way in which I use “structural uncertainty” was used in several theory papers. See this paper of Jonathan and Nabil. And this and the paper which I am advertising in these posts, about disappearance of uncertainty over time. (I am sure there are more.)

To be continued…

When I give a presentation about expert testing there is often a moment in which it dawns for the first time on somebody in the audience that I am not assuming that the processes are stationary or i.i.d. This is understandable. In most modeling sciences and in statistics stationarity is a natural assumption about a stochastic process and is often made without stating. In fact most processes one comes around are stationary or some derivation of a stationary process (think the white noise, or i.i.d. sampling, or markov chains in their steady state). On the other hand, most game theorists and micro-economists who work with uncertainty don’t know what is a stationary process even if they have heard the word (This is a time for you to pause and ask yourself if you know what’s stationary process). So a couple of introductory words about stationary processes is a good starting point to promote my paper with Nabil

First, a definition: A stationary process is a sequence {\zeta_0,\zeta_1,\dots} of random variables such that the joint distribution of {(\zeta_n,\zeta_{n+1},\dots)} is the same for all {n}-s. More explicitly, suppose that the variables assume values in some finite set {A} of outcomes. Stationarity means that for every {a_0,\dots,a_k\in A}, the probability {\mathop{\mathbb P}(\zeta_n=a_0,\dots,\zeta_{n+k}=a_{n+k})} is independent in {n}. As usual, one can talk in the language of random variables or in the language of distributions, which we Bayesianists also call beliefs. A belief {\mu\in\Delta(A^\mathbb{N})} about the infinite future is stationary if it is the distribution of a stationary process.

Stationarity means that Bob, who starts observing the process at day {n=0}, does not view this specific day as having any cosmic significance. When Alice arrives two weeks later at day {n=14} and starts observing the process she has the same belief about her future as Bob had when he first arrives (Note that Bob’s view at day {n=14} about what comes ahead might be different from Alice’s since he has learned something meanwhile, more on that later). In other words, each agent can denote by {0} the first day in which they start observing the process, but there is nothing in the process itself that day {0} corresponds to. In fact, when talking about stationary processes it will clear our thinking if we think of them as having infinite past and infinite future {\dots,\zeta_{-2},\zeta_{-1},\zeta_0,\zeta_1,\zeta_2,\dots}. We just happen to pop up at day {0}.

 

The first example of a stationary process is an i.i.d. process, such as the outcomes of repeated tossing of a coin with hsome probability {\theta} of success. If the probability of success is unknown then a Bayesian agent must have some prior {\lambda\in \Delta([0,1])} about {\theta}: The agent believes that {\theta} is randomized according to {\lambda} and then the outcomes are i.i.d. conditioned on {\theta}. A famous theorem of De-Finetti (wikipedia) characterizes all beliefs that are `mixtures of i.i.d.’ in this sense. All these beliefs are stationary.

Another example of stationary processes is Markov processes in their steady state. Again, we can generalize to situations in which the transition matrix is not known and one has some belief about it. Such situations are rather natural, but I don’t think there is a nice characterization of the processes that are mixtures of markov processes in this sense (that is, I don’t know of a De-Finetti Theorem for markov processes.) Still more general example is Markov process of some finite memory, for example when the outcome today depends on the history only through the outcomes of the last two days.

As an example of a stationary process which is not a Markov process of any finite memory consider a Hidden Markov model, according to which the outcome at every day is a function of an underlying, unobserved Markov process. If the hidden process is stationary then so is the observed process. This is an important property of stationary processes, which is obvious from the definition:

Theorem 1 Let {\dots,\zeta_{-2},\zeta_{-1},\zeta_0,\zeta_1,\dots} be a stationary process with values in some finite set {H}. Then the process {\dots,f(\zeta_{-2}),f(\zeta_{-1}),f(\zeta_0),f(\zeta_1),\dots} is stationary for every function {f:H\rightarrow A}.

As can be seen in all these examples, when one lives in a stationary environment then one has some (possibly degenerated) uncertainty about the parameters of the process. For example we have some uncertainty about the parameter of the coin or the markov chain or the hidden markov process. I still haven’t defined however what I mean by parameters of the process; What lurks behind is the ergodic decomposition theorem, which is an analogue of De-Finetti’s Theorem for stationary processes. I will talk about it in my next post. For now, let me say a word about the implications of uncertainty about parameters in economic modeling, which may account in part for the relative rareness of stationary processes in microeconomics (I will give another reason for that misfortune later):

Let Craig be a rational agent (=Bayesian expected utility maximizer) who lives in a stationary environment in which a coin is tossed every day. Craig has some uncertainty over the parameter of the coin, represented by a belief {\lambda\in\Delta([0,1])}. At every day, before observing the outcome of the coin, Craig takes an action. Craig’s payoff at every day depends on the action he took, the outcome of the coin, and possibly some other random objects which follow a stationary process observed by Craig. We observe the sequence of Craig’s actions. This process is not generally a stationary process. The reason is that Craig’s actions are functions of his posterior beliefs about the parameter of the coin, and this posterior belief does not follow a stationary process: as time goes by, Craig learns the parameter of the coin. His behavior in day {0}, when he doesn’t know the parameter is typically different from his behavior at day {14} when he already has a good idea about the parameter.

I said earlier that in stationarity environment, the point in time which we denote by {0} does not correspond to anything about the process itself but only reflect the point in time in which we start observing the process. In this example this is indeed the case with Craig, who starts observing the coin process at time {0}. It is not true for us. Our subject matter is not the coin, but Craig. And time {0} has a special meaning for Craig. Bottom line: Rational agents in a stationary environment will typically not behave in a stationary way.

To be continued…

Follow

Get every new post delivered to your Inbox.

Join 133 other followers