A Bayesian agent is observing a sequence of outcomes in {\{S,F\}}. The agent does not know the outcomes in advance, so he forms some belief {\mu} over sequences of outcomes. Suppose that the agent believes that the number {d} of successes in {k} consecutive outcomes is distributed uniformly in {\{0,1,\dots k\}} and that all configuration with {d} successes are equally likely:

\displaystyle \mu\left(a_0,a_1,\dots,a_{k-1} \right)=\frac{1}{(k+1)\cdot {\binom{k}{d}}}

for every {a_0,a_1,\dots,a_{k-1}\in \{S,F\}} where {d=\#\{0\le i<k|a_i=S\}}.

You have seen this belief {\mu} already though maybe not in this form. It is a belief of an agent who tosses an i.i.d. coin and has some uncertainty over the parameter of the coin, given by a uniform distribution over {[0,1]}.

In this post I am gonna make a fuss about the fact that as time goes by the agent learns the parameter of the coin. The word `learning’ has several legitimate formalizations and today I am talking about the oldest and probably the most important one — consistency of posterior beliefs. My focus is somewhat different from that of textbooks because 1) As in the first paragraph, my starting point is the belief {\mu} about outcome sequences, before there are any parameters and 2) I emphasize some aspects of consistency which are unsatisfactory in the sense that they don’t really capture our intuition about learning. Of course this is all part of the grand marketing campaign for my paper with Nabil, which uses a different notion of learning, so this discussion of consistency is a bit of a sidetrack. But I have already came across some VIP who i suspect was unaware of the distinction between different formulations of learning, and it wasn’t easy to treat his cocky blabbering in a respectful way. So it’s better to start with the basics.

Let {A} be a finite set of outcomes. Let {\mu\in\Delta(A^\mathbb{N})} be a belief over the set {A^\mathbb{N}} of infinite sequences of outcomes, also called realizations. A decomposition of {\mu} is given by a set {\Theta} of parameters, a belief {\lambda} over {\Theta}, and, for every {\theta\in\Theta} a belief {\mu_\theta} over {A^\mathbb{N}} such that {\mu=\int \mu_\theta~\lambda(\mathrm{d}\theta)}. The integral in the definition means that the agent can think about the process as a two stages randomization: First a parameter {\theta} is drawn according to {\lambda} and then a realization {\omega} is drawn according to {\mu_\theta}. Thus, a decomposition captures a certain way in which a Bayesian agent arranges his belief. Of course every belief {\mu} admits many decompositions. The extreme decompositions are:

  • The Trivial Decomposition. Take {\Theta=\{\bar\theta\}} and {\mu_{\bar\theta}=\mu}.
  • Dirac’s Decomposition. Take {\Theta=A^\mathbb{N}} and {\lambda=\mu}. A “parameter” in this case is a measure {\delta_\omega} that assigns probability 1 to the realization {\omega}.

Not all decompositions are equally exciting. We are looking for decompositions in which the parameter {\theta} captures some `fundamental property’ of the process. The two extreme cases mentioned above are usually unsatisfactory in this sense. In Dirac’s decomposition, there are as many parameters as there are realizations; parameters simply copy realizations. In the trivial decomposition, there is a single parameter and thus cannot discriminate between different interesting properties. For stationary process, there is a natural decomposition in which the parameters distinguish between fundamental properties of the process. This is the ergodic decomposition, according to which the parameters are the ergodic beliefs. Recall that in this decomposition, a parameter captures the empirical distribution of blocks of outcomes in the infinite realization.

So what about learning ? While observing a process, a Bayesian agent will update his belief about the parameter. We denote by {\lambda_n\left(a_0,\dots,a_{n-1}\right)\in\Delta(\Theta)} the posterior belief about the parameter {\theta} at the beginning of period {n} after observing the outcome sequence {a_0,\dots,a_{n-1}}. The notion of learning I want to talk about in this post is that this belief converges to a belief that is concentrated on the true parameter {\theta}. The example you should have in mind is the coin toss example I started with: while observing the outcomes of the coin the agent becomes more and more certain about the true parameter of the coin, which means his posterior belief becomes concentrated around a belief that gives probability {1} to the true parameter.

Definition 1 A decomposition of {\mu} is consistent if for {\lambda}-almost every {\theta} it holds that

\displaystyle \lambda_n\left(a_0,\dots,a_{n-1}\right)\xrightarrow[n\rightarrow\infty]{w}\delta_\theta

for {\mu_\theta}-almost every realization {\omega=(a_0,a_1,\dots)}.

In this definition, {\delta_\theta} is Dirac atomic measure on {\theta} and the convergence is weak convergence of probability measures. No big deal if you don’t know what it means since it is exactly what you expect.

So, we have a notion of learning, and a seminal theorem of L.J. Doob (more on that below) implies that the ergodic decomposition of every stationary process is consistent. While this is not something that you will read in most textbooks (more on that below too), it is still well known. Why do Nabil and I dig further into the issue of learnability of the ergodic decomposition ? Two reasons. First, one has to write papers. Second, there is something unsatisfactory about the concept of consistency as a formalization of learning. To see why, consider the belief {\mu} that outcomes are i.i.d. with probability {1/2} for success. This belief is ergodic, so from the perspective of the ergodic decomposition the agent `knows the process’ and there is nothing else to learn. But let’s look at Dirac’s decomposition instead of the ergodic decomposition. Then the parameter space equals the space of all realizations. Suppose the true parameter (=realization) is {\omega^\ast=(a_0,a_1,\dots)}, then after observing the first {n} outcomes of the process the agent’s posterior belief about the parameter is concentrated on all {\omega} that agrees with {\omega^\ast} on the first {n} coordinates. These posterior beliefs converge to {\delta_{\omega^\ast}}, so that Dirac decomposition is also consistent ! We may say that we learn the parameter, but “learning the parameter” in this environment is just recording the past. The agent does not gain any new insight about the future of the process from learning the parameter.

In my next post I will talk about other notions of learning, originating in a seminal paper of Blackwell and Dubins, which capture the idea that an agent who learns a parameter can make predictions as if he new the parameter. Let me also say here that this post and the following ones are much influenced by a paper of Jackson, Kalai, Smorodinsky. I will talk more on that paper in another post.

For the rest of this post I am going to make some comments about Bayesian consistency which, though again standard, I don’t usually see them in textbooks. Especially I don’t know of a reference for the version of Doob’s Theorem which I give below, so if any reader can give me such a reference it will be helpful.

First, you may wonder whether every decomposition is consistent. The answer is no. For a trivial example, take a situation where {\mu_\theta} are the same for every {\theta}. More generally troubles arrise when the realization does not pin down the parameter. Formally, let us say that function {f:\Omega\rightarrow \Theta} pins down or identifies the parameter if

\displaystyle \mu_\theta\left(\{\omega: f(\omega)=\theta\}\right)=1

for {\lambda}-almost every {\theta}. If such {f} exists then the decomposition is identifiable.

We have the following

Theorem 2 (Doob’s Theorem) A decomposition is identifiable if and only if it is consistent.

The `if’ part follows immediately from the definitions. The `only if’ part is deep, but not difficult: it follows immediately from the martingale convergence theorem. Indeed, Doob’s Theorem is usually cited as the first application of martingales theory.

Statisticians rarely work with this abstract formulation of decomposition which I use. For this reason, the theorem is usually formulated only for the case that {\Theta=\Delta(A)} and {\mu_\theta} is i.i.d. {\theta}. In this case the fact that the decomposition is identifiable follows from the strong law of large numbers. Doob’s Theorem then implies the standard consistency of Bayesian estimator of the parameter {\theta}.