You are currently browsing the tag archive for the ‘merging’ tag.

This post describes the main theorem in my new paper with Nabil. Scroll down for open questions following this theorem. The theorem asserts that a Bayesian agent in a stationary environment will learn to make predictions as if he knew the data generating process, so that the as time goes by structural uncertainty dissipates. The standard example is when the sequence of outcomes is i.i.d. with an unknown parameter. As times goes by the agent learns the parameter.

The formulation of `learning to make predictions’ goes through merging, which traces back to Blackwell and Dubins. I will not give Blackwell and Dubin’s definition in this post but a weaker definition, suggested by Kalai and Lehrer.

A Bayesian agent observes an infinite sequence of outcomes from a finite set {A}. Let {\mu\in\Delta(A^\mathbb{N})} represent the agent’s belief about the future outcomes. Suppose that before observing every day’s outcome the agent makes a probabilistic prediction about it. I denote by {\mu(\cdot|a_0,\dots,a_{n-1})} the element in {\Delta(A)} which represents the agent’s prediction about the outcome of day {n} just after he observed the outcomes {a_0,\dots,a_{n-1}} of previous days. In the following definition it is instructive to think about {\tilde\mu} as the true data generating process, i.e., the process that generates the sequence of outcomes, which may be different from the agent’s belief.

Definition 1 (Kalai and Lehrer) Let {\mu,\tilde\mu\in\Delta(A^\mathbb{N})}. Then {\mu} merges with {\tilde\mu} if for {\tilde\mu}-almost every realization {(a_0,\dots,a_{n-1},\dots)} it holds that

\displaystyle \lim_{n\rightarrow\infty}\|\mu(\cdot|a_0,\dots,a_{n-1})-\tilde\mu(\cdot|a_0,\dots,a_{n-1})\|=0.

Assume now that the agent’s belief {\mu} is stationary, and let {\mu=\int \theta~\lambda(\mathrm{d}\theta)} be its ergodic decomposition. Recall that in this decomposition {\theta} ranges over ergodic beliefs and {\lambda} represents structural uncertainty. Does the agent learn to make predictions ? Using the definition of merging we can ask, does {\mu} merges with {\theta} ? The answer, perhaps surprisingly, is no. I gave an example in my previous post.

Let me now move to a weaker definition of merging, that was first suggested by Lehrer and Smorodinsky. This definition requires the agent to make correct predictions in almost every period.

Definition 2 Let {\mu,\tilde\mu\in\Delta(A^\mathbb{N})}. Then {\mu} weakly merges with {\tilde\mu} if {\tilde\mu}-almost every realization {(a_0,\dots,a_{n-1},\dots)} it holds that

\displaystyle \lim_{n\rightarrow\infty,n\in T}\|\mu(\cdot|a_0,\dots,a_{n-1})-\tilde\mu(\cdot|a_0,\dots,a_{n-1})\|=0

for a set {T\subseteq \mathbb{N}} of periods of density {1}.

The definition of weak merging is natural: patient agents whose belief weakly merges with the true data generating process will make almost optimal decisions. Kalai, Lehrer and Smorodinsky discuss these notions of mergings and also their relationship with Dawid’s idea of calibration.

I am now in a position to state the theorem I have been talking about for two months:

Theorem 3 Let {\mu\in\Delta(A^\mathbb{N})} be stationary, and let {\mu=\int \theta~\lambda(\mathrm{d}\theta)} be its ergodic decomposition. Then {\mu} weakly merges with {\theta} for {\lambda}-almost every {\theta}.

In words: An agent who has some structural uncertainty about the data generating process will learn to make predictions in most periods as if he knew the data generating process.

Finally, here are the promised open questions. They deal with the two qualification in the theorem. The first question is about the “{\lambda}-almost every {\theta}” in the theorem. As Larry Wasserman mentioned this is unsatisfactory in some senses. So,

Question 1 Does there exists a stationary {\mu} (equivalently a belief {\lambda} over ergodic beliefs) such that {\mu} weakly merges with {\theta} for every ergodic distribution {\theta} ?

The second question is about strengthening weak merging to merging. We already know that this cannot be done for arbitrary belief {\lambda} over ergodic processes, but what if {\lambda} is concentrated on some natural family of processes, for example hidden markov processes with a bounded number of hidden states ? Here is the simplest setup for which I don’t know the answer.

Question 2 The outcome of the stock market at every day is either U or D (up or down). An agent believes that this outcome is a stochastic function of an unobserved (hidden) state of the economy which can be either G or B (good or bad): When the hidden state is B the outcome is U with probability {q_B} (and D with probability {1-q_B}), and when the state is G the outcome is U with probability {q_G}. The hidden state changes according to a markov process with transition probability {\rho(B|B)=1-\rho(G|B)=p_B}, {\rho(B|G)=1-\rho(G|G)=p_G}. The parameter is {(p_B,p_G,q_B,q_G)} and the agent has some prior {\lambda} over the parameter. Does the agent’s belief about outcomes merge with the truth for {\lambda}-almost every {(p_B,p_G,q_B,q_G)} ?.

In the lasts posts I talked about a Bayesian agent in a stationary environment. The flagship example was tossing a coin with uncertainty about the parameter. As time goes by, he learns the parameter. I hinted about the distinction between `learning the parameter’, and `learning to make predictions about the future as if you knew the parameter’. The former seems to imply the latter almost by definition, but this is not so.

Because of its simplicity, the i.i.d. example is in fact somewhat misleading for my purposes in this post. If you toss a coin then your belief about the parameter of the coin determines your belief about the outcome tomorrow: if at some point your belief about the parameter is given by some {\mu\in [0,1]} then your prediction about the outcome tomorrow will be the expectation of {\mu}. But in a more general stationary environment, your prediction about the outcome tomorrow depends on your current belief about the parameter and also on what you have seen in the past. For example, if the process is Markov with an unknown transition matrix then to make a probabilistic prediction about the outcome tomorrow you first form a belief about the transition matrix and then uses it to predict the outcome tomorrow given the outcome today. The hidden markov case is even more complicated, and it gives rise to the distinction between the two notions of learning.

The formulation of the idea of `learning to make predictions’ goes through merging. The definition traces back at least to Blackwell and Dubins. It was popularized in game theory by the Ehuds, who used Blackwell and Dubins’ theorem to prove that rational players will end up playing approximate Nash Equilibrium. In this post I will not explicitly define merging. My goal is to give an example for the `weird’ things that can happen when one moves from the i.i.d. case to an arbitrary stationary environment. Even if you didn’t follow my previous posts, I hope the following example will be intriguing for its own sake.

Every day there is a probability {1/2} for eruption of war (W). If no war erupts then the outcome is either bad economy (B) or good economy (G) and is a function of the number of peaceful days since the last war. The function from the number of peaceful days to outcome is an unknown parameter of the process. Thus, a parameter is a function {\theta:\{1,2,\dots\}\rightarrow\{\text{B},\text{G}\}}. I am going to compare the predictions about the future made by two agents: Roxana, who knows {\theta} and Ursula, who faces some uncertainty about {\theta} represented by a uniform belief over the set of all parameters. Both Roxana and Ursula don’t know the future outcomes and since both of them are rational decision makeres, they both use Bayes’ rule to form beliefs about the unknown future given what they have seen in the past.

Consider first Roxana. In the terminology I introduced in previous posts, she faces no structural uncertainty. After a period of {k} consecutive peaceful days Roxana believes that with probability {1/2} the outcome tomorrow will be W and with probability {1/2} the outcome tomorrow will be {\theta(k)}.

Now consider Ursula. While she does not initially know {\theta}, as times goes by she learns it. What do I mean here by learning ? Well, suppose Ursula starts observing the outcomes and she sees G,B,W,B,G,…. From this information Ursula she deduces that {\theta(1)=\text{B}}, so that if a peaceful day follows a war then it has a bad economy. Next time a war pops up Ursula will know to make a prediction about the outcome tomorrow which is as accurate as Roxana’s prediction. Similarly Ursula can deduce that {\theta(2)=\text{G}}. This way Ursula gradually deduces the values of {\theta(k)} while she observes the process. However, and this is the punch line, for every {k\in\{1,2,3,\dots\}} there will be a time when Ursula observes {k} consecutive peaceful day for the first time and at this day her prediction about the next outcome will be {1/2} for war, {1/4} for good economy and {1/4} for bad economy. Thus there will always be infinitely many occasions in which Ursula’s prediction differ from Roxana.

So, Ursula does learn the parameter in the sense that she gradually deduce more and more values of {\theta(k)}. However, because at every point in time she may require a different value of {\theta(k)} — This is the difference between the stationary environment and the i.i.d. environment ! — there may happen infinitely many times in which she has not yet been able to deduce the value of the parameter which she needs in order to make a prediction about the outcome tomorrow.

You may notice that Ursula does succeed in making predictions most of the times. In fact, the situations when she fails become more and more rare, after observing longer and longer blocks of peaceful days. Indeed, Nabil and I formalize this idea and show that this is the case in every stationary environment with structural uncertainty: the observer makes predictions approximately as if he knew the parameter in almost every day. For that, we use a weak notion of merging which was suggested by Lehrer and Smorodinsky. If you are interested then this is a good time to look at our paper.

Finally, the example given above is our adaptation to an example that appeared first in a paper by Boris Yakovlevich Ryabko. Ryabko’s paper is part of a relatively large literature about non-Bayesian predictions in stationary environment. I will explain the relationship between that literature and our paper in another post.

Follow

Get every new post delivered to your Inbox.

Join 132 other followers