You are currently browsing Eran’s articles.
Here is the bonus question from the final exam in my dynamic optimization class of last semester. It is based on problem 8 Chapter II in Ross’ book Introduction to stochastic dynamic programming. It appears there as `guess the optimal policy’ without asking for proof. The question seems very natural, but I couldn’t find any information about it (nor apparently the students). I have a solution but it is tricky and too specific to this problem. I will describe my solution next week but perhaps somebody can show me a better solution or a reference ?
We have two coins, a red one and a green one. When flipped, one lands heads with probability P1 and the other with probability P2. We do not know which coin is the P1 coin. We initially attach probability p to the red coin being the P1 coin. We receive one dollar for each heads and our objective is to maximize the total expected discounted return with discount factor β. Describe the optimal policy, including a proof of optimality.
Nicolas Copernicus’s de Revolutionibus, in which he advocated his theory that Earth revolved ar ound the sun, was first printed just before Copernicus’ death in 1543. It therefore fell on one Andreas Osiander to write the introduction. Here is a passage from the introduction:
[An astronomer] will adopt whatever suppositions enable [celestial] motions to be computed correctly from the principles of geometry for the future as well as for the past…. These hypotheses need not be true nor even probable. On the contrary, if they provide a calculus consistent with the observations, that alone is enough.
In other words, the purpose of the astronomer’s study is to capture the observed phenomena — to provide an analytic framework by which we can explain and predict what we see when we look at the sky. It turns out that it is more convenient to capture the phenomena by assuming that Earth revolved around the sun than by assuming, as the Greek astronomers did, a geo-centric epicyclical planet motion. Therefore let’s calculate the right time for Easter by making this assumption. As astronomers, we shouldn’t care whether this is actually true.
Whether or not Copernicus would have endorsed this approach is disputable. What is certain is that his book was at least initially accepted by the Catholic Church whose astronomers have used Copernicus’ model to develop the Gregorian Calendar. (Notice I said the word model btw, which is probably anachronistic but, I think, appropriately captures Osiander’s view). The person who caused the scandal was Galileo Galilei, who famously declared that if earth behaves as if it moves around the sun then, well, it moves around the sun. Yet it moves. It’s not a model, it’s reality. Physicists’ subject matter is the nature, not models about nature.
What about economists ? Econ theorists at least don’t usually claim that the components of their modeling of economic agents (think utilities, beliefs, discount factors, ambiguity aversions) correspond to any real elements of the physical world or of the cognitive process that the agent performs. When we say that Adam’s utility from apple is log(c) we don’t mean that Adam knows anything about logs. We mean — wait for it — that he behaves as if this is his utility, or, as Osiander would have put it, this utility provides a calculus consistent with the observations, and that alone is enough.
The contrast between theoretical economists’ `as if’ approach and physicists’ `and yet it moves’ approach is not as sharp as I would like it to be. First, from the physics side, modern interpretations of quantum physics view it, and by extension the entire physics enterprise, as nothing more than a computational tool to produce predictions. On the other hand, from the economics side, while I think it is still customary to pay lip service to the `as if’ orthodoxy at least in decision theory classes, I don’t often hear it in seminars. And when neuro-economists claim to localize the decision making process in the brain they seem to view the components of the model as more than just mathematical constructions.
Yep, I am advertising another paper. Stay tuned :)
This post describes the main theorem in my new paper with Nabil. Scroll down for open questions following this theorem. The theorem asserts that a Bayesian agent in a stationary environment will learn to make predictions as if he knew the data generating process, so that the as time goes by structural uncertainty dissipates. The standard example is when the sequence of outcomes is i.i.d. with an unknown parameter. As times goes by the agent learns the parameter.
The formulation of `learning to make predictions’ goes through merging, which traces back to Blackwell and Dubins. I will not give Blackwell and Dubin’s definition in this post but a weaker definition, suggested by Kalai and Lehrer.
A Bayesian agent observes an infinite sequence of outcomes from a finite set . Let represent the agent’s belief about the future outcomes. Suppose that before observing every day’s outcome the agent makes a probabilistic prediction about it. I denote by the element in which represents the agent’s prediction about the outcome of day just after he observed the outcomes of previous days. In the following definition it is instructive to think about as the true data generating process, i.e., the process that generates the sequence of outcomes, which may be different from the agent’s belief.
Definition 1 (Kalai and Lehrer) Let . Then merges with if for -almost every realization it holds that
Assume now that the agent’s belief is stationary, and let be its ergodic decomposition. Recall that in this decomposition ranges over ergodic beliefs and represents structural uncertainty. Does the agent learn to make predictions ? Using the definition of merging we can ask, does merges with ? The answer, perhaps surprisingly, is no. I gave an example in my previous post.
Let me now move to a weaker definition of merging, that was first suggested by Lehrer and Smorodinsky. This definition requires the agent to make correct predictions in almost every period.
Definition 2 Let . Then weakly merges with if -almost every realization it holds that
for a set of periods of density .
The definition of weak merging is natural: patient agents whose belief weakly merges with the true data generating process will make almost optimal decisions. Kalai, Lehrer and Smorodinsky discuss these notions of mergings and also their relationship with Dawid’s idea of calibration.
I am now in a position to state the theorem I have been talking about for two months:
Theorem 3 Let be stationary, and let be its ergodic decomposition. Then weakly merges with for -almost every .
In words: An agent who has some structural uncertainty about the data generating process will learn to make predictions in most periods as if he knew the data generating process.
Finally, here are the promised open questions. They deal with the two qualification in the theorem. The first question is about the “-almost every ” in the theorem. As Larry Wasserman mentioned this is unsatisfactory in some senses. So,
Question 1 Does there exists a stationary (equivalently a belief over ergodic beliefs) such that weakly merges with for every ergodic distribution ?
The second question is about strengthening weak merging to merging. We already know that this cannot be done for arbitrary belief over ergodic processes, but what if is concentrated on some natural family of processes, for example hidden markov processes with a bounded number of hidden states ? Here is the simplest setup for which I don’t know the answer.
Question 2 The outcome of the stock market at every day is either U or D (up or down). An agent believes that this outcome is a stochastic function of an unobserved (hidden) state of the economy which can be either G or B (good or bad): When the hidden state is B the outcome is U with probability (and D with probability ), and when the state is G the outcome is U with probability . The hidden state changes according to a markov process with transition probability , . The parameter is and the agent has some prior over the parameter. Does the agent’s belief about outcomes merge with the truth for -almost every ?.
One more word about organ selling before I return to my comfort zone and talk about Brownian motion in Lie groups. Selling living human organs is repugnant, in part because the sellers cause damage to their bodies out of desperation. But what about allowing your relatives to sell what’s left of you when you’re gone ? I think this should be uncontroversial. And there are side advantages too, in addition to increasing the number of transplantations. For example, it will encourage you to quit smoking.
Over to you, Walter.
Something funny happened when I started watching Al Roth’s lecture and looked at the paper: I realized that what I always assumed is the meaning of `repugnant transactions’ is not exactly the phenomena that Roth talks about. What I thought `repugnant transaction’ means is a situation of `two rights makes a wrong': it’s totally awesome that Xanders is willing to donate his extra kidney to Zordiac, and it’s really nice of Zordiac to donate money to Xanders, but these two nobles acts done together in exchange for each other is imoral and should be outlawed. Roth however defines `repugnant transaction’ more broadly as any transaction that some people want to engage in and others don’t think they should. Consider the opening example of his paper: laws against selling horse meat in restaurants. Here what is repugnant is not the exchange but the good itself. It’s not two rights makes wrong. It’s just wrong. We outlaw the exchange simply because of constitutional reasons or because it’s impossible to enforce a ban on eating — people will simply order take away and perform the crime of eating at their homes.
So let me offer a distinction between `repugnant exchanges’, where the good itself is fine but buying and selling it is repugnant and `repugnant good/services’ where the good or service are what is repugnant, even if for whatever reason what we actually outlaw is the transaction. Most of the examples that Roth gives fall into the `repugnant good/service’ category rather than `repugnant exchange’. Such is the case of buying and selling recreational drugs, endangered species, imported cultural property. Are there any examples for `repugnant exchanges’ in addition to selling human organs ? Well, there is `renting’ organs, as in surrogate motherhood. Anything else ? An interesting example is lending money with interest, which used to be repugnant in the West (we got over it already): The very idea of lending money was never considered repugnant. What was repugnant is doing it for payment in terms of interest. Finally, there is prostitution, which is illegal in the US. Repugnant service or repugnant exchange ? depends on your reasoning. Anti-prostitution laws have an unlikely coalition of supporters. There are the religious moralists, for whom the service (extra-marriage sexual intercourse) is what makes the transaction repugnant. They go for prostitution just because that’s what they can outlaw in the US. (They go further in Iran.) But there are also feminists and liberals who view prostitution as exploitation, as I view selling human organs. They find the exchange repugnant even if they have no problem with the service itself. Note that the example of prostitution shows the difficulty in the distinction I make between `repugnant good/service’ and `repugnant exchange': It relies on unobservable reasoning. Just by knowing the laws and customs of a society you don’t know to which category a forbidden transaction belongs. Moreover, since different people may have different reasoning, the category is sometimes not uniquely defined. But I still think it’s a useful distinction.
Here is Al Roth’s talk in the Lindau Meeting on Economic Sciences about repugnant transactions, which I guess is the technical term for the discomfort I feel at the idea of people donating their extra kidney to those who need it in return to, you know, money.
Before he was a Nobel Laureate Roth was a Nancy L. Schwartz Memorial Lecturer. His talk was about kidney exchanges — these are exchanges between several pairs of donor+recipient involving no money but only kidneys — and he started with a survey of the audience: who is in favor of allowing selling and buying of kidneys in the free market ? (I am glad I didn’t raise my hand. The next question was about selling and buying of living hearts.) I remember noticing that there was a correlation between raised hands and seniority: For whatever reason, seniors were more likely to be in favor of the free market than juniors.
In the dinner after the talk I ended up in a table of juniors & spouses and we got to discuss our objection to the idea of letting Bob sell his Kidney to Alice, so that Bob can afford to send his daughter to college, and in doing so save Alice’s small child from orphanhood. Turned out we agreed on the policy but for different reasons. I don’t remember which was my reason. I still find both of them convincing, though less so simultaneously.
Reason I: The market price would be too low. Hungry people will compete selling their organs for a bowl of red pottage out of desperation. The slippery slope leads to poor people being harvested for their body parts.
Reason II: The market price would be too high. Only the 0.01 % will be able to afford it. The slippery slope leads to a small aristocracy who live forever by regenerating their bodies.
As I said, both (somewhat) convincing. And please don’t ask me what would be the fair price, that is neither too low nor too high.
In the lasts posts I talked about a Bayesian agent in a stationary environment. The flagship example was tossing a coin with uncertainty about the parameter. As time goes by, he learns the parameter. I hinted about the distinction between `learning the parameter’, and `learning to make predictions about the future as if you knew the parameter’. The former seems to imply the latter almost by definition, but this is not so.
Because of its simplicity, the i.i.d. example is in fact somewhat misleading for my purposes in this post. If you toss a coin then your belief about the parameter of the coin determines your belief about the outcome tomorrow: if at some point your belief about the parameter is given by some then your prediction about the outcome tomorrow will be the expectation of . But in a more general stationary environment, your prediction about the outcome tomorrow depends on your current belief about the parameter and also on what you have seen in the past. For example, if the process is Markov with an unknown transition matrix then to make a probabilistic prediction about the outcome tomorrow you first form a belief about the transition matrix and then uses it to predict the outcome tomorrow given the outcome today. The hidden markov case is even more complicated, and it gives rise to the distinction between the two notions of learning.
The formulation of the idea of `learning to make predictions’ goes through merging. The definition traces back at least to Blackwell and Dubins. It was popularized in game theory by the Ehuds, who used Blackwell and Dubins’ theorem to prove that rational players will end up playing approximate Nash Equilibrium. In this post I will not explicitly define merging. My goal is to give an example for the `weird’ things that can happen when one moves from the i.i.d. case to an arbitrary stationary environment. Even if you didn’t follow my previous posts, I hope the following example will be intriguing for its own sake.
Every day there is a probability for eruption of war (W). If no war erupts then the outcome is either bad economy (B) or good economy (G) and is a function of the number of peaceful days since the last war. The function from the number of peaceful days to outcome is an unknown parameter of the process. Thus, a parameter is a function . I am going to compare the predictions about the future made by two agents: Roxana, who knows and Ursula, who faces some uncertainty about represented by a uniform belief over the set of all parameters. Both Roxana and Ursula don’t know the future outcomes and since both of them are rational decision makeres, they both use Bayes’ rule to form beliefs about the unknown future given what they have seen in the past.
Consider first Roxana. In the terminology I introduced in previous posts, she faces no structural uncertainty. After a period of consecutive peaceful days Roxana believes that with probability the outcome tomorrow will be W and with probability the outcome tomorrow will be .
Now consider Ursula. While she does not initially know , as times goes by she learns it. What do I mean here by learning ? Well, suppose Ursula starts observing the outcomes and she sees G,B,W,B,G,…. From this information Ursula she deduces that , so that if a peaceful day follows a war then it has a bad economy. Next time a war pops up Ursula will know to make a prediction about the outcome tomorrow which is as accurate as Roxana’s prediction. Similarly Ursula can deduce that . This way Ursula gradually deduces the values of while she observes the process. However, and this is the punch line, for every there will be a time when Ursula observes consecutive peaceful day for the first time and at this day her prediction about the next outcome will be for war, for good economy and for bad economy. Thus there will always be infinitely many occasions in which Ursula’s prediction differ from Roxana.
So, Ursula does learn the parameter in the sense that she gradually deduce more and more values of . However, because at every point in time she may require a different value of — This is the difference between the stationary environment and the i.i.d. environment ! — there may happen infinitely many times in which she has not yet been able to deduce the value of the parameter which she needs in order to make a prediction about the outcome tomorrow.
You may notice that Ursula does succeed in making predictions most of the times. In fact, the situations when she fails become more and more rare, after observing longer and longer blocks of peaceful days. Indeed, Nabil and I formalize this idea and show that this is the case in every stationary environment with structural uncertainty: the observer makes predictions approximately as if he knew the parameter in almost every day. For that, we use a weak notion of merging which was suggested by Lehrer and Smorodinsky. If you are interested then this is a good time to look at our paper.
Finally, the example given above is our adaptation to an example that appeared first in a paper by Boris Yakovlevich Ryabko. Ryabko’s paper is part of a relatively large literature about non-Bayesian predictions in stationary environment. I will explain the relationship between that literature and our paper in another post.
A Bayesian agent is observing a sequence of outcomes in . The agent does not know the outcomes in advance, so he forms some belief over sequences of outcomes. Suppose that the agent believes that the number of successes in consecutive outcomes is distributed uniformly in and that all configuration with successes are equally likely:
for every where .
You have seen this belief already though maybe not in this form. It is a belief of an agent who tosses an i.i.d. coin and has some uncertainty over the parameter of the coin, given by a uniform distribution over .
In this post I am gonna make a fuss about the fact that as time goes by the agent learns the parameter of the coin. The word `learning’ has several legitimate formalizations and today I am talking about the oldest and probably the most important one — consistency of posterior beliefs. My focus is somewhat different from that of textbooks because 1) As in the first paragraph, my starting point is the belief about outcome sequences, before there are any parameters and 2) I emphasize some aspects of consistency which are unsatisfactory in the sense that they don’t really capture our intuition about learning. Of course this is all part of the grand marketing campaign for my paper with Nabil, which uses a different notion of learning, so this discussion of consistency is a bit of a sidetrack. But I have already came across some VIP who i suspect was unaware of the distinction between different formulations of learning, and it wasn’t easy to treat his cocky blabbering in a respectful way. So it’s better to start with the basics.
Let be a finite set of outcomes. Let be a belief over the set of infinite sequences of outcomes, also called realizations. A decomposition of is given by a set of parameters, a belief over , and, for every a belief over such that . The integral in the definition means that the agent can think about the process as a two stages randomization: First a parameter is drawn according to and then a realization is drawn according to . Thus, a decomposition captures a certain way in which a Bayesian agent arranges his belief. Of course every belief admits many decompositions. The extreme decompositions are:
- The Trivial Decomposition. Take and .
- Dirac’s Decomposition. Take and . A “parameter” in this case is a measure that assigns probability 1 to the realization .
Not all decompositions are equally exciting. We are looking for decompositions in which the parameter captures some `fundamental property’ of the process. The two extreme cases mentioned above are usually unsatisfactory in this sense. In Dirac’s decomposition, there are as many parameters as there are realizations; parameters simply copy realizations. In the trivial decomposition, there is a single parameter and thus cannot discriminate between different interesting properties. For stationary process, there is a natural decomposition in which the parameters distinguish between fundamental properties of the process. This is the ergodic decomposition, according to which the parameters are the ergodic beliefs. Recall that in this decomposition, a parameter captures the empirical distribution of blocks of outcomes in the infinite realization.
So what about learning ? While observing a process, a Bayesian agent will update his belief about the parameter. We denote by the posterior belief about the parameter at the beginning of period after observing the outcome sequence . The notion of learning I want to talk about in this post is that this belief converges to a belief that is concentrated on the true parameter . The example you should have in mind is the coin toss example I started with: while observing the outcomes of the coin the agent becomes more and more certain about the true parameter of the coin, which means his posterior belief becomes concentrated around a belief that gives probability to the true parameter.
for -almost every realization .
In this definition, is Dirac atomic measure on and the convergence is weak convergence of probability measures. No big deal if you don’t know what it means since it is exactly what you expect.
So, we have a notion of learning, and a seminal theorem of L.J. Doob (more on that below) implies that the ergodic decomposition of every stationary process is consistent. While this is not something that you will read in most textbooks (more on that below too), it is still well known. Why do Nabil and I dig further into the issue of learnability of the ergodic decomposition ? Two reasons. First, one has to write papers. Second, there is something unsatisfactory about the concept of consistency as a formalization of learning. To see why, consider the belief that outcomes are i.i.d. with probability for success. This belief is ergodic, so from the perspective of the ergodic decomposition the agent `knows the process’ and there is nothing else to learn. But let’s look at Dirac’s decomposition instead of the ergodic decomposition. Then the parameter space equals the space of all realizations. Suppose the true parameter (=realization) is , then after observing the first outcomes of the process the agent’s posterior belief about the parameter is concentrated on all that agrees with on the first coordinates. These posterior beliefs converge to , so that Dirac decomposition is also consistent ! We may say that we learn the parameter, but “learning the parameter” in this environment is just recording the past. The agent does not gain any new insight about the future of the process from learning the parameter.
In my next post I will talk about other notions of learning, originating in a seminal paper of Blackwell and Dubins, which capture the idea that an agent who learns a parameter can make predictions as if he new the parameter. Let me also say here that this post and the following ones are much influenced by a paper of Jackson, Kalai, Smorodinsky. I will talk more on that paper in another post.
For the rest of this post I am going to make some comments about Bayesian consistency which, though again standard, I don’t usually see them in textbooks. Especially I don’t know of a reference for the version of Doob’s Theorem which I give below, so if any reader can give me such a reference it will be helpful.
First, you may wonder whether every decomposition is consistent. The answer is no. For a trivial example, take a situation where are the same for every . More generally troubles arrise when the realization does not pin down the parameter. Formally, let us say that function pins down or identifies the parameter if
We have the following
Theorem 2 (Doob’s Theorem) A decomposition is identifiable if and only if it is consistent.
The `if’ part follows immediately from the definitions. The `only if’ part is deep, but not difficult: it follows immediately from the martingale convergence theorem. Indeed, Doob’s Theorem is usually cited as the first application of martingales theory.
Statisticians rarely work with this abstract formulation of decomposition which I use. For this reason, the theorem is usually formulated only for the case that and is i.i.d. . In this case the fact that the decomposition is identifiable follows from the strong law of large numbers. Doob’s Theorem then implies the standard consistency of Bayesian estimator of the parameter .
Four agents are observing infinite streams of outcomes in . None of them knows the future outcomes and as good Bayesianists they represent their beliefs about unknowns as probability distributions:
- Agent 1 believes that outcomes are i.i.d. with probability of success.
- Agent 2 believes that outcomes are i.i.d. with probability of success. She does not know ; She believes that is either or , and attaches probability to each possibility.
- Agent 3 believes that outcomes follow a markov process: every day’s outcome equals yesterday’s outcome with probability .
- Agent 4 believes that outcomes follow a markov process: every day’s outcome equals yesterday’s outcome with probability . She does not know ; Her belief about is the uniform distribution over .
I denote by the agents’ beliefs about future outcomes.
We have an intuition that Agents 2 and 4 are in a different situations from Agents 1 and 3, in the sense that are uncertain about some fundamental properties of the stochastic process they are facing. I will say that they have `structural uncertainty’. The purpose of this post is to formalize this intuition. More explicitly, I am looking for a property of a belief over that will distinguish between beliefs that reflect some structural uncertainty and beliefs that don’t. This property is ergodicity.
You may have heard about ResearchGate, the so called facebook of scientists. Yes, another social network. Its structure is actually more similar to twitter: each user is a node and you can create directed edges from yourself to other users. Since I finally got rid of my facebook account (I am a Bellwether. In five years all the cool guys will not be on facebook), I decided to try ResearchGate. I wanted a stable platform to upload my preferable versions of my papers so that they will be the first to pop up on google. Also, I figured if I am returning to blogging then I need stuff to bitch about. ResearchGate only partially fulfill the first goal, but it does pretty well with the second.