You are currently browsing the monthly archive for August 2014.

200 students for a 9 am class in spite of a midterm on day 3; perhaps they’ve not read the syllabus.

Began with the ultimatum game framed in terms of a seller making a take or leave it offer to the buyer. The game allows one to make two points at the very beginning of class.

1) The price seller chooses depends on their model of how the buyer will behave. One can draw this point out by asking sellers to explain how they came by their offers. Best offers to discuss are the really low ones (i.e. give most of the surplus to the buyer) and the offers that split the difference.

2) Under the assumption that `more money is better than less’, point out that the seller captures most of the gains from trade. Why? The ability to make a credible take or leave it offer.

This makes for a smooth transition into the model of quasi-linear preferences. Some toy examples of how buyers make choices based on surplus. Emphasize it captures idea that buyers make trade-offs (pay more if you get more; if its priced low enough its good enough). Someone will ask about budget constraints. A good question, ignore budget for now and come back to it later in the semester.

Next, point out that buyers do not share the same reservation price (RP) for a good or service. Introduce demand curve as vehicle for summarizing variation in RPs. Emphasize that demand curve tells you demand as you change your price holding other prices fixed.

Onto monopoly with constant unit costs and limited to a uniform price. Emphasize that monopoly in our context does not mean absence of competition, only that competition keeps price fixed as we change ours. Reason for such an assumption is to understand first how buyers respond to one sellers price changes.

How does monopoly choose profit maximizing price? Trade-off between margin and volume. Simple monopoly pricing exercise. Answer by itself is uninteresting. Want to know what profit maximizing depends upon.

Introduce elasticity of demand, its meaning and derivation. Then, a table of how profit and elasticity vary with price in the toy example introduce earlier. Point out how elasticity rises as price rises. Demand starts to drop off faster than margin rises. Explain why we don’t stop where elasticity is 1. Useful place to point out that here a small price increase is revenue neutral but total costs fall. So, uniform price is doing things: determining how much is captured from buyers and controlling total production costs. Table also illustrates that elasticity of demand matters for choosing price.

Segue into the markup formula. Explain why we should expect some kind of inverse relationship between markup and elasticity. Do derivation of markup formula with constant unit costs.

Now to something interesting to make the point that what has come before is very useful: author vs. publisher, who would prefer a higher price for the book? You’ll get all possible answers which is perfect. Start with how revenue is different from profit (authors get percentage revenue). This difference means their interests are not aligned. So, they should pick different prices. But which will be larger? Enter markup formula. Author wants price where elasticity is 1. Publisher wants to price where elasticity is bigger than 1. So, publisher wants higher price. Wait, what about e-books? Then, author and publisher want same price because unit costs are zero.

This is the perfect opportunity to introduce the Amazon letter to authors telling them that elasticity of demand for e-books at the current $14.99 price is about 2.4. Well above 1. Clearly, all parties should agree to lower the price of e-books. But what about traditional books? Surely lower e-book price will cause some readers to switch from the traditional to the e-book. Shouldn’t we look at the loss in profit from that as well? Capital point, but make life simple. Suppose we have only e-books. Notice, under the agency model where Amazon gets a percentage of revenue, everyone’s incentives appear to be aligned.
Is Amazon correct in its argument that dropping the e-book price will benefit me the author? As expressed in their letter, no. To say that the elasticity of demand for my book at the current price is 2.4 means that if I drop my price 1%, demand will rise 2.4% HOLDING OTHER PRICES FIXED. However, Amazon is not taking about dropping the price of my book alone. They are urging a drop in the price of ALL books. It may well be that a drop in price for all e-books will result in an increase in total revenues for the e-book category. This is good for Amazon. However, it is not at all clear that it is good for me. Rustling of papers, and creaking of seats is a sign that time is up.

In the lasts posts I talked about a Bayesian agent in a stationary environment. The flagship example was tossing a coin with uncertainty about the parameter. As time goes by, he learns the parameter. I hinted about the distinction between `learning the parameter’, and `learning to make predictions about the future as if you knew the parameter’. The former seems to imply the latter almost by definition, but this is not so.

Because of its simplicity, the i.i.d. example is in fact somewhat misleading for my purposes in this post. If you toss a coin then your belief about the parameter of the coin determines your belief about the outcome tomorrow: if at some point your belief about the parameter is given by some {\mu\in [0,1]} then your prediction about the outcome tomorrow will be the expectation of {\mu}. But in a more general stationary environment, your prediction about the outcome tomorrow depends on your current belief about the parameter and also on what you have seen in the past. For example, if the process is Markov with an unknown transition matrix then to make a probabilistic prediction about the outcome tomorrow you first form a belief about the transition matrix and then uses it to predict the outcome tomorrow given the outcome today. The hidden markov case is even more complicated, and it gives rise to the distinction between the two notions of learning.

The formulation of the idea of `learning to make predictions’ goes through merging. The definition traces back at least to Blackwell and Dubins. It was popularized in game theory by the Ehuds, who used Blackwell and Dubins’ theorem to prove that rational players will end up playing approximate Nash Equilibrium. In this post I will not explicitly define merging. My goal is to give an example for the `weird’ things that can happen when one moves from the i.i.d. case to an arbitrary stationary environment. Even if you didn’t follow my previous posts, I hope the following example will be intriguing for its own sake.

Every day there is a probability {1/2} for eruption of war (W). If no war erupts then the outcome is either bad economy (B) or good economy (G) and is a function of the number of peaceful days since the last war. The function from the number of peaceful days to outcome is an unknown parameter of the process. Thus, a parameter is a function {\theta:\{1,2,\dots\}\rightarrow\{\text{B},\text{G}\}}. I am going to compare the predictions about the future made by two agents: Roxana, who knows {\theta} and Ursula, who faces some uncertainty about {\theta} represented by a uniform belief over the set of all parameters. Both Roxana and Ursula don’t know the future outcomes and since both of them are rational decision makeres, they both use Bayes’ rule to form beliefs about the unknown future given what they have seen in the past.

Consider first Roxana. In the terminology I introduced in previous posts, she faces no structural uncertainty. After a period of {k} consecutive peaceful days Roxana believes that with probability {1/2} the outcome tomorrow will be W and with probability {1/2} the outcome tomorrow will be {\theta(k)}.

Now consider Ursula. While she does not initially know {\theta}, as times goes by she learns it. What do I mean here by learning ? Well, suppose Ursula starts observing the outcomes and she sees G,B,W,B,G,…. From this information Ursula she deduces that {\theta(1)=\text{B}}, so that if a peaceful day follows a war then it has a bad economy. Next time a war pops up Ursula will know to make a prediction about the outcome tomorrow which is as accurate as Roxana’s prediction. Similarly Ursula can deduce that {\theta(2)=\text{G}}. This way Ursula gradually deduces the values of {\theta(k)} while she observes the process. However, and this is the punch line, for every {k\in\{1,2,3,\dots\}} there will be a time when Ursula observes {k} consecutive peaceful day for the first time and at this day her prediction about the next outcome will be {1/2} for war, {1/4} for good economy and {1/4} for bad economy. Thus there will always be infinitely many occasions in which Ursula’s prediction differ from Roxana.

So, Ursula does learn the parameter in the sense that she gradually deduce more and more values of {\theta(k)}. However, because at every point in time she may require a different value of {\theta(k)} — This is the difference between the stationary environment and the i.i.d. environment ! — there may happen infinitely many times in which she has not yet been able to deduce the value of the parameter which she needs in order to make a prediction about the outcome tomorrow.

You may notice that Ursula does succeed in making predictions most of the times. In fact, the situations when she fails become more and more rare, after observing longer and longer blocks of peaceful days. Indeed, Nabil and I formalize this idea and show that this is the case in every stationary environment with structural uncertainty: the observer makes predictions approximately as if he knew the parameter in almost every day. For that, we use a weak notion of merging which was suggested by Lehrer and Smorodinsky. If you are interested then this is a good time to look at our paper.

Finally, the example given above is our adaptation to an example that appeared first in a paper by Boris Yakovlevich Ryabko. Ryabko’s paper is part of a relatively large literature about non-Bayesian predictions in stationary environment. I will explain the relationship between that literature and our paper in another post.

About a year ago, I chanced to remark upon the state of Intermediate Micro within the hearing of my colleagues. It was remarkable, I said, that the nature of the course had not changed in half a century. What is more, the order in which topics were presented was mistaken and the exercises on a par with Vogon poetry, which I reproduce below for comparison:

“Oh freddled gruntbuggly,
Thy micturations are to me
As plurdled gabbleblotchits on a lurgid bee.
Groop, I implore thee, my foonting turlingdromes,
And hooptiously drangle me with crinkly bindlewurdles,
Or I will rend thee in the gobberwarts
With my blurglecruncheon, see if I don’t!”

The mistake was not to think these things, or even say them. It was to utter them within earshot of one’s colleagues. For this carelessness, my chair very kindly gave me the chance to put the world to rights. Thus trapped, I obliged. I begin next week. By the way, according to Alvin Roth, when an ancient like myself chooses to teach intermediate micro-economics it is a sure sign of senility.

What do I intend to do differently? First, re order the sequence of topics. Begin with monopoly, followed by imperfect competition, consumer theory, perfect competition, externalities and close with Coase.

Why monopoly first? Two reasons. First it involves single variable calculus rather than multivariable calculus and the lagrangean. Second, student enter the class thinking that firms `do things’ like set prices. The traditional sequence begins with a world where no one does anything. Undergraduates are not yet like the white queen, willing to believe 6 impossible things before breakfast.

But doesn’t one need preferences to do monopoly? Yes, but quasi-linear will suffice. Easy to communicate and easy to accept, upto a point. Someone will ask about budget constraints and one may remark that this is an excellent question whose answer will be discussed later in the course when we come to consumer theory. In this way consumer theory is set up to be an answer to a challenge that the students have identified.

What about producer theory? Covered under monopoly, avoiding needless duplication.

Orwell’s review of Penguin books is in the news today courtesy of Amazon vs Hachette. You can read here about that here. I wish, however, to draw your attention to an example that Orwell makes in his review:

It is, of course, a great mistake to imagine that cheap books are good for the book trade. Actually it is just the other way around. If you have, for instance, five shillings to spend and the normal price of a book is half-a-crown, you are quite likely to spend your whole five shillings on two books. But if books are sixpence each you are not going to buy ten of them, because you don’t want as many as ten; your saturation-point will have been reached long before that. Probably you will buy three sixpenny books and spend the rest of your five shillings on seats at the ‘movies’. Hence the cheaper the books become, the less money is spent on books.

Milton Friedman in his textbook Price Theory, as an exercise, asks readers to analyze the passage. He does not explicitly say what he is looking for, but I would guess this: what can you say about the preferences for such a statement to be true. Its a delightful question. A budget line is given and a point that maximizes utility on the budget lie is identified. Now the price of one of the goods falls, and another utility maximizing point is identified. What kind of utility function would exhibit such behavior?
By the way, there are 60 pence to a shilling and a half a crown is six pennies.

The news of Stanley Reiter’s passing arrived over the weekend. Born in a turbulent age long since passed, he lived a life few of us could replicate. He saw service in WW2 (having lied about his age), and survived the Battle of the Bulge. On the wings of the GI bill he went through City College, which  in those days, was the gate through which many outsiders passed on their way to the intellectual aristocracy.

But in the importance and noise of to-morrow
When the brokers are roaring like beasts on the floor of the Bourse

Perhaps  a minute to recall to what Stan left behind.

Stan, is well known of his important contributions to mechanism design in collaboration with Hurwicz and Mount. The most well known example of this is the notion of the size of the message space of a mechanism. Nisan and Segal pointed out the connection between this and the notion of communication complexity. Stan would have been delighted to learn about the connection between this and extension complexity.

Stan was in fact half a century ahead of the curve in his interest in the intersection of algorithms and economics. He was one of the first scholars to tackle the job shop problem. He proposed a simple index policy that was subsequently implemented and reported on in Business Week: “Computer Planning Unsnarls the Job Shop,” April 2, 1966, pp. 60-61.

In 1965, with G. Sherman, he proposed a local-search algorithm for the TSP (“Discrete optimizing”, SIAM Journal on Applied Mathematics 13, 864-889, 1965). Their algorithm was able to produce a tour at least as good as the tours that were reported in earlier papers. The ideas were extended with Don Rice  to a local search heuristic for  non-concave mixed integer programs along with a computation study of performance.

Stan was also remarkable as a builder. At Purdue, he developed a lively school of economic theory attracting the likes of Afriat, Kamien, Sonnenschein, Ledyard and Vernon Smith. He convinced them all to come telling them Purdue was just like New York! Then, to Northwestern to build two groups one in the Economics department and another (in collaboration with Mort Kamien) in the business school.

The Fields medals will be awarded this week in Seoul. What does the future hold for the winners? According to Borjas and Doran, declining productivity caused by a surfeit of dilettantism. The data point to a decline in productivity. By itself this is uninteresting. Perhaps all those on the cusp of 40, see a decline in productivity. What Borjas and Doran rely on is a degree of randomness in who gets a medal. First, there is the variation in tastes of the selection committee (Harish Chandra, for example, was eliminated on the grounds that one Bourbaki camp follower sufficed). Second, the arbitrary age cutoff (the case of the late Oded Schramm is an example of this). Finally, what is the underlying population? Borjas and Doran argue that by using a collection of lesser prizes and honors one can accurately identify the subset of mathematicians who can be considered potential medalists. These are the many who are called, of which only a few will be chosen. The winners are compared to the remaining members of this group. Here is the conclusion (from the abstract):

We compare the productivity of Fields medalists (winners of the top mathematics prize) to that of similarly brilliant contenders. The two groups have similar publication rates until the award year, after which the winners’ productivity declines. The medalists begin to `play the field,’ studying unfamiliar topics at the expense of writing papers.

The prize, Borjas and Doran suggest, like added wealth, allows the winners to consumer more leisure in the sense of riskier projects. However, the behavior of the near winners is a puzzle. After 40, the greatest prize is beyond their grasp. One’s reputation has already been established. Why don’t they `play the field’ as well?

A Bayesian agent is observing a sequence of outcomes in {\{S,F\}}. The agent does not know the outcomes in advance, so he forms some belief {\mu} over sequences of outcomes. Suppose that the agent believes that the number {d} of successes in {k} consecutive outcomes is distributed uniformly in {\{0,1,\dots k\}} and that all configuration with {d} successes are equally likely:

\displaystyle \mu\left(a_0,a_1,\dots,a_{k-1} \right)=\frac{1}{(k+1)\cdot {\binom{k}{d}}}

for every {a_0,a_1,\dots,a_{k-1}\in \{S,F\}} where {d=\#\{0\le i<k|a_i=S\}}.

You have seen this belief {\mu} already though maybe not in this form. It is a belief of an agent who tosses an i.i.d. coin and has some uncertainty over the parameter of the coin, given by a uniform distribution over {[0,1]}.

In this post I am gonna make a fuss about the fact that as time goes by the agent learns the parameter of the coin. The word `learning’ has several legitimate formalizations and today I am talking about the oldest and probably the most important one — consistency of posterior beliefs. My focus is somewhat different from that of textbooks because 1) As in the first paragraph, my starting point is the belief {\mu} about outcome sequences, before there are any parameters and 2) I emphasize some aspects of consistency which are unsatisfactory in the sense that they don’t really capture our intuition about learning. Of course this is all part of the grand marketing campaign for my paper with Nabil, which uses a different notion of learning, so this discussion of consistency is a bit of a sidetrack. But I have already came across some VIP who i suspect was unaware of the distinction between different formulations of learning, and it wasn’t easy to treat his cocky blabbering in a respectful way. So it’s better to start with the basics.

Let {A} be a finite set of outcomes. Let {\mu\in\Delta(A^\mathbb{N})} be a belief over the set {A^\mathbb{N}} of infinite sequences of outcomes, also called realizations. A decomposition of {\mu} is given by a set {\Theta} of parameters, a belief {\lambda} over {\Theta}, and, for every {\theta\in\Theta} a belief {\mu_\theta} over {A^\mathbb{N}} such that {\mu=\int \mu_\theta~\lambda(\mathrm{d}\theta)}. The integral in the definition means that the agent can think about the process as a two stages randomization: First a parameter {\theta} is drawn according to {\lambda} and then a realization {\omega} is drawn according to {\mu_\theta}. Thus, a decomposition captures a certain way in which a Bayesian agent arranges his belief. Of course every belief {\mu} admits many decompositions. The extreme decompositions are:

  • The Trivial Decomposition. Take {\Theta=\{\bar\theta\}} and {\mu_{\bar\theta}=\mu}.
  • Dirac’s Decomposition. Take {\Theta=A^\mathbb{N}} and {\lambda=\mu}. A “parameter” in this case is a measure {\delta_\omega} that assigns probability 1 to the realization {\omega}.

Not all decompositions are equally exciting. We are looking for decompositions in which the parameter {\theta} captures some `fundamental property’ of the process. The two extreme cases mentioned above are usually unsatisfactory in this sense. In Dirac’s decomposition, there are as many parameters as there are realizations; parameters simply copy realizations. In the trivial decomposition, there is a single parameter and thus cannot discriminate between different interesting properties. For stationary process, there is a natural decomposition in which the parameters distinguish between fundamental properties of the process. This is the ergodic decomposition, according to which the parameters are the ergodic beliefs. Recall that in this decomposition, a parameter captures the empirical distribution of blocks of outcomes in the infinite realization.

So what about learning ? While observing a process, a Bayesian agent will update his belief about the parameter. We denote by {\lambda_n\left(a_0,\dots,a_{n-1}\right)\in\Delta(\Theta)} the posterior belief about the parameter {\theta} at the beginning of period {n} after observing the outcome sequence {a_0,\dots,a_{n-1}}. The notion of learning I want to talk about in this post is that this belief converges to a belief that is concentrated on the true parameter {\theta}. The example you should have in mind is the coin toss example I started with: while observing the outcomes of the coin the agent becomes more and more certain about the true parameter of the coin, which means his posterior belief becomes concentrated around a belief that gives probability {1} to the true parameter.

Definition 1 A decomposition of {\mu} is consistent if for {\lambda}-almost every {\theta} it holds that

\displaystyle \lambda_n\left(a_0,\dots,a_{n-1}\right)\xrightarrow[n\rightarrow\infty]{w}\delta_\theta

for {\mu_\theta}-almost every realization {\omega=(a_0,a_1,\dots)}.

In this definition, {\delta_\theta} is Dirac atomic measure on {\theta} and the convergence is weak convergence of probability measures. No big deal if you don’t know what it means since it is exactly what you expect.

So, we have a notion of learning, and a seminal theorem of L.J. Doob (more on that below) implies that the ergodic decomposition of every stationary process is consistent. While this is not something that you will read in most textbooks (more on that below too), it is still well known. Why do Nabil and I dig further into the issue of learnability of the ergodic decomposition ? Two reasons. First, one has to write papers. Second, there is something unsatisfactory about the concept of consistency as a formalization of learning. To see why, consider the belief {\mu} that outcomes are i.i.d. with probability {1/2} for success. This belief is ergodic, so from the perspective of the ergodic decomposition the agent `knows the process’ and there is nothing else to learn. But let’s look at Dirac’s decomposition instead of the ergodic decomposition. Then the parameter space equals the space of all realizations. Suppose the true parameter (=realization) is {\omega^\ast=(a_0,a_1,\dots)}, then after observing the first {n} outcomes of the process the agent’s posterior belief about the parameter is concentrated on all {\omega} that agrees with {\omega^\ast} on the first {n} coordinates. These posterior beliefs converge to {\delta_{\omega^\ast}}, so that Dirac decomposition is also consistent ! We may say that we learn the parameter, but “learning the parameter” in this environment is just recording the past. The agent does not gain any new insight about the future of the process from learning the parameter.

In my next post I will talk about other notions of learning, originating in a seminal paper of Blackwell and Dubins, which capture the idea that an agent who learns a parameter can make predictions as if he new the parameter. Let me also say here that this post and the following ones are much influenced by a paper of Jackson, Kalai, Smorodinsky. I will talk more on that paper in another post.

For the rest of this post I am going to make some comments about Bayesian consistency which, though again standard, I don’t usually see them in textbooks. Especially I don’t know of a reference for the version of Doob’s Theorem which I give below, so if any reader can give me such a reference it will be helpful.

First, you may wonder whether every decomposition is consistent. The answer is no. For a trivial example, take a situation where {\mu_\theta} are the same for every {\theta}. More generally troubles arrise when the realization does not pin down the parameter. Formally, let us say that function {f:\Omega\rightarrow \Theta} pins down or identifies the parameter if

\displaystyle \mu_\theta\left(\{\omega: f(\omega)=\theta\}\right)=1

for {\lambda}-almost every {\theta}. If such {f} exists then the decomposition is identifiable.

We have the following

Theorem 2 (Doob’s Theorem) A decomposition is identifiable if and only if it is consistent.

The `if’ part follows immediately from the definitions. The `only if’ part is deep, but not difficult: it follows immediately from the martingale convergence theorem. Indeed, Doob’s Theorem is usually cited as the first application of martingales theory.

Statisticians rarely work with this abstract formulation of decomposition which I use. For this reason, the theorem is usually formulated only for the case that {\Theta=\Delta(A)} and {\mu_\theta} is i.i.d. {\theta}. In this case the fact that the decomposition is identifiable follows from the strong law of large numbers. Doob’s Theorem then implies the standard consistency of Bayesian estimator of the parameter {\theta}.

The August 3rd NY Times has an article about the advertising of Fishoil and Facebook. As sometimes happens with a NYT article, the interesting issues are buried beneath moderately interesting anecdotes that may be traded with others at the dinner table in what passes for serious discussion.

The story is about a company called MegaRed, that peddles fish oil. It wants to target consumers who are receptive to the idea of fish oil because they believe that it confers health benefits. The goal is to get them to try out and perhaps switch to MegaRed.

Facebook proposes a campaign which raises the eyebrows of the marketing director, J. Rodrigo:

“I can go to television at a quarter the price.”

Brett Prescott of Facebook agrees, that yes, Facebook is more expensive than TV. But offers an analogy between advertising on Facebook and firing a shotgun.

“And you are firing that buckshot knowing where every splinter of that bullet is landing.”

If biology is the study of bios, life, and geology is the study of goes, the earth, what does that make analogy?

Some arithmetic to clarify matters. Suppose 1 in 100 of all people would be receptive to the idea of MegaRed’s message. Suppose each of these people is worth $1 on average to MegaRed. If you could reach all 100 of these people via TV, then MegaRed should pay no more than 10 cents per person and so $1 in total.

Enter, stage left, Facebook. It claims that it can target its ads so that they go just to the right person. How much is that worth? $1. In this example, Facebook is no better or worse than TV.

If Facebook has any added value compared to TV it does not come from better targeting because one can always compensate for that by paying TV less and reaching more eyeballs. It must come from access to eyeballs unreachable via TV, or, identifying eyeballs that MegaRed would not initially have identified as receptive to their message, or that the medium itself is more persuasive than TV. Is this true for Facebook? If not, MegaRed is better off with TV.

Kellogg faculty blogroll