It is well-known that a non-linear transformation of a random variable does not transform the mean in an entirely straightforward way. That is, for a random variable X and function f, we can easily have E(f(X)) \neq f(E(X)). In our intro decision science courses, we call this the “flaw of averages,” a term coined by Sam Savage. See his book of that title for many examples of how one can, often inadvertently, fall into the false assumption that it suffices to replace a random variable by its average.

What if instead of the average, we talk about the mode, or most likely outcome? Denote this M(X). Surely, if f is a one-to-one function, the most likely value of f(X) must be f(M(X))? Amazingly, this can be false as well! It is much “closer to true,” in that it is true for all discrete distributions, but we run into trouble with continuous distributions. Here is an example:

Let X follow a standard normal distribution, and let Y=e^X. The distribution of Y is called “lognormal.” Here is a graph of its density:

Notice that while the standard normal distribution is peaked at 0, this distribution is not peaked at e^0=1! What is going on?

We can work this out with algebra and calculus, but here is a conceptual way of looking at it. The key is that probability densities differ fundamentally from probabilities, and the precise definition of the mode is different for continuous than discrete distributions. Saying that 0 is the “most likely” value for a standard normal variable X isn’t quite right. Any particular exact value, of course, has probability zero. What we really mean is that X is more likely to be near 0 than near any other value. Fine, so then why isn’t Y=e^X more likely to be near 1=exp(0) than any other value? Because the exponential transformation does funny things to “near.” Nearby values get stretched out more for larger X. So, while more realizations of X are near 0 than near -1, they are spread out more thinly when we exponentiate, so that the maximum density of Y occurs not at exp(0) but at exp(-1). It takes some algebra to find the exact value, but this argument makes it fairly clear it should be less than exp(0).

Why is this important? One reason is that in regression analysis we often use a model which predicts ln y rather than y. Then we need to convert the predicted value for ln y to a predicted value for y. It is well-known (and has been taught for years in our intro stats course) that if we are predicting the average y, it does not suffice to exponentiate; a correction must be made. But for predicting an individual observation of y, we all teach that no correction is necessary. It only occurred to me last week, after teaching the course for 6 years, that this is problematic if we seek the “most likely” value of y.

What is the resolution? First of all, there is no distortion for the median. If k is the point prediction for ln y, then we can conclude that y has a 50% chance of being above exp(k). So the method we have been teaching is a fine way to estimate the median value of y. Our lectures and textbook haven’t really said in precise language whether we are predicting a median or modal value, so I am glad to report we haven’t been teaching anything unambiguously wrong. Secondly, the problem goes away if we work with prediction intervals rather than single-value predictions, as we often encourage students to do. If we are 95% confident that ln y is in [a,b], we can certainly conclude that we are 95% confident y is in [e^a,e^b].

Most importantly, we should reinforce the lesson from the “flaw of averages” that any single number – mean, median or mode – is a poor summary of our knowledge of a random variable. This is especially true for a variable that is lognormal (or any asymmetric distribution) rather than normal, in which case all three values are usually different.

Postscript: To learn more about “the muddle of the mode,” here are two basic PhD-level exercises:

1. Show “Reverse Jensen’s inequality for modes”: If f is an increasing convex function, X is a continuous random variable, and both X and f(X) have a unique mode (point of maximum density), then mode(f(X)) <= f(mode(X)) . If f is strictly convex and X has continuously differentiable density, the inequality is strict.

2. (Based on a suggestion from Peter Klibanoff.) Let X have a continuously differentiable density and a unique mode, and Y=exp(X). Define the density of Y “on a multiplicative scale” by

g(y) = \lim_{\epsilon \rightarrow 0} P(Y \in [y,y(1+\epsilon)])/\epsilon

Show that g is maximized at exp(mode(X)). Note that the above formula is similar to the standard density, but with y(1+\epsilon) having replaced y + \epsilon. That is, if we consider Y to be measured on a multiplicative scale, with multiplicative errors, there is no distortion in the mode.