One reason I like thinking about probability and statistics is that my raw intuition does not fit well with the theory, so again and again I find myself falling into the same pits and enjoying the same revelations. As an impetus to start blogging again, I thought I should share some of these pits and revelations. So, for your amusement and instruction, here are three statistics questions with wrong answers which I came across during the last quarter. The mistakes are well known, and in fact I am sure I was at some level aware of them, but I still managed to believe the wrong answers for time spans ranging from a couple of seconds to a couple of weeks, and I still get confused when I try to explain what’s wrong. I will give it a shot some other post.

— Evaluating independent evidence —

The graduate students in a fictional department of economics are thrown to the sharks if it can be proved at significance level ${\alpha=0.001}$ that they are guilty of spending less than eighty minutes a day on reading von Mises `Behavioral econometrics’. Before the penalty is delivered, every student is evaluated by three judges, who each monitors the student in a random sample of days and then conducts a statistical hypothesis testing about the true mean of daily minutes the student spend on von Mises:

$\displaystyle \begin{array}{rcl} &H_0: \mu \ge 80\\&H_A: \mu<80\end{array}$

The three samples are independent. In the case of Adam Smith, a promising grad student, the three judges came up with p-values ${0.09, 0.1, 0.08}$. Does the department chair have sufficient evidence against Smith ?

Wrong answer: Yup. The p-value in every test is the probability of failing the test under the null. These are independent samples so the probability to end up the three tests with such p-values is ${0.09\cdot 0.1\cdot 0.08<0.001}$. Therefore, the chair can dispose of the student. Of course it is possible that the student is actually not guilty and was just extremely unlucky to get monitored exactly on the days in which he slacked, but hey, that’s life or more accurately that’s statistics, and the chair can rest assured that by following this procedure he only loses a fraction of ${0.001}$ of the innocent students.

— The X vs. the Y —

Suppose that in a linear regression of ${Y}$ over ${X}$ we get that

$\displaystyle Y=4 + X + \epsilon$

where ${\epsilon}$ is the idiosyncratic error. What would be the slope in a regression of ${X}$ over ${Y}$ ?
Wrong answer: If ${Y= 4 + X + \epsilon}$ then ${X = -4 + Y + \epsilon'}$, where ${\epsilon'=-\epsilon}$. Therefore the slope will be ${1}$ with ${\epsilon'}$ being the new idiosyncratic error.

— Omitted variable bias in probit regression —

Consider a probit regression of a binary response variable ${Y}$ over two explanatory variables ${X_1,X_2}$:

$\displaystyle \text{Pr}(Y=1)=\Phi\left(\beta_0 + \beta_1X_1 + \beta_2X_2\right)$

where ${\Phi}$ is the commulative distribution of a standard normal variable. Suppose that ${\beta_2>0}$ and that ${X_1}$ and ${X_2}$ are positively correlated, i.e. ${\rho(X_1,X_2)>0}$. What can one say about the coefficient ${\beta_1'}$ of ${X_1}$ in a probit regression

$\displaystyle \text{Pr}(Y=1)=\Phi\left(\beta_0'+ \beta_1'X_1\right)$

of ${Y}$ over ${X_1}$ ?
Wrong answer: This is a well known issue of omitted variable bias. ${\beta_1'}$ will be larger than ${\beta_1}$. One way to understand this is to consider the different meaning of the coefficients: ${\beta_1}$ reflects the the impact on ${Y}$ when ${X_1}$ increases and ${X_2}$ stays fixed, while ${\beta_1'}$ reflects the impact on ${Y}$ when ${X_1}$ increases without controlling on ${X_2}$. Since ${X_1}$ and ${X_2}$ are positively correlated, and since ${X_2}$ has positive impact on ${Y}$ (as ${\beta_2>0}$), it follows that ${\beta_1'>\beta_1}$.