If you live under the impression that in order to publish an empirical paper you must include the sentence “this holds with p-value x” for some number x<0.05 in your paper, here is a surprising bit of news for you: The editors of Basic and Applied Social Psychology have banned p-value from their journal, along with confidence intervals. In fact, according to the editorial, the state of the art of statistics “remains uncertain” so statistical inference is no longer welcome in their journal.
When I came across this editorial I was dumbfounded by the arrogance of the editors, who seem to know about statistics as much as I know about social psychology. But I haven’t heard about this journal until yesterday, and if I did I am pretty sure I wouldn’t believe anything they publish, p-value or no p-value. So I don’t have the right to complain here.
Here are somebodies who have the right to complain: The American Statistical Association. Concerned with the misuse, mistrust and misunderstanding of the p-value, ASA has recently issued a policy statement on p- values and statistical significance, intended for researchers who are not statisticians.
How do you explain p-value to practitioners who don’t care about things like Neyman-Pearson Lemma, independence and UMP tests ? First, you use language that obscures conceptual difficulties: “the probability that a statistical summary of the data would be equal to or more extreme than its observed value’’ — without saying what “more extreme’’ means. Second, you use warnings and slogans about what p-value doesn’t mean or can’t do, like “p-value does not measure the size of an effect or the importance of a result.’’
Among these slogans my favorite is
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone
What’s cute about this statement is that it assumes that everybody understands what “there is 5% chance that the studied hypothesis is true” and that the notion of P-value is the one that is difficult to understand. In fact, the opposite is true.
Probability is conceptually tricky. It’s meaning is somewhat clear in a situation of a repeated experiment: I more or less understand what it means that a coin has 50% chance to land on Heads. (Yes. Only more or less). But without going full subjective I have no idea what is the meaning of the probability that a given hypothesis (boys who eat pickles in kindergarten have higher SAT score than girls who play firefighters) is true. On the other hand, The meaning of the corresponding P-value relies only on the conceptually simpler notion of probabilities in a repeated experiment.
Why therefore do the committee members (rightly !) assume that people are comfortable with the difficult concept of probability that an hypothesis is true and are uncomfortable with the easy concept of p-value ? I think the reason is that unlike the word “p-value”, the word “probability” is a word that we use in everyday life, so most people feel they know what it means. Since they have never thought about it formally, they are not aware that they actually don’t.
So here is a modest proposal for preventing the misuse and misunderstanding of statistical inference: Instead of saying “this hypothesis holds with p-value 0.03” say “We are 97% confident that this hypothesis holds”. We all know what “confident” means right ?
9 comments
April 7, 2016 at 1:57 am
Dominik
The last paragraph confuses me: The whole point of disliking the use of p-values is that “this hypothesis holds with p-value 0.03” and “we are 97% confident that this hypothesis holds” are not equivalent. Yes?
April 7, 2016 at 2:06 am
Links for 04-07-16 – Finance
[…] The war on p-value – The Leisure of the Theory Class […]
April 7, 2016 at 6:37 am
Axel
A hammer is a very useful tool. However, if you ever tried to fix a screw with a hammer, you may draw the conclusion that hammers are useless, even dangerous. Now, people who ever tried to fix screws using a hammer unite, declare war against the use of hammers, and ban them from being used.
April 7, 2016 at 7:32 am
Benoit Essiambre
I think that would be a step in the wrong direction. If you insist on using p-values a more honest language would be
“We are 97% confident that our data is not pure noise.”
And to be really honest you would add “but any detectable signal could be due to systematic measurement biases”
You would also change the term “statistically significant” which shares little meaning with the normal definition of the word, with the term: “statistically detectable”.
The t-test does not tell you anything about the size of the effect. If you have enough data, even a rare p<0.0001 could mean only a minuscule effect size. Statistical significance test will pass with an arbitrarily low p just because of tiny biases in the way the experiment was performed that will have resulted in tiny detectable skews in the data.
All experiments have at least small systematic biases which means all of them will pass statistical significance tests due to these biases given enough data.
Researchers should report confidence intervals and before they run the experiment, register which variable they will be looking at along with an estimate of the size of potential systematic measurement biases you can expect based on the type of experiment they will be running with the type of tools they will be using. If the lower bound on the effect size found is 2 and the instrument calibration or testing method can systematically be off by 3 you still haven’t found anything!
Human administered tests have relatively high levels of systematic biases. I’ve seen research groups doing longitudinal studies where completely different sets of grad students were performing the experiments at different time intervals. Different people perform experiments slightly differently which can easily result in statistically detectable differences in results. These should be interpreted as meaningless even if they come out as “significant”.
April 7, 2016 at 7:46 am
maxrottersman
I have to point out this is very harsh: “When I came across this editorial I was dumbfounded by the arrogance of the editors, who seem to know about statistics as much as I know about social psychology. But I haven’t heard about this journal until yesterday, and if I did I am pretty sure I wouldn’t believe anything they publish, p-value or no p-value. So I don’t have the right to complain here.”
If you want to defend the use of statistical analysis, disparaging other people’s work “probability” (>.99999) strengthens the reason they banned p-values in the first place, which in your case, is dismissing the work of any journal that doesn’t use them. When an author uses a lot of statistical analysis at the expense of in-depth thought about the problem at hand then the solution of eliminating statistics may strengthen the journal communication of ideas. It is “social” psychology!
I’m horrible at statistics. However, I’ve read many books on the subject, hoping to one day get smart about it. What I’ve learned is that it’s a very nuanced subject and even the most basic ideas seem to have been lost in the never-ending publications of text-books which teach the how, not the why.
For example, Type I and Type II errors. Could the statistical community not find some better names? Anyway, there’s a big difference between what you can prove exists (space ships landing on your lawn) and what you can prove doesn’t exist (your lawn doesn’t have a spacecraft on it). Because no one can completely prove the former, they start from the latter. Many authors forget that reality.
I suggest the same for you. The error you make is assuming a journal must use statistics. That can’t be proved. Can you prove a journal without statistics (the spaceship on the lawn) is any worse than the journal with them?
April 7, 2016 at 9:17 am
Haynes Goddard
“For example, Type I and Type II errors. Could the statistical community not find some better names?”
In the medical literature, “false positives” and “false negatives” are used instead. They are clearer and more intuitive.
April 7, 2016 at 11:11 am
Soho
Your modest proposal misrepresents what’s going on. P values don’t represent our confidence that a theory is true. They represent our lack of evidence that a theory is false. This distinction becomes important if, for example, both your hypothesis and the null hypothesis have p < 0.05. In that case we can’t reject either hypothesis and all that we’ve learned is that the experiment failed to gather enough data to tell us anything useful.
April 9, 2016 at 3:35 pm
Lektüre-Tipps fürs Wochenende | thiemoblog
[…] The war on p-value | The Leisure of the Theory Class […]
April 10, 2016 at 4:29 pm
Links do dia (10/4) – Procrastinando Junto
[…] o significado estatístico de p-valores de forma voltada a não estatísticos (embora já existam críticas quanto à linguagem utilizada). Já Kocherlakota alerta para a importância de não se prender à […]