That is to say, your argument is dependent on saying that the true parameter is either in or not in the interval. Thus, there cannot be a chance that it is in the interval (other than 1 and 0) because either p=1 or p=0 is the true probability.

However, if we allow the probability to represent our uncertainty, a probabilistic statement is totally valid. I don’t know if the parameter is in the interval. Based on the math I’ve done, 95% of the time it will be. Thus, I can say there is a 95% chance, because I’m uncertain.

Now, the question is “should we let probability represent uncertainty?” To which I would answer yes unequivocally. You may disagree, but then you need a different system of representing uncertainty. Probability, being the system used in forecasting as well as gambling is natural in this context. If you disagree, I would ask what it means when Nate Silver says the chance of someone winning an election is 60%. There is a response regarding repeating the election over and over, but it isn’t as sensible as saying that the probability is representing his uncertainty.

]]>The situation I had in mind is after we already came up with the interval, say [3.7, 5.4]. So we should not say there is 95% chance that the parameter is between [3.7, 5.4].

]]>In the medical literature, “false positives” and “false negatives” are used instead. They are clearer and more intuitive.

]]>If you want to defend the use of statistical analysis, disparaging other people’s work “probability” (>.99999) strengthens the reason they banned p-values in the first place, which in your case, is dismissing the work of any journal that doesn’t use them. When an author uses a lot of statistical analysis at the expense of in-depth thought about the problem at hand then the solution of eliminating statistics may strengthen the journal communication of ideas. It is “social” psychology!

I’m horrible at statistics. However, I’ve read many books on the subject, hoping to one day get smart about it. What I’ve learned is that it’s a very nuanced subject and even the most basic ideas seem to have been lost in the never-ending publications of text-books which teach the how, not the why.

For example, Type I and Type II errors. Could the statistical community not find some better names? Anyway, there’s a big difference between what you can prove exists (space ships landing on your lawn) and what you can prove doesn’t exist (your lawn doesn’t have a spacecraft on it). Because no one can completely prove the former, they start from the latter. Many authors forget that reality.

I suggest the same for you. The error you make is assuming a journal must use statistics. That can’t be proved. Can you prove a journal without statistics (the spaceship on the lawn) is any worse than the journal with them?

]]>“We are 97% confident that our data is not pure noise.”

And to be really honest you would add “but any detectable signal could be due to systematic measurement biases”

You would also change the term “statistically significant” which shares little meaning with the normal definition of the word, with the term: “statistically detectable”.

The t-test does not tell you anything about the size of the effect. If you have enough data, even a rare p<0.0001 could mean only a minuscule effect size. Statistical significance test will pass with an arbitrarily low p just because of tiny biases in the way the experiment was performed that will have resulted in tiny detectable skews in the data.

All experiments have at least small systematic biases which means all of them will pass statistical significance tests due to these biases given enough data.

Researchers should report confidence intervals and before they run the experiment, register which variable they will be looking at along with an estimate of the size of potential systematic measurement biases you can expect based on the type of experiment they will be running with the type of tools they will be using. If the lower bound on the effect size found is 2 and the instrument calibration or testing method can systematically be off by 3 you still haven’t found anything!

Human administered tests have relatively high levels of systematic biases. I’ve seen research groups doing longitudinal studies where completely different sets of grad students were performing the experiments at different time intervals. Different people perform experiments slightly differently which can easily result in statistically detectable differences in results. These should be interpreted as meaningless even if they come out as “significant”.

]]>