The myth of significance testing

When I decided to leave work and go to University to study psychology I did so because of a genuine fascination with the study of human behaviour, thought, and emotion. Like many I was drawn to the discipline not by the allure of science but by the writings of Freud, Jung, Maslow, and Fromm. I believed at the time that the discipline was as much philosophy as it was science and had the romantic notion of sitting in the quad talking theory with my class mates.

Unfortunately from day one I was introduced not to the theory of psychology but the maths of psychology. This, I was told, was the heart of the discipline and supporting evidence came not from the strength of the theory but from the numbers. It did not matter that, as an 18 year old male I was supremely conscious of the power of libidos. Unless it could be demonstrated on a Likert scale it did not exist. The gold standard supporting evidence was significance testing.

I always struggled with the notion that the significance test (ST) was indeed as significant as my professors would have me believe. However it was not until I completed my post graduate diploma in applied statistics that the folly of ST truly came home to me. Here for the first time I was introduced to the concept of fishing for results and techniques such as the Bonferroni correction ( Moreover I truly understood how paltry the findings in psychology were and to establish robustness of such findings through a significance test was somewhat oxymoronic.

In 2012 a seminal paper on this topic came out and I would encourage everyone who works in our field to be aware of it. This is indeed the myth for this month: the myth of significance testing:

Lambdin, C. (2012) Significance tests as sorcery: Science is empirical – significance tests are not. Theory and Psychology, 22, 1, 67-90.


Since the 1930s, many of our top methodologists have argued that significance tests are not conducive to science. Bakan (1966) believed that “everyone knows this” and that we slavishly lean on the crutch of significance testing because, if we didn’t, much of psychology would simply fall apart. If he was right, then significance testing is tantamount to psychology’s “dirty little secret.” This paper will revisit and summarize the arguments of those who have been trying to tell us— for more than 70 years—that p values are not empirical. If these arguments are sound, then the continuing popularity of significance tests in our peer-reviewed journals is at best embarrassing and at worst intellectually dishonest.

The paper is a relatively easy read and the arguments are simple to understand:

“… Lykken (1968), who argues that many correlations in psychology have effect sizes so small that it is questionable whether they constitute actual relationships above the “ambient correlation noise” that is always present in the real world. Blinkhorn and Johnson (1990) persuasively argue, for instance, that a shift away from “culling tabular asterisks” in psychology would likely cause the entire field of personality testing to disappear altogether. Looking at a table of results and highlighting which ones are significant is, after all, akin to throwing quarters in the air and noting which ones land heads.”  (ala fishing for results)

The impact of this paper for so much of the discipline cannot be over stated. In an attempt to have a level of credibility beyond its station psychological literature has bordered on the downright fraudulent in making sweeping claims from weak but significant results. The impact is that our discipline becomes the laughing stock of future generations who will see through the emperors clothes that are currently parading as science.

“ … The most unfortunate consequence of psychology’s obsession with NHST is nothing less than the sad state of our entire body of literature. Our morbid overreliance on significance testing has left in its wake a body of literature so rife with contradictions that peer-reviewed “findings” can quite easily be culled to back almost any position, no matter how absurd or fantastic. Such positions, which all taken together are contradictory, typically yield embarrassingly little predictive power, and fail to gel into any sort of cohesive picture of reality, are nevertheless separately propped up by their own individual lists of supportive references. All this is foolhardily done while blissfully ignoring the fact that the tallying of supportive references—a practice which Taleb (2007) calls “naïve empiricism”—is not actually scientific. It is the quality of the evidence and the validity and soundness of the arguments that matters, not how many authors are in agreement. Science is not a democracy.

It would be difficult to overstress this point. Card sharps can stack decks so that arranged sequences of cards appear randomly shuffled. Researchers can stack data so that random numbers seem to be convincing patterns of evidence, and often end up doing just that wholly without intention. The bitter irony of it all is that our peer-reviewed journals, our hallmark of what counts as scientific writing, are partly to blame. They do, after all, help keep the tyranny of NHST alive, and “[t]he end result is that our literature is comprised mainly of uncorroborated, one-shot studies whose value is questionable for academics and practitioners alike” (Hubbard & Armstrong, 2006, p. 115).” P. 82

 Is there a solution to this madness? Using the psychometric testing industry as a case in point I believe the solution is multi-pronged. ST’s will continue to be part of our supporting literature as they are the requirement of the marketplace and without them test publishers will not be viewed credibly. However through education such as training for test users, this can be balanced so that the reality of ST’s can be better understood. This will include understanding the true variance that is accounted for in tests of correlation and therefore the true significance of the significance test will be understood! This will need to be equally matched with an understanding of the importance of theory building when testing a hypothesis and required alterations such as Bonferroni correction when conducted multiple tests with one set of data.  Finally, in keeping with the theme in this series of blogs the key is to treat the discipline as a craft not a science. Building theory, applying results in unique and meaningful ways and being focussed on practical outcomes is more important and more reflective of sound practice then militant adherence to a significance test.

P.S. For those interested in understanding how to use statistics as a craft to formulate applied solutions I strongly recommend this book

P.P.S. This article just out . Seems that there may be hope for the discipline yet.

About these ads

About Dr Paul Englert

This entry was posted in Uncategorized and tagged , , , , , , , , , , , , , , . Bookmark the permalink.

2 Responses to The myth of significance testing

  1. Tane O'Rorke says:

    Great article Paul. Do you think the prospect of big data and emerging technologies being harnessed by commercial organisations solving real world problems rather than academics tinkering with their calculators provides some hope for the profession?

    • Dr Paul says:

      Hi Tane

      Thank you for the positive response. Big data is certainly flavor of the month and does offer promise for our discipline but do I think it will have the pick up that you suggest? not in the short-term and certainly not in NZ.

      Big data is not, in-and-of-itself, an answer. You still require research design, data points, integration of HR strategy, etc. to bring out the value in big data. With this in mind I don’t see organisations harnessing the value of big data for multiple reasons:

      1. Cost: This requires time and expertise which in turn come at a cost. Organisations invariably cut cost at the evaluation stage (hence why so much training, with no demonstrable value, is implemented for example).

      2. Understanding: You need the person procuring the services (such as HR) to have an understanding of the big-data picture. Do most HR professionals have the research design expertise, stats, eval background, etc. to make informed choices in this space. Again I will leave that open for you to decide. What I will say is that people do not promote or buy what they do not understand.

      3. Actual data: This is the issue for NZ. Our firms are small so to get the data required fro big data is going to take time. Given the technological advancements the relevance of the data becomes questionable over time.

      The upshot is that big data is no panacea. The same problems of system thinking, research design and integrated HR strategy remain. Solving these problems is big bucks and in my experience is organisations simply do not have the appetite for it, yet. This is why one-off interventions whether they be training, selection or performance management tend to be far more likely to be introduced than a big data strategy.

Please Comment!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s