Today's Clinical Lab - News, Editorial and Products for the Clinical Laboratory
A researcher points to virtual projections of statistical data as graphs, plots, pie charts, and histograms.
Null hypothesis significance testing is the default approach to statistical analysis and reporting in biomedical and social sciences.
iStock, metamorworks

Null Hypothesis Significance Testing May Be an Inaccurate Tool

New study suggests looking beyond NHST and P values when triaging statistical data and findings

American Marketing Association
Published:Dec 13, 2023
|3 min read
Register for free to listen to this article
Listen with Speechify
0:00
3:00

Researchers from Northwestern University, the University of Pennsylvania, and the University of Colorado published a new study in the Journal of Marketing that proposes abandoning null hypothesis significance testing (NHST) as the default approach to statistical analysis and reporting. 

What is a null hypothesis and NHST?

Null hypothesis significance testing (NHST) is the default approach to statistical analysis and reporting in marketing and, more broadly, in the biomedical and social sciences. As practiced, NHST involves:

  1. assuming that a study being undertaken investigation has no effect along with other assumptions, 

  2. computing a statistical measure known as a P value based on these assumptions, and

  3. comparing the computed P value to the arbitrary threshold value of 0.05.

If the P value is less than 0.05, the effect is declared “statistically significant,” the assumption of no effect is rejected, and it is concluded that the intervention has an effect in the real world. If the P value is above 0.05, the effect is declared “statistically nonsignificant,” the assumption of no effect is not rejected, and it is concluded that the intervention has no effect in the real world.

Criticisms of NHST

Despite its default role, NHST has long been criticized by both statisticians and applied researchers. The most prominent criticisms relate to the dichotomization of results into “statistically significant” and “statistically nonsignificant.”

For example, authors, editors, and reviewers use statistical (non)significance as a filter to select which results to publish. Robert J Meyer, PhD, Frederick H. Ecker/MetLife Insurance professor of marketing at The Wharton School, University of Pennsylvania, and co-author of the study, says, “this creates a distorted literature because the effects of published interventions are biased upward in magnitude. It also encourages harmful research practices that yield results that attain so-called statistical significance.”

Perhaps the most widespread abuse of statistics is to ascertain where some statistical measure such as a P value stands relative to 0.05 and take it as a basis to declare statistical (non)significance and to make general and certain conclusions from a single study. 

“Single studies are never definitive, and thus, can never demonstrate an effect or no effect. The aim of studies should be to report results in an unfiltered manner so that they can later be used to make more general conclusions based on cumulative evidence from multiple studies. NHST leads researchers to wrongly make general and certain conclusions and to wrongly filter results,” says Eric T Bradlow, PhD, professor of marketing at Wharton.

Recommended Changes to Statistical Analysis

The authors propose a major transition in statistical analysis and reporting. Specifically, they propose abandoning NHST—and the P value thresholds intrinsic to it—as the default approach to statistical analysis and reporting. Their recommendations are as follows:

  • Statistical (non)significance should never be used as a basis to make general and certain conclusions. Statistical (non)significance should also never be used as a filter to select which results to publish. Instead, all studies should be published in some form or another.

  • Reporting should focus on quantifying study results via point and interval estimates. All of the values inside conventional interval estimates are at least reasonably compatible with the data given all of the assumptions used to compute them; therefore, it makes no sense to single out a specific value such as the null value.

  • General conclusions should be made based on the cumulative evidence from multiple studies.

  • Studies need to treat P values continuously and as just one factor among many—including prior evidence, the plausibility of mechanism, study design, data quality, and others that vary by research domain—that require joint consideration and holistic integration. Researchers must also respect the fact that such conclusions are necessarily tentative and subject to revision as new studies are conducted.

Decisions are seldom necessary in scientific reporting and are best left to end-users, such as managers and clinicians, when necessary. In such cases, they should be made using a decision analysis that integrates the costs, benefits, and probabilities of all possible consequences via a loss function (which typically varies dramatically across stakeholders)—not via arbitrary thresholds applied to statistical summaries such as P values, which outside of certain specialized applications such as industrial quality control, are insufficient for this purpose.

- This press release was originally published on the American Marketing Association website