Statistical power and underpowered statistics — Statistics Done Wrong (2024)

We’ve seen that it’s possible to miss a real effect simply by not taking enoughdata. In most cases, this is a problem: we might miss a viable medicine or failto notice an important side-effect. How do we know how much data to collect?

Statisticians provide the answer in the form of “statistical power.” The powerof a study is the likelihood that it will distinguish an effect of a certainsize from pure luck. A study might easily detect a huge benefit from amedication, but detecting a subtle difference is much less likely. Let’s try asimple example.

Suppose a gambler is convinced that an opponent has an unfair coin: rather thangetting heads half the time and tails half the time, the proportion isdifferent, and the opponent is using this to cheat at incredibly boringcoin-flipping games. How to prove it?

You can’t just flip the coin a hundred times and count the heads. Even with aperfectly fair coin, you don’t always get fifty heads:

The power of being underpowered¶

After hearing all this, you might think calculations of statistical power areessential to medical trials. A scientist might want to know how many patientsare needed to test if a new medication improves survival by more than 10%, and aquick calculation of statistical power would provide the answer. Scientists areusually satisfied when the statistical power is 0.8 or higher, corresponding toan 80% chance of concluding there’s a real effect.

However, few scientists ever perform this calculation, and few journal articlesever mention the statistical power of their tests.

Consider a trial testing two different treatments for the same condition. Youmight want to know which medicine is safer, but unfortunately, side effects arerare. You can test each medicine on a hundred patients, but only a few in eachgroup suffer serious side effects.

Obviously, you won’t have terribly much data to compare side effect rates. Iffour people have serious side effects in one group, and three in the other, youcan’t tell if that’s the medication’s fault.

Unfortunately, many trials conclude with “There was no statistically significantdifference in adverse effects between groups” without noting that there wasinsufficient data to detect any but the largestdifferences.⁵⁷ And so doctors erroneously think themedications are equally safe, when one could well be much more dangerous thanthe other.

You might think this is only a problem when the medication only has a weakeffect. But no: in one sample of studies published between 1975 and 1990 inprestigious medical journals, 27% of randomized controlled trials gave negativeresults, but 64% of these didn’t collect enough data to detect a 50% differencein primary outcome between treatment groups. Fifty percent! Even if onemedication decreases symptoms by 50% more than the other medication, there’sinsufficient data to conclude it’s more effective. And 84% of the negativetrials didn’t have the power to detect a 25% difference.¹⁷^,⁴^,¹¹^,¹⁶

In neuroscience the problem is even worse. Suppose we aggregate the datacollected by numerous neuroscience papers investigating one particular effectand arrive at a strong estimate of the effect’s size. The median study has onlya 20% chance of being able to detect that effect. Only after many studies wereaggregated could the effect be discerned. Similar problems arise in neurosciencestudies using animal models – which raises a significant ethical concern. Ifeach individual study is underpowered, the true effect will only likely bediscovered after many studies using many animals have been completed andanalyzed, using far more animal subjects than if the study had been doneproperly the first time.¹²

That’s not to say scientists are lying when they state they detected nosignificant difference between groups. You’re just misleading yourself when youassume this means there is no real difference. There may be a difference, butthe study was too small to notice it.

Let’s consider an example we see every day.

The wrong turn on red¶

In the 1970s, many parts of the United States began to allow drivers to turnright at a red light. For many years prior, road designers and civil engineersargued that allowing right turns on a red light would be a safety hazard,causing many additional crashes and pedestrian deaths. But the 1973 oil crisisand its fallout spurred politicians to consider allowing right turn on red tosave fuel wasted by commuters waiting at red lights.

Several studies were conducted to consider the safety impact of the change. Forexample, a consultant for the Virginia Department of Highways and Transportationconducted a before-and-after study of twenty intersections which began to allowright turns on red. Before the change there were 308 accidents at theintersections; after, there were 337 in a similar length of time. However, thisdifference was not statistically significant, and so the consultant concludedthere was no safety impact.

Several subsequent studies had similar findings: small increases in the numberof crashes, but not enough data to conclude these increases were significant. Asone report concluded,

There is no reason to suspect that pedestrian accidents involving RToperations (right turns) have increased after the adoption of [right turn onred]…

Based on this data, more cities and states began to allow right turns at redlights. The problem, of course, is that these studies were underpowered. Morepedestrians were being run over and more cars were involved in collisions, butnobody collected enough data to show this conclusively until several yearslater, when studies arrived clearly showing the results: significant increasesin collisions and pedestrian accidents (sometimes up to 100% increases).²⁷^,⁴⁸ The misinterpretation of underpoweredstudies cost lives.

Statistical power and underpowered statistics — Statistics Done Wrong (2024)

FAQs

What does it mean to be underpowered in statistics? ›

An underpowered study does not have a sufficiently large sample size to answer the research question of interest. An overpowered study has too large a sample size and wastes resources.

Read On ›

What are the issues with low statistical power? ›

Low power means that your test only has a small chance of detecting a true effect or that the results are likely to be distorted by random and systematic error. Power is mainly influenced by sample size, effect size, and significance level.

Discover More Details ›

How do you know if you have enough statistical power? ›

Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of concluding there's a real effect.

What are the problems with underpowered studies? ›

Underpowered studies are problematic because they lead to biased conclusions (Maxwell, 2004; Christley, 2010; Turner et al., 2013; Kühberger et al., 2014). The reason behind these biased conclusions is that underpowered studies yield excessively wide sampling distributions for the sample estimates.

See Details ›

What is the statistical power in statistics? ›

| Statistics. The power is the long-term probability of a series of identical studies to detect a statistically significant effect (eg. p<0.05) if there is any. The probability of a type 2 error in a series of identical studies is one minus the power (1-ß, often 20%).

Find Out More ›

Can you have too much statistical power? ›

On the other hand, a small, unimportant effect may be demonstrated with a high degree of statistical significance if the sample size is large enough. Because of this, too much power can almost be a bad thing, at least so long as many people continue to misunderstand the meaning of statistical significance.

Tell Me More ›

Is 80% statistical power enough? ›

Ideally, minimum power of a study required is 80%. Hence, the sample size calculation is critical and fundamental for designing a study protocol. Even after completion of study, a retrospective power analysis will be useful, especially when a statistically not a significant results are obtained.

Show Me More ›

What types of error does low statistical power increase? ›

[6] Thus, when conducting a study with a low sample size, and ultimately low power, researchers should be aware of the likelihood of a type II error. The greater the N within a study, the more likely it is that a researcher will reject the null hypothesis.

Explore More ›

What decreases statistical power? ›

Both small sample sizes and low effect sizes reduce the power in the study. Power, which is the probability of rejecting a false null hypothesis, is calculated as 1-β (also expressed as “1 - Type II error probability”).

How to improve statistical power? ›

Increase sample size, Increase the significance level (alpha), Reduce measurement error by increasing the precision and accuracy of your measurement devices and procedures, Use a one-tailed test instead of a two-tailed test for t tests and z tests.

Show Me More ›

How do you know if a statistic is sufficient? ›

A statistic T(X) is sufficient for θ if the conditional distribution of X given T(X) = T(x) does not depend on θ. The sufficiency depends on the parameter of interest. If X is discrete, then so is T(X) and sufficiency means that P(X = x|T(X) = T(x)) is known, i.e., it does not depend on any unknown quantity.

Read The Full Story ›

Why is it good to have high statistical power? ›

The statistical power of a hypothesis test has an impact on the interpretation of its results. Not finding a result with a more powerful study is stronger evidence against the effect existing than the same finding with a less powerful study.

See Details ›

Can an underpowered study be statistically significant? ›

However, it is entirely possible that the power of any specific test might be low and yet the probability of obtaining a statistically significant result somewhere in the study could be substantial.

Get More Info Here ›

What does underpowered mean in statistics? ›

An underpowered A/B or MVT test is a test which had relatively poor probability for detecting a specified effect size of interest (MEI). Getting a non-significant outcome from such a test is a poor evidence for the null hypothesis versus the specified alternative hypothesis for which statistical power is calculated.

Why is low statistical power bad? ›

Studies with low statistical power increase the likelihood that a statistically significant finding represents a false positive result.

What does low power mean in statistics? ›

Statistical power is the ability of a hypothesis test to detect an effect that exists in the population. Clearly, a high-powered study is a good thing just for being able to identify these effects. Low power reduces your chances of discovering real findings.

View Details ›

What does power level mean in statistics? ›

In Statistical Power, the power level specifies the level or the chance of not making a Type II error. Usually, the researcher takes the power level as 0.80. In other words, the researcher has an 80% chance of not making a Type II error.

What does powerful mean in statistics? ›

In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present.

Learn More ›

How do you increase power level in statistics? ›

The power of a test can be increased in a number of ways, for example increasing the sample size, decreasing the standard error, increasing the difference between the sample statistic and the hypothesized parameter, or increasing the alpha level.

Discover More Details ›