Statistical power and underpowered statistics — Statistics Done Wrong (2024)

We’ve seen that it’s possible to miss a real effect simply by not taking enoughdata. In most cases, this is a problem: we might miss a viable medicine or failto notice an important side-effect. How do we know how much data to collect?

Statisticians provide the answer in the form of “statistical power.” The powerof a study is the likelihood that it will distinguish an effect of a certainsize from pure luck. A study might easily detect a huge benefit from amedication, but detecting a subtle difference is much less likely. Let’s try asimple example.

Suppose a gambler is convinced that an opponent has an unfair coin: rather thangetting heads half the time and tails half the time, the proportion isdifferent, and the opponent is using this to cheat at incredibly boringcoin-flipping games. How to prove it?

You can’t just flip the coin a hundred times and count the heads. Even with aperfectly fair coin, you don’t always get fifty heads:

You can see that 50 heads is the most likely option, but it’s also reasonablylikely to get 45 or 57. So if you get 57 heads, the coin might be rigged, butyou might just be lucky.

Let’s work out the math. Let’s say we look for a p value of 0.05 or less, asscientists typically do. That is, if I count up the number of heads after 10 or100 trials and find a deviation from what I’d expect – half heads, half tails– I call the coin unfair if there’s only a 5% chance of getting a deviationthat size or larger with a fair coin. Otherwise, I can conclude nothing: thecoin may be fair, or it may be only a little unfair. I can’t tell.

So, what happens if I flip a coin ten times and apply these criteria?

Statistical power and underpowered statistics — Statistics Done Wrong (2)

This is called a power curve. Along the horizontal axis, we have the differentpossibilities for the coin’s true probability of getting heads, corresponding todifferent levels of unfairness. On the vertical axis is the probability that Iwill conclude the coin is rigged after ten tosses, based on the p value of theresult.

You can see that if the coin is rigged to give heads 60% of the time, and I flipthe coin 10 times, I only have a 20% chance of concluding that it’srigged. There’s just too little data to separate rigging from randomvariation. The coin would have to be incredibly biased for me to always notice.

But what if I flip the coin 100 times?

Statistical power and underpowered statistics — Statistics Done Wrong (3)

Or 1,000 times?

Statistical power and underpowered statistics — Statistics Done Wrong (4)

With one thousand flips, I can easily tell if the coin is rigged to give heads60% of the time. It’s just overwhelmingly unlikely that I could flip a fair coin1,000 times and get more than 600 heads.

The power of being underpowered

After hearing all this, you might think calculations of statistical power areessential to medical trials. A scientist might want to know how many patientsare needed to test if a new medication improves survival by more than 10%, and aquick calculation of statistical power would provide the answer. Scientists areusually satisfied when the statistical power is 0.8 or higher, corresponding toan 80% chance of concluding there’s a real effect.

However, few scientists ever perform this calculation, and few journal articlesever mention the statistical power of their tests.

Consider a trial testing two different treatments for the same condition. Youmight want to know which medicine is safer, but unfortunately, side effects arerare. You can test each medicine on a hundred patients, but only a few in eachgroup suffer serious side effects.

Obviously, you won’t have terribly much data to compare side effect rates. Iffour people have serious side effects in one group, and three in the other, youcan’t tell if that’s the medication’s fault.

Unfortunately, many trials conclude with “There was no statistically significantdifference in adverse effects between groups” without noting that there wasinsufficient data to detect any but the largestdifferences.57 And so doctors erroneously think themedications are equally safe, when one could well be much more dangerous thanthe other.

You might think this is only a problem when the medication only has a weakeffect. But no: in one sample of studies published between 1975 and 1990 inprestigious medical journals, 27% of randomized controlled trials gave negativeresults, but 64% of these didn’t collect enough data to detect a 50% differencein primary outcome between treatment groups. Fifty percent! Even if onemedication decreases symptoms by 50% more than the other medication, there’sinsufficient data to conclude it’s more effective. And 84% of the negativetrials didn’t have the power to detect a 25% difference.17, 4, 11, 16

In neuroscience the problem is even worse. Suppose we aggregate the datacollected by numerous neuroscience papers investigating one particular effectand arrive at a strong estimate of the effect’s size. The median study has onlya 20% chance of being able to detect that effect. Only after many studies wereaggregated could the effect be discerned. Similar problems arise in neurosciencestudies using animal models – which raises a significant ethical concern. Ifeach individual study is underpowered, the true effect will only likely bediscovered after many studies using many animals have been completed andanalyzed, using far more animal subjects than if the study had been doneproperly the first time.12

That’s not to say scientists are lying when they state they detected nosignificant difference between groups. You’re just misleading yourself when youassume this means there is no real difference. There may be a difference, butthe study was too small to notice it.

Let’s consider an example we see every day.

The wrong turn on red

In the 1970s, many parts of the United States began to allow drivers to turnright at a red light. For many years prior, road designers and civil engineersargued that allowing right turns on a red light would be a safety hazard,causing many additional crashes and pedestrian deaths. But the 1973 oil crisisand its fallout spurred politicians to consider allowing right turn on red tosave fuel wasted by commuters waiting at red lights.

Several studies were conducted to consider the safety impact of the change. Forexample, a consultant for the Virginia Department of Highways and Transportationconducted a before-and-after study of twenty intersections which began to allowright turns on red. Before the change there were 308 accidents at theintersections; after, there were 337 in a similar length of time. However, thisdifference was not statistically significant, and so the consultant concludedthere was no safety impact.

Several subsequent studies had similar findings: small increases in the numberof crashes, but not enough data to conclude these increases were significant. Asone report concluded,

There is no reason to suspect that pedestrian accidents involving RToperations (right turns) have increased after the adoption of [right turn onred]…

Based on this data, more cities and states began to allow right turns at redlights. The problem, of course, is that these studies were underpowered. Morepedestrians were being run over and more cars were involved in collisions, butnobody collected enough data to show this conclusively until several yearslater, when studies arrived clearly showing the results: significant increasesin collisions and pedestrian accidents (sometimes up to 100% increases).27, 48 The misinterpretation of underpoweredstudies cost lives.

Statistical power and underpowered statistics — Statistics Done Wrong (2024)

FAQs

What does it mean to be underpowered in statistics? ›

An underpowered study does not have a sufficiently large sample size to answer the research question of interest. An overpowered study has too large a sample size and wastes resources.

What are the issues with low statistical power? ›

Low power means that your test only has a small chance of detecting a true effect or that the results are likely to be distorted by random and systematic error. Power is mainly influenced by sample size, effect size, and significance level.

How do you know if you have enough statistical power? ›

Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of concluding there's a real effect.

What are the problems with underpowered studies? ›

Underpowered studies are problematic because they lead to biased conclusions (Maxwell, 2004; Christley, 2010; Turner et al., 2013; Kühberger et al., 2014). The reason behind these biased conclusions is that underpowered studies yield excessively wide sampling distributions for the sample estimates.

What is the statistical power in statistics? ›

| Statistics. The power is the long-term probability of a series of identical studies to detect a statistically significant effect (eg. p<0.05) if there is any. The probability of a type 2 error in a series of identical studies is one minus the power (1-ß, often 20%).

Can you have too much statistical power? ›

On the other hand, a small, unimportant effect may be demonstrated with a high degree of statistical significance if the sample size is large enough. Because of this, too much power can almost be a bad thing, at least so long as many people continue to misunderstand the meaning of statistical significance.

Is 80% statistical power enough? ›

Ideally, minimum power of a study required is 80%. Hence, the sample size calculation is critical and fundamental for designing a study protocol. Even after completion of study, a retrospective power analysis will be useful, especially when a statistically not a significant results are obtained.

What types of error does low statistical power increase? ›

[6] Thus, when conducting a study with a low sample size, and ultimately low power, researchers should be aware of the likelihood of a type II error. The greater the N within a study, the more likely it is that a researcher will reject the null hypothesis.

What decreases statistical power? ›

Both small sample sizes and low effect sizes reduce the power in the study. Power, which is the probability of rejecting a false null hypothesis, is calculated as 1-β (also expressed as “1 - Type II error probability”).

How to improve statistical power? ›

Increase sample size, Increase the significance level (alpha), Reduce measurement error by increasing the precision and accuracy of your measurement devices and procedures, Use a one-tailed test instead of a two-tailed test for t tests and z tests.

How do you know if a statistic is sufficient? ›

A statistic T(X) is sufficient for θ if the conditional distribution of X given T(X) = T(x) does not depend on θ. The sufficiency depends on the parameter of interest. If X is discrete, then so is T(X) and sufficiency means that P(X = x|T(X) = T(x)) is known, i.e., it does not depend on any unknown quantity.

Why is it good to have high statistical power? ›

The statistical power of a hypothesis test has an impact on the interpretation of its results. Not finding a result with a more powerful study is stronger evidence against the effect existing than the same finding with a less powerful study.

Can an underpowered study be statistically significant? ›

However, it is entirely possible that the power of any specific test might be low and yet the probability of obtaining a statistically significant result somewhere in the study could be substantial.

What does underpowered mean in statistics? ›

An underpowered A/B or MVT test is a test which had relatively poor probability for detecting a specified effect size of interest (MEI). Getting a non-significant outcome from such a test is a poor evidence for the null hypothesis versus the specified alternative hypothesis for which statistical power is calculated.

Why is low statistical power bad? ›

Studies with low statistical power increase the likelihood that a statistically significant finding represents a false positive result.

What does low power mean in statistics? ›

Statistical power is the ability of a hypothesis test to detect an effect that exists in the population. Clearly, a high-powered study is a good thing just for being able to identify these effects. Low power reduces your chances of discovering real findings.

What does power level mean in statistics? ›

In Statistical Power, the power level specifies the level or the chance of not making a Type II error. Usually, the researcher takes the power level as 0.80. In other words, the researcher has an 80% chance of not making a Type II error.

What does powerful mean in statistics? ›

In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present.

How do you increase power level in statistics? ›

The power of a test can be increased in a number of ways, for example increasing the sample size, decreasing the standard error, increasing the difference between the sample statistic and the hypothesized parameter, or increasing the alpha level.

Top Articles
Does Lowe's Store Card Do a Hard Pull?
SmartState | UTXO vs Account-Based blockchain models
This website is unavailable in your location. – WSB-TV Channel 2 - Atlanta
Is Paige Vanzant Related To Ronnie Van Zant
Walgreens Pharmqcy
Amtrust Bank Cd Rates
Overnight Cleaner Jobs
Cumberland Maryland Craigslist
Slapstick Sound Effect Crossword
Heska Ulite
Swimgs Yung Wong Travels Sophie Koch Hits 3 Tabs Winnie The Pooh Halloween Bob The Builder Christmas Springs Cow Dog Pig Hollywood Studios Beach House Flying Fun Hot Air Balloons, Riding Lessons And Bikes Pack Both Up Away The Alpha Baa Baa Twinkle
Snowflake Activity Congruent Triangles Answers
Progressbook Brunswick
Florida (FL) Powerball - Winning Numbers & Results
Ave Bradley, Global SVP of design and creative director at Kimpton Hotels & Restaurants | Hospitality Interiors
Ukraine-Russia war: Latest updates
Bernie Platt, former Cherry Hill mayor and funeral home magnate, has died at 90
Arboristsite Forum Chainsaw
Navy Female Prt Standards 30 34
Jayah And Kimora Phone Number
Vrachtwagens in Nederland kopen - gebruikt en nieuw - TrucksNL
Www.publicsurplus.com Motor Pool
Www Craigslist Com Bakersfield
Ups Print Store Near Me
Isaidup
Busted Mcpherson Newspaper
Betaalbaar naar The Big Apple: 9 x tips voor New York City
Gazette Obituary Colorado Springs
Shreveport City Warrants Lookup
Obituaries Milwaukee Journal Sentinel
As families searched, a Texas medical school cut up their loved ones
3 Ways to Drive Employee Engagement with Recognition Programs | UKG
The Rise of "t33n leaks": Understanding the Impact and Implications - The Digital Weekly
Kids and Adult Dinosaur Costume
Swgoh Boba Fett Counter
Where Can I Cash A Huntington National Bank Check
Texters Wish You Were Here
B.k. Miller Chitterlings
Autozone Locations Near Me
What Does Code 898 Mean On Irs Transcript
140000 Kilometers To Miles
Easy Pigs in a Blanket Recipe - Emmandi's Kitchen
More News, Rumors and Opinions Tuesday PM 7-9-2024 — Dinar Recaps
Wolf Of Wallstreet 123 Movies
Hello – Cornerstone Chapel
UNC Charlotte Admission Requirements
2000 Ford F-150 for sale - Scottsdale, AZ - craigslist
Craigslist Pets Charleston Wv
Who Is Nina Yankovic? Daughter of Musician Weird Al Yankovic
Craigslist Cars And Trucks For Sale By Owner Indianapolis
WHAT WE CAN DO | Arizona Tile
Bloons Tower Defense 1 Unblocked
Latest Posts
Article information

Author: Terence Hammes MD

Last Updated:

Views: 5370

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.