We’ve seen that it’s possible to miss a real effect simply by not taking enoughdata. In most cases, this is a problem: we might miss a viable medicine or failto notice an important side-effect. How do we know how much data to collect?
Statisticians provide the answer in the form of “statistical power.” The powerof a study is the likelihood that it will distinguish an effect of a certainsize from pure luck. A study might easily detect a huge benefit from amedication, but detecting a subtle difference is much less likely. Let’s try asimple example.
Suppose a gambler is convinced that an opponent has an unfair coin: rather thangetting heads half the time and tails half the time, the proportion isdifferent, and the opponent is using this to cheat at incredibly boringcoin-flipping games. How to prove it?
You can’t just flip the coin a hundred times and count the heads. Even with aperfectly fair coin, you don’t always get fifty heads:
This shows the likelihood of getting different numbers of heads, if you flip acoin a hundred times.
You can see that 50 heads is the most likely option, but it’s also reasonablylikely to get 45 or 57. So if you get 57 heads, the coin might be rigged, butyou might just be lucky.
Let’s work out the math. Let’s say we look for a p value of 0.05 or less, asscientists typically do. That is, if I count up the number of heads after 10 or100 trials and find a deviation from what I’d expect – half heads, half tails– I call the coin unfair if there’s only a 5% chance of getting a deviationthat size or larger with a fair coin. Otherwise, I can conclude nothing: thecoin may be fair, or it may be only a little unfair. I can’t tell.
So, what happens if I flip a coin ten times and apply these criteria?
This is called a power curve. Along the horizontal axis, we have the differentpossibilities for the coin’s true probability of getting heads, corresponding todifferent levels of unfairness. On the vertical axis is the probability that Iwill conclude the coin is rigged after ten tosses, based on the p value of theresult.
You can see that if the coin is rigged to give heads 60% of the time, and I flipthe coin 10 times, I only have a 20% chance of concluding that it’srigged. There’s just too little data to separate rigging from randomvariation. The coin would have to be incredibly biased for me to always notice.
But what if I flip the coin 100 times?
Or 1,000 times?
With one thousand flips, I can easily tell if the coin is rigged to give heads60% of the time. It’s just overwhelmingly unlikely that I could flip a fair coin1,000 times and get more than 600 heads.
The power of being underpowered¶
After hearing all this, you might think calculations of statistical power areessential to medical trials. A scientist might want to know how many patientsare needed to test if a new medication improves survival by more than 10%, and aquick calculation of statistical power would provide the answer. Scientists areusually satisfied when the statistical power is 0.8 or higher, corresponding toan 80% chance of concluding there’s a real effect.
However, few scientists ever perform this calculation, and few journal articlesever mention the statistical power of their tests.
Consider a trial testing two different treatments for the same condition. Youmight want to know which medicine is safer, but unfortunately, side effects arerare. You can test each medicine on a hundred patients, but only a few in eachgroup suffer serious side effects.
Obviously, you won’t have terribly much data to compare side effect rates. Iffour people have serious side effects in one group, and three in the other, youcan’t tell if that’s the medication’s fault.
Unfortunately, many trials conclude with “There was no statistically significantdifference in adverse effects between groups” without noting that there wasinsufficient data to detect any but the largestdifferences.57 And so doctors erroneously think themedications are equally safe, when one could well be much more dangerous thanthe other.
You might think this is only a problem when the medication only has a weakeffect. But no: in one sample of studies published between 1975 and 1990 inprestigious medical journals, 27% of randomized controlled trials gave negativeresults, but 64% of these didn’t collect enough data to detect a 50% differencein primary outcome between treatment groups. Fifty percent! Even if onemedication decreases symptoms by 50% more than the other medication, there’sinsufficient data to conclude it’s more effective. And 84% of the negativetrials didn’t have the power to detect a 25% difference.17, 4, 11, 16
In neuroscience the problem is even worse. Suppose we aggregate the datacollected by numerous neuroscience papers investigating one particular effectand arrive at a strong estimate of the effect’s size. The median study has onlya 20% chance of being able to detect that effect. Only after many studies wereaggregated could the effect be discerned. Similar problems arise in neurosciencestudies using animal models – which raises a significant ethical concern. Ifeach individual study is underpowered, the true effect will only likely bediscovered after many studies using many animals have been completed andanalyzed, using far more animal subjects than if the study had been doneproperly the first time.12
That’s not to say scientists are lying when they state they detected nosignificant difference between groups. You’re just misleading yourself when youassume this means there is no real difference. There may be a difference, butthe study was too small to notice it.
Let’s consider an example we see every day.
The wrong turn on red¶
In the 1970s, many parts of the United States began to allow drivers to turnright at a red light. For many years prior, road designers and civil engineersargued that allowing right turns on a red light would be a safety hazard,causing many additional crashes and pedestrian deaths. But the 1973 oil crisisand its fallout spurred politicians to consider allowing right turn on red tosave fuel wasted by commuters waiting at red lights.
Several studies were conducted to consider the safety impact of the change. Forexample, a consultant for the Virginia Department of Highways and Transportationconducted a before-and-after study of twenty intersections which began to allowright turns on red. Before the change there were 308 accidents at theintersections; after, there were 337 in a similar length of time. However, thisdifference was not statistically significant, and so the consultant concludedthere was no safety impact.
Several subsequent studies had similar findings: small increases in the numberof crashes, but not enough data to conclude these increases were significant. Asone report concluded,
There is no reason to suspect that pedestrian accidents involving RToperations (right turns) have increased after the adoption of [right turn onred]…
Based on this data, more cities and states began to allow right turns at redlights. The problem, of course, is that these studies were underpowered. Morepedestrians were being run over and more cars were involved in collisions, butnobody collected enough data to show this conclusively until several yearslater, when studies arrived clearly showing the results: significant increasesin collisions and pedestrian accidents (sometimes up to 100% increases).27, 48 The misinterpretation of underpoweredstudies cost lives.