
How strongly I recommend this book: 8 / 10
Date read: January 15, 2026
Get this book on Amazon. Here are my other book notes on biographies, psychology, and more.
This book is a survival guide for the modern era that teaches you how to see through the “data-driven” smoke and mirrors often used to mislead us. My biggest takeaway was that numbers aren’t inherently objective; they are frequently manipulated—sometimes accidentally and sometimes on purpose—to create a version of reality that isn’t actually there. It’s an eye-opening look at how easily we can be fooled by fancy models and “significant” findings that are actually just statistical noise.
I went through my notes and captured key quotes from all chapters below.
P.S. – Highly recommend Readwise if you want to get the most out of your reading.
For those who perform statistical analyses for their day jobs, there are Tips at the end of most chapters to explain what statistical techniques you might use to avoid common pitfalls. But this is not a textbook, so I will not teach you how to use these techniques in any technical detail. I hope only to make you aware of the most common problems so you are able to pick the statistical technique best suited to your question.
This is stupid. A little knowledge of statistics is not an excuse to reject all of modern science. A research paper’s statistical methods can be judged only in detail and in context with the rest of its methods: study design, measurement techniques, cost constraints, and goals. Use your statistical knowledge to better understand the strengths, limitations, and potential biases of research, not to shoot down any paper that seems to misuse a p value or contradict your personal beliefs. Also, remember that a conclusion supported by poor statistics can still be correct— statistical and logical errors do not make a conclusion wrong, but merely unsupported.
We will always observe some difference due to luck and random variation, so statisticians talk about statistically significant differences when the difference is larger than could easily be produced by luck.
If you test your medication on only one person, it’s not too surprising if her cold ends up being a little shorter than usual. Most colds aren’t perfectly average. But if you test the medication on 10 million patients, it’s pretty unlikely that all those patients will just happen to get shorter colds. More likely, your medication actually works. Scientists quantify this intuition with a concept called the p value. The p value is the probability, under the assumption that there is no true effect or no true difference, of collecting data that shows a difference equal to or more extreme than what you actually observed.
Remember, a p value is not a measure of how right you are or how important a difference is. Instead, think of it as a measure of surprise. If you assume your medication is ineffective and there is no reason other than luck for the two groups to differ, then the smaller the p value, the more surprising and lucky your results are— or your assumption is wrong, and the medication truly works.
How do you translate a p value into an answer to this question: “Is there really a difference between these groups?” A common rule of thumb is to say that any difference where p < 0.05 is statistically significant. The choice of 0.05 isn’t because of any special logical or statistical reasons, but it has become scientific convention through decades of common use.
Notice that the p value works by assuming there is no difference between your experimental groups. This is a counterintuitive feature of significance testing: if you want to prove that your drug works, you do so by showing the data is inconsistent with the drug not working. Because of this, p values can be extended to any situation where you can mathematically express a hypothesis you want to knock down.
But p values have their limitations. Remember, p is a measure of surprise, with a smaller value suggesting that you should be more surprised. It’s not a measure of the size of the effect. You can get a tiny p value by measuring a huge effect—“ This medicine makes people live four times longer”— or by measuring a tiny effect with great certainty. And because any medication or intervention usually has some real effect, you can always get a statistically significant result by collecting so much data that you detect extremely tiny but relatively unimportant differences.
In short, statistical significance does not mean your result has any practical significance. As for statistical insignificance, it doesn’t tell you much. A statistically insignificant difference could be nothing but noise, or it could represent a real effect that can be pinned down only with more data.
Recall that a p value is calculated under the assumption that luck(not your medication or intervention) is the only factor in your experiment, and that p is defined as the probability of obtaining a result equal to or more extreme than the one observed. This means p values force you to reason about results that never actually occurred— that is, results more extreme than yours. The probability of obtaining such results depends on your experimental design, which makes p values “psychic”: two experiments with different designs can produce identical data but different p values because the unobserved data is different.
Suppose I ask you a series of 12 true- or- false questions about statistical inference, and you correctly answer 9 of them. I want to test the hypothesis that you answered the questions by guessing randomly. To do this, I need to compute the chances of you getting at least 9 answers right by simply picking true or false randomly for each question. Assuming you pick true and false with equal probability, I compute p = 0.073.[ 3] And since p > 0.05, it’s plausible that you guessed randomly. If you did, you’d get 9 or more questions correct 7.3% of the time. 2 But perhaps it was not my original plan to ask you only 12 questions. Maybe I had a computer that generated a limitless supply of questions and simply asked questions until you got 3 wrong. Now I have to compute the probability of you getting 3 questions wrong after being asked 15 or 20 or 47 of them. I even have to include the remote possibility that you made it to 175,231 questions before getting 3 questions wrong. Doing the math, I find that p = 0.033. Since p < 0.05, I conclude that random guessing would be unlikely to produce this result. This is troubling: two experiments can collect identical data but result in different conclusions. Somehow, the p value can read your intentions.
If you want to test whether an effect is significantly different from zero, you can construct a 95% confidence interval and check whether the interval includes zero. In the process, you get the added bonus of learning how precise your estimate is. If the confidence interval is too wide, you may need to collect more data.
If you can write a result as a confidence interval instead of as a p value, you should.
You might think calculations of statistical power are essential for medical trials; a scientist might want to know how many patients are needed to test a new medication, and a quick calculation of statistical power would provide the answer. Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of detecting a real effect of the expected size.(If the true effect is actually larger, the study will have greater power.) However, few scientists ever perform this calculation, and few journal articles even mention statistical power. In the prestigious journals Science and Nature, fewer than 3% of articles calculate statistical power before starting their study.
Indeed, many trials conclude that “there was no statistically significant difference in adverse effects between groups,” without noting that there was insufficient data to detect any but the largest differences. 2 If one of these trials was comparing side effects in two drugs, a doctor might erroneously think the medications are equally safe, when one could very well be much more dangerous than the other.
Maybe this is a problem only for rare side effects or only when a medication has a weak effect? Nope. In one sample of studies published in prestigious medical journals between 1975 and 1990, more than four- fifths of randomized controlled trials that reported negative results didn’t collect enough data to detect a 25% difference in primary outcome between treatment groups. That is, even if one medication reduced symptoms by 25% more than another, there was insufficient data to make that conclusion. And nearly two- thirds of the negative trials didn’t have the power to detect a 50% difference.
In neuroscience, the problem is even worse. Each individual neuroscience study collects such little data that the median study has only a 20% chance of being able to detect the effect it’s looking for.
In 1960 Jacob Cohen investigated the statistical power of studies published in the Journal of Abnormal and Social Psychology8 and discovered that the average study had only a power of 0.48 for detecting medium- sized effects.
So why are power calculations often forgotten? One reason is the discrepancy between our intuitive feeling about sample sizes and the results of power calculations. It’s easy to think, “Surely these are enough test subjects,” even when the study has abysmal power. For example, suppose you’re testing a new heart attack treatment protocol and hope to cut the risk of death in half, from 20% to 10%. You might be inclined to think, “If I don’t see a difference when I try this procedure on 50 patients, clearly the benefit is too small to be useful.” But to have 80% power to detect the effect, you’d actually need 400 patients— 200 in each control and treatment group. 10 Perhaps clinicians just don’t realize that their adequate- seeming sample sizes are in fact far too small.
The perils of insufficient power do not mean that scientists are lying when they state they detected no significant difference between groups. But it’s misleading to assume these results mean there is no real difference. There may be a difference, even an important one, but the study was so small it’d be lucky to notice it.
A 2002 study, for example, considered the impact of paved shoulders on the accident rates of traffic on rural roads. Unsurprisingly, a paved shoulder reduced the risk of accident— but there was insufficient data to declare this reduction statistically significant, so the authors stated that the cost of paved shoulders was not justified. They performed no cost- benefit analysis because they treated the insignificant difference as meaning there was no difference at all, despite the fact that they had collected data suggesting that paved shoulders improved safety! The evidence was not strong enough to meet their desired p value threshold. 12 A better analysis would have admitted that while it is plausible that shoulders have no benefit at all, the data is also consistent with them having substantial benefits. That means looking at confidence intervals.
Even if the confidence interval includes zero, its width tells you a lot: a narrow interval covering zero tells you that the effect is most likely small(which may be all you need to know, if a small effect is not practically useful), while a wide interval clearly shows that the measurement was not precise enough to draw conclusions.
Truth inflation arises because small, underpowered studies have widely varying results. Occasionally you’re bound to get lucky and have a statistically significant but wildly overestimated result.
A popular strategy to fight this problem is called shrinkage. For counties with few residents, you can “shrink” the cancer rate estimates toward the national average by taking a weighted average of the county cancer rate with the national average rate. When the county has few residents, you weight the national average strongly; when the county is large, you weight the county strongly.
When you need to measure an effect with precision, rather than simply testing for significance, use assurance instead of power: design your experiment to measure the hypothesized effect to your desired level of precision.
Remember that “statistically insignificant” does not mean “zero.” Even if your result is insignificant, it represents the best available estimate given the data you have collected. “Not significant” does not mean “nonexistent.”
Careful experimental design can break the dependence between measurements. An agricultural field experiment might compare growth rates of different strains of a crop in each field. But if soil or irrigation quality varies from field to field, you won’t be able to separate variations due to crop variety from variations in soil conditions, no matter how many plants you measure in each field. A better design would be to divide each field into small blocks and randomly assign a crop variety to each block. With a large enough selection of blocks, soil variations can’t systematically benefit one crop more than the others.
Here are some options: 4 Average the dependent data points. For example, average all the blood pressure measurements taken from a single person and treat the average as a single data point. This isn’t perfect: if you measured some patients more frequently than others, this fact won’t be reflected in the averaged number. To make your results reflect the level of certainty in your measurements, which increases as you take more, you’d perform a weighted analysis, weighting the better- measured patients more strongly. Analyze each dependent data point separately. Instead of combining all the patient’s blood pressure measurements, analyze every patient’s blood pressure from, say, just day five, ignoring all other data points. But be careful: if you repeat this for each day of measurements, you’ll have problems with multiple comparisons, which I will discuss in the next chapter. Correct for the dependence by adjusting your p values and confidence intervals. Many procedures exist to estimate the size of the dependence between data points and account for it, including clustered standard errors, repeated measures tests, and hierarchical models.
Suppose I’m testing 100 potential cancer medications. Only 10 of these drugs actually work, but I don’t know which; I must perform experiments to find them. In these experiments, I’ll look for p < 0.05 gains over a placebo, demonstrating that the drug has a significant benefit. Figure 4- 1 illustrates the situation. Each square in the grid represents one drug. In reality, only the 10 drugs in the top row work. Because most trials can’t perfectly detect every good medication, I’ll assume my tests have a statistical power of 0.8, though you know that most studies have much lower power. So of the 10 good drugs, I’ll correctly detect around 8 of them, shown in darker gray. Figure 4- 1. Each square represents one candidate drug. The first row of the grid represents drugs that definitely work, but I obtained statistically significant results for only the eight darker- gray drugs. The black cells are false positives. Because my p value threshold is 0.05, I have a 5% chance of falsely concluding that an ineffective drug works. Since 90 of my tested drugs are ineffective, this means I’ll conclude that about 5 of them have significant effects. These are shown in black. I perform my experiments and conclude there are 13 “working” drugs: 8 good drugs and 5 false positives. The chance of any given “working” drug being truly effective is therefore 8 in 13— just 62%! In statistical terms, my false discovery rate— the fraction of statistically significant results that are really false positives— is 38%.
You often see news articles quoting low p values as a sign that error is unlikely: “There’s only a 1 in 10,000 chance this result arose as a statistical fluke, because p = 0.0001.” No! This can’t be true. In the cancer medication example, a p < 0.05 threshold resulted in a 38% chance that any given statistically significant result was a fluke. This misinterpretation is called the base rate fallacy.
So when someone cites a low p value to say their study is probably right, remember that the probability of error is actually almost certainly higher. In areas where most tested hypotheses are false, such as early drug trials(most early drugs don’t make it through trials), it’s likely that most statistically significant results with p < 0.05 are actually flukes.
For the first half of this quote, I wanted to cheer Huff on: yes, statistically significant doesn’t mean that we know the precise figure to two decimal places.(A confidence interval would have been a much more appropriate way to express this figure.) But then Huff claims that the significance level gives 19- to- 1 odds that the death rate really is different. That is, he interprets the p value as the probability that the results are a fluke. Not even Huff is safe from the base rate fallacy! We don’t know the odds that “the second group truly does have a higher death rate than the first.” All we know is that if the true mortality ratio were 1, we would
If you send out a 10- page survey asking about nuclear power plant proximity, milk consumption, age, number of male cousins, favorite pizza topping, current sock color, and a few dozen other factors for good measure, you’ll probably find that at least one of those things is correlated with cancer.
If we want to make many comparisons at once but control the overall false positive rate, the p value should be calculated under the assumption that none of the differences is real. If we test 20 different jelly beans, we would not be surprised if one out of the 20 “causes” acne. But when we calculate the p value for a specific flavor, as though each comparison stands on its own, we are calculating the probability that this specific group would be lucky— an unlikely event— not any 1 out of the 20. And so the anomalies we detect appear much more significant than they are.
There are techniques to correct for multiple comparisons. For example, the Bonferroni correction method allows you to calculate p values as you normally would but says that if you make n comparisons in the trial, your criterion for significance should be p < 0.05/ n. This lowers the chances of a false positive to what you’d see from making only one comparison at p < 0.05. However, as you can imagine, this reduces statistical power, since you’re demanding much stronger correlations before you conclude they’re statistically significant. In some fields, power has decreased systematically in recent decades because of increased awareness of the multiple comparisons problem.
Tips Remember, p < 0.05 isn’t the same as a 5% chance your result is false. If you are testing multiple hypotheses or looking for correlations between many variables, use a procedure such as Bonferroni or Benjamini– Hochberg(or one of their various derivatives and adaptations) to control for the excess of false positives. If your field routinely performs multiple tests, such as in neuroimaging, learn the best practices and techniques specifically developed to handle your data. Learn to use prior estimates of the base rate to calculate the probability that a given result is a false positive(as in the mammogram example).
However, a difference in significance does not always make a significant difference. 1 One reason is the arbitrary nature of the p < 0.05 cutoff. We could get two very similar results, with p = 0.04 and p = 0.06, and mistakenly say they’re clearly different from each other simply because they fall on opposite sides of the cutoff. The second reason is that p values are not measures of effect size, so similar p values do not always mean similar effects. Two results with identical statistical significance can nonetheless contradict each other. Instead, think about statistical power. If we compare our new experimental drugs Fixitol and Solvix to a placebo but we don’t have enough test subjects to give us good statistical power, then we may fail to notice their benefits. If they have identical effects but we have only 50% power, then there’s a good chance we’ll say Fixitol has significant benefits and Solvix does not. Run the trial again, and it’s just as likely that Solvix will appear beneficial and Fixitol will not.
This misuse of statistics is not limited to corporate marketing departments, unfortunately. Neuroscientists, for instance, use the incorrect method for comparing groups about half the time. 3 You might also remember news about a 2006 study suggesting that men with multiple older brothers are more likely to be homosexual. 4 How did they reach this conclusion? The authors explained their results by noting that when they ran an analysis of the effect of various factors on homosexuality, only the number of older brothers had a statistically significant effect. The number of older sisters or of nonbiological older brothers(that is, adopted brothers or stepbrothers) had no statistically significant effect. But as we’ve seen, this doesn’t guarantee there’s a significant difference between these different effect groups. In fact, a closer look at the data suggests there was no statistically significant difference between the effect of having older brothers versus older sisters. Unfortunately, not enough data was published in the paper to allow calculation of a p value for the comparison.
But scientists frequently simplify their data to avoid the need for regression analysis. The statement “Overweight people are 50% more likely to have heart disease” has far more obvious clinical implications than “Each additional unit of Metropolitan Relative Weight increases the log- odds of heart disease by 0.009.” Even if it’s possible to build a statistical model that captures every detail of the data, a statistician might choose a simpler analysis over a technically superior one for purely practical reasons.
A major objection to dichotomization is that it throws away information. Instead of using a precise number for every patient or observation, you split observations into groups and throw away the numbers. This reduces the statistical power of your study— a major problem when so many studies are already underpowered. You’ll get less precise estimates of the correlations you’re trying to measure and will often underestimate effect sizes. In general, this loss of power and precision is the same you’d get by throwing away a third of your data.
Consider an example. Say you’re measuring the effect of a number of variables on the quality of health care a person receives. Health- care quality(perhaps measured using a survey) is the outcome variable. For predictor variables, you use two measurements: the subject’s personal net worth in dollars and the length of the subject’s personal yacht. You would expect a good statistical procedure to deduce that wealth impacts quality of health care but yacht size does not. Even though yacht size and wealth tend to increase together, it’s not your yacht that gets you better health care. With enough data, you would notice that people of the same wealth can have differently sized yachts— or no yachts at all— but still get a similar quality of care. This indicates that wealth is the primary factor, not yacht length. But by dichotomizing the variables, you’ve effectively cut the data down to four points. Each predictor can be only “above the median” or “below the median,” and no further information is recorded. You no longer have the data needed to realize that yacht length has nothing to do with health care. As a result, the ANOVA procedure falsely claims that yachts and health care are related. Worse, this false correlation isn’t statistically significant only 5% of the time— from the ANOVA’s perspective, it’s a true correlation, and it is detected as often as the statistical power of the test allows it.
Don’t arbitrarily split continuous variables into discrete groups unless you have good reason. Use a statistical procedure that can take full advantage of the continuous variables.
If you do need to split continuous variables into groups for some reason, don’t choose the groups to maximize your statistical significance. Define the split in advance, use the same split as in previous similar research, or use outside standards(such as a medical definition of obesity or high blood pressure) instead.
How could the regression equation’s accuracy be so high? If you had the panelists rerate the melons, they probably wouldn’t agree with their own ratings with 99.9% accuracy. Subjective ratings aren’t that consistent. No procedure, no matter how sophisticated, could predict them with such accuracy. Something is wrong.
The authors of the study attempted to sidestep this problem by using stepwise regression, a common procedure for selecting which variables are the most important in a regression. In its simplest form, it goes like this: start by using none of the 1,600 frequency measurements. Perform 1,600 hypothesis tests to determine which of the frequencies has the most statistically significant relationship with the outcome. Add that frequency and then repeat with the remaining 1,599. Continue the procedure until there are no statistically significant frequencies. Stepwise regression is common in many scientific fields, but it’s usually a bad idea. 2 You probably already noticed one problem: multiple comparisons. Hypothetically, by adding only statistically significant variables, you avoid overfitting, but running so many significance tests is bound to produce false positives, so some of the variables you select will be bogus. Stepwise regression procedures provide no guarantees about the overall false positive rate, nor are they guaranteed to select the “best” combination of variables, however you define “best.”
Truth inflation is a more insidious problem. Remember, “statistically insignificant” does not mean “has no effect whatsoever.” If your study is underpowered— you have too many variables to choose from and too little data— then you may not have enough data to reliably distinguish each variable’s effect from zero. You’ll include variables only if you are unlucky enough to overestimate their effect on the outcome. Your model will be heavily biased.(Even when not using a formal stepwise regression procedure, it’s common practice to throw out “insignificant” variables to simplify a model, leading to the same problem.)
There are several variations of stepwise regression. The version I just described is called forward selection since it starts from scratch and starts including variables. The alternative, backward elimination, starts by including all 1,600 variables and excludes those that are statistically insignificant, one at a time.(This would fail, in this case: with 1,600 variables but only 43 melons, there isn’t enough data to uniquely determine the effects of all 1,600 variables. You would get stuck on the first step.) It’s also possible to change the criteria used to include new variables; instead of statistical significance, more- modern procedures use metrics like the Akaike information criterion and the Bayesian information criterion, which reduce overfitting by penalizing models with more variables. Other variations add and remove variables at each step according to various criteria. None of these variations is guaranteed to arrive at the same answer, so two analyses of the same data could arrive at very different results.
How can a regression model be fairly evaluated, avoiding these problems? One option is cross- validation: fit the model using only a portion of the melons and then test its effectiveness at predicting the ripeness of the other melons. If the model overfits, it will perform poorly during cross- validation. One common cross- validation method is leave- out- one cross- validation, where the model is fit using all but one data point and then evaluated on its ability to predict that point; the procedure is repeated with each data point left out in turn. The watermelon study claims to have performed leave- out- one cross- validation but obtained similarly implausible results. Without access to the data, I’m not sure whether the method genuinely works.
One example of this problem occurred in a 2010 trial testing whether omega- 3 fatty acids, found in fish oil and commonly sold as a health supplement, can reduce the risk of heart attacks. The claim that omega- 3 fatty acids reduce heart attack risk was supported by several observational studies, along with some experimental data. Fatty acids have anti- inflammatory properties and can reduce the level of triglycerides in the bloodstream— two qualities known to correlate with reduced heart attack risk. So it was reasoned that omega- 3 fatty acids should reduce heart attack risk. 5 But the evidence was observational. Patients with low triglyceride levels had fewer heart problems, and fish oils reduce triglyceride levels, so it was spuriously concluded that fish oil should protect against heart problems. Only in 2013 was a large randomized controlled trial published, in which patients were given either fish oil or a placebo(olive oil) and monitored for five years. There was no evidence of a beneficial effect of fish oil.
Another problem arises when you control for multiple confounding factors. It’s common to interpret the results by saying, “If weight increases by one pound, with all other variables held constant, then heart attack rates increase by…” Perhaps that is true, but it may not be possible to hold all other variables constant in practice. You can always quote the numbers from the regression equation, but in reality the act of gaining a pound of weight also involves other changes. Nobody ever gains a pound with all other variables held constant, so your regression equation doesn’t translate to reality.
Simpson’s paradox arises whenever an apparent trend in data, caused by a confounding variable, can be eliminated or reversed by splitting the data into natural groups.
For a nonmedical example, if you compare flight delays between United Airlines and Continental Airlines, you’ll find United has more flights delayed on average. But at each individual airport in the comparison, Continental’s flights are more likely to be delayed. It turns out United operates more flights out of cities with poor weather. Its average is dragged down by the airports with the most delays. 7 But you can’t randomly assign airline flights to United or Continental. You can’t always eliminate every confounding factor. You can only measure them and hope you’ve measured them all.
Remember that a statistically insignificant variable does not necessarily have zero effect; you may not have the power needed to detect its effect.
Avoid stepwise regression when possible. Sometimes it’s useful, but the final model is biased and difficult to interpret. Other selection techniques, such as the lasso, may be more appropriate. Or there may be no need to do variable selection at all.
To test how well your model fits the data, use a separate dataset or a procedure such as cross- validation.
Watch out for confounding variables that could cause misleading or reversed results, as in Simpson’s paradox, and use random assignment to eliminate them whenever possible.
Even reasonable practices, such as remeasuring patients with strange laboratory test results or removing clearly abnormal patients, can bring a statistically insignificant result to significance. 2 Apparently, being free to analyze how you want gives you enormous control over your results!
A group of researchers demonstrated this phenomenon with a simple experiment. Twenty undergraduates were randomly assigned to listen to either “When I’m Sixty- Four” by the Beatles or “Kalimba,” a song that comes with the Windows 7 operating system. Afterward, they were asked their age and their father’s age. The two groups were compared, and it was found that “When I’m Sixty- Four” listeners were a year and a half younger on average, controlling for their father’s age, with p < 0.05. Since the groups were randomly assigned, the only plausible source of the difference was the music. Rather than publishing The Musical Guide to Staying Young, the researchers explained the tricks they used to obtain this result. They didn’t decide in advance how much data to collect; instead, they recruited students and ran statistical tests periodically to see whether a significant result had been achieved.(You saw earlier that such stopping rules can inflate false- positive rates significantly.) They also didn’t decide in advance to control for the age of the subjects’ fathers, instead asking how old they felt, how much they would enjoy eating at a diner, the square root of 100, their mother’s age, their agreement with “computers are complicated machines,” whether they would take advantage of an early- bird special, their political orientation, which of four Canadian quarterbacks they believed won an award, how often they refer to the past as “the good old days,” and their gender. Only after looking at the data did the researchers decide on which outcome variable to use and which variables to control for.(Had the results been different, they might have reported that “When I’m Sixty- Four” causes students to, say, be less able to calculate the square root of 100, controlling for their knowledge of Canadian football.) Naturally, this freedom allowed the researchers to make multiple comparisons and inflated their false- positive rate. In a published paper, they wouldn’t need to mention the other insignificant variables; they’d be free to discuss the apparent antiaging benefit of the Beatles. The fallacy would not be visible to the reader.
The result is an explosion of diversity in study design, with every new paper using a different combination of methods. When there is intense pressure to produce novel results, as there usually is in the United States, researchers in these fields tend to produce biased and extreme results more frequently because of their freedom in experimental design and data analysis.
For some fields, prepublication replication can solve this problem: collect a new, independent dataset and analyze it using exactly the same methods. If the effect remains, you can be confident in your results.(Be sure your new sample has adequate statistical power.) But for economists studying a market crash, it’s not possible(or at least not ethical) to arrange for another one. For a doctor studying a cancer treatment, patients may not be able to wait for replication.
With preregistered analyses, blinding, and further research into experimental methods, we can start to treat our data more humanely.
Before collecting data, plan your data analysis, accounting for multiple comparisons and including any effects you’d like to look for. Register your clinical trial protocol if applicable.
If you deviate from your planned protocol, note this in your paper and provide an explanation. Don’t just torture the data until it confesses. Have a specific statistical hypothesis in mind before you begin your analysis.
Even the prestigious journal Nature isn’t perfect, with roughly 38% of papers making typos and calculation errors in their p values.
Does the trend hold true for less speculative kinds of medical research? Apparently so. Of the top- cited research articles in medicine, a quarter have gone untested after their publication, and a third have been found to be exaggerated or wrong by later research. 12 That’s not as extreme as the Amgen result, but it makes you wonder what major errors still lurk unnoticed in important research. Replication is not as prevalent as we would like it to be, and the results are not always favorable.
In 2005, Jelte Wicherts and colleagues at the University of Amsterdam decided to analyze every recent article in several prominent journals of the American Psychological Association(APA) to learn about their statistical methods. They chose the APA partly because it requires authors to agree to share their data with other psychologists seeking to verify their claims. But six months later, they had received data for only 64 of the 249 studies they sought it for. Almost three- quarters of authors never sent their data.
In 2007, researchers from the Nordic Cochrane Center sought data from the EMA about two weight- loss drugs. They were conducting a systematic review of the effectiveness of the drugs and knew that the EMA, as the authority in charge of allowing drugs onto the European market, would have trial data submitted by the manufacturers that was perhaps not yet published publicly. But the EMA refused to disclose the data on the grounds that it might “unreasonably undermine or prejudice the commercial interests of individuals or companies” by revealing their trial design methods and commercial plans. They rejected the claim that withholding the data could harm patients. After three and a half years of bureaucratic wrangling and after reviewing each study report and finding no secret commercial information, the European Ombudsman finally ordered the EMA to release the documents. In the meantime, one of the drugs had been taken off the market because of side effects including serious psychiatric problems.
the dataset is no longer in use by its creators, they have no incentive to maintain a carefully organized personal archive of datasets, particularly when data has to be reconstructed from floppy disks and filing cabinets. One study of 516 articles published between 1991 and 2011 found that the probability of data being available decayed over time. For papers more than 20 years old, fewer than half of datasets were available.
Of course, underreporting is not unique to medicine. Two- thirds of academic psychologists admit to sometimes omitting some outcome variables in their papers, creating outcome reporting bias.
In biological and biomedical research, the problem often isn’t reporting of patient enrollment or power calculations. The problem is the many chemicals, genetically modified organisms, specially bred cell lines, and antibodies used in experiments. Results can be strongly dependent on these factors, but many journals do not have reporting guidelines for these factors, and the majority of chemicals and cells referred to in biomedical papers are not uniquely identifiable, even in journals with strict reporting requirements.
A similar study looked at reboxetine, an antidepressant sold by Pfizer. Several published studies suggested it was effective compared to a placebo, leading several European countries to approve it for prescription to depressed patients. The German Institute for Quality and Efficiency in Health Care, responsible for assessing medical treatments, managed to get unpublished trial data from Pfizer— three times more data than had ever been published— and carefully analyzed it. The result: reboxetine is not effective. Pfizer had convinced the public that it was effective only by neglecting to mention the studies showing it wasn’t.
A similar review of 12 other antidepressants found that of studies submitted to the United States Food and Drug Administration during the approval process, the vast majority of negative results were never published or, less frequently, were published to emphasize secondary outcomes. 19(For example, if a study measured both depression symptoms and side effects, the insignificant effect on depression might be downplayed in favor of significantly reduced side effects.) While the negative results are available to the FDA to make safety and efficacy determinations, they are not available to clinicians and academics trying to decide how to treat their patients. This problem is commonly known as publication bias, or the file drawer problem. Many studies sit in a file drawer for years, never published, despite the valuable data they could contribute. Or, in many cases, studies are published but omit the boring results. If they measured multiple outcomes, such as side effects, they might simply say an effect was “insignificant” without giving any numbers, omit mention of the effect entirely, or quote effect sizes but no error bars, giving no information about the strength of the evidence.
It is possible to test for publication and outcome reporting bias. If a series of studies have been conducted on a subject and a systematic review has estimated an effect size from the published data, you can easily calculate the power of each individual study in the review.[ 23] Suppose, for example, that the effect size is 0.8(on some arbitrary scale), but the review was composed of many small studies that each had a power of 0.2. You would expect only 20% of the studies to be able to detect the effect— but you may find that 90% or more of the published studies found it because the rest were tossed in the bin.
Physicists, however, have done a great deal of research on a similar problem: teaching introductory physics students the basic concepts of force, energy, and kinematics. An instructive example is a large- scale survey of 14 physics courses, including 2,084 students, using the Force Concept Inventory to measure student understanding of basic physics concepts before and after taking the courses. The students began the courses with holes in their knowledge; at the end of the semester, they had filled only 23% of those holes, despite the Force Concept Inventory being regarded as too easy by their instructors.
Premier journals need to lead the charge. Nature has begun to do so, announcing a new checklist that authors are required to complete before articles can be published. 23 The checklist requires reporting of sample sizes, statistical power calculations, clinical trial registration numbers, a completed CONSORT checklist, adjustment for multiple comparisons, and sharing of data and source code. The guidelines address most issues covered in this book, except for stopping rules, preferential use of confidence intervals over p values, and discussion of reasons for departing from the trial’s registered protocol.
Perhaps Schekman, shielded by his Nobel, can make the point the rest of us are afraid to make: the frenzied quest for more and more publications, with clear statistical significance and broad applications, harms science. We fixate on statistical significance and do anything to achieve it, even when we don’t understand the statistics. We push out numerous small and underpowered studies, padding our résumés, instead of taking the time and money to conduct larger, more definitive ones.
One proposed alternative to the tyranny of prestigious journals is the use of article- level metrics. Instead of judging an article on the prestige of the journal it’s published in, judge it on rough measures of its own impact. Online- only journals can easily measure the number of views of an article, the number of citations it has received in other articles, and even how often it is discussed on Twitter or Facebook. This is an improvement over using impact factors, which are a journal- wide average number of citations received by all research articles published in a given year— a self- reinforcing metric since articles from prestigious journals are cited more frequently simply because of their prestige and visibility.
Look for important details in a statistical analysis, such as the following: The statistical power of the study or any other means by which the appropriate sample size was determined How variables were selected or discarded for analysis Whether the statistical results presented support the paper’s conclusions Effect- size estimates and confidence intervals accompanying significance tests, showing whether the results have practical importance Whether appropriate statistical tests were used and, if necessary, how they were corrected for multiple comparisons Details of any stopping rules
If you work in a field for which a set of reporting guidelines has been developed(such as the CONSORT checklist for medical trials), familiarize yourself with it and read papers with it in mind. If a paper omits some of the required items, ask yourself what impact that has on its conclusions and whether you can be sure of its results without knowing the missing details.
Want to add a quick comment? (optional)