As we said in the last article, medical science has made great strides, and when we look at problems we should not lose sight of that fact. I for one have no wish to return to the days when the leading cause of death for women was childbirth, and a man was considered old if he reached the age of 40. However, it is worth taking the time to develop a better idea of what constitutes quality in research and how results should be interpreted.
No one cares about negative results
Well, they should care. To modify our previous example a little, let’s consider whether it is *hot* breakfast that improves student performance and raises test scores. Our control group would then be students who only get a cold breakfast (e.g. juice and cereal with milk). We do a study, and it turns out there is no statistically significant evidence that hot breakfasts produce better performance than cold breakfasts. We could then target our assistance to children better, and maybe save some money. But we run into a problem of bias in publishing, which is that journals tend to prefer to publish studies that have positive results.
This is very pronounced in studies of the efficacy of drugs. Pharmaceutical companies in the United States have to do careful studies for any new drug to establish a) the safety of the drug; and b) the efficacy of the drug. And I think there are similar rules in other countries. So what happens if the company does extensive testing and cannot find proof that the drug does anything? The study goes onto a dusty shelf somewhere and is never heard from again. This matters for a couple of reasons. First, data is data, and even negative results have utility. But even more important is that sometimes studies that by themselves do now show positive results can be combined with other studies and a positive result can come out. This is because combining studies gives you a larger sample size which can improve the power of the combined studies.
No one cares about replication
Remember in the last article when we said that about one out of twenty studies will reach the wrong conclusion? The best defense against that is to have other scientists replicate the study and see if they get the same results. And this is a huge problem because quality journals generally are not interested in publishing replication studies, and in addition tenure committees at universities do not consider replication studies as valuable research. Everyone wants something original. But many original studies have problems that could be brought out by replication studies.
And when it has been studied, we have found that a large number of studies cannot be replicated, and this is particularly acute in Psychology. This has become known as the Replication Crisis. As an example, a 2018 paper in Nature Human Behaviour looked at 21 studies and found that only 13 could be successfully replicated.
Professor John Ioannidis of Stanford University has been a leading voice in the replication problem. He was an adviser to the Reproducibility Project: Cancer Biology. This study looked at 5 influential studies in cancer research, and found that 2 were confirmed, 2 were uncertain, and one was disproved. And this is not something that derives from fraud in the vast majority of cases. It can derive from small differences in the research approach that bring out hitherto unknown factors of importance. And this matters because an influential study drives further research, and if it is wrong you end up wasting time and resources. Pharmaceutical companies have found that they cannot replicate the basic research underlying some of their proposed drugs, which causes waste and diverts resources that could be used more productively.
There is a lot of controversy over replication, because there is an inference that if a scientist’s results cannot be replicated the scientist was, at worst, fraudulent, or at best, sloppy and lacking in care. While there are examples of this (See Andrew Wakefield or Hwang Woo-suk), the best evidence is that this is rare, and that most replication problems are fundamentally due to this stuff being really hard to do in the first place. A researcher whose study cannot be replicated may have done nothing particularly wrong, but can have their career damaged in any case (see Yoshiki Sasai).
Again, look at this in the context of what is overall remarkable success in medical research. Replication is a problem, but one that we can deal with if we make some effort. For example, we could devote a percentage of our research budget to replication studies, and reward people who do good work in this area.
As we discussed in the previous article, statistical significance of a result is based on the p-value, which is the probability that the result came from random chance instead of coming from a real relationship. This is important because people are not in general good at dealing with probability and statistics, and a formal test can provide a valuable check on confirmation bias (which is seeing what you want to see, basically). And we noted how this should happen. A good researcher should come up with an hypothesis, then collect the data, and then do a test to see if there is a statistically significant result. That is the ideal. But as we saw above, if you find nothing, you have a problem. Studies that fail to find anything do not generally get published. Researchers who do not get published do not get tenure, and will probably have trouble getting grants to do any more research. So there is tremendous pressure to get a result that is significant and can be published. Your whole career can rest on it.
One result of that pressure is a practice known as p-hacking, which in general means looking for ways to get a significant result that is publishable by any means necessary.
Colloquially it has been referred to as “Torturing the data until it confesses.” One variation is to skip the initial hypothesis and just go looking for any relationship you can find in the data. You might think that no harm is done since if the researcher finds a relationship that is a good thing. But remember the idea is to find true relationships. If you grab enough variables and run them against each other you will definitely find correlations, but they will not generally be anything other than coincidence. The whole point of stating an hypothesis in advance of collecting the data is that your hypothesis should be grounded in a legitimate theory of what is going on. For example, someone once did a study showing that you could predict the winner of the US Presidential election by which team won the previous Superbowl. There is no way that this correlation means anything other than random chance (helped by a small sample size).
Another approach is to drop some parameters of the original study to get a better p-value. Or you might find an excuse to eliminate some of the data until you get a “result”. But if the p-value is the gate you have to get through to get published and have a career, some people will look for a reason to go through that gate.
So the point of this analysis is that you have to be careful about uncritically accepting any result you come across. Reports on television news or in newspapers deserve skepticism. And never accept any one study as definitive. You need to see repeated studies that validate the result before you can start to accept it. Of course most “civilians” are going to have trouble doing all of that, which sets us up for the next article where we look at how health care providers can help with this.
And above all else, if Gwyneth Paltrow suggests something, do the opposite. She may be the first recorded case of an IQ measured in negative numbers.
Listen to the audio version of this post on Hacker Public Radio!