The Hierarchy of Evidence

As we saw previously, the idea of Evidence-Based Medicine is that we formulate medical treatments on the basis of the best evidence from quality research studies. Breathless posts on social media about “The Miracle Breakthrough Your Doctor Doesn’t Want You To Know About” are of course absolutely worthless, as are pretty much any of the blatherings of celebrities like Gwyneth Paltrow. And we said it was hard to beat Double-blind Randomly Controlled Trials. But this is slightly more complicated when we consider how feasible it is to these trials. Like in our example of breakfasts and test scores, no decent person is going to deprive a group of children of having breakfast in a randomly controlled trial, even if that would get us high quality data. Some things are just not done.

There is an approach favored by advocates of Evidence-Based Medicine called the Hierarchy of Evidence that ranks the quality of data by how the evidence was obtained. The idea is that you would rely on such evidence only as much as the data deserve based on how the study was done. If the data is low quality, you place a low value on it. It still may be something, but you would of course reject it if a better study came along with a different conclusion. So, how does this hierarchy rank the kinds of studies?

Now, we need acknowledge here that this approach is not 100% supported by everyone in medicine, but we should also understand that it is vastly superior to doing “research” on Facebook. In a world where people are not vaccinating their children, promoting fad diets, and shoving odd things into various orifices, this approach to validating evidence is a huge improvement.

Systematic reviews and meta-analyses of “RCTs with definitive results”

Systematic reviews are gathering the whole body of literature on a topic to see where there is agreement. By definition, you have to have multiple studies done before you can even think of a systematic review, but when you have them and they all point in the same direction, that is very powerful. And if these studies are themselves Randomly Controlled Trials, the evidence becomes extremely persuasive.

When you have multiple studies pointing in the same direction, that is powerful because it addresses one of the biggest problems, that of replicating results. Now, it also matters that the individual studies being combined into the meta-analysis are themselves of high-quality, and if they are based on randomized, controlled trials that creates a strong presumption of quality.

Randomized Controlled Trials

Here instead of a meta-analysis of multiple studies, we are looking directly at individual studies. A single result is never as persuasive as a combination of multiple results that are in agreement, but it can be a good indication if done properly. But we should also understand that not all Randomized Controlled Trials are equal in weight and reliability. Here are some questions we can ask when assessing RCTs:

Questions to consider when assessing an RCT

  • Did the study ask a clearly focused question? A good study will be designed to address a specific question, and focused on that question. If you did a study heart disease, and along the way noticed a result affecting the kidneys, that is not focused. It may suggest something worth looking at, but the appropriate response would then  be to design a study to look at the kidney problems.
  • Was the study an RCT and was it appropriately so? While RCTs are the gold standard, they are not always the most appropriate way to study something, as mentioned above.
  • Were participants appropriately allocated to intervention and control groups? This is a question of randomization. The mathematics of probability require that every study member has an equal probability of being assigned to the control or the study group. But there can sometimes be reasons to use things like stratification. This helps when you need to ensure, for example, that both men and women are properly represented in a study that is meant to apply to both sexes,
  • Were participants, staff, and study personnel blind to participants’ study groups? This is the double-blind requirement we have discussed previously, and it is important to ensure that no one has a biased view of the outcomes. All participants do not know if they are in the study group or the control group, and neither do the people doing the study.
  • Were all the participants who entered the trial accounted for at its conclusion? One thing you need to guard against is dropping inconvenient data. If you started with 100 people in your study, but only report results for 90, what happened to the other 10 people? There can be legitimate reasons that people drop out, or are dropped, but you need to account for it so we know that you are not trying to bias the results by getting rid of data points that might contradict your conclusions.
  • Were participants in all groups followed up and data collected in the same way?
  • Did the study have enough participants to minimise the play of chance? Sample size matters in statistics! To state the obvious, a study of two people is nothing more than an anecdote. It may be right, or it may be wrong, but you should never rely on it. On the other hand, a study with 1,000 people has a much higher probability of being right.
  • How are the results presented and what are the main results?
  • How precise are the results? How big is the treatment effect, and how does that compare to the margin of error? If your study showed a decline of 3 points in cholesterol, with a margin of error of plus or minus 20 points, it is not very precise. There may be an effect, but you wouldn’t place a lot of trust in it.
  • Were all important outcomes considered and can the results be applied to your local population? If you are a pediatrician, and study was entirely made up of adults, the results might be valid, but do they really apply to your population. And if you are looking at the study as applying to you, a similar question comes up. Was the study entirely of men, and you are a woman? Was it all people in a different racial group (and yes, that can matter in some cases).

Cohort Studies

These are studies that follow a group of similar people ( a “cohort”) over time, and can be useful in epidemiology studies. By definition, there is no “control group” involved, which is why these rank below RCTs in the hierarchy. One of the classic cohort studies is the Framingham Heart Study. It studies the residents of Framingham, Massachusetts, USA, and since its beginning in 1948 it has now moved to the third generation of participants. Much of our current knowledge about hypertension and heart disease comes out of this massive cohort study. But it also has been criticized for for over-estimating some of the risks, and there are questions about how well its results apply to other populations.

Case-control studies

These studies attempt to match people with a particular condition with other similar people who do not have that condition. These may appear to be superficially similar to RCTs, but are different in very important ways. These are observational studies and the people doing the study are not in any way blind, nor are the participants. And there is no scope for randomization because each participant was deliberately selected for the study.

Cross sectional surveys

These are also observational, but in this case it is looking at a population of some kind at a specific instant of time. So if Case-Control studies are the less useful version of RCTs, you could similarly say that Cross sectional surveys are the lesser version of the Cohort Studies. Generally, these studies are done using general data that is routinely collected, and because that data is routinely collected they are inexpensive to do. But this also means that the data was not collected to answer the specific question you may have.

Case reports

These are reports about specific individual cases. They may provide a clue, but you don’t have a sample, a control, etc. A good example are the cases that Sigmund Freud reported. And when you understand how little validity Freud’s results enjoy today, you see the weakness in this approach. There is a reason it is at the bottom.


So the hierarchy, from Best to Worst, looks like this:

  1. Systematic reviews and meta-analyses of “RCTs with definitive results”.
  2. RCTs
  3. Cohort studies
  4. Case-control studies
  5. Cross sectional surveys
  6. Case reports

So you should place the most trust in Systematic Reviews and Meta-analyses, and the least trust in Case reports.

Listen to the audio version of this post on Hacker Public Radio!

 Save as PDF

Comments are closed.