A Discussion of Statistical Power and Sample Size
Plaquenil (hydroxychloroquine) has become quite popular over the past few months. An anti-malarial drug that physicians use now for a bevy of rheumatologic diseases sure has caused quite a stir. Believe it or not, if you surveyed a hundred physicians, you’d be lucky to find one that actually understands how it treats auto-immune disease. If you surveyed 1000 physicians here in the US, not one of them could tell you how it disposes of malaria. Unfortunately these days it does not take much for large groups of people to develop strong convictions about something they have little understanding of. Poor Plaquenil is just caught in the middle of this. Let’s all be thankful it’s not a colon-cleanse.
Azithromycin is a similar situation. It’s an antibiotic, which means it disrupts and kills bacteria. Not viruses. To be clear COVID-19 is a virus. Bacteria are actually living organisms. Viruses are not. Viruses are genetic material wrapped inside a fancy capsule. Bacteria have to eat, defend themselves against their surroundings, replicate, excrete waste, and produce their own machinery to do all of this. This gives microbiologists and pharmacologists plenty of targets to attack them in the form of medications. Viruses hack your own cells to do the dirty work a bacterium does on its own, leaving few targets to attack. So, how does azithromycin work? Since I took pharmacology 10 years ago, I had to Wikipedia this. It binds to the S50 ribosomal subunit that inhibits protein synthesis within a bacterium. That was a mouthful. You may be asking, does COVID-19 have an S50 ribosomal subunit to make proteins? That answer would be no. So, how does it kill COVID-19? My guess is with the phrase…Bippity boppity boo!
(There is so much to say about azithromycin and its use as an effective placebo here in the US. Maybe I’ll write a book someday. Spoiler: In the end you recovered from your stuffy nose on your own.)
So, where did all of this hoopla with Plaquenil start. It actually started back in 2006 after the first SARS outbreak. During that time China was desperately looking for a treatment for SARS. Fortunately, globalization was less developed, and the outbreak originated in a smaller Chinese city. Lockdown measures and contact tracing helped extinguish the outbreak. But fear remained that it would return. After the threat was extinguished a study was published that showed chloroquine (similar to hydroxychloroquine) counteracted the virus in a test-tube.(1) And that was the end of it for the next 13 years. As COVID-19 spread, a physician in southern France picked up the trail and began treating patients with Plaquenil. He treated 34 patients with success.(2) Wow! 34! He reported his results and they took off like a rocket ship before the study was even published. That’s it. Since then tens-of-thousands to hundreds-of-thousands of patients have been treated with Plaquenil and the social/political circus is stirring stronger than ever.
One silver lining is that it has produced a couple of randomized control trials for us to learn from. I will be appraising the recent trial out of Brazil published on July 28th, 2020 about the use of Plaquenil with or without azithromycin in Mild-to-Moderate COVID-19. The primary endpoint in this study was change in a 7-point ordinal disease severity scale at day 15 after enrollment. Secondary endpoints were need for mechanical ventilation or death from COVID.
There are 6 key components that determine if the results of a randomized control trial are valid. They are: 1) Randomization, 2) concealment of randomization, 3) Blinding of researchers and participants, 4) completion of follow-up (attrition), 5) completion of the trial as per the protocol, and 6) Intention to treat principle. I’ll run through each of these briefly for this study.
In this trial the groups were randomized. What that means is participants in the study were placed in groups based on a random order. The randomization strategy was 1:1:1. The sequence was determined by a software program run by an independent agency. Since randomization was performed independently it was concealed. Concealment is essentially hiding the sequence of randomization from the researchers so it cannot be manipulated. In order to ensure balance in the group, the randomization was also stratified for the requirement of oxygen. Stratification means participants that met certain criteria are randomized in a separate sequence to ensure the higher risk cases do not end up in one group by random chance. You can check “yes” next to components 1 and 2.
The study was not blinded. This was due to the principle of equipoise. Blinding is concealing the composition of the two groups hidden during the trial. Generally, doctors, patients, data collectors, adjudicators of data, and statisticians are all blinded to prevent bias. This is why placebos are used, and the placebo medicine looks just like the experimental medicine. The principle of equipoise is a research ethics principle that basically states you cannot force participants in a study to remain in one group if the other is getting significant benefit. Since previous reports showed benefit from Plaquenil, everyone needed to know which group they were in to prevent one group from being unnecessarily harmed.
Follow-up was complete in this study.
The trial was not stopped early for benefit.
The data was analyzed with the intention to treat principle. Intention to treat is an analysis principle that requires participants to be analyzed in the group to which they were randomized regardless of activity during the trial. So, if they switch groups, or stop the trial early, or whatever happens, their data is counted in the original group. This is to preserve randomization, which preserves prognosis throughout the study. Both groups, collectively, need to be equal. Once you lose equality the analysis falls apart.
Sans the blinding, the methods in this study were valid. However, there was one big issue with their sample size calculation that I will discuss later. Onto the results.
Here are the highlights. There were 667 patients enrolled with 504 confirmed COVID cases. 217 received Plaquenil. 221 received Plaquenil and Azithromycin. 229 were treated with standard of care. Mean age was 50. Oxygen was required in 42% at the time of enrollment. Randomization was successful in the confirmed COVID cases since there were no differences seen between the groups.
They found no difference in disease severity scale at day 15 between the three groups. The odds ratio for Plaquenil alone was 1.21 with a confidence interval of 0.69-2.21. The confidence interval crosses one which shows there was no statistical significance. The comparison for Plaquenil plus azithromycin was similar.
There were no differences in the secondary endpoints of need for mechanical ventilation and death. A total of 43 participants required mechanical ventilation and 18 died. More adverse events occurred in the Plaquenil and the Plaquenil + azithromycin groups.
In this randomized placebo-controlled trial, the authors did not find any difference in severity of illness between the control and experimental groups. Randomized placebo-controlled trials are our basis for understanding the effects of treatments. So many variables are controlled throughout the experiment that the intervention (in this case Plaquenil with or without azithromycin) is on its own. It either helped or it didn’t. It’s difficult to find reasons to refute the result of an absence of benefit. However, they must be designed and executed correctly.
In this paper the authors bring up several shortcomings that I would like to point out. First, their confidence intervals were quite wide. This suggests there was an issue with their sample size calculation. Remember, the true value lies within the confidence interval with 95% confidence. In this study, the true result ranges from substantially beneficial to significant harm. Second, there was no blinding. The investigators and the patients knew which group they were in. This could introduce bias. Finally, median time from symptom onset to first dose was 7 days. Theoretically the medicine should have a greater impact if started earlier in the disease course.
I wanted to spend a while on statistical power because this study is a great example of its importance. Honestly, this was a small study. I know they did a power analysis/sample size calculation based on their first 120 patients, but there is a high probability they messed this up. A sample size calculation is a measurement that tells the investigators how many participants they will need to achieve statistical significance. It is calculated prior to the start of the study. Usually an 80% probability to achieve statistical significance with 95% confidence is the standard. Even with this standard, 20% of the time you fall short of the size you need. If you achieve statistical power, it is highly unlikely your result was due to chance. There are several clues that demonstrate shortcomings in their calculation. This is important because they found no difference between the groups, but they may have not enrolled enough people to detect a difference.
First is the wide confidence interval. Theoretically, let’s repeat this trial exactly as is 100 times. Based on the established statistical probabilities you would have 5 results outside of the confidence interval. We’ll throw those away for now. The remaining 95 would fall somewhere within the confidence interval. The 0.69 says the Plaquenil takers were 30% less likely to get worse. The 2.11 says the Plaquenil takers were twice as likely to worsen. Now, I’m going to talk a little about standard deviations. Take a deep breath it won’t be that bad. Standard deviations provide a sense of how close the numbers are to the mean. The more variability you have (wide confidence interval) the larger the standard deviation. Standard deviations allow you to predict the probability of future results should you repeat the same measurement. Large standard deviations tell you that your measurements are not precise and there’s something messed up with your experiment, or your testing something with great variability. The confidence interval comes from the standard deviation, so we know in this case it is large.
For this illustration I’ll assume the individual repeat experiments were equally distributed across the confidence interval. Distribution curves visually capture all of the potential outcomes of a test (Figure 1 below). One typical standard deviation from the mean will capture 34% of the results on either side of the curve, leaving 16% of the remaining results near the tails. What that means is in our theoretical 100 repeats, 16 of them would show substantial benefit. Another 16 of them would show substantial harm. The remaining 68 trials would likely accept the null hypothesis (no change). So where is the true value?
Figure 1: Typical Gaussian Distribution Curve
This is why a narrow confidence interval is more valuable than the actual odds ration number reported. The odds ratio in this study is subject to random chance. For example, think of the difference a confidence interval of 0.90-1.10 tells you. The true value rests between those two numbers with 1.00 telling you the groups were equal. Even if the true value is .93 there is minimal effect to your treatment, so it’s much easier to accept the absence of effect when making a real-life decision. An odds ratio of 0.69 and 2.11 are close to a doubling of the effect on either side, which could make a big difference in the real world. The confidence interval says how confident you are, hence, the name. The wider it is, the less it is reliable.
Second is the small number of individuals in the trial. Honestly, this is COVID. Every day we are hearing about new cases, hospitalizations, and rising death toll. The cases are in the millions worldwide, and probably much more than reported. The study took place in Brazil. One of the hardest hit countries.(3) This isn’t Mongolia. They only enrolled 200 patients in each arm. Only 504 were confirmed COVID positive. They used the severity of the first 120 patients as the basis for their power calculation instead of historical data. Based on initial data from China about 11% of hospitalized patients deteriorated from minor to severe. If you predicted Plaquenil reduced your deterioration rate by 1/3, (that would be awesome!) 7.5% of patients would deteriorate in the experimental arm with 11% deterioration in the control. Based on observational data it was suspected that Plaquenil would pass this threshold.(4) A simple sample size calculation for this event rate would be 1075 patients in each arm in order to detect a difference. How they got to 504 as their sample for all three groups combined baffles me. Even if I thought Plaquenil was mana from heaven I would still provide some margin of safety and expect to enroll more than 3225 (1075+1075+1075) it would take to get the events needed.
Finally, there were practically no events. Only 54 patients reached level 4 or higher on the severity scale (see figure below). That’s across all three groups. Ten percent of the patients in the study reached a clinically relevant endpoint. The remaining ninety percent were uneventful. I often talk about the rule of 50 as a quick rule to determine if there were enough events to be able to detect a difference. The rule of 50 states “it is very difficult to detect a difference between two groups if you have less than 50 events in each arm”. Your sample size has to be very large to detect a difference with such a small event rate.
Figure 2: Highlighted Events from the Study Based on 7 Point Scale
This study demonstrates how important preparation is in medical research. All of the planning. All of the work recruiting, following, and analyzing data. Then writing the paper. Finally, the submission and review for publication. All those hours of hard work, and one little part overlooked, or one mistake in preparation and the study falls apart. Discovering truth is oh so very difficult, that’s probably why we have so few things we know for certain.
I don’t know if Plaquenil is helpful in the treatment of COVID-19. The theory is a stretch. Repurposing an old medicine for its “antiviral” or anti-inflammatory properties without a great understanding of the mechanism is suspect. It’s the equivalent of finding a flower deep in the amazon that cures cancer. It seems serendipitous to me, so I don’t trust it. Besides the available data showing no benefit, the fact that the theory does not translate well from bench to patient is a serious issue. It certainly should not be considered for widespread use at this time, but the next published study may refute the findings published here.
1. Vincent MJ, Bergeron E, Benjannet S, et al. Chloroquine is a potent inhibitor of SARS coronavirus infection and spread. Virol J. 2005;2:69. doi:10.1186/1743-422X-2-69
2. Gautret P, Lagier J-C, Parola P, et al. Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial. Int J Antimicrob Agents. 2020;56(1):105949. doi:10.1016/j.ijantimicag.2020.105949
3. Brazil Coronavirus: 3,035,582 Cases and 101,136 Deaths - Worldometer. Accessed August 10, 2020. https://www.worldometers.info/coronavirus/country/brazil/
4. Arshad S, Kilgore P, Chaudhry ZS, et al. Treatment with hydroxychloroquine, azithromycin, and combination in patients hospitalized with COVID-19. Int J Infect Dis IJID Off Publ Int Soc Infect Dis. 2020;97:396-403. doi:10.1016/j.ijid.2020.06.099