Statistical analysis methods applied to early outpatient COVID-19 treatment case series data

: When confronted with a public health emergency, signiﬁcant innovative treatment protocols can sometimes be discovered by medical doctors at the front lines based on repurposed medications. We propose a statistical framework for analyzing the case series of patients treated with such new protocols, that enables a comparison with our prior knowledge of expected outcomes, in the absence of treatment. The goal of the proposed methodology is not to provide a precise measurement of treatment efﬁcacy, but to establish the existence of treatment efﬁcacy, in order to facilitate the binary decision of whether the treatment protocol should be adopted on an emergency basis. The methodology consists of a frequentist component that compares a treatment group against the probability of an adverse outcome in the absence of treatment, and calculates an efﬁcacy threshold that has to be exceeded by this probability, in order to control the corresponding p -value, and reject the null hypothesis. The efﬁcacy threshold is further adjusted with a Bayesian technique, in order to also control the false positive rate. A random selection bias threshold is then calculated from the efﬁcacy threshold to control for random selection bias. Exceeding the efﬁcacy threshold establishes the existence of treatment efﬁcacy by the preponderance of evidence, and exceeding the more demanding random selection bias threshold establishes the existence of treatment efﬁcacy by the clear and convincing evidentiary standard. The combined techniques are applied to case series of high-risk COVID-19 outpatients, that were treated using the early Zelenko protocol and the more enhanced McCullough protocol.


Introduction
In medical research, the efficacy of new drugs or treatment protocols is established by controlled studies in which a treatment group is compared against a control group.A case series is one half of a controlled study consisting only of the treatment group.At the beginning of the COVID-19 pandemic, practicing medical doctors were confronted with having no treatment to offer to their patients that can prevent or minimize hospitalization and/or death.In response, some doctors were compelled to innovate and discover, on their own, treatment protocols using repurposed off-label medications.Most notable examples, amongst several others, include Didier Raoult [1] in the IHU Méditerranée Infection hospital in Marseilles France, Vladimir Zelenko [2] in upstate New York, George Fareed and Brian Tyson [3] in California, Shankara Chetty [4] in South Africa, Jackie Stone [5] in Zimbabwe, and Paul Marik's group [6,7], which was in the beginning based at the Eastern Virginia Medical School.Their efforts to treat patients generated case series of successfully treated patients that constitute real-world evidence [8].
The goal of this paper is to present a statistical framework for rapidly analyzing systematic case series data of early treatment protocols with binary endpoints (e.g.hospitalization or death), and comparing them against our prior knowledge of the likelihood of adverse outcomes in the absence of treatment.Although, the development of the proposed statistical technique was originally motivated by the need to assess available case series [2,[9][10][11][12][13] of multi-drug treatment protocols [2,[14][15][16] for COVID-19, it can also play a very important role in the public health response to future pandemics or epidemics with no established treatment protocols.Furthermore, the potential scope of our methodology is very broad and it can be used to compare any treatment group case series, with binary endpoints, against our prior knowledge of the probability of adverse outcomes based on population-level historical controls.A limitation of the methodology is that it should be used only for treatment protocols that are based on repurposed medications [17] with known acceptable safety.The main advantage of the technique is that it can be very good at rapidly validating and enabling the deployment of treatment protocols, based on repurposed medications, when there is a sufficiently strong signal of efficacy.When confronted with a mass casualty event, it is critically important to be able to rapidly leverage the direct clinical experience of medical doctors, towards formulating an evidence-based standard of care, while also being able to rigorously quantify the quality of the available evidence.
The closest concept to our approach is the idea of using a virtual control group [18], in which the outcomes observed in a treatment group case series are compared against the predicted outcomes for the same patient cohort without treatment, using a trained statistical model, based on data accumulated before the discovery of the treatment in question.The virtual control group method aims to not only establish the existence of efficacy, but to also measure the corresponding treatment efficacy.Our idea is to abandon any attempt to obtain an unbiased measure of the treatment efficacy, and to focus on establishing, with sufficient confidence, the existence of some positive treatment efficacy.We do this by comparing the treatment group case series with a probability lower-bound for the expected negative outcomes without treatment.Such lower bounds can be easily extracted from available data, and can be facilitated by applying risk-stratification on the treatment group case series, when necessary.Thus, our aim is to establish, with sufficient confidence, a positive lower bound for treatment efficacy, quickly and without expending substantial resources, using real-world evidence that has been accumulated from the efforts of practicing physicians.In turn, this can be sufficient for a positive recommendation to adopt the corresponding treatment protocol.
Because case-series are susceptible to selection bias, we define two cross-over thresholds for making a positive recommendation: an efficacy threshold, corresponding to a preponderance of evidence standard, where we assume there is no selection bias, and a random selection bias threshold, corresponding to the clear and convincing evidentiary standard, which controls for random selection bias in the case series.Following the recommendation of the American Statistical Association statement on statistical significance and p-values [19], the proposed approach combines use of the p-value, which enables one to reject the null hypothesis, with a Bayesian factor analysis framework [20][21][22][23][24] for controlling the false positive rate [25] in the calculation of the efficacy threshold.Empirically, we have found that the frequentist p-value framework has done a pretty good job on its own, at least for the analysis of the case series data considered in this paper.However, complementing it with Bayesian factor analysis is a reasonable precaution, because it can help raise the red flag when dealing with small sample sizes and/or weak signals.
We apply the proposed framework to the processing of available case series data [2,[9][10][11][12][13] that support proposed early outpatient treatment protocols for COVID-19 patients, such as the original Zelenko triple-drug protocol [2] and the more advanced McCullough protocol [14][15][16].The original Zelenko protocol was first announced on March 23, 2020 [26].The proposed approach was to risk-stratify patients into two groups (low-risk vs high-risk), provide supportive care to the low-risk group, and treat the high-risk group with a triple-drug protocol (hydroxychloroquine, azithromycin, zinc sulfate).Results were reported in an April 28, 2020 letter [9] and a June 14, 2020 letter [10], and the lab-confirmed subset of the April data was published in a formal case-control study [2].Zelenko's letters have been attached to our supplementary material document [27].
The rationale for the triple-drug therapy was based on the following mechanisms of action: Hydroxychloroquine prevents the virus from binding with the cells, and also acts as a zinc ionophore that brings the zinc ions inside the cells, which in turn inhibit the RDRP (RNA Dependent RNA Polymerase) enzyme used by the virus to replicate [28,29].Azithromycin's role is to guard against a secondary infection, but we have since learned that it also has its own anti-viral properties [30][31][32], and a signal of the efficacy of adding azithromycin on top of hydroxychloroquine can be clearly discerned in a study of nursing home patients in Andorra, Spain [33].
It is interesting that chloroquine was shown in vitro to have antiviral properties against the previous SARS-CoV-1 virus [34], and that there is an anecdotal report from 1918 [35] about the successful use of quinine dihydrochloride injections as an early treatment of the Spanish flu.In hindsight, it is now known that influenza viruses also use the RDRP protein to replicate [36], which can be inhibited with intracellular zinc ions [28,29].Consequently, there is a mechanism of action that can explain why we should anticipate the combination of zinc with a zinc ionophore (i.e.hydroxychloroquine, or quercetin [37], or EGCG [38]) to inhibit the replication of the influenza viruses.Other RNA viruses, including the respiratory syncytial virus (RSV) [39] and the highly pathogenic Marburg and Ebola viruses [40,41], are also using the RDRP protein to replicate, raising the question of whether the zinc/zinc ionophore concept could also play a useful role against them.
Zelenko's protocol was soon extended into a sequenced multi-drug approach, known as the McCullough protocol [14][15][16], which is based on the insight that COVID-19 is a tri-phasic illness that manifests in three phases: (1) an initial viral replication phase, in which the virus infects cells and uses them to replicate and make new viral particles, during which patients present with flu-like symptoms; (2) an inflammatory hyper-dysregulated immune-modulatory florid pneumonia, that presents with a cytokine storm, coughing, and shortness of breath, triggered by the toxicity of the spike protein [42], when it is released, as viral particles are destroyed by the immune system, triggering release of interleukin-6 and a wave of cytokines; (3) a thromboembolic phase, during which microscopic blood clots develop in the lungs and the vascular system, causing oxygen desaturation, and very damaging complications that can include embolic stroke, deep vein thrombosis, pulmonary embolism, myocardial injury, heart attacks, and damage to other organs.
The rationale of the original Zelenko protocol was that early intervention to stop the initial viral replication phase could prevent the disease from progressing to the second and third phase, and, in doing so, prevent hospitalizations or death.The McCullough protocol [14][15][16] extends the Zelenko protocol by using multiple drugs in combination sequentially to mitigate each of the three phases of the illness, depending on how they present for each individual patient.McCullough's therapeutic recommendations for handling the cytokine injury phase and the thrombosis phase of the COVID-19 illness are, for the most part, standard on-label treatments for treating hyper-inflammation and preventing blood clots.The most noteworthy innovations to the antiviral part of the protocol are the addition of ivermectin [43][44][45][46][47][48], which has 20 known mechanisms of action against COVID-19 [49], to be used as an alternative or in conjunction with hydroxychloroquine, the addition of a nutraceutical bundle [50][51][52] combined with a zinc ionophore (quercetin [37] or EGCG [38]) for both low-risk and high-risk patients, and lowering the age threshold for high-risk patients to 50 years.The MATH+ protocol [6,7], developed for hospitalized patients by Marik's group, follows the same principles of a sequenced multi-drug treatment.A similar treatment protocol, based on similar insights, was independently discovered and published on May 2020 by Chetty [4] in South Africa.
McCullough's protocol [14][15][16] was adopted by some treatment centers throughout the United States and overseas, but has not been endorsed by the United States public health agencies, ostensibly due to lack of support of the entire sequenced treatment algorithm by an RCT.In spite of the urgent need for safe and effective early outpatient treatment protocols for COVID-19, there has been no attempt to conduct any such trials of any comprehensive multi-drug outpatient treatment protocols throughout the pandemic.Instead, the prevailing approach has been to try to build treatment protocols, one drug at a time, after validating each drug with an RCT.Because COVID-19 is a multifaceted tri-phasic illness, there is no a priori reason to expect that a single drug alone will work for all 3 phases of the disease.Consequently, the first priority should be to validate the efficacy of treatment protocols that use multiple drugs in combination, since this is what is actually going to be used in practice to treat patients.To that end we have analyzed the case series by Zelenko [2,9,10], Procter [11,12], and Raoult [13], where such multidrug outpatient treatment protocols have been used by practicing physicians.
The broader context in which the proposed statistical methodology is situated is as follows.Shortly before COVID-19 was declared a pandemic by the World Health Organization, an article [53] was published on February 23, 2020 in the New England Journal of Medicine arguing that "the replacement of randomized trials with non-randomized observational status is a false solution to the serious problem of ensuring that patients receive treatments that are both safe and effective".The opposing viewpoint was published earlier in 2017 by Frieden [54], highlighting the limitations of randomized controlled trials (hereafter RCT) and the need to leverage and overcome the limitations of all available sources of evidence, including real world evidence [8], in order to make lifesaving public health decisions.In particular, Frieden [54] stressed that the very high cost of RCTs and the long timelines needed for planning, recruiting patients, conducting the study, and publishing it, are limitations that "affect the use of randomized controlled trials for urgent health issues, such as infectious disease outbreaks for which public health decisions must be made quickly on the basis of limited and imperfect data." Deaton and Cartwright [55] presented the conceptual framework that underlies RCTs and highlighted several limitations.Among them, they have stressed that randomization requires very large samples on both arms of the trial, otherwise, an RCT should not be presumed to be methodologically superior to a corresponding observational study.For example, the randomized controlled trial study of hydroxychloroquine by Dubee et al [56], was administratively stopped after recruiting 250 patients, with 124 in the treatment group and 123 in the control group.Although a two-fold mortality rate reduction was observed by day 28, the study failed to reach statistical significance, due to the small sample size.Even if statistical significance had been achieved via a stronger mortality rate reduction signal, the small sample size would have still prevented sufficient randomization.Consequently, although the study has gone through the motions of an RCT, it is not methodologically superior to a retrospective observational study.There are several other RCT studies of hydroxychloroquine with similar shortcomings [57].
Furthermore, although a properly conducted RCT has internal validity, in that the inferences are applicable to the specific group of patients that participated in the trial, the external validity of the RCT outcomes needs to be justified conceptually on the basis of prior knowledge, which is either observational, or based on a deeper understanding of the underlying mechanisms of action.Because COVID-19 mortality risk in the absence of early treatment can span three orders of magnitude (from 0.01% to more than 10%) [58][59][60][61][62][63][64], depending on age and comorbidities, trials using low-risk patient cohorts are not informative about expected outcomes on the high-risk patient cohorts and vice versa.Likewise, the timing of treatment and the medication dosage/duration of treatment will confound the results of an RCT.In general, better results are expected when treatment is initiated earlier rather than later, and negative results can be caused by inappropriate medication dosage (i.e., too much or too little).These are all relevant considerations for establishing the external validity of an RCT.
As was noted by Risch [65], when interpreting evidence from RCTs, and more broadly from any study, we should bear in mind that results of efficacy or toxicity of a treatment regimen on hospitalized patients cannot be extrapolated to outpatients and vice versa.Likewise, Risch [65] noted that evidence of efficacy or lack of efficacy of a single drug do not necessarily extrapolate to using several drugs in combination.This latter point is further amplified when there is an algorithmic overlay governing which drugs should be used and when, based on the individual patient's medical history and ongoing response to treatment.Consequently, RCTs that compare a single drug monotherapy against supportive care are not always informative about whether the drug should be included in a multi-drug protocol.
In addition to all that, we are also confronted with an ethical concern.If the available observational evidence are sufficiently convincing, then there is a crossover point where it is no longer ethical to justify randomly refusing treatment to a large number of patients, in order to have a sufficiently large control group.The corresponding mathematical challenge is being able to quantify the quality of our observational evidence in order to determine whether or not we are already situated beyond this ethical crossover point.
Just as the quality of evidence provided by randomized controlled trials is fluid, with respect to successful randomization and external validity, the same is true about the quality of real world evidence [8] that will inevitably become available from the initial response to an emerging new pandemic.We envision that a successful pandemic response, in the area of early outpatient treatment, will proceed as follows: The first element of pandemic response is to assess and monitor the situation by prospectively collecting data, needed to construct predictive models of the probability of hospitalization and death, in the absence of treatments that have yet to be discovered, as a function of the patient's medical profile/history.These models do not necessarily need to be sophisticated at the early stages of pandemic response.It could be sufficient to be able to predict good lower bounds for the hospitalization or mortality probabilities, as opposed to more precise estimates.This early data can be used to identify the predictive factors for hospitalization or death and risk stratify the patients into low risk and high risk categories.They can also be used as a historical control group that establishes our prior knowledge of expected outcomes, in the absence of treatment, that has yet to be discovered.
In parallel with gathering and analyzing data, which is the primary duty and responsibility of our public health and academic institutions, medical doctors have an ethical responsibility to use the emerging scientific understanding of the new disease and it's mechanisms of actions to try to save the lives of as many patients as possible.Under article 37 of the 2013 Helsinki declaration [66], it is ethically appropriate for physicians to "use an unproven intervention, if in the physician's judgment it offers hope of saving life, re-establishing health or alleviating suffering", provided, there is informed consent from the patient, and "where proven interventions do not exist, or other known interventions have been ineffective".
When this effort leads to the discovery of a treatment protocol, with an empirical signal of benefit and acceptable safety, and using the treatment protocol results in a case series of treated patients, then the confluence of the following conditions makes it possible to statistically establish the existence of treatment efficacy: First, the proposed treatment protocols should use repurposed drugs [17] with known acceptable safety.When testing new drugs, we have no prior knowledge of the risks involved and a rigorous controlled study is required to determine the balance of risks and benefits.Second, we need data that give us prior knowledge of the probability risk of the relevant binary endpoints (i.e.hospitalization and/or death) in the absence of treatment, as a function of the relevant stratification parameters.Third, and most importantly, the case series corresponding to treated patients should exhibit a very strong signal of benefit, relative to our prior experience with untreated patients, prior to the discovery of the respective treatment protocol.
Under these conditions, the idea that is proposed in this paper works as follows.Our input is the number N of high-risk patients treated, the number of patients a with an adverse outcome (i.e.hospitalization or death) and selection criteria for extracting the highrisk cohort under consideration, from which we can deduce, based on prior knowledge, that the unknown probability x of an adverse outcome without treatment is bounded by p 1 ≤ x ≤ p 2 .We also choose the desired level of p-value upper bound p 0 , which is typically p 0 = 0.05 (95% confidence), although we shall also consider p 0 = 0.01 and p 0 = 0.001.The output is an efficacy threshold x 0 (N, a, p 0 ) that gives us the following rigorous mathematical statement: if x 0 (N, a, p 0 ) < x, then we have more than 1 − p 0 confidence that the treatment is effective relative to the standard of care.This statement has to be paired with the subjective assessment of our prior knowledge, based on which we need to show that x 0 (N, a, p 0 ) < p 1 .The upper bound p 2 is used by the Bayesian factor technique as part of finalizing the calculation of the efficacy threshold x 0 (N, a, p 0 ).Implicit in this argument is the assumption of no selection bias, allowing us to apply the probability x at the population level to our particular case series.From the sample size N and the finalized efficacy threshold x 0 , we also calculate a random selection bias threshold x 1 (N, x 0 , p 0 ), higher than x 0 , that quantifies how large the gap between p 1 and x 0 needs to be, in order to mitigate with 1 − p 0 confidence, any possible random selection bias in the case-series sample (N, a).
As a result, we can assert the existence of treatment efficacy using two distinct standards of evidence.If we can establish x 0 < p 1 , then the preponderance of evidence is in favor of the existence of treatment efficacy, and this can justify its provisional adoption on an emergency basis, in order to gather more evidence.If we can establish that x 1 < p 1 , then the evidence becomes clear and convincing, and if these results are replicated by multiple treatment centers, then it becomes ethically questionable to deny patients access to the treatment protocol, for the purpose of conducting an RCT, or simply due to therapeutic nihilism by public health authorities.In Fig. 1, we show how the proposed statistical methodology can be integrated into an epidemic or pandemic response that leverages and deploys the direct experience of frontline medical doctors, resulting from their efforts to treat their patients.We stress again that this approach is appropriate only for treatment protocols using repurposed medications with known acceptable safety.When new medications, as opposed to repurposed drugs, are introduced into a pre-existing treatment protocol, then they should be rigorously tested both for safety and efficacy with prospective RCTs.
The paper is organized as follows.In Section 2 we present the technique for calculating the efficacy threshold and the random selection bias threshold.We also explain the relationship of the proposed technique with the exact Fisher test and with the binomial proportion confidence interval problem.In Section 3, we present a Bayesian technique for adjusting the efficacy thresholds in order to also control the corresponding false positive rate.In Section 4, we illustrate an application of both techniques to the Zelenko case series [2,9,10] as well as the Procter [11,12] and Raoult [13] case series.Discussion and conclusions are given in Section 5.With the exception of Section 3, which is mainly relevant to a more careful analysis by biostatisticians, we have strived to make Section 2 and Section 4 of the paper relevant and accessible to both clinicians and biostatisticians, by minimizing the mathematical details.Material that is relevant only to biostatisticians is relegated to the appendices.The computer code and the corresponding calculations are included in the supplementary data document [27].

Methods. Part I: Frequentist methods for case series analysis
In this section, we present the technique for comparing a treatment group case series of high-risk patients against the expected probability x of an adverse outcome without treatment, based on prior knowledge.Since our prior knowledge bounds the probability x inside an interval p 1 < x < p 2 but the precise value of x is unknown, we calculate the minimum value (efficacy threshold) that this probability has to exceed in order to be able to reject the null hypothesis, that the treatment has no efficacy.The proposed technique is equivalent to an exact Fisher test where we take the limit of an infinitely large control group with probability of an adverse outcome set equal to x.We also explain the relationship of the proposed approach with the binomial proportion confidence interval problem, and provide evidence that the corresponding coverage probability is conservative.
The assumption of conservative coverage is used, in turn, to derive a random selection bias threshold that x should exceed in order reject the possibility of a false positive result due to random selection bias.

Comparing treatment group against expected adverse event rate without treatment.
Suppose that we have a treatment group of high-risk patients in which N patients have received treatment, and a patients have had an adverse outcome.Let us also assume that all N patients in the case series satisfy precise selection criteria, used to classify them as high-risk patients, from which we can infer, from our prior knowledge, that in the absence of treatment, the probability x of an adverse outcome for a similar population is bounded in the interval p 1 ≤ x ≤ p 2 .To establish the existence of treatment efficacy, we assume the null hypothesis, that the treatment has no effect and that consequently, the probability of an adverse outcome in the treatment group is also equal to x.Under this null hypothesis the probability of observing a patients with an adverse outcome out of a total of N patients is given by pr which corresponds to a binomial distribution.The first factor gives the number of combinations for choosing the a patients that have an adverse outcome out of all N patients.The second factor x a is the probability that the chosen a patients have an adverse outcome, under the assumption of the null hypothesis.The third factor (1 − x) N−a is likewise the probability that the remaining N − a patients will not have an adverse outcome.Consequently, the product of the three factors is the probability of seeing the event (N, a) under the null hypothesis.The corresponding p-value is calculated by adding to the probability of the event (N, a), the probability of all other events with smaller or equal probability, and it reads where H is the modified Heaviside function given by The Heaviside function factor in Eq. ( 2) selects the events (N, n) that are less probable than the observed event (N, a) for inclusion in the probability sum, as per the formal definition of the p-value.
In order to reject the null hypothesis, we need to construct a convincing argument that establishes that p(N, a, x) < p 0 , with p 0 = 0.05 in order to achieve 95% confidence.Such an argument, in effect, is a hypothesis test that compares the treatment group (N, a) outcome against a fixed probability x for an adverse outcome in the absence of treatment.Our proposal for doing so is conceptually very simple.First we calculate an efficacy threshold x 0 (N, a, p 0 ) such that In doing so, we are seeking the smallest possible value of x 0 that satisfies Eq. ( 4).If our prior belief about x is that it satisfies p 1 ≤ x ≤ p 2 , then it follows that if we show that our prior belief about the lower bound p 1 of the probability x of an adverse outcome without treatment exceeds the efficacy threshold x 0 , then we have a statistically significant signal of benefit in favor of the proposed treatment protocol.This is, in turn, sufficient to recommend to other physicians to consider using the treatment protocol, on an emergency basis, in order to save as many patients as possible, as soon as possible.
We stress again that implicit in this reasoning is the assumption that all observed adverse events in the treatment group case series have been caused by the disease and not by the treatment.For this reason, this methodology has to be limited only to the evaluation of treatment protocols using repurposed medications [17] with previously known acceptable safety.Furthermore, in order to have a prior belief constraining the probability x of an adverse outcome for high-risk patients, in the absence of treatment, it is necessary for public health agencies and academic institutions to prospectively collect data on the predictive factors for hospitalizations or death, as soon as possible, at the beginning of an emerging new disease.These data can then be used both to define the selection criteria for identifying patients as high-risk and to constrain the corresponding probability x within the interval p 1 ≤ x ≤ p 2 .Finally, the above argument is predicated on the assumption of no selection bias in the case series (N, a), in order for the inequality p 1 ≤ x ≤ p 2 to be applicable to the case series.We extend the argument to account for selection bias in Section 2.3.

Comments on the proposed hypothesis testing technique
We now make the following comments about the above hypothesis testing technique.First, we note that the p-value p(N, a, x), corresponding to a comparison of a case series (N, a) of a treatment group against the probability x of an adverse outcome without treatment, as given by Eq. ( 2), can be also obtained by running an exact Fisher test with an artificial control group (M, b) of M patients with b adverse outcomes, with x = b/M, in the Figure 2. We plot the p-value calculated from an exact Fisher test that compares the treatment group from the DSZ study [2] (141 high-risk patients treated with 1 death) against an artificial control group with 3.8% mortality rate.Note that the exact p-value in the infinite control group limit should be 0.047, which is approached to three decimals when we get to control group size between 160,000 and 180,000 limit where the size of this artificial control group goes to infinity.In appendix A, we give a mathematical proof of this claim and also explain the mathematically precise formulation of the statement.This convergence property is in fact, a consequence of a known relationship [67][68][69] between the hypergeometric distribution, used in the calculation of the exact Fisher test p-value, and the binomial distribution used in the calculation of p(N, a, x).
Paradoxically, as shown from the example in Fig. 2, the convergence of the p-value is not monotonic with respect to the control group size.Intuitively, increasing the size of the control group should increase confidence in rejecting the null hypothesis, which should result in a monotonically decreasing p-value.Instead, we see that the p-value increases, as the control group size is increased, with intermittent downward jumps driving the convergence to p(N, a, x).We also see that the convergence is slower than we might expect.Nevertheless, the result of Appendix A assures us that, in the limit of an infinite control group, the p-value eventually does converge to p(N, a, x).
Second, the proposed hypothesis technique is also mathematically related with the well-researched binomial proportion confidence interval problem [70].Given the case series (N, a) of a treatment group of N patients, with a patients having an adverse outcome, the challenge of the binomial proportion confidence interval problem is to identify a probability interval (q 1 , q 2 ), such that we can assert, with 1 − p 0 confidence, that the probability of an adverse event with treatment is inside the interval (q 1 , q 2 ).
If the null hypothesis is satisfied, then the probability of an adverse outcome with treatment is equal to the probability of an adverse event without treatment, and it follows that the intervals (p 1 , p 2 ) and (q 1 , q 2 ) have to intersect.The contrapositive of this deduction is that if the intervals (p 1 , p 2 ) and (q 1 , q 2 ) do not intersect, then the null hypothesis is false.This argument shows that the upper endpoint q 2 is the efficacy threshold x 0 (N, a, p 0 ) that has to be exceeded by all probabilities in the interval (p 1 , p 2 ) in order to reject the null hypothesis and claim a signal of benefit.More specifically, the method proposed in the preceding section for calculating the efficacy threshold x 0 (N, a, p 0 ) is equivalent to calculating the upper endpoint of the Sterne interval [71] for the corresponding binomial proportion confidence interval problem.
It is worth noting that although several alternative techniques have been proposed for solving the binomial proportion confidence interval problem, none of them has coverage consistent with the desired statistical confidence and most of them do not have conservative coverage [72].This means that, given a case series (N, a) for a treatment group, the obtained 95% confidence interval (q 1 , q 2 ) for the probability of an adverse event, with treatment, could be wider or narrower than it should be, depending on the sample size N and the unknown true value of that probability.Furthermore, it has already been proven that no solution to the binomial proportion confidence interval problem exists with perfect coverage [73].For our purposes, a solution technique with conservative coverage that always overestimates the efficacy threshold x 0 (N, a, p 0 ) is acceptable, and to be preferred over techniques that will sometimes overestimate and sometimes underestimate the efficacy threshold.
The coverage of a specific solution technique is quantified via the coverage probability c(N, p 0 |x), which is defined as the conditional probability of observing a case series (N, a), given a fixed sample size N, for which our solution technique will yield a confidence interval that includes the true probability of an adverse event, under the condition that this true probability is equal to x.Here, 1 − p 0 is the desired level of confidence.For very large sample sizes, c(N, p 0 |x) is calculated using computer simulations, however for smaller samples, it can be calculated analytically [74] from the equation, where I(N, n, x, p 0 ) is an indicator function, such that I(N, n, x, p 0 ) = 1 if and only if x is in the confidence interval obtained by the proposed solution technique for a given case series (N, n) corresponding to 1 − p 0 confidence.Otherwise we set I(N, n, x, p 0 ) = 0. Conservative coverage requires that our solution technique satisfy c(N, p 0 |x) ≥ 1 − p 0 for all values 0 ≤ x ≤ 1 of the probability x.The Clopper-Pearson interval [75] is a very well-known solution technique to the binomial proportion confidence interval problem that is known to have conservative coverage.However, an efficacy threshold, defined as the upper limit of the Clopper-Pearson interval, is not equal to what we would have obtained from an exact Fisher test in the limit of an infinite control group, unlike with the Sterne interval [71].In Fig. 3, we show the coverage probability for the Sterne interval for sample sizes N = 20 and N = 100 and note that it also has conservative coverage, which is very desirable in the context of hypothesis testing.In Fig. 4, we compare the coverage probability of the Clopper-Pearson interval against the coverage probability of the Sterne interval and note that although they are both conservative, the Sterne interval has less conservative coverage probability than the Clopper-Pearson interval, over the same sample size.
Our third comment concerns the numerical calculation of the efficacy thresholds x 0 (N, a, p 0 ) from the function p(N, a, x).To illustrate this calculation with an example, on Fig. 5, we plot the p-value p(N, a, x) against the expected mortality rate x without early outpatient treatment of COVID-19, based on Procter's combined case series [12] of 869 high risk patients that received an early treatment protocol with 2 reported deaths.The figure has vertical lines marking the crossover to 95%, 99% and 99.9% confidence.The corresponding efficacy thresholds are located at the points where the zigzag graph of the function p(N, a, x) intersects with the vertical lines.Finding the intersection points numerically with an efficient algorithm is challenging, due to the zigzag shape of the graph.An efficient such algorithm was discovered very recently [76], although we did not use it in our calculations [27].
The discontinuous behavior of p(N, a, x) may seem paradoxical, since we would have expected it to be monotonically decreasing with respect to x, but we have found that it is caused by some of the right-tail contributions to the p-value sum given by Eq. ( 2).If we make an unwarranted approximation, replacing the right-tail sum with the left-tail sum, we obtain the smooth curve shown in Fig. 5.The intersection points of the smooth curve with the vertical lines, give the upper endpoint of the Clopper-Pearson interval [75].Similar graphs for all case series considered in the study have been included in our supplementary data document [27].We have found empirically, at least for the case series being studied here, that both the correct zigzag curve and the approximate smooth curve give almost the same values for all relevant efficacy thresholds.

Selection bias mitigation and selection bias thresholds
The idea of hypothesis testing, in which we compare a case series of treated patients against the historical population level (or possibly a more limited) control group, is vulnerable to the criticism of possible selection bias in the treatment case series in favor of Relationship between p-value and expected mortality rate for high risk patients without early treatment, based on the case series data from Procter's dataset of 869 high-risk patients [12].The zigzag curve follows p(N, a, x) given by Eq. ( 2), whereas the smooth curve approximates the right tail terms in the p-value sum by replacing them with the left-tail terms on the horizontal axis.establishing treatment efficacy.Some of the selection bias could be systemic (i.e., there may be a tendency towards selecting healthier high-risk patients), but even in the absence of any systemic bias, some selection bias will inevitably occur randomly, as a consequence of using a small sample of patients for the treatment case series, randomly chosen out of the general population.We propose the following idea for mitigating random selection bias, and then we discuss the problem of selection bias more broadly.
Suppose that we have a case series (N, a), of N treated patients with a adverse outcomes, and suppose that we have calculated the efficacy threshold x 0 for this case series.We can choose x 0 to be either set equal to x 0 (N, a, p 0 ), or we can choose to have it further increased, if necessary, using the Bayesian technique of Section 3. In either case, if we have a prior belief that the probability of an adverse event, without treatment, in the high-risk part of the general population, under the same high-risk patient selection criteria used to form the case series, is equal to x, then, when selecting a random sample of N high-risk patients out of the general population, we can have 1 − p 0 confidence that the true rate x of adverse events, without treatment, for that particular sample, will range according to a discretized confidence interval m 1 (N, x, p 0 )/N ≤ x ≤ m 2 (N, x, p 0 )/N.Here, m 1 (N, x, p 0 ) is the minimum number of adverse events and m 2 (N, x, p 0 ) is the maximum number of adverse events that we expect to see in any one particular sample of N high-risk patients, in the absence of treatment, with confidence 1 − p 0 .The possible criticism of our approach is that, perhaps, for the specific sample of patients in our case series, the true adverse event rate x , without treatment, could happen to be below the efficacy threshold x 0 , in spite of the corresponding population level adverse event rate x exceeding the efficacy threshold x 0 .This raises the question of how big does the gap between x and x 0 need to be, to ensure that the entire confidence interval of x lies above the efficacy threshold x 0 ?Since the lower endpoint of this confidence interval is m 1 (N, x, p 0 )/N, the answer to this question defines a new higher threshold x 1 (N, x 0 , p 0 ), which we shall call the random selection bias threshold, if set is equal to the minimum value of x 1 that satisfies In Appendix B, we prove that this random selection bias threshold can be calculated by choosing the smallest possible value of x 1 that satisfies the implication Here, the notation x 0 N represents rounding the number x 0 N upwards towards the nearest integer.We note that the calculation of the confidence interval for x , as given by Appendix B, uses the assumption that the Sterne interval solution [71] of the binomial proportion confidence interval problem has conservative coverage.Given the random selection bias threshold x 1 , and the prior knowledge that in a similarly high-risk cohort, at the population level, the probability x of an adverse outcome, in the absence of treatment, ranges between p 1 ≤ x ≤ p 2 , establishing that x 1 < p 1 with some gap between x 1 and p 1 can be used to rule out random selection bias, with 1 − p 0 confidence, as the sole cause of a signal of efficacy.Generally speaking, it is more likely than not that a strong efficacy signal cannot be caused in a particular observation, solely as a result of random selection bias, as long as x, which is near the center of the confidence interval for x , exceeds the efficacy threshold x 0 .Even if part of the x confidence interval is below x 0 , more than half of the interval will be above x 0 .As a result, the efficacy threshold x 0 and the random selection bias threshold x 1 quantify two levels of evidence.Showing x 0 < p 1 establishes the existence of treatment efficacy by the preponderance of evidence.Meeting this evidentiary standard should be sufficient for communicating the proposed treatment protocol to other physicians for emergency adoption, with a caveat that it is still investigational, and that more data is needed before making a definitive claim.Showing x 1 < p 1 establishes the existence of treatment efficacy by the clear and convincing evidentiary standard.Our view is that exceeding the random selection bias threshold x 1 , for a treatment protocol with acceptable safety, is an objective milestone beyond which therapeutic nihilism, and even the denial of treatment for research purposes, becomes unethical.
With regards to the broader problem of systemic bias, there are multiple possibilities to consider: There may be some population-level geographic bias in the patients that live in the geographic area served by a particular treatment center; there may be reporting bias, in that we hear about case series because of the good outcomes, without these outcomes being representative of the actual outcomes at the national or international level; there may be bias in the patient demographics (ratio of low vs high risk patients), and with respect to the timing of treatment (early vs late treatment).The latter concern can be addressed by stratifying the case series with respect to risk and/or timing of treatment.Geographic bias can be addressed by investigating case series across multiple geographic locations and/or by using localized population statistics for the historical control.Outcome reporting bias can be minimized, if we have consecutive case series from the same treatment center, where the initially reported outcomes are replicated by subsequent results.
Further mitigation of systemic selection bias is possible by establishing a large gap between the random selection bias threshold x 1 and the lower bound p 1 for adverse outcomes in the historical control statistics.To quantify the magnitude of systemic selection bias, consider the likelihood ratio L = x/(1 − x) of selecting unhealthy vs healthy patients, if the selection is truly random, i.e. without any systemic bias.Here, we define unhealthy patients to be the high-risk patients that will have an adverse outcome without early treatment, if symptomatically infected, and we define healthy patients to be the patients that are not unhealthy patients.If there is systemic bias in favor of selecting healthy patients, then that could account for a false positive signal of efficacy.It would also reduce the corresponding likelihood ratio to L/F, with F ≥ 1 a numerical factor measuring how much more likely it is to choose healthy patients due to systemic selection bias.In Appendix B, we have also shown that the systemic selection bias threshold x 1 (F|N, x 0 , p 0 ), that x has to overcome in order to mitigate systemic selection bias with magnitude F, is related with the random selection bias threshold x 1 (N, x 0 , p 0 ) via the equation: Given our prior belief that, at the population level, the probability x of an adverse outcome in high-risk patients without treatment satisfies p 1 ≤ x ≤ p 2 , we can find the maximum amount F max of selection bias that can be tolerated, before the evidence quality falls back to the preponderance of evidence evidentiary standard, by solving the equation x 1 (F|N, x 0 , p 0 ) = p 1 with respect to F. The corresponding solution is given by This means that if the systemic bias tends to select healthy high-risk patients F times more likely than the likelihood corresponding to their proportion in the general population of high-risk patients, then 1 ≤ F < F max implies that we can have at least 1 − p 0 confidence that the observed positive signal of efficacy cannot be explained solely as a consequence of systemic selection bias.
Last, but not least, statistical quantitative evidence can be corroborated and amplified with more qualitative evidence based on the Bradford Hill criteria [77] for establishing a causal relationship between treatment and positive patient outcomes.Particularly relevant are the criteria of: (1) plausibility, i.e. the existence of a known biological mechanism of action that explains why the treatment protocol is expected to work; (2) consistency, i.e. observing the same effect in different treatment centers in different locations; (3) biological gradient, i.e. observing improved outcomes with increased medication dosage or length of treatment, additional medications, or by initiating treatment earlier, rather than later.(4) temporality, i.e. immediate improvement in symptoms, following the administration of the treatment protocol.The statistical evidence alone, speak in support of only the strength of association criterion, but that is only one of the several criteria proposed by Bradford Hill [77].If we can establish that these additional criteria are satisfied, then that constitutes additional evidence on top of the statistical evidence, that treatment efficacy exists, and that the signal of benefit cannot be explained away, as a result solely caused by selection bias.

Methods. Part II: Bayesian factor analysis of efficacy thresholds
The methodology that we proposed in Section 2 is also vulnerable to the criticism that rejecting the null hypothesis, solely on the basis that the p-value satisfies p < 0.05, is not sufficient for asserting that treatment efficacy is statistically significant.This is indeed the position of the recent American Statistical Association statement on statistical significance and p-values [19].The problem is that p-values only measure how incompatible the data are with the null hypothesis.Consequently, a concern has been expressed that it is not self-evident that the p-value will always do a good job at controlling the probability of a false positive result [78].To estimate the latter probability, we would have to formulate the appropriate alternate hypothesis and consider how much the data is compatible or incompatible with that alternate hypothesis.This has prompted recommendations to lower the p-value threshold down to 0.01 or 0.001 [78,79].However, this is only a stopgap measure that does not fundamentally address the problem.
In this section, we supplement the p-value based analysis of Section 2 with a proposal for a Bayesian factor analysis [20][21][22][23][24].The Bayesian factor compares the alternate hypothesis (treatment efficacy) against the null hypothesis, and can be used to calculate the probability of a false positive result [25].We do not mean to suggest that the Bayesian factor should replace the p-value in hypothesis testing.Our view is that we need to use both.That is, use the p-value to reject the null hypothesis, and then use the Bayesian factor to assess the strength of the evidence in favor of the alternate hypothesis.This viewpoint is similar to earlier proposals for conditional frequentist testing [20].
In the following, we will briefly review the Bayesian factor framework, and outline our specific proposal for validating and adjusting, as needed, the efficacy threshold x 0 (N, a, p 0 ).We note that the calculation of the random selection bias threshold x 1 is independent of the technique used to calculate the efficacy threshold x 0 .In terms of procedure, one could initially calculate the efficacy threshold x 0 using only the technique of Section 2, and use that to calculate the corresponding random selection bias threshold x 1 .Alternatively, a more detailed analysis would involve: (a) calculating the efficacy threshold using the technique of Section 2; (b) adjusting the efficacy threshold x 0 using the technique presented in this section; (c) using the adjusted efficacy threshold x 0 to calculate the corresponding random selection bias threshold x 1 .

Bayesian factor and the false positive rate
Let A, B, be two arbitrary events in some probability space.From the definition of conditional probability, we obtain the Bayes rule, given by Let D represent our data, H 0 represent the null hypothesis, and H 1 represent the alternate hypothesis.In the Bayesian statistics framework, we assign probabilities p(H 0 ), p(H 1 ) to the hypotheses H 0 , H 1 representing our prior belief about how likely each hypothesis is, and then calculate the updated probabilities p(H 0 |D) and p(H 1 |D) on the condition of observing the data D. In this way, Bayesian statistics is distinct from frequentist statistics where probabilities are not assigned to the hypotheses themselves.
From the Bayes rule we have, and dividing the two equations gives The Bayes factor B(D|H 1 , H 0 ) is defined to read and it is the numerical factor that amplifies our prior belief about the odds ratio b(H 1 , H 0 ) = and it follows that the probability of a false positive result is given by We see that the false positive probability p(H 0 |D) approximately scales as the inverse of the Bayes factor B(D|H 1 , H 0 ).On the other hand, the dependence of p(H 0 |D) on the prior likelihood ratio b(H 1 , H 0 ), which measures our subjective belief about the odds ratio between H 1 and H 0 , before seeing the data D, is uncomfortable.There are three ways to cope with that: First, one can simply join the frequentist camp, consider probabilities based on beliefs as meaningless, and forget about the whole thing.Second, one can use an uninformed prior, meaning that we assume that both hypotheses H 0 and H 1 are equally probable, not having any prior knowledge that favors one over the other, and choose p(H 0 ) = p(H 1 ) = 1/2, which corresponds to b(H 1 , H 0 ) = 1.An interesting third way is to use the reverse Bayesian analysis technique proposed by Colquhoun [25], which is based on the equivalence which relates an upper bound p 0 on the probability p(H 0 |D) with a corresponding lower bound b min (p 0 , B) on the prior likelihood ratio b(H 1 , H 0 ), which is given by with B being the value of the corresponding Bayesian factor.The meaning of Eq. ( 21) is that, given a desired lower bound p 0 for the false positive rate and a threshold B for the Bayesian coefficient, b min (p 0 , B) is the minimum prior likelihood ratio p(H 1 )/p(H 0 ) for our prior knowledge of the extent to which the alternate hypothesis H 1 is favored over the null hypothesis H 0 , for which the Bayesian threshold B can control the false positive rate and keep it below p 0 .As such, given our subjective choice for b min , one can calculate the threshold B for the Bayesian factor corresponding to the minimum tolerated false positive rate p 0 .Since we wish to constrain the false positive rates to less than 0.05, in order to claim 95% statistical significance, we choose p 0 = 0.05.Kass and Raftery [24] and Jeffries [80] both recommend that the threshold B > 100 be used for a decisive acceptance of the alternate hypothesis H 1 over the null hypothesis H 0 .Using B = 100, we find that b min (0.05, 100) = 0.19.This means that if we associate the decisive threshold B > 100 with 95% confidence, doing so is equivalent to a prior belief that the null hypothesis is 5 times more likely than the alternate hypothesis.In turn, this prior belief can be used to deduce Bayesian factor thresholds for higher levels of confidence, consistently with our choice to associate B > 100 with 95% confidence.This choice can be interpreted as defining the word "decisive" to mean 95% confidence, in the context of stating that B > 100 is "decisive".It could be critiqued as being an arbitrary choice, but the same can be said for the p 0 = 0.05 p-value threshold and the Bayesian factor B > 100 threshold.Our particular approach has the advantage of being more transparent, in terms of an intuitive interpretation, than an arbitrary choice made in terms of the prior probabilities for H 0 and H 1 .

Application to hypothesis testing for case series
Now, let us consider how Bayesian factor analysis can be applied to a case series with a treatment group of N patients, where a patients have an adverse outcome.Let x 0 be the corresponding efficacy threshold, determined via the techniques of Section 2, and let x be the probability of an adverse outcome with treatment.We define a null hypothesis H 0 and an alternate hypothesis H 1 about the value of x such that We use for x 0 the upper endpoint of the binomial proportion confidence interval corresponding to the observed data (N, a).Consequently, the null hypothesis H 0 has been defined to place x outside and above that interval, and the alternative hypothesis H 1 considers the remaining possible values for x.
Because both H 0 and H 1 are composite hypotheses, it is necessary to introduce prior probabilities pr(x|H 0 ) and pr(x|H 1 ), corresponding to H 0 and H 1 .It may seem tempting to just use uninformed priors, both for H 0 and H 1 , however, doing so would certainly not be appropriate for the null hypothesis H 0 in almost all situations, since with many illnesses, we can rule out the probability of an adverse outcome exceeding some upper bound p 2 .Instead, we can thus use an uninformed prior on the interval [x 0 , p 2 ], given by pr(x|H 0 (x 0 , and perform an appropriate sensitivity analysis on the parameter p 2 .In general, increasing p 2 will tend to increase the Bayes factor, since doing so will tend to increase the contrast between the null and alternate hypotheses.So, we can explore how much p 2 can be decreased and still maintain a decisive Bayes factor.Likewise, for the alternate hypothesis H 1 , we will use an uninformed prior on the interval [0, t] with t ≤ x 0 given by The reason for this choice is that we have found empirically that, in some cases, the Bayes factor may actually increase if, instead of an uninformed prior on [0, x 0 ], we use an uninformed prior on the shorter interval [0, t].From an intuitive standpoint, we surmise that if the data has a very strong efficacy signal, then the contrast between the null and alternate hypotheses is increased when one eliminates the relatively unlikely values of x between t and x 0 .For this reason, we shall use the maximum value of the Bayes factor taken over all values t ∈ (0, x 0 ), on a decimal logarithmic scale which is given by b In Appendix C we prove that the function b 0 (x 0 , p 2 , t) is initially increasing and then decreasing, with respect to t, with a maximum in the interval [a/N, 1].If this maximum is located in the narrower interval [a/N, x 0 ], then the optimal Bayes factor is indeed obtained when we use a choice t ∈ (0, p 0 ) for the prior distribution of the alternate hypothesis H 1 .
If the maximum is formally located at t > x 0 , then the optimal Bayes factor is obtained at t = x 0 .The resulting metric b(x 0 , p 2 ) is still dependent on the parameter p 2 of the prior distribution of the null hypothesis H 0 .
To complete the metric definition by Eq. ( 26) and Eq. ( 27), we now show the calculation of the Bayes factor B(N, a|H 1 (x 0 , t), H 0 (x 0 , p 2 )) between H 1 and H 0 as a function of x 0 , p 2 , t and the data N, a.We note that the probabilities for seeing the data (N, a), under the hypotheses H 1 and H 0 , are given by: pr(N, a|H 0 (x 0 , p 2 )) = and pr(N, a|H 1 (x 0 , p 2 )) = x 0 0 dx pr(N, a|x) pr(x|H 1 (x 0 , t)) consequently, the corresponding Bayes factor is given by The integrals can be calculated using exact algebra or numerically with the open source computer algebra software Maxima [81].The exact algebra calculation takes longer to carry out, but we have confirmed that the numerical calculation using the function quad_qagr is just as accurate.
In order to control for the false positive rate, we propose that the efficacy thresholds x 0 (N, a, p 0 ) with p 0 = 0.05 should be increased, if necessary, by requiring that they also satisfy b(x 0 , p 2 ) ≥ 2. Since the threshold used for a decisive Bayes factor with p 0 = 0.05 corresponds approximately to b min (p 0 , B) = 1/5, it is reasonable to use the empirical formula to adjust the efficacy thresholds x 0 (N, a, p 0 ) for an arbitrary value of demanded confidence p 0 .For p 0 = 0.01, Eq. ( 36) gives b(x 0 , p 2 ) ≥ 2.7.For p 0 = 0.001, we find b(x 0 , p 2 ) ≥ 3.7.
Both Bayes factor thresholds also correspond to a prior likelihood ratio p(H 1 )/p(H 0 ) = 1/5.Consequently, they are the thresholds that we recommend imposing on the Bayes factors for the purpose of adjusting the corresponding efficacy thresholds x 0 (N, a, p 0 ), for the choices p 0 = 0.01 and p 0 = 0.001.

Results
We shall now apply the proposed framework to the processing of available high-risk COVID-19 patient case series by Zelenko [2,9,10], Procter [11,12], and Raoult [13] that provide evidence for the original Zelenko triple-drug protocol [2] and the more advanced McCullough protocol [14][15][16], which are both based on safe repurposed medications.Section 4.1 reviews the case series under consideration.Section 4.2 summarizes the data and the calculation of the corresponding efficacy and random selection bias thresholds.These are used in Section 4.3 and Section 4.4 to assess the evidence in support of mortality rate reduction and hospitalization rate reduction correspondingly.Section 4.5 shows that the Bayesian factor analysis of the efficacy threshold has negligible impact for the specific case series under consideration.

Review of the Zelenko, Procter and Raoult case series
In the Zelenko April 2020 letter [9], Zelenko reported on his outcomes based on a total of 1,450 patients that he treated for COVID-19 until April 28, 2020 in an Orthodox Jewish community in upstate New York.From this cohort, 405 patients were classified as high risk and treated with his triple-drug therapy (hydroxychloroquine, azithromycin, zinc sulfate).The reported outcomes were 6 hospitalizations and 2 deaths.From amongst the patients classified as low risk, who were given only supportive care, there were no hospitalizations or deaths.Zelenko's criteria for risk stratification define three categories of high risk patients: (1) every patient older than 60; (2) every patient younger than 60 but with comorbidities or obesity (BMI ≥ 30kg/m 2 ); (3) patients younger than 60 and without comorbidities that presented with shortness of breath.
A subset of the April 28, 2020 case series was published in a case controlled study [2] that included only the patients seen with COVID-19 infection that was confirmed by a PCR test or an antibody IgG test.The remaining patients were clinically diagnosed from symptomatic presentation and via ruling out a bacterial or influenza infection.This Derwand-Scholz-Zelenko study (hereafter DSZ study) [2] included 335 patients of which, 141 patients were classified as high-risk patients and treated with the triple drug protocol, with 4 hospitalizations and 1 death.Detailed demographic data is given for the high-risk patient treatment group, including a detailed breakdown in the three high-risk categories.The study also included a control group of 377 patients, who were seen by other treatment centers in the same community, that were only offered supportive care and no early outpatient treatment.From this untreated group, 13 patients died and 58 patients were hospitalized.The untreated group includes both low-risk and high-risk patients, so we expect that it underestimates both the hospitalization and mortality risk for high-risk patients.Unfortunately, demographic data was not available for the untreated group, so from a strictly methodological point of view, one cannot entirely rule out the theoretical possibility that the untreated group might have consisted of patients that are at higher risk, on average, than those of the high-risk treatment group.On the other hand, using a case series of untreated patients from Israel [61], with demographic data indicating a combination of low and high-risk patients, with 143 deaths reported out of 4,179 untreated patients, gives the same 3.4% mortality rate as in the DSZ control group, suggesting that the DSZ control group also consists of a mixed demographic of low and high risk patients.
The June 2020 Zelenko case series [10] is reported in a letter that Zelenko sent to the Israeli Health Minister at the time, Dr. Moshe Bar Siman-Tov, on June 14, 2020, which was later made publicly available.In the letter, Zelenko reported that a total of approximately 2,200 patients were seen as of June 14, 2020, with 800 patients deemed high-risk, under the same criteria who were treated with the triple-drug therapy, since the beginning of the pandemic.The reported cumulative outcomes are: 12 hospitalizations, 2 deaths, no serious side effects, and no cardiac arrhythmias.
During the April 2020-June 2020 interval, at the beginning of May 2020, Zelenko enhanced his triple drug therapy protocol with oral dexamethasone and budesonide nebulizer.He introduced the blood thinner Eliquis towards the end of May 2020 and the beginning of June 2020.Ivermectin was not used by Zelenko until October 2020.Consequently, the DSZ study [2] and the Zelenko April 2020 case series [9] reflect the outcomes of the triple drug therapy, when used by itself as an early outpatient treatment.The Zelenko June 2020 case series [10] includes the use of steroid medications and a blood thinner, so the underlying treatment protocol is closer to the McCullough protocol [14][15][16].
It is worth noting that both letters [9,10] were originally posted on Google Drive by Zelenko, and were censored by Google during 2021.The April 2020 letter [9] was cited by Risch [65], whose paper has also preserved the corresponding case series data.The June 2020 letter case series data [10] was independently reported in a subsequent publication by Risch [82], which included only the number of reported deaths, and not the number of hospitalizations.The authors have attached copies of all three Zelenko letters [9,10,26] to our supplementary material document [27].
The Procter case series were reported consecutively in two publications [11,12].The first paper [11] reports on 922 patients that were seen between April 2020 and September 2020, of which 320 were risk stratified as high-risk patients and treated with the McCullough protocol [14][15][16].The outcome was, 6 hospitalizations and 1 death.The second paper [12] reports on an additional patient cohort seen between September 2020 and December 2020.Out of the total number of patients during that time period, 549 were risk stratified as high-risk and treated with an outcome of 14 hospitalizations and one death.For both case series, the risk stratification criteria were similar to those used by Zelenko.However, the age threshold used to risk stratify patients as high-risk was lowered to 50 years.The medications used were customized for each patient in accordance with the McCullough protocol [14][15][16] and included hydroxychloroquine, ivermectin, zinc, azithromycin, doxycycline, budesonide, foliate, thiamin, IV fluids, and for more severe cases, dexamethasone and ceftriaxone were also added.Demographic details for the cohorts were reported in the respective publications [11,12].
The final high-risk patient case series is extracted from a recent cohort study [13] of 10,429 patients that were seen between March 2020 and December 2020 by Raoult's IHU Méditerranée Infection hospital in Marseille, France.From the entire cohort, 8,315 patients were treated with hydroxychloroquine, azithromycin and zinc.Of those patients, those older than 70 or with comorbidities were also treated with enoxaparin.Low-dose dexamethasone was given on a case by case basis to patients that presented with inflammatory pneumonopathy, high viral loads, or on a case by case basis.This treatment protocol is consistent, to some extent, with the principles that underlie the McCullough protocol [14][15][16].The remaining 2,114 patients did not receive hydroxychloroquine or azithromycin or both due to contraindications or because the patients did not consent to using one or two of these medications.This cohort was used in the Raoult study [13] as a control group.The study risk-stratified the patients by age (see Table 1 of Ref. [13]), making it possible to extract a case series of high-risk patients under the restriction age ≥ 60.In the treatment group, this results in 1,495 high-risk patients with 5 deaths and 106 hospitalizations.In the control group, under the age ≥ 60 constraint, there are 520 high-risk patients with 38 hospitalizations and 11 deaths.The authors note that no serious adverse events to the medications were reported, and that the reported deaths were not related to side effects of hydroxychloroquine or azithromycin.Furthermore, no deaths were reported for age < 60 cohort in both the treatment group and control group.

Tabular summaries of the Zelenko, Procter and Raoult case series
Table 1 summarizes the aforementioned case series, including the treatment groups from the DSZ study [2], the Zelenko [2,9,10] and Procter [11,12] case series and the age ≥ 60 treatment group from the Raoult study [13].Note that the Zelenko June 2020 case series and the Procter II case series as reported on Table 1, combine the two respective consecutive case series.We also report on Table 1 the DSZ study's control group [2], the alternative Israeli control group [61], and the age ≥ 60 part of the Raoult control group [13].We emphasize that all reported treatment group case series consist of high-risk patients.
From a cursory examination of Table 1, we see that the mortality rate is consistent across all treatment groups, which speaks to the consistency Bradford Hill criterion [77].Hospitalization rates are also consistent between the Zelenko [2,9,10] and Procter case series [11,12], but there is a clear discrepancy with the hospitalization rates reported in the Raoult treatment case series [13].We believe that the reason for the discrepancy is that both Zelenko and Procter explicitly aimed to prevent hospitalizations due to the poor outcomes of the inpatient treatment protocols used in the United States.In Marseille, France, Raoult had the option of using his IHU Méditerranée Infection hospital for short hospitalizations, in order to closely monitor his more concerning cases.
In Table 2, we show the results of comparing the Zelenko April 2020 [9] and Zelenko June 2020 [10] case series against both the original DSZ control group [2] as well as the alternative control group from Israel [61].The confidence intervals were calculated using  [11,12], and Raoult's high risk (older than 60) treatment group [13].The table also lists the same data for the control group in the DSZ study [2], the untreated group in the Israeli study [61], and the control group in the Raoult study [13].[9], and Zelenko's complete June 2020 data set [10] against the low risk and high risk patient control groups in the DSZ study [2] and the Israeli study [61].The p-values where there is a failure to establish 95% confidence are highlighted.[11,12], and Raoult's high risk (older than 60) treatment group [13].In parenthesis, we also display the corresponding higher random selection bias thresholds.

Study
Woolf's formula [83,84].Although in the original DSZ study [2] mortality rate reduction was not statistically significant, we have found that comparing either the Zelenko April 2020 case series [9] or the June 2020 case series [10] against either control group, gives more than 90% mortality rate reduction, which is also statistically significant in terms of both p-value and confidence interval.Likewise, we see at least 90% hospitalization rate reduction when the Zelenko April 2020 case series or Zelenko June 2020 case series is compared against the DSZ control group, which is statistically significant as well.Because the control groups consist of a combination of both low-risk and high-risk patients, whereas the treatment groups consist of only high-risk patients, the resulting comparisons are biased towards the null, and are thus underestimating the actual efficacy of the respective treatment protocols.This comparison is compelling, due to the consistency between the two control groups, as evidence in favor of showing the existence of treatment efficacy.However, from a methodological standpoint, it may not be convincing enough by itself, in terms of measuring the extent of treatment efficacy.
We have also calculated the efficacy threshold for mortality rate reduction and hospitalization rate reduction corresponding to the case series by Zelenko [2,9,10], Procter [11,12], and Raoult [13].The calculations are shown in the supplementary material document [27].The results are tabulated in Table 3.We display the efficacy thresholds for 95%, 99% and 99.9% confidence, which are calculated as the upper end points of the corresponding Sterne interval [71] and, in parentheses, we display the corresponding random selection bias thresholds.We use precision of 0.1% for most case series, except for the two largest ones, Procter II [12] and Raoult [13], where we use 0.01% precision.
Each threshold corresponds to a mathematically rigorous conditional statement about rejecting the null hypothesis that the corresponding early outpatient treatment protocol is ineffective.For example, the 1.8% efficacy threshold corresponding to 95% confidence for rejecting the null hypothesis in the Zelenko April 2020 case series [9] corresponds to the following statement: if the expected mortality rate for an equivalent cohort without early outpatient treatment exceeds 1.8%, then the null hypothesis can be rejected with at least 95% confidence.Similar statements can be formulated for each efficacy threshold metric on Table 3.Likewise, the 4.0% random selection bias threshold for the Zelenko April 2020 case series [9] corresponds to the following statement: If the observed mortality rate, at the population level, for high risk patients, classified as such using the same selection criteria as in the treatment case series, exceeds 4.0%, then we can be 95% confident that the observed signal of efficacy  cannot be attributed solely to random selection bias, and we can also reject the null hypothesis with at least 95% confidence.Similar statements are implied from all of the other random selection bias thresholds reported on Table 3.These statements are mathematical facts.However, to complete the inference argument, they need to be paired with an inevitably subjective statement that provides an estimate, or at least a lower bound, on the expected mortality or hospitalization rates of similar cohorts without early outpatient treatment.Secondarily, we need an inference about the intervals of mortality or hospitalization rates, in the absence of early outpatient treatment, in order to do the Bayesian adjustment of the efficacy thresholds.

Analysis of mortality rate reduction efficacy
To establish that early treatment protocols result in mortality rate reduction, when administered to high-risk patients, we recall that patients have been classified as high-risk based on the following three categories: (1) old age; (2) comorbidities or obesity (with BMI ≥ 30kg/m 2 ); (3) shortness of breath upon presentation.The age threshold for high risk classification is age ≥ 60 for the Zelenko [2,9,10] and Raoult [13] case series, and age ≥ 50 for the Procter [11,12] case series.The high-risk treatment groups for the Zelenko [2,9,10] and Procter [11,12] case series include the demographic distribution of all three categories of high-risk patients, whereas in the Raoult [13] case series we have included only age ≥ 60 patients.Our approach, in the following, is to lower bound the mortality rate, in the absence of early outpatient treatment, separately for each of the three high-risk patient categories.Then, the common lower bound becomes applicable to any demographic distribution of the three categories.To establish the existence of treatment efficacy, it is sufficient for this lower bound to exceed the corresponding thresholds of Table 3.In the following, we shall now consider the mortality rate for each of the three high-risk patient categories separately.
With regards to the first category of patients classified as high-risk due to old age, the earliest data from China [60], as of February 11, 2020, estimated a minimum of 3.6% mortality rate for patients older than 60 and a minimum of 1.3% mortality rate for patients older than 50 (see Table 4).These numbers are consistent with numbers from China [58] and Italy [59] as of March 17, 2020 (see Table 5).We can also estimate the mortality risk of the first category of high risk patients (age ≥ 60 or age ≥ 50) using adjusted estimates by the CDC [62-64] of COVID-19 deaths per symptomatic cases.The CDC report attempts to adjust for the differences in underreporting of symptomatic illness, hospitalizations, and deaths, and it is based on reports ranging from February 2020 to September 2021.The raw data and a copy of the CDC report website are given in our supplementary material document [27].From that, we calculate for the age ≥ 50 group a mortality rate of 2.26% (95% CI:   5.52%).We observe that the stratification of mortality risk with respect to age is consistent between three distinct geographical regions.
The second category of high risk patients are patients with comorbidities regardless of age.In Table 6, we show case fatality rates with respect to comorbidities (i.e.cardiovascular disease, diabetes, respiratory disease, hypertension, cancer), based on data from China [58] in the period up to February 11, 2020, and additional data from Israel [61], with patients diagnosed in the period up to April 16, 2020 and deaths recorded up to July 16, 2020.There is variability in mortality rates from 5% to 15%.The Israeli data appear to show higher mortality rates than the data from China, and the reason for that could be that the Israeli study [61] accounted for the time lag between patient diagnosis and death.Nevertheless, with respect to using 5% as a lower bound mortality rate for high-risk patients with comorbidities, the available data from both locations is consistent.
These studies do not account for the mortality risk from obesity, and do not account for the mortality risk corresponding to the third category of high-risk patients that present with shortness of breath.A collaborative study by Risch and a research group in Brazil [85] found, using multivariate regression analysis, that both obesity and dyspnea pose a higher mortality risk than heart disease (see Table 2 of Ref. [85]), therefore, we expect that they both lie in the same 5% to 15% interval as patients with other comorbidities.
For the case of obesity, as a mortality risk factor, this conclusion is also supported by more recent meta-analysis [86], showing that obesity is a greater mortality risk factor than diabetes and hypertension, and one that increases with increasing BMI.A study of 148,494 patients across 238 hospitals by the CDC [87] also confirms that obesity is an increasing mortality risk factor with increasing BMI.It is known that obesity is associated with increased levels of the inflammatory cytokines TNF-α (tumor necrosis factor alpha), IL-1β (interleukin-1-beta), and IL-6 (interleukin 6), produced by macrophages in the adipose tissue [88].A study of 9390 hospitalized patients in Abu Dhabi, United Arab Emirates, has found that patients with severe COVID-19 symptoms, requiring intensive care, had significantly elevated IL-6 biomarker relative to patients that presented with mild or moderate symptoms [89].An earlier meta-analysis [90] has also confirmed that the IL-6 biomarker is associated with severe progression of the COVID-19 disease.Consequently, there is a very compelling biological mechanism that explains why obesity is a severe risk factor for progression of the disease to the COVID-19 pneumonia phase, requiring a high risk classification and immediate early outpatient treatment.
For the case of patients presenting with shortness of breath, it is important to appreciate the fact that, without an early outpatient treatment intervention, such presentation implies that the disease is progressing beyond the viral replication phase, into the COVID-19 pneumonia phase, soon to be followed with the thromboembolic stage, oxygen desaturation, and hospitalization.It is thus self-evident that these patients should be classified as highrisk and treated immediately.Assuming that most of such patients will be hospitalized without outpatient treatment, we can also estimate the corresponding mortality risk, in the absence of outpatient treatment, by looking at the conditional probability of death, assuming hospitalization has already taken place.A study by the Houston Methodist Hospital [91] has shown an average mortality rate of 5.8% for hospitalized patients between March 2020 and July 2020, in spite of the use of hydroxychloroquine and anticoagulants.Furthermore, the study reports 12.1% mortality rate, for hospitalized patients between March 13th 2020 and May 15th 2020, and 3.5% mortality rate between May 16th 2020 and July 7, 2020, corresponding to two consecutive surges, noting that the second surge targeted younger patients than the first surge.A prospective multicenter study [92] from Italy of 1050 patients in the Coracle registry, between February 22, 2020 and April 1, 2020, showed an overall 13% average mortality rate, with 7.4% mortality rate for hospitalized patients that do not require supplemental oxygen or invasive ventilation, 12.8% mortality rate for hospitalized patients that require supplemental oxygen, and 22.9% mortality rate for hospitalized patients that are invasively ventilated.
Based on the above arguments, we can lower bound the untreated mortality risk by 3.5% for the age ≥ 60 demographic and by 5.0% for the high risk patients with comorbidities, obesity, or shortness of breath presentation.For the age ≥ 50 demographics, we have an expected 2.26% mortality rate for the United States demographic distribution, as estimated by the CDC.The common lower bound for high risk patients in all three categories of the Zelenko case series is thus estimated as 3.5%, it exceeds the efficacy thresholds for both the Zelenko April 2020 [9] and the Zelenko June 2020 [10] case series, and it also exceeds the random selection bias threshold for the Zelenko June 2020 case series [10] with F max ≥ 1.77.The same untreated mortality rate lower bound of 3.5% applies to the Raoult case series [13], which exceeds the efficacy threshold 0.79% and the random selection bias threshold 1.40% by a wide margin with F max ≥ 2.55.Finally, using the CDC mortality rate of 2.26%, which includes a minority of treated patients and a majority of untreated patients for the age ≥ 50 demographic in the United States, as a conservative untreated mortality rate lower bound for the Procter case series, we find that it exceeds the efficacy threshold for both Procter I [11] and Procter II [12] case series, and also exceeds the random selection bias threshold for the Procter II case series [12] with F max ≥ 1.24.We stress that the estimates for F max are lower bounds, and note that the results from the Raoult case series [13] are particularly robust against systemic selection bias.
A completely different approach is to compare the efficacy and random selection bias thresholds against the CFR for the entire population [93].The CFR for the United States and France is displayed on Fig. 6 for the time period between April 2020 and October 2021.During 2020, the CFR ranged from 2% to 6% in the United States and from 2% to 16% in France.In both countries, the CFR converged to 1.7% during 2021 and remained roughly constant, with very small oscillations throughout 2021.The minimum value of 1.7% exceeds the mortality rate reduction efficacy thresholds for the Zelenko June 2020 [10], Procter II [12], and Raoult case series [13].It also exceeds the random selection bias threshold for the Raoult case series [13].Using 2.0% as the minimum CFR during 2020, we note that it exceeds the random selection bias threshold for the Procter II case series [12] and it equals the random selection bias threshold for the Zelenko June 2020 case series [10].Taking the CFR at face value, this is a very strong signal of efficacy, because the CFR includes asymptomatic, low-risk, and high-risk patients, regardless of whether they received early treatment, against solely high-risk patients in the treatment groups of the respective case series.This comparison strongly biases against being able to reject the null hypothesis, but we are still able to do so.
In particular, we note that in the United States, the CFR ranged from 2% to 6% during 2020, which lies above the 1.8% mortality rate reduction efficacy threshold for Zelenko April 2020 case series [9].This is an indicator that the preponderance of evidence was in favor of adopting Zelenko's triple-drug protocol at that time, on an emergency basis, but was nonetheless not officially adopted in the United States for outpatients [94].By June 2020, the respective efficacy threshold decreased to 1.0%, and the random selection bias threshold decreased to 2.0%, while the CFR was still in the neighborhood of 3.0%.Thus, the evidence in favor of adopting the Zelenko triple drug therapy had just crossed over to the clear and convincing evidentiary standard by the summer of 2020.Raoult's data [13] was available by December 2020, and strongly corroborate Zelenko's results [9,10,28].In particular, the F max ≥ 2.55 lower bound, obtained for the Raoult case series [13], means that even if there is systemic selection bias in favor of selecting healthy high-risk patients by a factor of 2.55, we can be 95% confident that the observed signal of benefit cannot be explained by systemic selection bias alone.

Analysis of hospitalization rate reduction efficacy
In Table 3, we see that the 95% efficacy thresholds for hospitalization rate reduction range from 2.7% to 4.1% for all case series, with the exception of the DSZ case series, where it is at 7.0% due to the smaller sample size.Likewise, the random selection bias thresholds for hospitalization rate reduction with 95% confidence range from 4.2% to 7.3% for all case series, except for the DSZ case series [2].
These thresholds can be compared against the following empirical data.At the beginning of the pandemic, on data from China until February 11, 2020, there was an initial estimate [60] that the probability of hospitalization for a high-risk age ≥ 60 cohort would range from 10% to 18%.The control group from Zelenko's study [2], consisting of both low and high-risk patients, again at the beginning of the pandemic here in the United States, reported 377 patients with 58 hospitalizations, corresponding to 15% hospitalization rate.In the Cleveland study [95], which was used to train a predictive model for the risk of hospitalization and death based on patient medical history, the entire dataset consisted of a total of 4,536 patients between March 8, 2020 and June 5, 2020.There were 582 hospitalizations corresponding to 21% hospitalization rate.In the Mass General Brigham hospital study [96], from a cohort of 12,347 patients that tested positive, there were 3,401 hospitalizations between March 4, 2020 and July 14, 2020, corresponding to a 27% hospitalization rate.This was also a cohort that included both low-risk and high-risk patients.The CDC adjusted data [62][63][64] between February 2020 and September 2021, estimate 13.79% (95% CI: 17.09% to 28.52%) hospitalization probability for the age ≥ 50 group, given a symptomatic infection.For the age ≥ 65 cohort, this estimate increases to 22.09% (95% CI: 17.09% to 28.52%) Overall, our observation is that we tend to see numbers ranging from 10% to 28% with substantial variability between various cohorts, all of which were not given early outpatient treatment.On the other hand, we see that the case series of high risk patients shown in Table 3, have efficacy thresholds for hospitalization rate reduction ranging from 2.7% to 4.1%, which have a substantial separation from the 10% to 28% interval.Most remarkably, the hospitalization rate reduction random selection bias thresholds also have a substantial separation from the 10% to 28% interval.We interpret this big gap between the two intervals as strong evidence of the existence of hospitalization rate reduction efficacy as a result of the respective early outpatient treatment protocols in the Zelenko April 2020 [9], Zelenko June 2020 [10], Procter I [11], Procter II case series [12]

Bayesian analysis of efficacy thresholds
We shall now assess whether the efficacy thresholds need to be increased, using the Bayesian technique described in Section 3, in order to control the false positive rate.In Table 7, we have calculated the logarithmic Bayesian metric b(x 0 , p 2 ), given by Eq. ( 26), for the mortality and hospitalization rate reduction efficacy thresholds corresponding to 95% confidence, using a range of values of p 2 for the purpose of sensitivity analysis.The calculation details are available in our supplementary material document [27].Recall from Section 3 that p 2 corresponds to our sense of the worst possible probability of the respective adverse outcome (hospitalization or death) in high-risk patients, in the absence of early outpatient treatment.As such, 5% to 10% is a typical range for mortality rates in untreated high-risk patients, making p 2 = 5% a highly conservative choice.We did not consider values higher than 10%, even though worse probabilities are possible, because for p 2 > 10%, we see that all logarithmic Bayesian factors already satisfy b(x 0 , p 2 ) ≥ 2. We have also looked at p 2 = 2%, which is obviously entirely unrealistic, because it corresponds to the mortality rate of the Raoult control group [13], where some partial treatment was given.Likewise, for the hospitalization rate reduction efficacy thresholds, we have used the values p 2 = 10%, 15%, 20% based on our expectation of a typical 10% to 28% range for the probability of hospitalization, in the absence of early outpatient treatment.We did not consider p 2 > 20% since almost all of the logarithmic Bayesian factors satisfy b(x 0 , p 2 ) ≥ 2 at p 2 = 20%.[11,12], and Raoult's high risk (older than 60) treatment group [13].We highlight the Bayesian thresholds that exceed the frequentist thresholds.[11,12], and Raoult's high risk (older than 60) treatment group [13].We highlight the Bayes factors that violate the condition b(x 0 , p 2 ) ≥ 2.7.
In Table 8, we compare the efficacy thresholds for rejecting the null hypothesis with the corresponding 95% confidence Bayesian thresholds, obtained by the inequality b(x 0 , p 2 ) ≥ 2 for accepting the alternate hypothesis.For the DSZ study [2], we see that the corresponding Bayesian thresholds for hospitalization rate reduction range from 7.2% to 9.5%, which lie above the 7.0% threshold obtained via the p-value.So, the most cautious course of action is to opt for the 9.5% threshold, which is still below most of our estimates for hospitalization probability of untreated patients.For the DSZ study [2], for both p 2 = 2% and p 2 = 5%, the logarithmic Bayesian factor for mortality rate reduction does not go above the decisive threshold for any value of x with a/N ≤ x ≤ p 2 , consequently the corresponding Bayesian thresholds are undefined, and for p 2 = 10% we find a Bayesian mortality rate reduction threshold of 3.9% which is slightly larger than the p-value threshold of 3.8%.For the Procter I case series [11], there is a weak indication that the 4.1% efficacy threshold for hospitalization rate reduction might have to be increased to 4.3%, and the mortality rate reduction threshold increased from 1.7% to 1.9%.Likewise for the Procter II case series [12], an increase of the hospitalization rate reduction efficacy threshold from 3.6% to 3.7% is weakly indicated.Both adjustments are negligible and inconsequential.For the Zelenko April 2020 [9] and Zelenko June 2020 [10] case series, where the sample sizes are much larger, we see that the overall trend is for the Bayesian thresholds to be far more lenient than the ones obtained via the p-value.This is possibly attributed to a very strong signal of efficacy in the data.
It is interesting to repeat the Bayesian analysis on the efficacy thresholds for mortality rate reduction and hospitalization rate reduction corresponding to 99% confidence and 99.9% confidence.We have seen that the Bayesian adjustments to the 95% confidence efficacy thresholds, when they are needed, are very small, so the relevant question is whether this pattern continues when the demanded confidence increases to 99% or 99.9%.Table 9 and Table 10 show the values of the logarithmic Bayesian factor b 2 (x 0 , p 2 ) at the mortality and hospitalization efficacy thresholds for 99% and 99.9% confidence, as determined solely from the p-value, and for various values of p 2 , as previously discussed.Note that for Table 9, the decisive Bayesian factor threshold corresponding to 99% confidence is b 2 (x 0 , p 2 ) ≥ 2.7.Likewise, in   [11,12], and Raoult's high risk (older than 60) treatment group [13].We highlight the Bayes factors that violate the condition b(x 0 , p 2 ) ≥ 3.7.
to 99.9% confidence is b 2 (x 0 , p 2 ) ≥ 3.7.We see that the logarithmic Bayesian factors are either above or near their respective thresholds.Likewise, in Table 11 and Table 12, we are comparing the mortality and hospitalization rate reduction efficacy thresholds determined via the p-value against the corresponding efficacy thresholds determined using the logarithmic Bayesian factor b 2 (x 0 , p 2 ) for 99% and 99.9% confidence correspondingly.We see that the Bayesian perturbations to the efficacy thresholds are mostly negligible for both 99% and 99.9% confidence, continuing the similar pattern that we have observed for the 95% confidence efficacy thresholds.
Based on these results, we conclude that for the case series under consideration, the Bayesian adjustments to the efficacy thresholds for mortality and hospitalization rate reduction are negligible, and they do not impact the analysis of the preceding sections.

Discussion and Conclusions
Our findings fully support risk stratification in the management of acute COVID-19, with the intent of reducing the intensity and duration of symptoms and by that mechanism, lower the risk of hospitalization and death.Although COVID-19 is generally known as a respiratory disease, there is an accumulation of evidence [42,97,98] that it is also, if not primarily, a vascular disease, with endothelial injury having a major role in sustained permanent injuries, hospitalizations, and death.The spike protein has been shown to damage the vascular endothelial cells [42] by downregulating ACE2, thereby inhibiting mitochondrial function, and by impairing the bioavailability of nitric oxide to endothelial cells.The spike protein also triggers immune dysregulation, triggering endothelial cells to transition to an activated immune response state, which causes both macrovascular and diffuse microvascular thrombosis, leading to myocardial injury and other organ damage [97,98].Early outpatient treatment, using multiple drugs in combination, prevents these adverse outcomes by stopping viral replication at the first phase of the illness, and mitigating the injuries caused by the hyper inflammatory COVID-19 pneumonia phase and the subsequent thromboembolic phase.
One of the lessons learned during the COVID-19 pandemic is that some of the key discoveries for the successful treatment of a novel disease emerge from the experience of    [11,12], and Raoult's high risk (older than 60) treatment group [13].We highlight the Bayesian thresholds that exceed the frequentist thresholds.
the frontline doctors that are directly confronted with the need to find a way to help their patients.A conceptual understanding of the biological mechanisms via which a disease agent infects and harms patients can be used to rapidly identify therapeutic strategies, based on repurposed drugs, that may counter the disease and its sequelae.In the absence of proven and effective treatment protocols, physicians have a duty to treat, with informed consent from their patients, requiring an effort to innovate and/or adopt such novel therapeutic strategies, in order to immediately reduce hospitalizations and deaths and to alleviate suffering [66].Although the orthodox approach is to consider possible treatments as unproven until they are validated with an RCT, in real life, it is possible to be confronted with a situation where the real-world observational data that result from clinical practice are sufficiently strong to justify the immediate adoption of a newly-discovered treatment protocol, and to raise the ethical concern of whether it is appropriate to even conduct the RCT, and deny treatment to a very large cohort of patients, in order to form a control group [18].Consequently, there is a need to be able to analyze the quality of observational data in a statistically rigorous way.
We have provided a hybrid statistical framework for assessing observational evidence that combines both frequentist and Bayesian methods; the frequentist methods aim to control the p-value for rejecting the null hypothesis, whereas the Bayesian methods aim to control the false positive rate.The two methods are complementary and not mutually exclusive.We have also proposed a formalism for assessing the signal of efficacy with respect to both random and systemic selection bias, and explain how it can be integrated with the proposed hybrid frequentist-Bayesian method.We stress that the method aims to answer only the question of whether we are confident that the proposed treatment protocol works, in order to facilitate the binary choice of whether or not it should be adopted.An exact measurement of the efficacy is not our primary concern; we only need to establish positive as opposed to null or negative efficacy.
The main weakness of the proposed statistical methodology is that it has to be limited only to the assessment of treatments that are based on repurposed medications [17] with known acceptable safety.It would be highly inappropriate to use this approach on new medications, or other countermeasures, where the balance of risks and benefits is yet to be determined.Furthermore, the analysis of the treatment group case series needs to be compared with a model that can, at minimum, lower-bound the probability of adverse outcomes without treatment, based on our prior knowledge.On the other hand, the development of this model can be done independently from the analysis of the treatment group case series.
One way in which our approach deviates from the usual way of doing things is that we are using the proposed statistical methodology to assess the efficacy of the entire treatment algorithm against supportive care.Both of the original Zelenko protocol [2] and the more enhanced McCullough protocol [14][15][16] are examples of sequenced multidrug treatment protocols.Furthermore, both protocols are algorithmic, in the sense that treatment is customized to the individual patient based on the patient's medical history and the response to treatment.For the case of the Zelenko protocol [2] this is done via the risk stratification of patients to low-risk and high-risk patients.For the case of the McCullough protocol [14][15][16], this is done both by risk stratification and also by accounting for the progression of the illness through the three distinct stages and response to treatment.Consequently, the immediate goal is not to establish that any one particular drug is effective.The goal is to establish that the treatment algorithm itself is effective, so that it can be deployed rapidly on an emergency basis and be subsequently improved over time with further research.
A possible theoretical criticism is that the particular case series that we have analyzed may have selection bias.This is mitigated, to some extent, by having reported case series from three different treatment centers, two in the United States and one in France, with consistent mortality rates; this consistency is compelling statistical evidence against geographic selection bias.More importantly, for both of the Zelenko [2,9,10] and Procter [11,12] case series, we have two consecutive reports over two consecutive time intervals replicating the hospitalization and mortality rate reduction outcomes, and these replications are additional statistical evidence against reporting selection bias.Furthermore, the treatment protocols have known biological mechanisms of action that have been reviewed in Section 1. Finally, we have introduced the idea of random selection bias thresholds that can be used to account for random selection bias.For the Zelenko June 2020 [10], Procter II [12] and Raoult [13] case series, we can have 95% confidence that random selection bias cannot be entirely responsible for the positive signal of benefit in mortality and hospitalization rate reduction.Furthermore, for the Raoult case series [13], systemic selection bias that favors the selection of high-risk healthy patients by a factor of up to 2.55 (a conservative estimate) is not sufficient to overturn the positive signal of efficacy.
The case series that we have analyzed in this paper add up to a total of 3164 high-risk patients.It is currently estimated that the total number of high-risk patients that have been treated with early outpatient treatment protocols throughout the United States may exceed this number by one or two orders of magnitude [99].Unfortunately, no resources have been allocated to study this data by our public health agencies, but we can make some suggestions about how such an analysis could be carried out.One idea for quickly analyzing a very large dataset is to extract the age > 50 and/or age > 65 part of the database, calculate the corresponding efficacy thresholds for hospitalization rate reduction and mortality rate reduction, and compare them with the CDC estimates [62][63][64] for number of hospitalizations and deaths for these age groups over the total number of cases with symptomatic illness.Given a large enough data set, it would also be interesting to further risk-stratify the age > 50 and/or age > 65 cohorts with respect to number of days between initial symptoms and initiation of treatment, and calculate the efficacy thresholds as a function of the delay in initiating treatment.This analysis would inadvertently not include younger patients that are high risk due to comorbidities or shortness of breath presentation, however, it has the advantage that it can be carried out quickly with limited resources.
Furthermore, it would be useful to break down the case series data in sequential time intervals corresponding to different waves and different variants of the SARS-CoV-2 virus.The case-series considered in this paper are limited to 2020, before the vaccine roll out, during which natural immunity held up at preventing reinfection [100] up until the emergence of the omicron variants near the end of 2021, which broke through natural immunity from previous variants, but also provided back immunity to the delta variant [101].Nevertheless, in terms of general methodology, it would also be useful to subject any results to sensitivity analysis with respect to host immunity (i.e.history of previous infection and or vaccination status), as needed.Analyzing the data from several more treatment centers, that have adopted early outpatient treatment protocols for high-risk patients, would further mitigate the potential for selection bias.
With substantial resources, a more detailed analysis, based on the virtual control group methodology [18], is possible that can consider the entire dataset and actually estimate the treatment efficacy.Given a case series of N patients, one can input the medical history of each patient to the Cleveland Clinic calculator [95] and use their mathematical model to predict the probability of hospitalization and death for each patient individually.Knowing the corresponding sequence of probabilities q = (p 1 , p 2 , . . ., p N ) for an adverse outcome (hospitalization or death) for all patients, the probability pr(N, a|q) of seeing a adverse outcomes follows a Poisson binomial distribution [102], and it can be substituted to Eq. ( 2) in order to calculate the p-value for rejecting the null hypothesis of no treatment efficacy.Because the probability of an adverse outcome is known for each patient, note that there is no need to worry about selection bias or calculating any efficacy thresholds, and it is possible instead to directly calculate the p-value for rejecting the null hypothesis.. Furthermore, since the mean of the Poisson binomial distribution is the average q = (1/N)(p 1 + p 2 + . . .+ p n ) of the individual probabilities, one can calculate the risk ratio via the equation RR = a/(qN).To conduct the corresponding Bayesian analysis, we can assume that the effect of the early outpatient treatment is to reduce the probabilities of adverse outcome by a numerical factor x to xq = (xp 1 , xp 2 , . . ., xp N ) with 0 ≤ x ≤ 1 and use the Poisson binomial distribution pr(N, a|xq) in Eq. ( 29) and Eq. ( 32) to calculate the corresponding integrals needed for the Bayesian factor.All other aspects of the Bayesian analysis would remain the same, except that the hypothesis being validated would not concern any efficacy thresholds, but would instead concern hypotheses about the actual efficacy x of the early outpatient treatment protocol.
That said, we do not mean to imply that such a detailed analysis is necessary in order to greenlight the use of the investigated early outpatient treatment protocols for COVID-19.However, we wish to highlight that such a detailed analysis is indeed possible to carry out, using existing data and prior mathematical modeling, in order to validate the McCullough protocol.A limitation of the Cleveland Clinic calculator is that it should ideally be used in conjunction with case series over time intervals that are aligned with the data set used to train the calculator's mathematical predictive model.Because the Cleveland Clinic calculator used data collected between March 4th 2020 and July 14th 2020 it can certainly be applied to case series up until July 2020.However, we believe that it can also be extended up until and including the Delta variant, that became dominant towards the end of 2021, since all of these subsequent variants were just as hard to treat or harder than the initial waves in 2020.
Notwithstanding the hesitancy confronting the adoption of early treatment protocols for COVID-19 [94,103], everything that we have been through during the last two years vindicates the position of Frieden [54] that there is an urgent need to leverage and overcome the limitations of real-world evidence data, in order to deploy a timely life-saving response to urgent health issues.Although case series real-world data is viewed as imperfect from an epidemiological viewpoint, this viewpoint is predicated on the goal being the unbiased measurement of treatment efficacy.We have explained how case series data of high-risk patients for a treatment protocol based on repurposed medications, combined with our prior knowledge of population-level probabilities for adverse binary outcomes, can be used to answer the simpler question of whether or not the treatment protocol actually works (i.e.showing only the existence of efficacy), in order to make the up or down decision about whether or not to adopt it.The proposed statistical framework also provides a rigorous technique for quantifying the quality of this data.This can help to make objective policies on the appropriate thresholds for adopting such treatments as a standard of care.There is still an opportunity to learn much by analyzing data from various treatment centers here in the United States that treated COVID-19 with early outpatient treatment protocols, as well as treatment centers from all around the world.It is also necessary to reflect on and develop policies and procedures for leveraging the direct experience of frontline doctors treating patients, towards an agile and effective response to future epidemics and pandemics.papers by Frieden [54] and Deaton and Cartwright [55] and for the suggestion that we look into the coverage probability, at an early stage of our work.We also thank Marc Rendell for suggesting that we introduce a quantitative technique to mitigate selection bias, and an anonymous reviewer for very insightful comments on our manuscript throughout it's development.Finally, we wish to acknowledge Lawrence Huntoon for encouraging us to look into a Bayesian approach, at a very early stage of this research project.
The first inequality step follows from S 0 (N, x, p 0 ) ⊆ S(N, x, p 0 ).The next step follows from Eq. (A17), then we apply the definition of the coverage probability, and the last inequality is based on the assumption that the Sterne interval has conservative coverage.We conclude, therefore, that there is at least 1 − p 0 probability that the number n of adverse events without treatment would have been in the interval m 1 (N, x, p 0 ) ≤ n ≤ m 2 (N, x, p 0 ) for an equivalent cohort of N patients, chosen randomly out of the general population.Now let us consider the random selection bias threshold, which is calculated as the minimum number x 1 that satisfies the implication x > x 1 (N, x 0 , p 0 ) =⇒ p(N, x 0 N , x) < p 0 , (A22) with x 0 N the natural number obtained if we round up the number x 0 N. To show that this threshold works, we need to show that x > x 1 (N, x 0 , p 0 ) implies that x 0 < m 1 (N, x, p 0 )/N, which is easily done with the following argument: First, we see that Eq. (A22) implies that the number x 0 N is outside of the set S(N, x, p 0 ), which in turn means that either x 0 N < m 1 (N, x, p 0 ) or x 0 N > m 2 (N, x, p 0 ).To rule out the second possibility, we observe that if x exceeds the random selection bias threshold x 1 (N, x 0 , p 0 ), then it also exceeds the efficacy threshold x 0 and thus x > x 0 .We also observe that the population level probability x of an adverse outcome for untreated high risk patients has to be inside its own confidence interval, i.e. m 1 (N, x, p 0 )/N ≤ x ≤ m 2 (N, x, p 0 )/N.Using these inequalities, it follows that and therefore we can rule out x 0 N > m 2 (N, x, p 0 ).We conclude that x 0 N < m 1 (N, x, p 0 ), and therefore x 0 = x 0 N/N ≤ x 0 N /N < m 1 (N, x, p 0 )/N, (A25) which gives us x 0 < m 1 (N, x, p 0 )/N.Since the true rate x of an adverse event for an equivalent cohort of N patients can be bound between m 1 (N, x, p 0 )/N ≤ x ≤ m 2 (N, x, p 0 )/N with 1 − p 0 confidence, we can be assured that, in spite of any random selection bias, x exceeds the efficacy threshold x 0 with 1 − p 0 confidence.This concludes the argument that justifies the calculation of the random selection bias threshold given by Eq. (7).
To calculate the selection bias threshold x 1 (F|N, x 0 , p 0 ) corresponding to systemic selection bias with factor F, we can simply recycle the preceding argument by considering a hypothetical population that has the systemic bias effect build into the proportions of healthy versus unhealthy patients, and then calculating the random selection bias threshold for this hypothetical population.Let L = x/(1 − x) be the likelihood ratio of randomly selecting unhealthy vs healthy patients, if there is no systemic selection bias.If there is some systemic selection bias, this ratio is modified into L/F.Consider a hypothetical population where the probability of an adverse outcome for high-risk patients without treatment is x such that L/F = x/(1 − x).With basic algebra, we see that Selecting randomly from this hypothetical population is statistically equivalent to selecting with systemic bias F from the actual population, in the sense that in both cases we obtain the same confidence interval for the probability x .This equivalence implies that x 1 (F|N, x 0 , p 0 ) ≤ x ≤ 1 =⇒ x 0 < m 1 (N, x, p 0 )/N.(A27) Furthermore, from the definition of the random selection bias threshold, we have x 1 (N, x 0 , p 0 ) ≤ x ≤ 1 =⇒ x 0 < m 1 (N, x, p 0 )/N.(A28)

Figure 1 .
Figure1.This flowchart shows the suggested interactions between medical doctors, public health agencies, and the proposed statistical methodology that are needed, in order to implement an emergency epidemic or pandemic response that leverages the direct experience of frontline medical doctors treating their patients.

Figure 3 .
Figure 3. Coverage probability for the Sterne interval [71] with sample sizes N = 20 and N = 100.The black curve corresponds to N = 20 and the blue curve, which is situated below the black curve, corresponds to N = 100.The coverage probabilities were calculated using 0.01 increments on the horizontal axis.

Figure 4 .
Figure 4. Comparison of the coverage probability for the Clopper-Pearson interval [75] versus the Sterne interval [71] with sample size N = 100.The black curve shows the coverage probability for the Clopper Pearson interval, and the blue curve, which is situated below the black curve, shows the coverage probability for the Sterne interval.The coverage probabilities were calculated using 0.01 increments on the horizontal axis.

Figure 5 .
Figure 5.Relationship between p-value and expected mortality rate for high risk patients without early treatment, based on the case series data from Procter's dataset of 869 high-risk patients[12].The zigzag curve follows p(N, a, x) given by Eq. (2), whereas the smooth curve approximates the right tail terms in the p-value sum by replacing them with the left-tail terms on the horizontal axis.

Figure 6 .
Figure 6.Cumulative case fatality rate in the United States and France between April 2020 and November 2021.
(11)1 )/p(H 0 ) after seeing the data D. Here, p(D|H 1 ) is the probability of seeing the data D if H 1 is true and p(D|H 0 ) is likewise the probability of seeing the data D if H 0 is true.To interpret the meaning of the Bayesian factor, the following argument is used to calculate the posterior probabilities p(H 1 |D) and p(H 0 |D) in terms of B(D|H 1 , H 0 ) and b(H 1 , H 0 ) = p(H 1 )/p(H 0 ).We assume that H 0 , H 1 satisfy p(H 0 ) + p(H 1 ) = 1 and p(H 0 |D) + p(H 1 |D) = 1.Combining the second equation with Eq.(11)and Eq. (12) gives the Bayes theorem p(D) = p(D|H 0 )p(H 0 ) + p(D|H 1 )p(H 1 ),

Table 1 .
[9]e series list: The table lists the total number of patients, the subset of high risk patients that were treated with a sequenced multidrug regimen, number of patients that were hospitalized, and number of deaths, for the following case series: Derwand-Scholtz-Zelenko study treatment group [2], Zelenko's complete April 2020 data set[9], Zelenko's complete June 2020 data set [10], Procter's observational studies

Table 2 .
[2]ct Fisher test comparing the mortality rate reduction and hospitalization rate reduction between the high risk patient treated group the DSZ study[2], Zelenko's complete April 2020 data set

Table 4 .
[60]e Case Fatality Rate data, in the absence of early outpatient treatment, based on early data from China as of February 11, 2020, and published on March 30, 2020.[60].We highlight the high-risk age brackets with CFR ≥ 1.0%.
1.94% -2.61%).We cannot deduce an age ≥ 60 mortality rate from the CDC report, but note that the age ≥ 65 mortality rate, according to the CDC, is 4.79% (95% CI: 4.11% to

Table 5 .
[58,59]ase Fatality Rate data, in the absence of early outpatient treatment, based on early data from China and Italy as of March 17, 2020 and published on March 23, 2020[58,59].We highlight the high-risk age brackets with CFR ≥ 1.0%.

Table 6 .
[61] fatality rate based on early-stage analysis of COVID-19 outbreak in China in the period up to February 11, 2020[58]vs similar statistics from Israel published on September 7, 2020[61].

Table 10 ,
the decisive Bayesian factor threshold corresponding Bayes factors at the mortality rate efficacy thresholds Study 99.9% threshold p 2 = 0.02 p 2 = 0.05 p 2 = 0.1