Estimation of Population Prevalence of COVID-19 Using Imperfect Tests

: I formulate three basic biomedical/statistical assumptions that should ideally guide well-designed population prevalence studies of the present or past disease including COVID-19. On the basis of these assumptions alone, I compute several probability distributions required for statistical analysis of testing data collected from a sample of individuals drawn from a heterogeneous population. I also construct a consistent asymptotically unbiased estimator of the population prevalence of the disease or infection from the collected data and derive a simple upper bound for its variance. All the results are rigorously proved and valid for any test for COVID-19 or other disease provided that the sum of the test’s sensitivity and speciﬁcity is larger than 1. A few recommendations for the design of COVID-19 prevalence studies informed by the results of this work are formulated. The methodology developed in this article may prove applicable to diseases and conditions other than COVID-19 as well as in some non-epidemiological settings.


Introduction
Uncovering the true scale of COVID-19 pandemic is critical for shaping public health policy, designing effective medical interventions, and planning economic, social, and educational activities. This requires conducting testing studies aimed at the assessment of the unknown prevalence of the ongoing or past infection in a given population. Knowledge of the prevalence parameter is also indispensable for estimation of the all-important infection mortality rate of COVID-19.
The prevalence of a current or past disease or infection in a population at a given time is defined as the fraction of the population that at the time of interest has or has had the disease or infection. The prevalence is time-dependent. Due to the fact that COVID-19 is highly contagious and has a relatively short incubation period, the prevalence of COVID-19 in a population may change on the time scale of weeks. Thus, to provide a meaningful estimate of the population prevalence of COVID-19, testing studies should be conducted during a short period of time.
Selection of an appropriate target population for a given prevalence study is essential for the study's scientific value. Such selection may be guided by the current knowledge of the epidemiology of the disease. Testing the entire population for COVID-19 or other diseases is expensive and usually impractical. This is why it is typically performed on a relatively small sample of individuals deemed to be representative of the population in terms of the likelihood of a present or past infection. The sample can be drawn randomly or recruited by other means and controlled for individual characteristics known to be associated with the rate of occurrence of the infection or severity of the disease in order to approximate the distribution of these characteristics in the target population. In the case of The diagnostic quality of a test for a certain infection (or disease) is determined by its sensitivity α (i.e., the probability of obtaining a positive test result for an infected individual) and specificity β (i.e., the probability of a negative test result for an uninfected individual). Great effort is expended by the manufacturers of test kits to ensure that the sensitivity and specificity of their test remain stable across a wide range of testing conditions, which involves running the test on a large number of known true positive and true negative samples. This is why in this article I assume α and β to be fixed. In spite of the effort, however, the sensitivity and specificity of a test, especially a newly designed one, may display certain random or systematic variation depending on the testing site, tested individual, and other conditions. A Bayesian framework for studying the effects of such variation on population prevalence of COVID-19 was presented, in the context of critical analysis of the Santa Clara county study [2], in [6]. For a review of Bayesian modeling approaches to the assessment of accuracy of diagnostic tests, see [9].
Below, the reader will encounter references to various tests for COVID-19 infection including molecular amplification tests such as RT-PCR for SARS-CoV-2 and serological tests for its antibodies, such as IgA, IgM, and IgG that serve as biomarkers of the present or past infection. For a general reference on this subject, see [10].
In this work, I compute in closed form, starting with a homogeneous population, (1) the distribution of the number, n, of positive test outcomes that result from testing N individuals using a test with given sensitivity α and specificity β; (2) the distribution of the number of true and false positive test results conditional on the event that n positive test results were observed; and (3) the conditional expected value and variance of the true number of infected individuals among those tested given n positive test results, see Sections 3 and 4. In Section 5, an Equivalence Principle that enables an extension of all these results to heterogeneous populations is established. In Section 6, I bring the above results to bear on the design of a consistent estimator of the population prevalence of the disease (or infection) from the test results collected from a sample of individuals. Finally, in Section 6, I also provide a detailed justification of the mathematical form of the prevalence estimator and study its properties. This estimator was employed in [2] and used for the analysis of uncertainty propagation in [6].
The results of this work can be used for the assessment of plausibility of prevalence estimates reported by various population studies. Specifically, the prevalence of the current or past infection can be checked for self-consistency by comparing the observed number of positive cases with its theoretical prediction and by computing the expected number of false positive and false negative test results and associated probabilities. Additionally, the assumptions formulated in Section 2 may help optimize the design of future prevalence studies for COVID-19 and other diseases.
The Santa Clara county study [2] was criticized for using suboptimal tests for the presence of IgM and IgG antibodies that serve as respective biomarkers of current and past SARS-CoV-2 infection, with combined sensitivity α = 0.828 and specificity β = 0.995 on the grounds that these tests could have produced a large number of false positive and false negative results. I argue that this criticism is largely unfounded and show that the estimator of the true population prevalence, p, derived in the present work and also utilized in [2], is consistent and has small variance for a sufficiently large sample size for any α and β such that α + β > 1 and 1 − β < p < α. For example, if at the time of the study [2], the true COVID-19 seroprevalence in Santa Clara county was, say, between 1% and 10%, then the tests employed (and even those with the same specificity and a dismal sensitivity of 15%) would still be appropriate for a sufficiently large number of tested individuals, should the three assumptions formulated in Section 2 be met.
Because of the current acute interest in SARS-CoV-2 and COVID-19, I will be employing below the terminology and biological considerations pertaining to COVID-19 testing. However, the results of this work may prove applicable to testing for other diseases or conditions as well as in some non-epidemiological settings.
This article can be viewed as a tutorial on some basic mathematical and statistical aspects of testing and prevalence estimation. It represents an expanded and refined version of the author's preprint [11].

Basic Assumptions
I start by spelling out three assumptions, termed A, B , C that are ideally to be met by a well-designed population prevalence study. Their discussion is tailored to testing for COVID-19 and uses studies [2][3][4][5] as an illustration. As many general assumptions in probability and statistics, assumptions A-C are largely empirically untestable. However, various kinds of empirical evidence as well as careful study design can make the case of their validity stronger. In the terminology of Henri Poincaré [12], these assumptions serve as neutral hypotheses that are not too restrictive yet enable building a rigorous quantitative framework for population prevalence estimation. Conversely, specific aspects of study design, discussed below, likely contravene these assumptions. Finally, the assumptions do not have to be necessarily met by the entire sample of subjects recruited for testing or those actually tested; rather, they may serve as a guide for selection of a subset of study participants whose valid test results will be used for population prevalence estimation.
A. Independence. The events of current or past infection in all tested individuals are stochastically independent.
Although this assumption is indispensable for rigorous statistical analysis of testing data, its violations in prevalence studies are fairly common. One particular reason is excessive inclusion of multiple members of the same household in a set of testing data used for prevalence estimation. The effects of such oversampling were clearly demonstrated by the study [3] conducted on 31 March-6 April 2020 in Gangelt, a community of around 12,500 people in the state of North Rhine-Westphalia, Germany. One of the aims of the study was to estimate the excess risk of contracting COVID-19 for someone who lives in the same household with an infected individual. It was found that, interestingly, the risk of such secondary infection increases by 28.1%, 20.2%, and 2.8% for households with two, three, and four people, respectively, relative to the 15.5% risk of the primary infection [3]. Given that 919 participants of the Gangelt study for whom valid RT-PCR SARS-CoV-2 test results and/or the titers of IgA or IgG antibodies were obtained belonged to just 405 households, the validity of Assumption A for this study is questionable. The same likely applies to the Santa Clara study [2] conducted on 3-4 April 2020 in Santa Clara county, CA with the population of around 2 million people. In that study, among 3390 participants whose blood specimens were analyzed for the presence of IgM or IgG antibodies there were 2747 adults and 643 children, no more than one per household, living with some of these adults [2]. By contrast to the Gangelt and Santa Clara studies, the Los Angeles study of IgM/IgG seroprevalence [4] that was conducted on 10-11 April 2020 in the Los Angeles county, CA limited participation to one individual per household.
The validity of Assumption A may also prove problematic if a disproportionately large amount of study participants attended a known COVID-19 super-spreading event or may be suspected of belonging to known clusters of COVID-19 cases. For example, investigation of the effects of one such event, a carnival festivity held around 15 February 2020 in Gangelt, revealed that the infection rate among participants of the event was 2.6 higher than among non-participants; in addition, the course of the disease was much more severe in the former than in the latter [3].
B. Test Uniformity. The sensitivity and specificity of a test are identical for all tested individuals.
Although this assumption is tacitly adopted almost universally, its validity may prove in certain cases questionable. For example, in asymptomatic or pre-symptomatic carriers of COVID-19, the number of viral particles on a nasopharyngeal swab may be too low to be detectable by the RT-PCR test. Additionally, in some asymptomatic or even symptomatic individuals, the titer of IgA or IgM antibodies indicative of an ongoing disease may not exceed the detection threshold of a serological test. As yet another example, in convalescent COVID-19 patients, the presence of viral particles may already be undetectable while the titer of IgG antibody, an indicator of a past disease, may not be detectable yet. In all these cases, the sensitivity of the test will be reduced. Likewise, cross-reactivity with viral fragments of, or antibodies to, another virus may increase the likelihood of false positive responses in those individuals who at the time of testing are, or have recently been, infected with a similar pathogen, e.g., a coronavirus causing the common cold. Such cross-reactivity will result in a reduction in the test's specificity.
Another source of systematic non-uniformity of test performance is the use of composite tests. For example, IgM and IgG antibody titers in the Santa Clara county study [2] were measured concurrently, and so were IgA and IgG antibody titers in the Gangelt study. Suppose two tests with respective sensitivities α 1 , α 2 and specificities β 1 , β 2 were given to the same group of subjects. Whereas the specificity of the composite test for any uninfected individual can be assumed to be β 1 β 2 ,, the sensitivity of the composite test varies. Consider, for example, an infected individual who only has the first antibody. If a positive result is defined as detection of at least one antibody, then, under the independence assumption, the sensitivity of the composite test for such individual would be Similarly, for an infected individual who has only the second antibody, it is 1 − β 1 (1 − α 2 ). However, for those subjects who have both antibodies, the test's sensitivity is α 1 α 2 (for an evidence that the fraction of such subjects is not negligible, see, e.g., [13]). Thus, the sensitivities of the composite test for these three categories of individuals are, generally speaking, all distinct.
Finally, test results of the Vo' study [1] have not been adjusted for the sensitivity and specificity of the RT-PCR test, which amounts to assuming that α = β = 1.
C. The Matching Principle. Tested individuals are selected independently of each other, and the prevalence structure of the sample of these individuals matches that of the target population.
This assumption suggests a particular way in which the sample of tested individuals is representative of the target population. One sampling method that satisfies the Matching Principle is Simple Random Sampling (SRS), see, e.g., [14], whose defining property is that all samples of a given size are equally likely to be selected. To see that SRS satisfies Assumption C, consider drawing a sample of a given size N from a population of P individuals. For a fixed subpopulation S, homogeneous or otherwise, consisting of Q individuals, denote by η the random number of individuals from a sample that belong to S. It follows from the definition of SRS that random variable η has hypergeometric distribution H(P, Q, N). Therefore, for its expected value, µ, we have µ = NQ/P, so that µ/N = Q/P = w, where w is the fractional size, or weight, of the subpopulation S. Thus, under SRS, every subpopulation is represented in the sample, on average, in accordance with its weight.
SRS can be combined with stratification of the population into several subpopulations determined by observable individual characteristics associated with the likelihood of the disease or infection. For example, the total sample size N can be first partitioned into r subsample sizes, N = N 1 + N 2 + . . . +N r , proportional to the demographic weights of the identified subpopulations and then random subsample of size N i can be generated from the i−th subpopulation by means of SRS for each i = 1, 2, . . . , r.
One source of potential violation of Assumption C is oversampling from the same household, discussed above, allowed by the design of the studies [2,3]. The validity of Assumption C may also prove uncertain if recruitment for testing involves a significant opportunity for self-selection, which makes it likely that people who surmised that they have, or have had, the disease volunteered for the study. Such selection bias was manifestly present in the SARS-CoV-2 population prevalence study [5] conducted in Iceland between 13 March and 1 April 2020 where about half of the 10,797 tested participants who volunteered for the study had mild respiratory symptoms. A possibility for self-selection also existed in the Santa Clara county study whose recruited volunteers responded to an advertisement posted on Facebook [2]. The same problem was potentially present in the Los Angeles study where only 865 among 1952 randomly selected adults (with some restrictions aimed at matching the county demographics) were actually tested [4]. Finally, the initial recruitment effort of the Gangelt study consisted of generating a random sample of 600 community members with distinct last names and inviting them to participate in the study. However, the 407 study participants who responded to the invitation were allowed to bring in other household members for testing. As a result, 1007 individuals from 405 households were tested [3].
The overall logic and flow of exposition in the rest of the article are as follows. All mathematical results, derived under Assumptions A and B for a homogeneous population, where the probability of having a current or past infection can be assumed the same for all individuals, are formulated in Sections 3 and 4. Next, Section 5 introduces, based on Assumption C, the Equivalence Principle that enables a natural extension of all the results obtained in Sections 3 and 4 to a heterogeneous population consisting of any number of homogeneous subpopulations. Section 6 is dedicated to construction of a prevalence estimator and studying its properties. Finally, in Section 7, I summarize the findings and formulate recommendations for the design of prevalence studies informed by the analysis of this article.

Distribution of the Number of True and False Positive Test Results
Consider a test with a binary outcome (positive/negative) administered to N individuals selected from a homogeneous population with infection prevalence p, 0 < p < 1. Let α be the sensitivity and β be the specificity of the test, 0 < α, β < 1. Suppose the test resulted in n, 0 ≤ n ≤ N, positive outcomes. Denote by X, Y the respective unobservable numbers of true positive and false positive test results, and let Z = X + Y be the observable total number of positive outcomes. Denote by M the unknown true number of presently or previously infected individuals (depending on the nature of the test) among the N tested individuals. Below, we seek to compute, under Assumptions A and B, the distribution of random variables X, Y, Z and the conditional distributions of X and Y given Z = n. The conditional expectation and variance of random variable M given Z = n will be computed in closed form in the next section.
Assumption A implies that the distribution of random variable M is binomial B(N, p) : For this and other basic concepts and results from probability, the reader is referred to [15]. If M = m is fixed, then the testing of each infected individual produces a positive test result, independently of other individuals (Assumption A), with the same probability α, the sensitivity of the test (Assumption B). Then, for the number, X, of true positive test results, we have Similarly, it follows from Assumption B that every uninfected individual receives a false positive test result with the same probability 1 − β. Thus, the distribution of the number, Y, of false positives is given by Importantly, it follows from Assumption A that, for every m, random variables X and Y are conditionally independent given M = m.
Due to Assumptions A and B, random variable X is a thinning of the binomial random variable M with probability α. In general, thinning of a sequence of random events is their independent marking, or filtration, with the same probability. Accordingly, the random variable that counts the number of marked events is called a thinning of the random variable counting the occurrence of the original events; for more on thinning, see [16]. By compounding distributions (1) and (2), we find that random variable X has binomial distribution B (N, αp). In fact, for 0 ≤ x ≤ N, we have using the formula of total probability, setting j = m − x, and finally employing Newton's binomial formula: Likewise To compute the joint distribution of random variables X and Y, notice that, if X = x and Y = y, then every admissible value, m, of random variable M satisfies the inequalities x ≤ m ≤ N − y.
Using the formula of total probability, invoking Equations (1)-(3), rearranging the factors, making a change of variable j = m − x, and finally employing Newton's binomial formula, we obtain for all x, y ≥ 0 such that x + y ≤ N : Finally, lumping together true and false positive test results and combining their probabilities lead to a conclusion that the distribution of the total number, Z = X + Y, of positive test results, is binomial B (N, λ) : where According to the formula of total probability, λ represents the probability of obtaining a positive test result for a randomly selected individual from the given homogeneous population.
Formulas (4) and (5) produce the following distributions of the number of true and false positive test results conditional on the observed total number of positive outcomes: and where Thus, the distribution of the number of true and false positives given the observed number, n, of positive test results is B(n, θ) and B(n, 1 − θ), respectively. Distributions (7) and (8) have the following three notable features: (a) They are independent of the total number, N, of tested individuals; (b) Parameters θ and 1 − θ specified in (9) represent the predictive positive and predictive negative values that can be obtained by applying Bayes theorem to prior probabilities p and 1 − p, see, e.g., [17]; (c) Distributions (7) and (8) depend on a single parameter that combines the basic parameters p, α, β.
The extraordinary simplicity of Formulas (5), (7), and (8) should not becloud the fact that their validity depends critically on Assumptions A and B.

Conditional Expected Number of Infected Individuals for a Given Number of Positive Test Results
A natural estimator,p, of the prevalence of an infection in a population can be defined as the expected fraction of infected individuals among those tested given the observed number, n, of positive test results:p The main goal of this section is to compute E(M|Z = n) in the case of a homogeneous population. For the distribution of random variable M conditional on Z = n, we have where P(Z = n) is given in (5)-(6) and x satisfies the inequalities 0 ≤ x ≤ m, x ≤ n and n − x ≤ N − m or equivalently a(m, n) := max{n + m − N, 0} ≤ x ≤ min{m, n} =: b(m, n).
Although (1)- (3) and (5) combined with (11) lead to a formula for the conditional probability P(M = m|Z = n), this formula does not seem to be reducible to a simple expression. However, the corresponding conditional expectation and variance can be computed in closed form, as I show below.
Formula (11) suggests that, in order to find the conditional expectation E(M|Z = n), one has to compute the following quantity: where the bounds for variable x are given in (12). Notice that the range of pairs (x, m) has a simpler representation than for pairs (m, x). Therefore, switching the order of summation, changing the variable in the internal sum to j = m − x, and using Formulas (1)-(3) yield Using (6) we represent the internal sum in (13) as where we set δ = (1 − α)p/(1 − λ) and used the formula for the expected value of the binomial distribution B(N − n, δ). Now, the above derivation of the formula for A(n) can be continued: where we employed (5) along with the formula for the expected value of the binomial distribution B(n, θ), where θ = αp/λ is the same as in Formula (9). Thus, in view of (11), A very similar argument leads to the following formula for the conditional second moment of M given Z = n : Therefore, due to (14) and (6), Inspection of Formulas (14) and (15) reveals that the conditional expectation and variance of random variable M given Z = n depend on the following two combinations of parameters p, α, β alone:

The Equivalence Principle for Heterogeneous Populations
Sections 3 and 4 dealt with a population that was assumed homogeneous in the sense that all its individuals had the same probability, p, to have a current or past infection. The aim of this section is to extend the results of Sections 3 and 4 to a more realistic case of a heterogeneous population consisting of r homogeneous subpopulations. Let w = (w 1 , w 2 , . . . , w r ), where ∑ r i=1 w i = 1, is the vector of relative sizes (weights) of these subpopulations and p = (p 1 , p 2 , . . . , p r ) be the vector of their disease prevalences.
I start with introducing the following convenient notation. For a non-negative integer vector x with r components, set | x | = ∑ r i=1 x i and x! = Π r i=1 x i !. In addition, for two such vectors x and y, we denote x · y = ∑ r i=1 x i y i , xy = (x 1 y 1 , x 2 y 2 , . . . , x r y r ) and x y = Π r i=1 x i y i . Finally, y ≤ x means that y i ≤ x i for i = 1, 2, . . . , r.
Let N i , 1 ≤ i ≤ r, be the number of individuals from the i−th homogeneous subpopulation among N tested individuals. The Matching Principle (Assumption C) implies that random vector N = (N 1 , N 2 , . . . , N r ) with | N | = N has multinomial distribution Mult(N, w) : Let M i be the number of infected individuals among N i tested individuals, 1 ≤ i ≤ r. Random vector M = (M 1 , M 2 , . . . , M r ) represents a component-wise thinning of random vector N with thinning probabilities forming the vector p. What is the distribution of random vector M? A computation below shows that, in contrast to the binomial case, it is not multinomial! It follows from Assumption A and subpopulation homogeneity that, for any i, 1 ≤ i ≤ r,, the conditional distribution of random variable M i given N i = x i is binomial B(x i , p i ). In addition, Assumption A implies that, for every vector x such that | x | = N, components of random vector M are conditionally independent given N = x. Therefore, for any vector y with | y | ≤ N, we have using (16), employing the formula of total probability, making a change of variable z = x − y, and, finally using the multinomial formula, Due to Assumption B, all the computations in Sections 3 and 4 involve only the total number, m = | M |, of infected individuals among those tested. The distribution of random variable | M | can now be derived using the multinomial formula and (17): Thus, the total number of infected individuals among N subjects tested follows the binomial distribution B (N, w · p).
Comparison between Formulas (18) and (1) leads to the following conclusion that can be termed the Equivalence Principle: Under Assumption C, the distribution of the total number of infected individuals among N tested individuals selected from a heterogeneous population consisting of r homogeneous subpopulations with weights w 1 , w 2 , . . . , w r and infection prevalences p 1 , p 2 , . . . , p r is the same as for a homogeneous population with infection prevalence The Equivalence Principle is also true when the r subpopulations comprising the population of interest are heterogeneous. In fact, partitioning them into homogeneous subsubpopulations, applying Formula (19), regrouping and rescaling the terms pertaining to the same subpopulation, and applying the Equivalence Principle again leads to Formula (19) in which w i are the weights (relative sizes) and p i are the prevalences of the r heterogeneous subpopulations.
In summary, all the results in Sections 3 and 4 that were derived for a homogeneous population with infection prevalence p are also valid for a heterogeneous population if one selects p in accordance with Formula (19). This equivalence property is, of course, quite natural; however, it depends on Assumptions A-C in very essential ways.

Prevalence Estimation
If N individuals drawn from a population of interest were tested and n positive test results were observed, then a "naïve" estimate of the prevalence of the current or past infection in the population would be p 0 = n/N. The testing process can be viewed as the following mental experiment: for an infected individual, a coin is flipped that lands "heads" with probability α and "tails" with probability 1 − α while, for an uninfected individual, another coin is flipped that lands "heads" with probability 1 − β and "tails" with probability β. In these terms, p 0 is the fraction of "heads" (positive test results) recorded for N independent replications of this random experiment. Clearly, p 0 depends on the sensitivity and specificity of the test and the prevalence of the disease. Therefore, one needs to untangle them and construct a consistent estimator of the prevalence alone.
In the rest of this section, it will be assumed that This condition is always met in practice; otherwise, either the sensitivity or the specificity of a test would not exceed 0.5, thus making the test equivalent or inferior, for either infected or uninfected individuals, to flipping a fair coin.
Recall that the number, n, of positive test results has binomial distribution B (N, λ), see Section 3, where λ = αp + (1 − β)(1 − p). Notice that, under condition (20), I first define the desired estimate,p, of the population prevalence p heuristically. Because p 0 is a consistent unbiased estimator of λ, the following "plug-in" equation can be set up forp : This definesp as the population prevalence that would produce, on average, the same fraction of positive test results when N individuals are tested as the one actually observed. Solving Equation (22) forp yieldsp Note that this estimator was employed in the Santa Clara county study [2], see also [6]. Observe that 0 <p < 1 if and only if 1 − β < p 0 < α; compare with (21). Thus, the complete definition of estimator This formula implies that meaningful prevalence estimation in a population with low prevalence of a disease requires a test with high specificity (namely, with β > 1 − p 0 , where p 0 is the raw positivity rate for a representative sample). Likewise, estimation of large prevalence requires a test with sufficiently high sensitivity (specifically, with α > p 0 ).
Since p 0 → λ almost surely as N → ∞, Equations (23), (21), and (6) imply that almost surely as N → ∞. Therefore,p is a consistent estimator of p. Because estimatorp is uniformly bounded, we also have Ep → p as N → ∞, which means that estimatorp is asymptotically unbiased. The heuristic formula (23) can be derived on more "theoretical" grounds. Recall thatp is defined as the expected fraction of infected individuals among those tested conditional on the observed number of positive test results, see Equation (10). Then, in view of (14), The principal difficulty with this definition of the prevalence estimatorp is that it depends on the unknown true prevalence parameter p thatp seeks to estimate. A natural idea, then, would be to determine the value of p for which f (p) = p and take it as the desired prevalence estimator. Using expression (6) for λ, one finds after some algebra Upon comparison with (23), this leads to the conclusion that, under assumption (20), the required fixed point of function f is exactly the above heuristic estimatorp! My next goal is to estimate the variance ofp. Recall that a function T : R → R is called a contraction if | T(x) − T(y) | ≤ | x − y | for all x, y ∈ R. Setting here y = 0 shows that for every contraction T An important family of contractions consists of functions T a,b defined for a < b by If U is a random variable with finite second moment defined on a sample space S with probability measure P and V = T(U), where T is a contraction, then it follows from (24) that the second moment of random variable V is also finite. Moreover, VarV ≤ VarU. In fact, In particular, setting we conclude from the above full definition of estimatorp thatp = T 0,1 (U). This leads to the following upper bound for the variance ofp : Interestingly, this upper bound depends only on the quantity √ N(α + β − 1). The estimatorp has a remarkable feature that can be termed the "mixture-invariance" property. Consider a population that consists of r subpopulations. Let N i ≥ 1 be the number of tested individuals from the i−th subpopulation and n i be the number of positive outcomes, based on the same test. Denote byp i the above-designed prevalence estimate for the i−th subpopulation and setŵ i = N i /N, where N = N 1 + N 2 + · · · + N r . Finally, assume that 1 − β < n i /N i < α for all i. Then, the prevalence estimate,p, for the entire population becomeŝ compared with (19). In fact, it follows from Formula (23) and ∑ r i=1ŵ i = 1 that Observe that Thus, p 0 is a weighted average of the ratios n i /N i , which implies, due to the above-assumed bounds for these ratios that 1 − β < p 0 < α. We now continue (27) to finally get The mixture-invariance property (26) enables one to combine prevalence estimates for several subpopulations of a given population obtained within the same study, or different studies utilizing the same test, into a prevalence estimate for the entire population.

Discussion and Recommendations
In this article, I computed the distribution of the number of positive outcomes resulting from administration of a test with known sensitivity and specificity to N individuals selected from a given population. I also found the conditional distribution of the unobservable number of true and false positive test results given the observed number, n, of positive outcomes. These formulas lead to a closed form expression for the expected value of the unknown true number of infected individuals among those tested conditional on n. In Sections 3 and 4, these results were obtained for a homogeneous population while the Equivalence Principle derived in Section 5 extended them to a heterogeneous population. This theory culminated with a construction in Section 6 of a consistent estimator,p, of the prevalence of infected individuals in a population and finding an upper bound for the variance ofp.
Importantly, in Section 2, I formulated three basic assumptions required for the validity of the above results and identified, using the well-known early COVID-19 prevalence studies [2][3][4][5] as examples, several sources of their violation. Because it is uncommon for epidemiological studies including [2][3][4][5] to disclose all the details of their statistical analyses, it is hard to say with certainty if some of the formulas obtained in this work were employed in the published prevalence studies (as mentioned earlier, the Santa Clara county study [2] did use the prevalence estimator (23)). However, the design of these studies does not seem to fully meet the assumptions upon which these formulas vitally depend.
Note that this work never employed any asymptotic arguments, i.e., those applicable to large values of N. Therefore, all the results can be used for small sample size N as long as the sample of tested individuals is sufficiently representative of the prevalence structure of the target population.
One possible line of extension of this work would be to incorporate the compliance rate of individuals recruited for a prevalence study and the rate at which a testing system produces invalid results into the prevalence estimator. Another direction of further work would be to develop a probabilistic framework for the propagation of uncertainty in the test's sensitivity and specificity into the population prevalence estimator. As mentioned in the Introduction, a Bayesian approach to quantifying such propagation was developed in [6].
I close with a list of specific conclusions and recommendations regarding the design of population prevalence studies and relevant statistical methodology.
1. The estimator,p, of population prevalence p introduced in Section 6 is consistent for any test whose sensitivity, α, and specificity, β, satisfy the conditions α + β > 1 and 1 − β < p < α. The quality of this estimator, as determined by the magnitude of its variance, depends on the quantity √ N(α + β − 1) alone, see inequality (25), which can be used for deciding on the number of individuals to be tested. While the accuracy of estimatorp improves with the increase in the sensitivity and specificity of the test, the same improvement can be achieved by increasing the sample size. Thus, contrary to the common belief, high sensitivity and specificity of a test is not of primary importance for population prevalence estimation (although the accuracy of individual test results depends critically on how close the test's sensitivity and specificity are to 100%).
2. The "naïve" prevalence estimator p 0 = n/N depends on the true population prevalence and the test's sensitivity and specificity. It may deviate considerably fromp, the correct "disentangled" population prevalence estimator. For example, for a perfectly specific test (β = 1), one haŝ p = min{p 0 /α, 1}. Therefore, the use of p 0 as prevalence estimator should be discouraged, unless a test with sensitivity and specificity close to 100% is employed.
3. Accurate prevalence estimation for a population with a high prevalence of a disease or infection requires a very sensitive test while, for a population with low prevalence, a very specific test should be used. 4. Prevalence estimates, resulting from data obtained on the same testing platform, for subpopulations with known weights leads automatically, through the "mixing-invariance" property, see Section 6, to a prevalence estimate for a heterogeneous population comprised of these subpopulations without the need for a de novo study. 5. The validity of an estimate of population prevalence of the current or past infection depends critically on study design. Here are a few recommendations for a selection of study participants and choosing among them those individuals whose valid test results can be used for prevalence estimation: (a) Excessive inclusion of testing data for more than one household member in the same analysis of the prevalence of COVID-19 should be avoided. The same applies to individuals who are known to have been in close and/or protracted contact without PPE.
(b) Prevalence studies in populations where known COVID-19 super-spreading events have occurred should eschew oversampling from the subpopulation of their participants. Likewise, study participants could be screened for association with known infection clusters.
(c) Compliance with the Matching Principle (Assumption C) can be achieved through simple random sampling from the target population. This approach can be combined with stratification based on individual characteristics relevant to the prevalence of the disease or infection. Selected individuals can be additionally screened based on the above criteria (a) and (b) as well as for membership in high-risk professional groups and for residence in locations with high or low prevalence of registered cases.
(d) Self-selection of study subjects can make the resulting prevalence estimate unreliable. One way to prevent this selection bias is to improve post-selection compliance by providing financial incentives to study participants.
(e) Using composite testing data (e.g., combining molecular RT-PCR or IgA/IgM serological test for current infection with serological IgG test for past infection) within the same prevalence analysis violates the testing uniformity assumption and should be avoided.
Author Contributions: L.H. is responsible for all components of the manuscript and all aspects of its preparation. The author has read and agreed to the published version of the manuscript.

Funding:
The author confirms that no funding associated with this manuscript was provided.