4.1. Bayesian Posterior Reliability and the Base-Rate Trap
The preceding analysis focused on frequentist system reliability: the probability that at least one innocent individual is flagged. However, practitioners ultimately need the posterior probability that a flagged person is truly a target. This Bayesian perspective reveals an even more stringent constraint: in the sparse-target regime (where the expected number of true targets r is small relative to population size), posterior reliability degrades once becomes comparable to and collapses when . In this regime, flags become epistemically meaningless well before the frequentist transition at .
4.1.1. The Bayesian Framework and Positive Predictive Value
We introduce standard notation from diagnostic testing and forensic statistics [
2]:
By Bayes’ rule, the
positive predictive value (PPV) is
In a screened population of size
n with
r true targets (
), the expected number of flagged individuals is
Thus, the expected false discovery rate (FDR) is
matching (
15) exactly.
Remark 9 (Sensitivity Dependence on Threshold)
. The analysis above treats sensitivity s as constant, but in practice typically decreases as the threshold m increases: stricter criteria miss a larger fraction of true targets. A simple parametric model capturing this tradeoff iswhere is sensitivity at low thresholds, is the expected number of matching attributes for a true target, controls the decay rate, and . The exponential form reflects the common observation that match-score distributions exhibit approximately exponential tails, though any monotone decreasing form would yield the same qualitative conclusions.Substituting (17) into the PPV formula (16) reveals competing effects as m increases: False positives decrease: For innocent individuals, with and , and Lemma 1 shows that decays very rapidly once (indeed, faster than any fixed-rate exponential in m).
True positives decrease: remains near for but then decays exponentially for .
These effects create an intermediate regime
in which PPV may plateau
or improve more slowly than the constant-s analysis suggests, particularly once m exceeds and losses in sensitivity offset some of the gains from reduced false positives. Under the model (17), however, Lemma 1 implies that eventually decays much faster than , so and therefore as . Any plateau or dip in PPV can therefore occur only over this intermediate range of thresholds, not asymptotically. Because sensitivity is bounded above by , the constant-s model used earlier provides an upper bound
on achievable PPV for any given threshold: is increasing in s, and for all m. Optimal threshold selection ultimately requires specifying both the innocent and target match distributions—equivalently, full ROC curve analysis [23]. Since the signal distribution is application-dependent, we do not pursue this direction here. Of course, driving m to extremely large values also drives the overall flag rate toward zero, so the limit is primarily of theoretical interest; in practice, the operationally relevant regime is the intermediate range of thresholds where nontrivial detection rates are maintained. 4.1.2. Posterior Reliability and Bayesian Critical Scales
In the large-deviation setting of Theorem 1, with threshold
and
, the false positive rate satisfies
up to subexponential prefactors. Combining this with (
16) yields an explicit condition for maintaining actionable posterior probabilities.
Remark 10 (Sparse-Target Regime). The following proposition applies to the “needle in a haystack” setting, where r (the expected number of true targets) is fixed or grows slowly while population n increases. If instead (constant prevalence), then the ratio is fixed, and the condition for actionable PPV reduces to a bound on relative to , independent of n. The sparse-target regime is the most challenging case for screening systems and is therefore the primary focus of this analysis.
Proposition 4 (Bayesian Critical Population for Actionable PPV)
. Fix a desired posterior level (e.g., ) and sensitivity . Let r denote the expected number of true targets in the population (so the base rate is ). In the sparse-target regime where r is fixed or grows sublinearly with n, the conditionis satisfied wheneverUnder the large-deviation scaling , the Bayesian critical population size
satisfieswhere ≍ hides subexponential factors absorbed into . Proof. From (
16),
iff
. Rearranging gives
, which yields (
18). Substituting the large-deviation scaling for
q proves (
19). □
When sensitivity varies with the threshold, as in Remark 9, the bound (
19) should be interpreted as an upper bound on achievable Bayesian reliability, since
for all
m.
Remark 11 (The Bayesian Trap). Frequentist reliability deteriorates once , when the system is likely to produce at least one false alert. Bayesian actionability demands the stronger condition . When , there is an intermediate regime where false alerts occur frequently but individual flags retain some evidential value. When (extremely sparse targets), posterior reliability collapses before the system becomes statistically unreliable. In all cases, if , posterior probabilities decay toward zero even when individual false positives remain rare.
4.1.3. Likelihood Ratios and Classical Fallacies
Analysts often cite tiny false positive rates
q or large likelihood ratios
and mistakenly infer that
is therefore large. This is the classical
prosecutor’s fallacy [
2]. Bayes’ rule shows that
When the base rate is small, the prior odds can overwhelm any fixed likelihood ratio. Even extremely rare false positives () do not guarantee high PPV. When , most flagged individuals remain innocent despite individually low false positive rates.
4.1.4. Resolving the DNA Database Controversy
Forensic statisticians have long debated the evidential value of DNA “cold hits.” The authors of [
1] argued that searching databases weakens evidence by inflating coincidental match probabilities; refs. [
2,
34] countered that likelihood ratios preserve evidential weight. Our framework resolves this apparent contradiction by identifying the relevant asymptotic regime.
Using (
20), the posterior odds after a database match are
The DNA regime. Standard STR genotype profiling yields match probabilities on the order of or smaller. Even with databases of size , the product remains far below the critical scale where coincidental matches become probable.
In this extreme regime, likelihood ratios dominate the posterior odds. The prior odds may be small (perhaps if we have one suspect among a million), but the likelihood ratio is so enormous that posterior probabilities remain overwhelmingly high. This validates Balding’s argument: database size does not meaningfully dilute evidential weight when .
The surveillance regime. Multi-attribute surveillance systems operate in a fundamentally different regime. With
and threshold multiplier
(i.e.,
), we have
(
Section 3.1). For populations of
, the product
places the system far above the critical scale.
In this regime, prior odds collapse faster than likelihood ratios can compensate. Even if the likelihood ratio is substantial, the prior odds are so unfavorable that posterior probabilities remain low. Stockmarr’s caution applies: match evidence loses evidential weight as search populations grow.
Resolution. The critical scale separates these regimes. Stockmarr and Balding are both correct in their respective contexts: DNA forensics operates where (likelihood-driven), while multi-attribute surveillance operates where (base-rate-dominated). The apparent contradiction dissolves once we recognize this regime distinction.
4.1.5. Key Takeaways
Bayesian scaling mirrors frequentist scaling: Bayesian actionability inherits the exponential factor
but includes the additional multiplicative factor
; see (
19).
Posterior collapse can precede frequentist failure: In the sparse-target regime, posterior reliability collapses once approaches , often well before the frequentist transition at .
Exponential data growth overwhelms adaptation: Reducing
q by a factor of
requires increasing
k by only
, while real-world data growth is exponential:
(
Section 3.2). Thus posterior collapse is temporally inevitable.
Epistemic saturation: As n grows, base rates shrink. Even rare false positives become dominated by prior odds, causing to decay toward zero.
Resolution of the DNA debate: Stockmarr and Balding are correct in different regimes: Balding for (DNA) and Stockmarr for (large-scale attribute screening).
Remark 12 (Connections to Classical Statistical Fallacies)
. This section unifies the base-rate fallacy, the prosecutor’s fallacy, false discovery rate control [15], and the PPV problem in medical screening [35]. Conceptually, these phenomena are identical: all reflect Bayes’ rule under low prevalence and imperfect specificity. The widely cited argument that “most published research findings are false” [36] is the same FDR/PPV problem in another domain. 4.2. Fairness Implications
The Group Dominance Effect (Theorem 3) has important implications for fairness in surveillance systems. When different population groups experience differential surveillance exposure, small differences in exposure rates create exponential disparities in outcomes. Proposition 3 shows that exposure ratios of 2–4 times generate false alert disparities exceeding 20 times near critical thresholds. This exponential amplification, which arises from Poisson tail behavior, means that even modest differences in surveillance intensity produce severe outcome inequalities.
Crucially, these disparities cannot be eliminated through threshold adjustment. Group-specific thresholds merely encode the underlying exposure inequality in a different form. Equalizing outcomes requires equalizing data collection intensity at the source, not algorithmic tuning. Moreover, since the high-exposure group drives system-level false alerts, aggregate reliability metrics obscure concentrated burdens on specific subpopulations, making demographic disaggregation essential for understanding actual system performance.
Remark 13 (Structural vs. Algorithmic Bias). Proposition 3 demonstrates that disparate outcomes arise from the probabilistic structure of screening systems, independent of algorithmic design choices. When different groups experience differential surveillance exposure rates (), this mathematically guarantees unequal false positive rates () under any common threshold m, creating disproportionate false alert burdens through Poisson tail behavior.
The exponential amplification in part (1) is particularly striking: when both groups are screened using the same attribute set and threshold, small differences in exposure translate to exponential differences in false alert rates (Figure 1d). The effect manifests temporally as different groups reach critical false alert rates at different times: Group 2 with exposure rate fails at attributes, while Group 1 with remains reliable until . While this fourfold difference in system lifetime simply reflects the fourfold difference in exposure rates (a linear relationship), the amplification becomes exponential when comparing simultaneous false alert rates: at any intermediate , Group 2 experiences exponentially more false alerts than Group 1.
Connection to algorithmic fairness. These findings relate directly to classical impossibility theorems in the algorithmic fairness literature. Kleinberg et al. [37] and Chouldechova [19] show that equalizing false positive rates, false negative rates, and calibration is impossible when base rates differ. Our analysis identifies a complementary, and more fundamental, mechanism: surveillance exposure itself creates different effective base rates
(), guaranteeing unequal false alert burdens even before
any classifier is applied. Standard fairness interventions operate at the classifier level: equalized odds [18], demographic parity, and calibration attempt to constrain predictions. None can correct the structural disparity we identify, because it arises from data collection intensity (), not from how a classifier processes the collected data. Group-specific thresholds might equalize and , but only by encoding exposure inequality directly. Achieving parity requires assigning a higher threshold to the more surveilled group. This perspective connects to “fairness through awareness” [38] and formalizes “structural bias” arguments from the critical algorithm studies literature [22,33,39]. Disparities can be intrinsic to systems built on heterogeneous data collection, rather than artifacts of biased algorithms or training data. Remark 14 (Policy Implications). These results suggest that surveillance system audits should perform the following:
- 1.
Measure exposure rates () across demographic and geographic groups, not just aggregate false alert rates.
- 2.
Recognize that system reliability is bounded by the worst-performing group (Theorem 3), making demographic disaggregation essential.
- 3.
Account for temporal dynamics: groups with higher exposure fail first as data accumulates, creating windows of maximum disparity.
- 4.
Acknowledge that threshold adjustments cannot eliminate disparities arising from differential exposure; only equalizing across groups can achieve fairness.
4.3. Limitations and Future Directions
This analysis operates under several simplifying assumptions that define its scope and suggest natural directions for future work.
Independence across individuals. Our core results (Theorems 1 and 2) assume statistical independence of match counts across individuals (Remark 2). Common-mode events (mass gatherings, natural disasters, viral social media content, and coordinated activities) introduce positive dependence that would
increase false alert rates beyond our bounds. Positive dependence inflates upper-tail probabilities and therefore worsens system reliability relative to the independent case. Extensions via positively associated random variables or Chen–Stein methods [
26] could quantify these effects, but our independence-based analysis provides a lower bound on false alert rates.
Binary attributes. We model attributes as binary indicators (match/no-match). Continuous features, count data, or multi-level categorical attributes would require different distributional assumptions and tail bounds. The qualitative insights about combinatorial explosion in high-dimensional spaces should persist, but quantitative thresholds would differ. Extensions to Gaussian or sub-Gaussian attributes would preserve the essential exponential tail behavior underlying our critical-scale results.
Fixed thresholds. Our analysis assumes detection thresholds m are fixed at deployment time. Adaptive systems that adjust thresholds based on observed alert rates or estimated base rates could potentially extend operational lifetimes. However, Theorem 2 suggests fundamental limits: under exponential data growth, even optimally adaptive thresholds would need to grow exponentially to maintain reliability, eventually exceeding meaningful detection capabilities.
Constant sensitivity. The Bayesian analysis (
Section 4.1) initially treats sensitivity
as constant. Remark 9 introduces a simple threshold-dependent model, but a complete analysis would require specifying both the target match distribution and the full signal distribution of true targets and performing ROC optimization [
23]. This is inherently application-dependent and beyond our current scope.
Heuristic correlation treatment. The effective degrees of freedom approach (
Section 3.4;
Appendix B) captures variance inflation but does not constitute rigorous large-deviation analysis. Formal treatment would require specifying mixing conditions or dependency graph structures [
26,
29]. Our heuristic provides qualitative guidance rather than formal guarantees.
Lack of empirical validation. We have not validated our predictions against operational surveillance data, which is typically proprietary, classified, or subject to confidentiality restrictions. Instead, we use proxy datasets for illustrative rather than validating analyses (
Appendix C); these provide qualitative checks but do not constitute validation in the intended deployment environment.
Static population structure. We assume a fixed population composition with stable group sizes
and exposure rates
. Dynamic populations with entry, exit, demographic shifts, and changing surveillance intensity would require stochastic population models. The temporal analysis (
Section 3.2) addresses data growth but not population dynamics.
Formal versus heuristic results. Theorems 1, 2, and 3, along with Propositions 1 and 3, are formal results with complete proofs. The effective dimensionality analysis (
Appendix B) and sensitivity threshold model (Remark 9) are heuristic approximations.
Broader applicability. While surveillance systems motivated this analysis, the mathematical framework applies to any domain where threshold rules screen large collections across many low-probability binary indicators. The critical population bounds (Theorem 1), temporal saturation dynamics (Theorem 2), and group-level disparity amplification (Theorem 3) characterize generic properties of high-dimensional threshold detection, independent of the specific application. Natural extensions include network intrusion detection, manufacturing quality control, financial fraud screening, medical diagnostic panels, and environmental monitoring systems. The binary indicator assumption could be relaxed to accommodate hybrid frameworks combining discrete and continuous variables [
40], though the essential combinatorial explosion in threshold-based screening would persist. We developed the theory through the surveillance lens because it offered the clearest exposition of the societal stakes, but the probabilistic limits derived here constrain any system that aggregates rare coincidences across high-dimensional attribute spaces.
Despite these limitations, the core mathematical structure (exponential scaling of critical populations, finite system lifetimes under data growth, and structural amplification of exposure disparities) should prove robust across modeling variations. Modeling refinements would shift numerical thresholds but not the qualitative scaling laws, which arise from intrinsic high-dimensional coincidence phenomena.