1. Introduction
Single cells constitute the fundamental operational units of biological systems, exhibiting distinct developmental trajectories and molecular states. Over the past decade, single-cell RNA sequencing has elucidated pseudo-temporal trajectories by capturing static snapshots of transient transcriptomic states. More recently, the emergence of single-cell genomics has provided a longitudinal and evolutionary perspective [
1,
2,
3], enabling the dissection of genetic heterogeneity in developmental biology [
4], tumor evolution [
5,
6,
7], organ homeostasis [
8], and aging [
9,
10,
11]. However, the extreme rarity and stochastic distribution of the spontaneous mutations necessitate high-fidelity WGA from the femtogram-scale DNA template of a single cell, which is the primary challenge in single-cell genomic research.
To overcome the inherent constraints of single-cell templates, Degenerate Oligonucleotide-Primed PCR (DOP-PCR) and Multiple Displacement Amplification (MDA) utilize degenerate or random primers to achieve non-specific amplification [
12,
13]. These exponential amplification-based methods, however, inevitably introduce substantial amplification bias, allelic imbalances, and the propagation of errors from the early cycles. To address these limitations, next-generation strategies, like Multiple Annealing and Looping-Based Amplification Cycles (MALBAC) and Linear Amplification via Transposon Insertion (LIANTI), are engineered to preferentially amplify primary templates over daughter strands, leading to a quasi-linear amplification process [
14,
15]. More recently, Primary Template-directed Amplification (PTA) has refined this approach by utilizing termination bases to effectively suppress the further amplification of daughter strands [
16]. These methods aim to ensure that the majority of amplicons derive independently from primary templates, thereby reducing amplicon redundancy and enabling higher-fidelity single-cell genomics analysis. The differences in these amplification strategies are summarized in
Table S1, which further incorporates PicoFlex [
17], TruePrime [
18] and Single-Stranded Sequencing using Microfluidic Reactors (SISSOR) [
19].
Beyond technical advancements in amplification strategies, accurately assessing the fidelity of variant calling in single-cell genomic data remains challenging. Conventional quality control frameworks primarily rely on uniformity indicators, like genomic coverage, Gini index, and Lorenz curves, to assess overall amplification performance [
20,
21,
22]. While these indicators are indispensable for copy number variation (CNV) analysis, the fidelity of SNV calling is governed by different characteristics, like allelic imbalance or allele dropout [
22]. Relying on uniformity-based indicators for SNV quality assessment creates a methodological mismatch, potentially leading to the overestimation of quality in libraries with high redundancy but low informational complexity. Furthermore, while the Variant Allele Frequency (VAF) density distribution is commonly used to qualitatively reflect allelic imbalance, it remains largely descriptive. Currently, there is a lack of quantitative indicators to decode the underlying amplification mechanisms or to assess the direct impact of such imbalances on SNV calling accuracy for a given dataset. In addition, while benchmarking strategies often employ single-cell clonal expansion to validate the mutation fidelity [
4,
23,
24], additional experimental validation is costly and largely unscalable for most tissues or for routine quality assessment of individual datasets. Consequently, there is an urgent need for a unified, experiment-free quantitative framework to assess the fidelity of variant calling in a single-cell WGA library.
To address this goal, we propose that the DIA provides a fundamentally more rigorous foundation for mutation calling by offering an interpretable and quantitative assessment of allelic imbalance, beyond a descriptive distribution of allele frequency. Theoretically, the reliability of a genomic variant is not merely a function of its allelic fraction among total reads, but of its redundancy-free molecular evidence. Specifically, true variants residing on the primary template will be recurrently captured across multiple independent amplification events. In contrast, stochastic artifacts, like amplified errors, are typically restricted to a single daughter strand and its progenies. Therefore, a higher DIA value reinforces the statistical power of variant calling by ensuring convergent evidence from multiple primary templates, while a lower DIA indicates a larger stochasticity of the library, where early amplification errors propagate more readily and exacerbate allelic imbalances.
In this study, we introduce the DIAG, a unified statistical framework designed to quantify the independent amplification events for a single-cell genomic library. Beyond in silico simulations, we validated the reliability of the DIAG using real-world biological data from single cells and pairwise organoid samples. By formalizing the relationship between amplicon independence and variant reliability, the DIAG establishes a robust, rigorous standard for quantifying the fidelity of mutation calling without the need for extra experiments.
3. Results
3.1. Theoretical Framework of the DIAG
scWGA is essentially a process of allelic information transfer from minimal primary templates to an NGS library. Within this workflow, the obtained reads are derived from two distinct populations of amplicons (
Figure 1A). The initial products directly and independently derived from the primary templates are termed independent amplicons, whereas the remaining amplicons, produced through the further amplification of existing products, are redundant. The DIA represents the effective number of these independent amplicons, indicating the molecular complexity of the library.
For single-cell genomic analysis, both the enzymic error rate and the DIA would dominantly dictate the reliability of variant detection. The former determines the frequency of stochastic errors, a well-recognized intrinsic property of specific WGA strategies. The latter, however, has been frequently overlooked in previous studies. A lower DIA leads to an overrepresentation of redundant amplicons, allowing polymerase-introduced errors to propagate and become indistinguishable from true variants (
Figure 1B). In contrast, a higher DIA dilutes the impact of such stochastic errors by ensuring they occur in only a small fraction of independent amplicons, while true mutations remain consistently represented. This allows artificial errors, which are unlikely to recur at the same locus across independent templates, to be effectively filtered using a consensus strategy (
Figure 1B). Therefore, given a comparable enzymatic error rate, DIA provides a direct and standardized assessment of amplification performance for any scWGA library.
To derive DIA, we modeled the scWGA and NGS workflows as a two-stage hierarchical stochastic process. Utilizing variance decomposition and the MoM, we derived a formal relationship to estimate DIA based on the VAF distribution across heterozygous loci (Methods). To ensure the robustness of our framework across the heterogeneous genomic landscape, the DIAG supports partitioned analysis based on genomic features, such as GC content and CNV, allowing for the quantification of DIA within specific genomic bins (10M). Furthermore, we implemented bootstrap resampling of loci to quantify statistical uncertainty and generate confidence intervals. This results in a comprehensive, user-friendly quality assessment report for each single-cell library (
Figure 1B).
3.2. Accurate Recapitulation of VAF Distributions by the DIAG in In Silico Simulation
To validate the performance of the DIAG framework, we performed in silico simulations mimicking the two-stage scWGA-NGS process. For each locus, we first sampled the alleles of independent amplificons from the genomic template, followed by sampling sequenced reads from this independent amplificon pool (Methods). Then, we tested the sensitivity of the VAF distribution to variations in DIA (
Figure 2A). As expected, lower DIA values introduced significant sampling variation at the amplification stage, manifesting as highly dispersed VAF distributions and prominent pseudo-homozygous peaks. Conversely, as DIA increased, the distributions became increasingly concentrated around the expected heterozygote frequency of 0.5. These results indicate that different DIA values can represent a distribution that intuitively reflects allele frequencies in heterozygous loci.
The simulated sequencing data were evaluated using the mathematical DIAG framework (
Figure 2B). Under a baseline scenario under 30× sequencing depth across 100,000 loci, the estimated DIA values exhibited a nearly perfect correlation with the predefined ground-truths (
Figure 2B; r = 0.9997,
n = 100,
p < 1 × 10
−38). Across multiple replicates, the estimation accuracy consistently remained between 94% and 100% (
Figure S2). Therefore, the DIAG accurately recapitulates the VAF distributions, providing a potential solid analytical framework.
3.3. Robustness of the DIAG Framework Across Technical and Biological Variables
We then systematically evaluated the robustness of the DIAG framework under various technical and biological perturbations, including non-uniform DIA distributions, varying sequencing depths, genomic data scales, intrinsic enzymatic error rates, and pervasive aneuploidy.
First, the DIAG demonstrates high tolerance to amplification preferences across the genome. By modeling DIA as a distribution rather than a fixed constant, we observed that DIA estimates remained stable under various distributional assumptions. Under normal distributions, the accuracy of estimation is relatively stable across different standard deviations (
Figure 3A and
Figure S3A). Similar results were obtained using a uniform distribution (
Figure 3B and
Figure S3B), further indicating that the DIAG can robustly recover DIA signals even in genomic regions prone to severe amplification bias, such as GC-enriched or repetitive sequences.
Second, we assessed the impact of sequencing depth and the number of informative loci on the estimation accuracy. The DIAG shows rapid saturation with respect to sequencing depth and maintains high fidelity even under low-coverage conditions (
Figure 3C and
Figure S3C), achieving an average accuracy of 91.8% at 5× coverage. Furthermore, the DIAG exhibits remarkable insensitivity to the number of informative loci (
Figure 3D and
Figure S3D), suggesting its potential for application in high-throughput but low-pass sequencing libraries or region-specific genomic assessments. Collectively, these results suggest that a minimum empirical requirement for reliable single-cell quality control is 5× coverage with at least 1000 informative sites.
In addition, we confirmed that DIA inference is decoupled from the enzymatic error rate. Conceptually, these two factors are independent because the enzymatic error rate determines the frequency of primary amplification errors, whereas the DIA dictates how these errors are represented in the final sequencing library. Our results confirm this by showing that the DIAG is highly insensitive to the underlying enzymatic error rate (
Figure 3E and
Figure S3E). Specifically, DIA inference maintains a stable performance even at an error rate as high as 0.1, achieving an average accuracy of 95.5%. This underscores that DIA captures the physical independence of amplicons rather than the biochemical fidelity of the polymerase.
Furthermore, the DIAG remains robust within complex biological contexts characterized by pervasive aneuploidy. To account for the CNV typical of tumor cells, we adjusted the preset allele frequency
p to reflect local ploidy. Notably, the DIAG accurately recovered the number of independent amplicons across diverse copy number states (
Figure 3F and
Figure S3F), demonstrating its reliability for assessing atypical single-cell datasets with an unstable genome. Collectively, these results establish the DIAG as a robust and reliable tool, unconfounded by common technical artifacts or complex biological contexts.
3.4. DIA as a Compraehensive Indicator of SNV Calling Fidelity
The ultimate utility of a quality control metric lies in its capacity to predict downstream analytical performance. Therefore, we systematically evaluated the relationship between DIA and downstream mutation calling performance. Using an in silico dataset with a known ground-truth, we benchmarked DIA against widely used critical performance indicators, including TP count, ADO, FN count, FP count, precision, and sensitivity, under varying enzymatic error conditions. This benchmarking effort aims to bridge the gap between abstract library complexity and the concrete reliability of somatic mutation discovery, serving as a quantitative basis for setting quality thresholds in single-cell genomics analysis.
First, DIA is the primary determinant of recovered TP variants and ADO mitigation. As DIA increases, the recovery of true variants improves rapidly and reaches saturation (
Figure 4A). Conversely, ADO is drastically mitigated at higher DIA levels, as the probability of failing to capture a specific allele decreases exponentially with the accumulation of independent amplicons (
Figure 4B). Notably, as DIA dictates the probability that a variant is physically captured and detected, we found that the enzymatic error rate is decoupled from these two indicators.
Second, high DIA suppresses false variants through a molecular consensus mechanism. We found that both FN and FP counts are negatively correlated with DIA (
Figure 4C,D). At low DIA levels, stochastic errors introduced during early amplification cycles can become overrepresented, making them indistinguishable from true biological variants and inflating the FP count. However, as DIA increases, these random errors are effectively suppressed through the consensus of multiple independent amplicons derived from the same template. While higher enzymatic error rates increase the baseline of errors, a high DIA enables a robust filter to suppress these stochastic artifacts, significantly enhancing precision and sensitivity (
Figure 4E,F).
In addition, by conducting our analysis of heterozygous and homozygous loci, two distinct benchmarking profiles were found (
Figure S4). Specifically, ADO was identified as the primary source of error at heterozygous sites, representing a fundamental sampling failure. In contrast, polymerase-induced amplification errors dominated at homozygous sites, where the absence of a second biological allele renders the system more vulnerable to propagated artifacts. Consequently, while metrics such as TP and ADO are governed almost exclusively by DIA, the FP rate is co-modulated by both DIA and intrinsic enzymatic fidelity (
Figure S5). Collectively, our results demonstrate that DIA serves as a robust and comprehensive predictor of scWGA fidelity, offering a unified metric to evaluate the quality of single-cell libraries.
3.5. Validation of the DIAG Framework in Real Biological Scenarios
To validate the practicality of the DIAG framework in real biological scenarios, we employed a human organoid culture system as a controlled ground-truth model (
Figure 5A). Our experimental design was guided by three key considerations. First, we designed two different biological scales, including single-cell and pairwise organoid samples. Compared to single-cell samples, multicell organoids provide a significantly larger pool of initial genomic templates and thus possess theoretically higher DIA values. Second, by performing pairwise comparisons between single cells and their corresponding organoids, we established a biological gold standard to accurately distinguish true mutations from technical noise, enabling a rigorous assessment of sensitivity and precision (
Figure 5B). Third, we evaluated two distinct WGA methodologies, including MALBAC and PTA (
Figure 5C). The former is a classic method designed for uniform amplification and the latter is a leading-edge technology that effectively suppresses the re-amplification of daughter strands via termination bases. This multiscale approach allowed us to comprehensively evaluate the DIAG across diverse biological scenarios.
First, we examined the VAF density distributions at heterozygous germline sites. We found that the Allele Frequency Spectrum (AFS) in organoid samples was more tightly centered around the theoretical heterozygous frequency of 0.5 compared to the distributions observed in single cells (
Figure 5D and
Figure S6). We examined the germline SNV count using the definitions shown in
Figure 5B (
Figure S7). Quantitatively, the DIAG successfully captured the intrinsic differences in initial template abundance, with single cells exhibiting an average DIA of 10 (ranging up to 17.41), while organoid samples achieved an average DIA of 80 (ranging up to 86.75) (
Figure 5E,F). Considering the limited sample size of our PTA cohort (
n = 2), which may constrain the overall statistical power, we incorporated additional independent datasets, including five PTA-amplified and three MALBAC-amplified single cells. Crucially, the results show that there is a trend where PTA exhibited higher DIA than MALBAC within our evaluated datasets (
Figure S8). This demonstrates that the DIA metric successfully quantifies the allelic imbalance that is otherwise only qualitatively visible in AFS plots.
We further assessed the capability of the DIAG to predict library fidelity by performing correlation analyses between DIA and standard performance indicators. Our results revealed that DIA values are positively correlated with both precision (
Figure 5G, ρ = 0.94,
p = 2.67 × 10
−12) and sensitivity (
Figure 5H, ρ = 0.98,
p = 9.42 × 10
−19), demonstrating their power as quality evaluation in scWGA library quality.
Beyond germline variants, the pairwise organoid controls provided an exploratory opportunity to evaluate the relevance of DIA to somatic mutation discovery. The somatic mutations in this study are defined as the variants that were not detected in the bulk sample but were detected both in the single-cell sample and the organoid sample (
Figure 5B). The count of somatic mutations detected in 13 single cells is shown in
Figure S7. In fact, the absolute number of detected somatic mutations is primarily determined by the inherent biological characteristics of the sample, so it is essential to establish a metric to gauge the discovery potential across different libraries. Consistent with this requirement, we observed a positive trend between the SNR and DIA (ρ = 0.64,
p = 4.42 × 10
−4;
Figure 5I). We should note that the organoid samples in this study were also influenced by the amplification process, leading to a more moderate correlation compared to that observed for germline variants.
Nevertheless, our findings imply that DIA serves as a useful reference for assessing the potential for somatic mutation detection in the WGA library. Taken together, our results demonstrate the robust capability of the DIAG framework in real biological datasets. Specifically, DIAG-based inferences are consistent with experimental variables, accurately reflecting both template abundance and the intrinsic performance of distinct WGA methods. Crucially, the DIAG enables these quality assessments without the need for costly and unscalable validation experiments, ensuring its broad applicability to a specific single-cell genomic dataset.
3.6. Cross-Method Benchmarking of scWGA Fidelity via DIAG
While traditional uniformity-based indicators are primarily valuable for CNV analysis, we utilized the DIAG framework to conduct a comprehensive benchmarking of several widely used scWGA methods in public datasets [
15,
16]. Our analysis revealed several key findings regarding technological performance.
First, PTA consistently outperformed all other methods in both VAF density distributions and DIA estimations (
Figure 6A,B and
Figure S9). Early-stage methods, such as DOP-PCR, tended to exhibit low DIA values (averaging 1.12, up to 1.15), reflecting minimal molecular independence and high amplicon redundancy. In contrast, PTA achieved an average DIA of 10.44 (up to 17.41), demonstrating its superior capacity to maintain a diversity of primary genomic templates during amplification.
To further characterize how these DIA variations impact downstream SNV calling fidelity, we integrated DIA estimations with the intrinsic enzymatic fidelity of each method (
Figure 6C,D). This further assessment revealed that PTA achieved the highest precision (99.8% on average, up to 99.97%), followed by LIANTI at 95.4% (up to 96.7%). Furthermore, the SNV calling sensitivity of PTA reached an average of 97.3% (up to 99.8%). As summarized in
Table 1, PTA consistently outperforms alternative methods in both precision and sensitivity, ensuring reliable single-cell variant discovery.
From another perspective, the DIAG was specially designed to account for genomic heterogeneity, and the framework provides subregion-level DIA estimations in the output report, enabling researchers to dissect variance across distinct genomic features, such as GC content and aneuploidy. To validate this, we examined the relationship between DIA and GC content across 36 single-cell datasets. Our results show that DIA remains largely independent of GC content for the majority of cells (27/36,
Figure S10), showing that the DIA effectively captures the complexity of the WGA library despite local genomic heterogeneity. Notably, we observed that certain methods exhibit higher sensitivity to genomic features, such as LIANTI and PTA, and subtle correlations with GC content in specific cells. For instance, a decrease in DIA was observed in high-GC regions for two LIANTI-amplified cells (ρ = −0.64,
p = 2.11 × 10
−27 and ρ = −0.61,
p = 2.09 × 10
−24,
Figure S10). This result suggests that the degree of GC sensitivity varies significantly between amplification chemistries, and the DIAG is capable of pinpointing such localized technical biases. Furthermore, the integration of CNV-aware partitions within DIAG ensures that these principles remain applicable to samples with complex karyotypes. Ultimately, the DIAG enables the localized quantification of WGA quality by automatically partitioning the genome into 10 Mb bins. This granular approach identifies technical variance across heterogeneous genomic regions, ensuring reliable quality assessment even in samples with complex biological contexts.
3.7. Decoupling of Traditional Metrics from SNV Calling Fidelity
We further evaluated the relationship between DIA and traditional metrics for the scWGA library. Analysis reveals that DIA remained independent of sequencing depth across five of the six WGA methods (
Figure S11). The only exception was DOP-PCR, which lacked sufficient statistical power due to its limited sample size (
n = 3). Furthermore, while a negative trend was found between DIA and the conventional uniformity metric, like the Gini index (
Figure S12), we inferred that the relationship is phenomenological rather than functional. Because the DIA captures the depth of independent amplicons, which inherently determines the ultimate genomic uniformity, it is conceptually distinct from coverage-based metrics. Unlike traditional uniformity measures that only describe the final sequencing distribution, DIA serves as an intrinsic indicator of the initial amplification complexity, effectively capturing how WGA chemistry interacts with genomic heterogeneity, offering a more fundamental assessment of library fidelity. To validate this, we conducted a down-sampling analysis in the same library to evaluate the performance between DIA and the traditional uniformity metric. Crucially, as shown in down-sampling analysis in PTA datasets (
Figure 7A,B), uniformity-based metrics such as the Gini index and KL divergence exhibited significant shifts as sequencing depth was reduced by artificial clapping (
Figure 7C,
p < 0.05 and
p < 0.01, respectively). Furthermore, both precision and sensitivity showed no significant changes under these conditions, confirming that DIA is decoupled from sampling-induced uniformity fluctuations.
Taken together, these results suggest that DIA is a more robust and effective indicator for WGA quality assessment than traditional metrics.
4. Discussion
In this study, we introduced the DIAG, a unified statistical framework that redefines quality control for single-cell genomics by redirecting the analytical focus from global uniformity to locus-specific molecular independence. Unlike conventional uniformity-based indices, the DIAG estimates the number of independent amplicons derived directly from primary templates, which is the fundamental determinant of single-cell mutation fidelity. Specifically, the DIAG conceptualizes WGA and NGS as a two-stage hierarchical stochastic sampling process, each characterized by nested binomial distributions. This enables the construction of a comprehensive probabilistic model to rigorously estimate the DIA. Based on the analyses of in silico datasets under diverse perturbations, our results demonstrate that the DIA serves as a robust indicator of the precision and specificity of variant calling, even in complex biological scenarios.
We further demonstrated the practicability of the DIAG using real biological datasets. Using organoid-derived clones as a positive control, we showed that the precision and specificity of mutations are highly correlated with the estimated DIA. Notably, a particular strength of the DIAG is its reference-free nature, which enables intrinsic quality assessment without the need for costly, unscalable experimental controls like single-cell clonal expansion. While previous benchmarking relied on these ground-truth positives to validate specific WGA strategies, such methods are impractical for high-throughput research. The DIAG bypasses this bottleneck by directly quantifying the complexity of each library.
Furthermore, the DIAG effectively captures amplification efficiency across diverse genomic regions. It is well established that GC content dictates the melting temperature and secondary structure stability of DNA templates. High-GC regions increase the energy required for strand separation, potentially leading to stochastic polymerase stalling and reduced processivity during the denaturation and annealing phases of PCR. Conversely, extremely low-GC regions are prone to non-specific priming. While global estimation remains a robust and necessary standard for general analysis, localized genomic features including GC content and CNVs can drive imbalances in amplification across the genome. In scenarios where samples exhibit high regional heterogeneity, the partition-based framework of DIAG provides a valuable alternative by directly reflecting regional amplification efficiency. Rather than viewing these genomic features as confounding variables to be modeled out, this approach treats them as mechanistic drivers, utilizing the DIAG to provide a more precise lens into regional processivity. In addition, the DIAG accounts for amplification imbalance across diverse genomic regions, maintaining a consistent performance even in regions of genomic gain or loss. This characteristic enables the transition from genome-wide descriptive assessment to locus-specific probabilistic modeling, allowing for the high-resolution filtration of stochastic artifacts and ensuring reliable single-cell genomic analysis. With the growing availability of single-cell WGA datasets, we anticipate that the DIAG will become an essential foundation for resolving the plasticity of human development, the clonal origins of tumorigenesis, and the intricate population dynamics of cellular aging.
Data integration from multiple laboratories introduces confounding variables, such as heterogeneous cell types, diverse donor backgrounds, and different sequencing platforms. To enhance the performance of the DIAG in future releases, several technical avenues remain to be explored. First, while the robustness of DIA estimation has been well demonstrated in standard cell-by-cell WGA datasets, it requires approximately 1000 informative heterozygous sites to ensure the output reliability, limiting its utility in ultra-low-coverage samples or regions with extensive loss of heterozygosity. Future optimizations could incorporate disequilibrium information to aggregate sparse local molecular signals for low-quality samples or high-throughput, low-coverage libraries (e.g., DEFND-seq [
37]). Second, while the DIAG is able to effectively capture regional amplification variation, the requirement of 1000 informative heterozygous sites for high-confidence estimation limits further genomic subdivision. Consequently, in regions with low SNV density, the method is unable to split the genome into smaller segments, limiting the achievable resolution. Specifically, we have provided an optional analysis in the DIAG framework to control for ploidy-driven biases. In future versions, we anticipate refining this into a joint probabilistic model that simultaneously accounts for multiple genomic covariates, enabling a more robust and fully automated analysis. Third, by leveraging locus-specific DIA values, a likelihood-based scoring system could be implemented to statistically distinguish high-confidence mutations from stochastic background noise.