Next Article in Journal
Association Between Immunohistochemical Profile and Radiographic Presentation of Breast Cancer Skeletal Metastases
Previous Article in Journal
Assessment of Liver Fibrosis Stage and Cirrhosis Regression After Long-Term Follow-Up Following Sustained Virological Response
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BayesCNV: A Bayesian Hierarchical Model for Sensitive and Specific Copy Number Estimation in Cell Free DNA

Pillar Biosciences Inc., Natick, MA 01760, USA
*
Authors to whom correspondence should be addressed.
Diagnostics 2026, 16(2), 280; https://doi.org/10.3390/diagnostics16020280
Submission received: 6 December 2025 / Revised: 4 January 2026 / Accepted: 12 January 2026 / Published: 16 January 2026
(This article belongs to the Section Pathology and Molecular Diagnostics)

Abstract

Background/Objectives: Detecting copy number variations (CNVs) from next-generation sequencing (NGS) is challenging, particularly in targeted sequencing panels, especially for cell-free DNA (cfDNA), where the signal is weak and noise is high. Methods: We present BayesCNV, a Bayesian hierarchical model for gene-level copy ratio estimation from targeted amplicon read depths compared to a CNV-neutral reference sample. The model provides posterior uncertainty for each gene and supports interpretable calling based on effect size and posterior confidence. The model also provides a principled quality-control strategy based on the marginal log likelihood of each sample, with low values indicating low confidence in the calls. BayesCNV uses thermodynamic integration, a technique to reliably estimate this quantity. We benchmark our method against two publicly available CNV callers using Seracare® reference samples with known CNVs on the OncoReveal® Core Lbx panel. Results: Our method achieves a sensitivity of 0.87 and specificity of 0.996, dramatically outperforming two competitor methods, IonCopy and DeviCNV. In a separate FFPE dataset using the OncoReveal® Essential Lbx panel, we show that the marginal log likelihood cleanly separates, degraded from high-quality samples, even when conventional sequencing QC metrics do not. Conclusions: BayesCNV provides accurate and interpretable gene-level CNV estimates and uncertainty quantification, along with an evidence-based quality control metric that improves robustness in targeted cfDNA workflows.

1. Introduction

Copy number variations (CNVs) are genomic aberrations caused by an excess or deficit in the number of copies of a genomic region or gene [1] relative to the typical diploid state. These aberrations are involved in the onset and progression of various diseases, including cancer [2], heart disease [3], and thalassemia [4]. Accurate identification of CNVs from next-generation sequencing (NGS) data is vital for understanding the genetic basis of diseases and for targeted therapies [5]. In the realm of liquid biopsies utilizing cell-free DNA (cfDNA), detecting somatic copy number amplifications is increasingly recognized as an invaluable tool for early diagnosis [6], disease monitoring [7], and therapy selection, particularly for cancer patients [8].
Despite its clinical utility, CNV calling in cfDNA is difficult for reasons that are less acute in tumor tissue. cfDNA is highly fragmented [9] and exists in low concentrations [10], thereby necessitating specialized pre-processing steps [11] and higher sequencing depth to detect the weak signal. Existing CNV callers designed for bulk tumor tissue may not perform well under these conditions due to their assumptions about tumor purity and clonal heterogeneity.
A further complication is that many clinical workflows rely on targeted panels rather than whole-genome or whole-exome sequencing. These panels are cost-effective and focus on clinically relevant genes, but their narrow genomic scope can reduce the availability of control regions and make common model assumptions inappropriate [12]. Many CNV callers were developed for WGS/WES and are not optimized for targeted panel data, as we describe more in depth in the related work section [13]. As such, most methods fail to run or provide accurate calls in this setting. Furthermore, many callers do not function in small batch sizes commonly encountered in clinical settings. Given these constraints and limitations of existing CNV calling methods, there is a pressing need for a novel, optimized somatic CNV caller specifically designed for detecting copy number amplifications in cfDNA at low tumor content from targeted panels.
In this paper, we introduce BayesCNV, a state-of-the-art Bayesian somatic CNV caller tailored for these specific challenges. This model uses a Bayesian hierarchical model of the copy ratio compared to a normal sample, with gene-level means and variances to estimate the copy number. We derive this caller from basic assumptions and develop an inference scheme for estimating the parameters, making gene-specific calls, and calculating variant call confidence. We also develop a method, based on thermodynamic integration [14], for estimating the Bayesian evidence of the sample. This enables quality control and automatic detection of low-quality or mismatched samples. We then demonstrate the performance of this caller on data coming from Seracare® samples (Milford USA) using two OncoReveal® Liquid Biopsy Panels (Natick USA). These results are compared to two competitor methods, IonCopy [15] and DeviCNV [16], showing superior performance.
The contents of this paper are as follows: in Section 2 we report our literature review on current CNV callers and describe why most extant callers are unable to be used in our specific applications. In Section 3, we describe our development of the mathematical model with all steps required in inference, computation, and calling. In Section 4, we present the performance of our model on synthetic and real data. Finally, in Section 5 we provide some brief remarks on future directions for this work. An open-source implementation of the mathematical model is available at https://github.com/Pillar-Biosciences-Inc/BayesCNV (accessed on 1 December 2025).

2. Related Work

There is a large body of work on CNV detection from NGS data. However, among the methods we evaluated, only DeviCNV [16] and IonCopy met the practical requirements of targeted liquid biopsy panels in our setting. In particular, many publicly available CNV callers are not well-suited to clinical cfDNA workflows because they (1) rely on CNVs spanning long contiguous genomic regions, (2) require large cohorts for normalization or denoising, (3) do not explicitly estimate copy number magnitude, and/or (4) do not support calling both gains and losses. Table 1 summarizes the tools considered in our review.
Many CNV callers were developed for WES/WGS, such as cn.MOPS [17], CNVkit [18], and EXCAVATOR2 [19]. These methods are optimized for extremely low per-amplicon read depths, often below 100 reads. To mitigate high variance at individual targets, these methods often aggregate signal across longer contiguous regions, implicitly assuming that CNVs span large genomic intervals (typically >20 kb). While this improves stability by averaging out target-level noise, it can reduce sensitivity for short, focal CNVs that are common in clinical applications.
Another class of callers, such as Canoes, CoNIFER, and XHMM [20,21,22], relies on large cohorts of normal or pseudo-normal samples for denoising. For example, cn.MOPS [17] leverages reference normal samples to build an accurate baseline read-depth profile, while XHMM and CoNIFER apply a singular value decomposition to remove systematic variation across the samples. These requirements are often incompatible with clinical liquid biopsy workflows, which typically process small batches of samples in order to minimize turnaround time. Additional limitations across many CNV callers include an inability to detect both gains and losses, or to estimate the magnitude of the CNV (e.g., distinguishing between three and four copies).
The only three tools we identified that are explicitly designed for targeted panels are DeviCNV [16], IonCopy [15], and StateCNV [23]. StateCNV was designed for targeted panels with strong spatial correlation among amplicons and is therefore not well matched to liquid biopsy panels whose targets are distributed across the genome, as considered here. However, the other two are suitable competitors. DeviCNV employs a regression-based framework: it computes amplicon-level depth, normalizes it using sample-wide depth to account for GC content and PCR bias, and compares it against a null distribution under a diploid assumption. IonCopy uses a two-step normalization procedure, first across amplicons and then across samples, and constructs a robust null distribution to assign per-amplicon significance. Both methods were validated on targeted sequencing data and support CNV calling at high resolution, making them relevant baselines for our work.
Table 1. A Summary of Common Publicly Available CNV Callers.
Table 1. A Summary of Common Publicly Available CNV Callers.
ToolPMID ReferenceYearHigh
Spatial Res
Small
Sample
Support
Quantify CNVGain and Losses
Canoes [20]247713422014NNYY
CLAMMS [24]263821962015YYNY
Cn.MOPS [17]223021472012NNYY
CNVkit [18]271007382016NYYY
CODEX [25]256188492015NYYY
CoNIFER [21]225858732012NNYY
DeviCNV [16]303268462018YYYY
EXCAVATOR2 [19]275078842016NYYY
ExomeDepth [26]229420192012NYYY
ExonDel [27]253228182014YYYN
FishingCNV [28]235393062013NNYY
HMZDelFinder [29]279800962017NYYN
IonCopy [15]269108882016YYYY
XHMM [22]230404922012NNYY

3. Materials and Methods

3.1. Overall Processing Steps

The PiVAT analysis pipeline begins with FASTQ files, which are aligned to the GRCh37/hg19 reference genome using the Burrows-Wheeler Aligner (BWA) [30] version bwa-0.7.16a. The resulting BAM files are processed with SAMtools [31] version 1.10 to compute alignment and coverage metrics, including per-target read counts and summary coverage statistics. Filtered and refined alignments are then used to derive amplicon-specific read counts, which serve as inputs for downstream CNV analysis. In subsequent steps, read counts are normalized and transformed to produce gene-level CNV estimates. A schematic overview of the pipeline is shown in Figure 1.

3.2. Biological Assumptions

Our CNV caller is based on a small set of biologically motivated assumptions. Specifically, we make the following assumptions:
  • Linearity of read counts. Expected read counts scale approximately linearly with copy number. For example, a locus present at four copies yields, on average, twice as many reads as a locus with two copies.
  • Sample-composition invariance. Observed read depth reflects the total DNA mixture in the sample, and tumor-derived DNA is processed similarly to background DNA within the assay. Consequently, the copy number scales with tumor fraction.
  • Stable amplicon-specific effects. While amplification efficiency varies across amplicons, these effects are assumed to be consistent across samples processed under comparable conditions. Therefore, the case and normal samples should be process-matched.
  • Sparsity of CNVs. The majority of the amplicons target loci are assumed to be copy-neutral, with CNVs affecting only a minority of targets.
These assumptions motivate the data transformations described as follows. Assumptions 1 and 2 relate the read depths to the copy number to be inferred. Namely, the relationship between copy number and read depth is linear (Assumption 1) and modeling of the tumor content is unnecessary (Assumption 2). We consider read counts from J targeted amplicons, obtained from a case sample and process-matched normal using the PiVAT pipeline. Let { s j } j = 1 J denote the read counts for the case sample and { n j } j = 1 J the corresponding counts from the process-matched normal (after filtering any amplicons that failed to amplify in either the sample or normal). To assess copy number variations, we compute the log copy ratio for each amplicon:
x ~ j = log s j log n j = log s j n j
The normal sample accounts for amplicon-specific effects in read counts (Assumption 3). This reference may be explicitly specified by the user, or for normal-free calling, approximated by the mean profile across samples within a batch. To account for sample-specific effects such as variation in DNA input or sequencing depth, we median-center the log ratios:
x j = x ~ j m e d i a n ( { x ~ j } j = 1 J )
assuming that most amplicons are copy-number neutral (Assumption 4). The resulting values { x j } j = 1 J are referred to as the log copy number ratios (lCNRs), which ideally reflect the log relative copy number between the case and control. Our objective is to estimate the copy numbers of G genes (e.g., the ten genes targeted in OncoReveal® Core Lbx panel).

3.3. Mathematical Modeling

Each amplicon targeting a gene g can be treated as a noisy observation of the underlying gene-level lCNR. Let x i g denote the median-centered lCNR for amplicon i { 1 , , I g } targeting gene g { 1 , , G } . We model x i g as arising from a shared latent signal μ g representing gene-level lCNR, plus biological and technical variability.
We estimate these quantities using a Bayesian hierarchical model. Each gene-level mean is drawn from a global distribution centered at μ 0 , allowing information to be shared across genes while permitting gene-specific deviations. Each gene is also assigned a noise scale τ g 2 to reflect differing levels of measurement uncertainty. The full generative model is:
μ 0 ~ N ( 0 , ν ) , σ 2 ~ I n v G a m m a ( α σ , β σ ) , τ 0 2 ~ I n v G a m m a ( α τ 0 , β τ 0 ) , μ g | μ 0 , σ 2 N ( μ 0 , σ 2 ) , z g ~ I n v G a m m a ( α τ , β τ ) , x i g | μ g , τ 0 2 , z g S o f t L a p l a c e ( μ g , τ g 2 = τ 0 2 z g 2 )
The SoftLaplace distribution serves as a heavy-tailed alternative to the Gaussian, providing robustness to amplification outliers while remaining differentiable at the origin, which improves the numerical stability for gradient-based inference. Furthermore, the incorporation of a gene-specific noise term allows us to detect genes that have erratic empirical lCNRs as a quality control mechanism. A diagram of this model is shown on the left of Figure 2.
Intuitively, this hierarchical structure reflects our belief that most genes cluster around a common lCNR (e.g., diploid baseline) but may deviate due to biological variation or technical artifacts. By pooling information across amplicons within a gene, the model reduces noise and yields more stable estimates of μ g , especially when individual amplicons are noisy or unreliable [32]. At the same time, the per-gene variance τ g 2 accommodates genes with inherently noisier measurements. The SoftLaplace distribution provides robustness to outliers that are frequently observed in amplicon-based assays, such as extreme GC-content effects, while retaining smoothness necessary for gradient-based optimization.

3.4. Calling CNVs via Posterior Distribution

For convenience, we denote the set of model parameters by θ = { μ 0 , σ 2 , { μ g } g = 1 G , τ 0 2 , { z g } 1 G } . Given observed amplicon-level lCNRs x = { x j } j = 1 J , the posterior of our model is
p ( θ | x ) = p ( θ ) p ( x | θ ) p ( θ ) p ( x | θ ) d θ
where the joint prior and likelihood factorize according to the generative model:
p ( θ ) = p ( μ 0 ) p ( σ 2 ) p ( τ 0 2 ) g = 1 G p ( μ g | μ 0 , σ 2 ) p ( z g ) , p ( x | θ ) = g = 1 G i = 1 I g p ( x i g | μ g , τ g 2 ) ,                                             τ g 2 = τ 0 2 z g 2
As stated previously, this model is designed to be easily interpretable: μ g represents the gene-level lCNR for gene g . We therefore use the marginal posterior p ( μ g | x ) to make CNV calls using two criteria:
  • Effect size threshold. The posterior mean should exceed a minimum magnitude, that is E [ p ( μ g | { x } i = 1 : N ) ] > log T , where T is the copy ratio we wish to be able to detect
  • Confidence threshold. The posterior probability close to 0 (CNV neutral) should be small: Pr ( | μ g | < δ ) < ϵ , for some values of δ ,   ϵ . Here, δ indicates the size of variation we expect neutral genes to have and ϵ is the probability of false positive we are willing to accept.
Together, these criteria ensure that only sufficiently large and well-supported deviations from neutrality result in CNV calls.

3.5. Markov Chain Monte Carlo Inference

To perform inference over the hierarchical model, we use Markov chain Monte Carlo (MCMC) to draw samples from the posterior distribution [33]. In particular, we employ Hamiltonian Monte Carlo (HMC), a gradient-based MCMC algorithm that improves efficiency over traditional random-walk Metropolis approaches [34]. By introducing auxiliary momentum variables and simulating Hamiltonian dynamics, HMC can propose larger moves in the parameter space while often maintaining high acceptance rates. This results in faster mixing and more efficient exploration of complex, high-dimensional posteriors. Furthermore, we improved the geometry of the parameter space via reparameterization, a common trick in HMC to reduce curvature and improve mixing [34]. We found that MCMC on our dataset with 10 genes and 441 amplicons was very quick, roughly 10 s for a single clinical sample.
To eliminate the need to hand-tune trajectory lengths, we use the No-U-Turn Sampler (NUTS), a self-tuning extension of HMC [35]. NUTS adaptively builds a set of candidate proposals by simulating forward and backward along the Hamiltonian trajectory and terminates when a “U-turn” is detected, when the trajectory would return to a previously explored space. This dynamic trajectory length allows NUTS to balance computational cost with effective exploration automatically. A visualization of posterior samples is provided in the middle of Figure 2. We evaluated convergence using standard metrics for HMC, in particular ensuring that the acceptance probability was in a reasonable range (0.6–0.9), no divergent transitions (large differences in the starting and ending Hamiltonian), and effective sample size (a measure roughly quantifying the number of “independent” samples). We used 10,000 iterations, which is large, particularly for a relatively simple Bayesian model [32].

3.6. Quality Control via Likelihood Evaluation

Previous work has shown that the Bayesian model evidence (marginal likelihood) can serve as a useful diagnostic for detecting model-data mismatch and other pathologies [23]. The evidence is defined as
Z = p ( x ) = p ( x | θ ) p ( θ ) d θ
Direct evaluation of Z is generally intractable for high-dimensional models. A common but unreliable approach is the harmonic mean estimator, which uses posterior samples θ ( i ) p ( θ | x ) to estimate Z via
Z ^ H M E = ( 1 N i = 1 N 1 p ( x | θ ( i ) ) ) 1
The HME is well known to be unstable and can have infinite variance because it is dominated by rare samples in the low-likelihood tails. As such, it should be avoided in spite of its continued use [36].
More reliable approaches transform evidence computation into better-behaved expectations. Several alternatives exist including bridge sampling [14] and annealed importance sampling [37]. In this work, we use thermodynamic integration (TI) [14]. TI constructs a continuous path between the prior and posterior using a family of tempered (power posterior) distributions indexed by β :
p β ( θ | x ) = p ( x | θ ) β p ( θ ) Z ( β )
which interpolates between the prior ( β = 0 ) and posterior ( β = 1 ) . TI exploits the identity
log Z = 0 1 E θ p β ( θ | x ) [ log p ( x | θ ) ] d β
expressing the log evidence as an integral over expectations under these tempered distributions. In practice, we approximate this integral by evaluating the expectation at a discrete set of temperatures { β k } k = 1 : K and applying the trapezoidal rule
log Z ^ = k = 1 K 1 β k + 1 β k 2 ( μ ^ β k μ ^ β k + 1 )
where
μ ^ β k = 1 N i = 1 N log p ( x | θ β k ( i ) )
is the Monte Carlo estimate of the expectation of p β k ( θ | x )   using N MCMC samples { θ β k } i = 1 N . This procedure is visualized on the right of Figure 2. This was more computationally intensive than the MCMC of Section 3.5, as we set K to 20, resulting in a runtime of roughly 2 min per clinical sample. We used the same diagnostics to ensure a convergent chain.

3.7. Code Availability

The model used in this paper is implemented in NumPyro version 0.11.0. This package allows for models to be implemented with a small amount of code while maintaining computational efficiency. This implementation is publicly available at https://github.com/Pillar-Biosciences-Inc/BayesCNV (accessed on 1 December 2025), along with demonstrations of how to use it on synthetic datasets.

4. Results

We evaluated the performance of BayesCNV using Seraseq® ctDNA Complete Mutation Mix contrived samples that contain known ERBB2 and MET copy number amplifications. Samples were tested using the OncoReveal® Core Lbx panel (441 amplicons targeting a variety of cancer-related mutations, including CNVs in 10 genes). We used a minimum call threshold of a minimal call threshold of 1.5, corresponding to three copies relative to a diploid baseline. We compared our method to IonCopy and DeviCNV. For IonCopy, we used gene-level calling, Bonferroni correction and global p-value adjustment. We left the p-value threshold at a default of 0.05 to compensate for the anticipated overly conservative nature of a Bonferroni correction. For DeviCNV, we set the user-controlled parameters of using the PCR data type, deduplicating reads, and setting a quality score threshold of 0. We also ensured the threshold matched our target of a 1.5 copy ratio.

4.1. Comparing BayesCNV to DeviCNV

We first compared BayesCNV with IonCopy and DeviCNV using 84 samples, a mixture of internal clinical normal and eight Seraseq Complete Mutation Mix CNV-positive samples at 5%, 2.5% and 1.25% VAF. Expected calls and copy numbers were determined via digital droplet PCR. One sample had sufficiently low concentration such that only the ERBB2 CNV was expected to be detected. IonCopy achieved the highest sensitivity at 0.93, with BayesCNV closely following at 0.87. In contrast, DeviCNV failed to identify any true positives, resulting in a sensitivity of 0. While IonCopy demonstrated marginally higher sensitivity, it did so at the cost of markedly reduced specificity. IonCopy produced 159 false positives, compared to 32 for DeviCNV and only 3 for our method, corresponding to specificities of 0.81, 0.96, and 0.996, respectively. Notably, by applying a conservative log-likelihood threshold of −300, corresponding to the bottom 5% of samples, we were able to eliminate all false positives from our method entirely. A summary of these results is presented in Table 2.

4.2. Limit of Detection with Synthetic Data

We next quantified the limit of detection (LOD) using semi-synthetic dilution experiments. Starting from a Seraseq-positive sample Section 4.1 with known ERBB2 and MET amplifications, we generated a series of in silico mixtures by scaling the ERBB2/MET amplicon counts to progressively weaker copy-number signals, down to an approximately diploid baseline. We then added noise calibrated to the empirical variance observed in CNV-neutral genes to mimic realistic assay variability.
We then analyzed each simulated dataset with BayesCNV and summarized the inferred copy ratio as a function of the ground-truth copy number. As shown in Figure 3 (left, 90% credible intervals), reliable CNV detection emerges at an approximate copy ratio of 1.3 (absolute copy number approximately 2.6). Using the same simulations, we estimated the false positive probability as a function of the calling threshold on the remaining CNV-neutral genes. The false-positive rate drops rapidly with increase ng threshold and is approximately 2% at a copy-ratio cutoff at 1.3.

4.3. Likelihood-Based Sample Filtering

As a final illustration of BayesCNV’s utility, we demonstrate how the TI estimate of the marginal log likelihood can serve as a discriminator of sample quality. Whereas Section 4.1 suggested that evidence-based quality control can reduce false positives, here we directly evaluate whether the log evidence separates known poor-quality samples from high-quality samples. For this experiment, we used a second dataset consisting of formalin-fixed paraffin-embedded (FFPE) samples processed with the OncoReveal® Essential Lbx panel. Of the 32 samples, 10 were derived from highly degraded FFPE tissue. The DNA templates extracted from these samples amplify poorly in amplicon- or PCR-based assays due to damage sustained during the fixation process. The remaining 22 samples were high quality FFPE samples that were acquired from BioIVT.
For each sample, we fit the hierarchical model on each of these samples using the cohort batch mean as the reference normal and computed the log-likelihood. Figure 4 (left and middle panels) plots the empirical copy ratio profile for a high-quality and degraded sample, respectively. Notably, traditional sequencing QC metrics such as read depth and on-target rate did not flag these samples as problematic.
Degraded samples exhibited markedly lower marginal log likelihoods, with an average of −232 ± 35. In contrast, the average log-likelihood was substantially higher at −15 ± 17. The worst observed log likelihood among the high-quality samples was −87, while the best (highest) degraded sample was only −137. As such, any threshold between these two values would provide perfect distinguishing ability between the two groups in this dataset.
In practice, a more conservative (lower) threshold may be preferable to account for limited training size and potential clinical variability. However, these data suggest that samples exhibiting dramatically lower likelihoods in the Essential panel (e.g., below −200) can be flagged for possible sample/normal mismatch or assay artifacts, and downstream CNV calls should be reviewed. Overall, these findings suggest that the marginal log-likelihood offers a principled, data-driven quality control metric in the BayesCNV workflow.

5. Discussion

In this study, we present a novel Bayesian framework for detecting and quantifying CNVs in targeted amplicon liquid biopsy data. The approach is grounded in fundamental biological and technically motivated assumptions about amplification behavior, resulting in a model that pools information across amplicons within each gene. Posterior inference is performed via Markov Chain Monte Carlo (MCMC), enabling posterior inference of gene-level copy numbers with uncertainty estimates, an important feature in clinical applications. We benchmarked our caller against two publicly available CNV detection methods. Our method demonstrated superior sensitivity and specificity, emphasizing its utility in clinical settings. Beyond CNV detection, we also introduce an evidence-based quality-control method based on the marginal log likelihood of the data. Estimating this quantity was performed via thermodynamic integration, a technique from statistical physics. On a second dataset consisting of FFPE-derived samples of varying quality, we showed that TI-derived log likelihoods clearly separated high-quality samples from degraded ones, even when traditional metrics did not distinguish them. This suggests that this quantity can serve as a powerful, model-based quality control metric in clinical applications.
There are several directions for improving BayesCNV. From a mathematical standpoint, integrating SNP-based tumor fraction estimation as a preprocessing step could help determine whether CNV calling is unlikely to be reliable, particularly in low-input settings. Eliminating samples where tumor content is too low to make viable calls could further reduce the false positive rate. Furthermore, while TI provides an accurate estimate of the likelihood, it is computationally intensive. Alternative estimators, such as annealed importance sampling, Rao-Blackwellized tempered sampling, and bridge sampling may offer more efficient approximations of the marginal likelihood, potentially reducing computation without compromising accuracy.
From a scientific standpoint, a natural direction is for single-cell applications. Recent studies have demonstrated that integrating bulk and single-cell omics data can substantially improve prediction of immunotherapy response by resolving tumor heterogeneity at the cellular level [38,39]. Several adaptations of our framework could enable such integration. First, the hierarchical structure of BayesCNV, which pools information across amplicons within genes, could be extended to pool information across cells within clonal populations, similar to how BayesPrism leverages single-cell references for Bayesian deconvolution of bulk RNA-seq data [40]. The SoftLaplace likelihood employed in BayesCNV is particularly well-suited to single-cell data, where amplification artifacts and dropout events produce heavy-tailed noise distributions [41]. Second, our thermodynamic integration-based quality control metric could be adapted to identify low-quality cells or detect outlier populations, addressing a persistent challenge in single-cell CNV analysis [42]. Third, single-cell CNV profiles could serve as reference signatures for deconvolving clonal substructure from bulk cfDNA, potentially enabling inference of clone-specific copy number states and their relative abundances within a liquid biopsy sample. This would be particularly valuable for tracking clonal evolution during therapy and identifying resistant subclones. Finally, following the paradigm established by the Scissor algorithm [43], single-cell-derived CNV signatures could be linked to clinical phenotypes such as immunotherapy response, enabling identification of CNV-defined cell populations that drive treatment outcomes. Such extensions would require addressing technical challenges inherent to single-cell CNV detection, including substantially lower per-cell coverage and the need to distinguish true biological heterogeneity from technical noise.

Author Contributions

Conceptualization, A.T., L.R. and A.K.; Data curation, A.C., N.Y. and M.L.; Formal analysis, A.T. and A.K.; Investigation, N.Y. and M.L.; Methodology, A.T. and A.K.; Project administration, A.K. and Z.W.; Resources, Z.W.; Software, A.T., A.K., M.Z. and S.P.; Supervision, L.R. and A.K.; Visualization, L.R. and A.K.; Writing, A.T., L.R. and A.K.; Review, A.T., L.R., Y.K., Z.W. and A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study received funding from Pillar Biosciences. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT-5.1 for the purposes of copy editing and grammar. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The Authors were employed by the Pillar Biosciences Inc. The authors declare that this study received funding from Pillar Biosciences. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Abbreviations

The following abbreviations are used in this manuscript:
CNVCopy Number Variation
FFPEFormalin-Fixed Paraffin-Embedded
HMCMCHamiltonian Monte Carlo
lCNRLog Copy Number Ratio
LODLimit of Detection
MCMCMarkov chain Monte Carlo
NGSNext-Generation Sequencing
NUTSNo U-Turn Sampler
QCQuality Control
TIThermodynamic Integration
VAFVariant Allele Frequency
WESWhole-Exome Sequencing
WGSWhole-Genome Sequencing

References

  1. Pös, O.; Radvanszky, J.; Buglyó, G.; Pös, Z.; Rusnakova, D.; Nagy, B.; Szemes, T. DNA copy number variation: Main characteristics, evolutionary significance, and pathological aspects. Biomed. J. 2021, 44, 548–559. [Google Scholar] [CrossRef] [PubMed]
  2. Shlien, A.; Malkin, D. Copy number variations and cancer. Genome Med. 2009, 1, 62. [Google Scholar] [CrossRef] [PubMed]
  3. Vijay, A.; Garg, I.; Ashraf, M.Z. Perspective: DNA Copy Number Variations in Cardiovascular Diseases. Epigenet. Insights 2018, 11, 2516865718818839. [Google Scholar] [CrossRef] [PubMed]
  4. Fan, D.; Yang, X.; Huang, L.; Ouyang, G.; Yang, X.; Li, M. Simultaneous detection of target CNVs and SNVs of thalassemia by multiplex PCR and next-generation sequencing. Mol. Med. Rep. 2019, 19, 2837–2848. [Google Scholar] [CrossRef]
  5. Masood, D.; Ren, L.; Nguyen, C.; Brundu, F.G.; Zheng, L.; Zhao, Y.; Jaeger, E.; Li, Y.; Cha, S.W.; Halpern, A.; et al. Evaluation of somatic copy number variation detection by NGS technologies and bioinformatics tools on a hyper-diploid cancer genome. Genome Biol. 2024, 25, 163. [Google Scholar] [CrossRef]
  6. Yazaki, S.; Tokura, M.; Aiba, H.; Kojima, Y.; Shiraishi, K. Clinical applications of cell-free DNA-based liquid biopsy analysis. Transl. Oncol. 2025, 61, 102519. [Google Scholar] [CrossRef]
  7. Ma, L.; Guo, H.; Zhao, Y.; Liu, Z.; Wang, C.; Bu, J.; Sun, T.; Wei, J. Liquid biopsy in cancer: Current status, challenges and future prospects. Signal Transduct. Target. Ther. 2024, 9, 336. [Google Scholar] [CrossRef]
  8. Di Sario, G.; Rossella, V.; Famulari, E.S.; Maurizio, A.; Lazarevic, D.; Giannese, F.; Felici, C. Enhancing clinical potential of liquid biopsy through a multi-omic approach: A systematic review. Front. Genet. 2023, 14, 1152470. [Google Scholar] [CrossRef]
  9. Martignano, F.; Munagala, U.; Crucitta, S.; Mingrino, A.; Semeraro, R.; Del Re, M.; Petrini, I.; Magi, A.; Conticello, S.G. Nanopore sequencing from liquid biopsy: Analysis of copy number variations from cell-free DNA of lung cancer patients. Mol. Cancer 2021, 20, 32. [Google Scholar] [CrossRef]
  10. Hallermayr, A.; Wohlfrom, T.; Steinke-Lange, V.; Benet-Pagès, A.; Scharf, F.; Heitzer, E.; Mansmann, U.; Haberl, C.; De Wit, M.; Vogelsang, H.; et al. Somatic copy number alteration and fragmentation analysis in circulating tumor DNA for cancer screening and treatment monitoring in colorectal cancer patients. J. Hematol. Oncol. 2022, 15, 125. [Google Scholar] [CrossRef]
  11. Antonello, A.; Bergamin, R.; Calonaci, N.; Househam, J.; Milite, S.; Williams, M.J.; Anselmi, F.; d’Onofrio, A.; Sundaram, V.; Sosinsky, A.; et al. Computational validation of clonal and subclonal copy number alterations from bulk tumor sequencing using CNAqc. Genome Biol. 2024, 25, 38. [Google Scholar] [CrossRef]
  12. Singh, A.K.; Olsen, M.F.; Lavik, L.A.S.; Vold, T.; Drabløs, F.; Sjursen, W. Detecting copy number variation in next generation sequencing data from diagnostic gene panels. BMC Med. Genom. 2021, 14, 214. [Google Scholar] [CrossRef] [PubMed]
  13. Moreno-Cabrera, J.M.; Del Valle, J.; Castellanos, E.; Feliubadaló, L.; Pineda, M.; Brunet, J.; Serra, E.; Capellà, G.; Lázaro, C.; Gel, B. Evaluation of CNV detection tools for NGS panel data in genetic diagnostics. Eur. J. Hum. Genet. 2020, 28, 1645–1655. [Google Scholar] [CrossRef] [PubMed]
  14. Gelman, A.; Meng, X.-L. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Stat. Sci. 1998, 13, 163–185. [Google Scholar] [CrossRef]
  15. Budczies, J.; Pfarr, N.; Stenzinger, A.; Treue, D.; Endris, V.; Ismaeel, F.; Bangemann, N.; Blohmer, J.-U.; Dietel, M.; Loibl, S.; et al. Ioncopy: A novel method for calling copy number alterations in amplicon sequencing data including significance assessment. Oncotarget 2016, 7, 13236–13247. [Google Scholar] [CrossRef]
  16. Kang, Y.; Nam, S.-H.; Park, K.S.; Kim, Y.; Kim, J.-W.; Lee, E.; Ko, J.M.; Lee, K.-A.; Park, I. DeviCNV: Detection and visualization of exon-level copy number variants in targeted next-generation sequencing data. BMC Bioinform. 2018, 19, 381. [Google Scholar] [CrossRef]
  17. Klambauer, G.; Schwarzbauer, K.; Mayr, A.; Clevert, D.-A.; Mitterecker, A.; Bodenhofer, U.; Hochreiter, S. cn.MOPS: Mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012, 40, e69. [Google Scholar] [CrossRef]
  18. Talevich, E.; Shain, A.H.; Botton, T.; Bastian, B.C. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. PLoS Comput. Biol. 2016, 12, e1004873. [Google Scholar] [CrossRef]
  19. D’Aurizio, R.; Pippucci, T.; Tattini, L.; Giusti, B.; Pellegrini, M.; Magi, A. Enhanced copy number variants detection from whole-exome sequencing data using EXCAVATOR2. Nucleic Acids Res. 2016, 44, e154. [Google Scholar] [CrossRef]
  20. Backenroth, D.; Homsy, J.; Murillo, L.R.; Glessner, J.; Lin, E.; Brueckner, M.; Lifton, R.; Goldmuntz, E.; Chung, W.K.; Shen, Y. CANOES: Detecting rare copy number variants from whole exome sequencing data. Nucleic Acids Res. 2014, 42, e97. [Google Scholar] [CrossRef]
  21. Krumm, N.; Sudmant, P.H.; Ko, A.; O’Roak, B.J.; Malig, M.; Coe, B.P.; NHLBI Exome Sequencing Project; Quinlan, A.R.; Nickerson, D.A.; Eichler, E.E. Copy number variation detection and genotyping from exome sequence data. Genome Res. 2012, 22, 1525–1532. [Google Scholar] [CrossRef] [PubMed]
  22. Fromer, M.; Moran, J.L.; Chambert, K.; Banks, E.; Bergen, S.E.; Ruderfer, D.M.; Handsaker, R.E.; McCarroll, S.A.; O’Donovan, M.C.; Owen, M.J.; et al. Discovery and Statistical Genotyping of Copy-Number Variation from Whole-Exome Sequencing Depth. Am. J. Hum. Genet. 2012, 91, 597–607. [Google Scholar] [CrossRef] [PubMed]
  23. Talbot, A.; Kotlar, A.; Rishishiwar, L.; Ke, Y. Classifying Copy Number Variations Using State Space Modeling of Targeted Sequencing Data: A Case Study in Thalassemia. In Proceedings of the Machine Learning for Healthcare 2025, Rochester, MN, USA, 15 August 2025. [Google Scholar]
  24. Packer, J.S.; Maxwell, E.K.; O’Dushlaine, C.; Lopez, A.E.; Dewey, F.E.; Chernomorsky, R.; Baras, A.; Overton, J.D.; Habegger, L.; Reid, J.G. CLAMMS: A scalable algorithm for calling common and rare copy number variants from exome sequencing data. Bioinformatics 2016, 32, 133–135. [Google Scholar] [CrossRef] [PubMed]
  25. Jiang, Y.; Oldridge, D.A.; Diskin, S.J.; Zhang, N.R. CODEX: A normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res. 2015, 43, e39. [Google Scholar] [CrossRef]
  26. Plagnol, V.; Curtis, J.; Epstein, M.; Mok, K.Y.; Stebbings, E.; Grigoriadou, S.; Wood, N.W.; Hambleton, S.; Burns, S.O.; Thrasher, A.J.; et al. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling. Bioinformatics 2012, 28, 2747–2754. [Google Scholar] [CrossRef]
  27. Guo, Y.; Zhao, S.; Lehmann, B.D.; Sheng, Q.; Shaver, T.M.; Stricker, T.P.; Pietenpol, J.A.; Shyr, Y. Detection of internal exon deletion with exon Del. BMC Bioinform. 2014, 15, 332. [Google Scholar] [CrossRef]
  28. Shi, Y.; Majewski, J. FishingCNV: A graphical software package for detecting rare copy number variations in exome-sequencing data. Bioinformatics 2013, 29, 1461–1462. [Google Scholar] [CrossRef]
  29. Gambin, T.; Akdemir, Z.C.; Yuan, B.; Gu, S.; Chiang, T.; Carvalho, C.M.B.; Shaw, C.; Jhangiani, S.; Boone, P.M.; Eldomery, M.K.; et al. Homozygous and hemizygous CNV detection from exome sequencing data in a Mendelian disease cohort. Nucleic Acids Res. 2017, 45, 1633–1648. [Google Scholar] [CrossRef]
  30. Li, H.; Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 2010, 26, 589–595. [Google Scholar] [CrossRef]
  31. Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef]
  32. Gelman, A.; Carlin, J.; Stern, H.; Rubin, D. Bayesian Data Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 1995. [Google Scholar]
  33. Robert, C.; Casella, G. Monte Carlo Statistical Methods; Springer: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
  34. Betancourt, M. A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv 2018, arXiv:1701.02434. [Google Scholar] [CrossRef]
  35. Hoffman, M.D.; Gelman, A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 2014, 15, 1593–1623. [Google Scholar]
  36. Raftery, A.E.; Newton, M.A.; Satagopan, J.M.; Krivitsky, P.N. Estimating the Integrated Likelihood via Posterior Simulation Using the Harmonic Mean Identity. Bayesian Stat. 2007, 8, 371–416. [Google Scholar] [CrossRef]
  37. Neal, R.M. Annealed Importance Sampling. Stat. Comput. 2001, 11, 125–139. [Google Scholar] [CrossRef]
  38. Lai, G.; Xie, B.; Zhang, C.; Zhong, X.; Deng, J.; Li, K.; Liu, H.; Zhang, Y.; Liu, A.; Liu, Y.; et al. Comprehensive analysis of immune subtype characterization on identification of potential cells and drugs to predict response to immune checkpoint inhibitors for hepatocellular carcinoma. Genes Dis. 2025, 12, 101471. [Google Scholar] [CrossRef]
  39. Zhang, Y.; Zhang, C.; He, J.; Lai, G.; Li, W.; Zeng, H.; Zhong, X.; Xie, B. Comprehensive analysis of single cell and bulk RNA sequencing reveals the heterogeneity of melanoma tumor microenvironment and predicts the response of immunotherapy. Inflamm. Res. 2024, 73, 1393–1409. [Google Scholar] [CrossRef]
  40. Chu, T.; Wang, Z.; Pe’er, D.; Danko, C.G. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat. Cancer 2022, 3, 505–517. [Google Scholar] [CrossRef]
  41. Jia, C. Kinetic foundation of the zero-inflated negative binomial model for single-cell RNA sequencing data. arXiv 2019, arXiv:1911.00356. [Google Scholar] [CrossRef]
  42. Chen, X.; Fang, L.T.; Chen, Z.; Chen, W.; Wu, H.; Zhu, B.; Moos, M.; Farmer, A.; Zhang, X.; Xiong, W.; et al. A benchmarking study of copy number variation inference methods using single-cell RNA-sequencing data. Precis. Clin. Med. 2025, 8, pbaf011. [Google Scholar] [CrossRef]
  43. Choi, H.Y.; Jo, H.; Zhao, X.; Hoadley, K.A.; Newman, S.; Holt, J.; Hayward, M.C.; Love, M.I.; Marron, J.S.; Hayes, D.N. SCISSOR: A framework for identifying structural changes in RNA transcripts. Nat. Commun. 2021, 12, 286. [Google Scholar] [CrossRef]
Figure 1. (A) The overall CNV pipeline. The reads are first mapped to the HG19 genome. This is followed by proprietary denoising steps and paired-end assembly. At this point, coverage statistics are computed (read depth), which are used by the caller described in this work. (B) An illustration of the primary assumption of the caller, the slope of reads in the sample compared to the normal depends on the number of copies. With a log transform this instead becomes a different intercept. (C) The different ratios of amplicons can be used to generate gene-level calls.
Figure 1. (A) The overall CNV pipeline. The reads are first mapped to the HG19 genome. This is followed by proprietary denoising steps and paired-end assembly. At this point, coverage statistics are computed (read depth), which are used by the caller described in this work. (B) An illustration of the primary assumption of the caller, the slope of reads in the sample compared to the normal depends on the number of copies. With a log transform this instead becomes a different intercept. (C) The different ratios of amplicons can be used to generate gene-level calls.
Diagnostics 16 00280 g001
Figure 2. (a) The statistical model. The amplicon LCR is observed, colored in gray. Its distribution is parameterized by its mean, the global variance, and local variance. (b) A visualization of posterior samples for two gene LCRs derived from HMCMC. (c) Computation of the partition function via thermodynamic integration. The temperature is gradually decreased to the posterior, and the area under this curve corresponds to the log evidence of the statistical model.
Figure 2. (a) The statistical model. The amplicon LCR is observed, colored in gray. Its distribution is parameterized by its mean, the global variance, and local variance. (b) A visualization of posterior samples for two gene LCRs derived from HMCMC. (c) Computation of the partition function via thermodynamic integration. The temperature is gradually decreased to the posterior, and the area under this curve corresponds to the log evidence of the statistical model.
Diagnostics 16 00280 g002
Figure 3. Limit of detection and false positive rate for CNV detection. Semi-synthetic datasets were generated by progressively down-scaling read counts from Seracare positive samples for the MET and ERBB2 loci and adding noise consistent with CNV-neutral variance. (a) The estimated CNV ratios from our caller as a function of the true copy number, with shaded regions indicating 90% confidence intervals. (b) The corresponding false positive rate as a function of the detection threshold. Reliable detection is achieved at a copy ratio of approximately 1.3, corresponding to 2.6 copies, with an associated false positive rate of about 2%.
Figure 3. Limit of detection and false positive rate for CNV detection. Semi-synthetic datasets were generated by progressively down-scaling read counts from Seracare positive samples for the MET and ERBB2 loci and adding noise consistent with CNV-neutral variance. (a) The estimated CNV ratios from our caller as a function of the true copy number, with shaded regions indicating 90% confidence intervals. (b) The corresponding false positive rate as a function of the detection threshold. Reliable detection is achieved at a copy ratio of approximately 1.3, corresponding to 2.6 copies, with an associated false positive rate of about 2%.
Diagnostics 16 00280 g003
Figure 4. Likelihood-based quality control. (a) The empirical copy ratios of the high-quality sample. (b) The empirical copy ratios of a degraded sample. Notice the wider spread and erratic ratios. (c) The distributions of the likelihoods of the low- and high-quality data. A threshold of −100 would perfectly distinguish the two groups.
Figure 4. Likelihood-based quality control. (a) The empirical copy ratios of the high-quality sample. (b) The empirical copy ratios of a degraded sample. Notice the wider spread and erratic ratios. (c) The distributions of the likelihoods of the low- and high-quality data. A threshold of −100 would perfectly distinguish the two groups.
Diagnostics 16 00280 g004
Table 2. Performance on the OncoReveal® Core Lbx Panel.
Table 2. Performance on the OncoReveal® Core Lbx Panel.
MethodTPTNFPFNSensSpec
IonCopy1468615910.930.81
DeviCNV0813321500.96
BayesCNV13842320.870.996
BayesCNV + QC13825020.871
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Talbot, A.; Kotlar, A.; Rishishwar, L.; Conley, A.; Zhao, M.; Yang, N.; Liu, M.; Wang, Z.; Polvino, S.; Ke, Y. BayesCNV: A Bayesian Hierarchical Model for Sensitive and Specific Copy Number Estimation in Cell Free DNA. Diagnostics 2026, 16, 280. https://doi.org/10.3390/diagnostics16020280

AMA Style

Talbot A, Kotlar A, Rishishwar L, Conley A, Zhao M, Yang N, Liu M, Wang Z, Polvino S, Ke Y. BayesCNV: A Bayesian Hierarchical Model for Sensitive and Specific Copy Number Estimation in Cell Free DNA. Diagnostics. 2026; 16(2):280. https://doi.org/10.3390/diagnostics16020280

Chicago/Turabian Style

Talbot, Austin, Alex Kotlar, Lavanya Rishishwar, Andrew Conley, Mengyao Zhao, Nachen Yang, Michael Liu, Zhaohui Wang, Sean Polvino, and Yue Ke. 2026. "BayesCNV: A Bayesian Hierarchical Model for Sensitive and Specific Copy Number Estimation in Cell Free DNA" Diagnostics 16, no. 2: 280. https://doi.org/10.3390/diagnostics16020280

APA Style

Talbot, A., Kotlar, A., Rishishwar, L., Conley, A., Zhao, M., Yang, N., Liu, M., Wang, Z., Polvino, S., & Ke, Y. (2026). BayesCNV: A Bayesian Hierarchical Model for Sensitive and Specific Copy Number Estimation in Cell Free DNA. Diagnostics, 16(2), 280. https://doi.org/10.3390/diagnostics16020280

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop