Identifying Differential Methylation in Cancer Epigenetics via a Bayesian Functional Regression Model

DNA methylation plays an essential role in regulating gene activity, modulating disease risk, and determining treatment response. We can obtain insight into methylation patterns at a single-nucleotide level via next-generation sequencing technologies. However, complex features inherent in the data obtained via these technologies pose challenges beyond the typical big data problems. Identifying differentially methylated cytosines (dmc) or regions is one such challenge. We have developed DMCFB, an efficient dmc identification method based on Bayesian functional regression, to tackle these challenges. Using simulations, we establish that DMCFB outperforms current methods and results in better smoothing and efficient imputation. We analyzed a dataset of patients with acute promyelocytic leukemia and control samples. With DMCFB, we discovered many new dmcs and, more importantly, exhibited enhanced consistency of differential methylation within islands and their adjacent shores. Additionally, we detected differential methylation at more of the binding sites of the fused gene involved in this cancer.


Introduction
DNA methylation at the fifth position in cytosine (5 mC), an epigenetic mark found in many living organisms, has a variety of regulatory roles in disease and normal biology.Specifically, it regulates or causes transcriptional activity during embryonic development [1], tissue development [2], cell differentiation [3], genomic imprinting [4], and X-chromosome inactivation [5], among other things.
We consider, in particular, the role of differential methylation of processes leading to the development of cancer (e.g., [6]).There is increasing evidence that epigenetic variation considerably influences the regulatory processes that modulate cancer growth.The principal focus of our analysis is acute promyelocytic leukemia (APL) [7,8].APL is an aggressive subtype of acute myeloid leukemia (AML) that accounts for 10% of its cases.AML is a group of hematopoietic neoplasms involving myeloid lineage cells, with APL as a distinct variant.A single mutation is known to cause APL, specifically a translocation involving two genes: PML on Chromosome (Chr) 15 and RARA on Chr 17 [9], denoted t(15; 17)(q24.1;q21.2);PML::RARA.APL leads to high early mortality, often from hemorrhage caused by coagulopathy.Immediate treatment with all-trans retinoic acid is crucial once APL is suspected.Approximately 92% of APL patients have the balanced translocation t(15; 17)(q24.1;q21.1), while 5% have the PML/RARA fusion gene from other chromosomal rearrangements or insertions.Although the genetic cause is clear, how epigenetic changes lead to dysregulation is the subject of ongoing research [7].For example, Schoofs et al. [8] observed broad DNA hypermethylation in APL cells compared with control samples and argued that this is a consequence of a loss of transcription factor binding.
Epigenetic variation is also strongly believed to influence the choice of therapeutic intervention: The success of all-trans retinoic acid treatments as an alternative to chemotherapy has been marked [6], but the treatments vary in efficacy, with resistance to treatment being observed, due to mechanisms related to methylation.Of specific interest is the PML-RARα fusion protein which, when over-expressed due to translocation, has been determined to have a leukemia-generating action [10,11].Thus, understanding the regulation modulation of the expression of this protein is crucial.Understanding methylation patterns and variations associated with APL can help one to identify potential biomarkers for early diagnosis, prognosis, and treatment targets.
In light of the growing understanding of the role of epigenetic variation in processes influencing tumorigenesis, there is a demand for both fundamental and clinical research on DNA methylation and also a demand for analytical and statistical tools.Several assays to collect methylation data have been developed [12].Bisulfite sequencing (BS-Seq) [13] of DNA is a popular technique that provides positive identification of 5 mC residues in genomic DNA, particularly at CpG sites (where a cytosine nucleotide is followed by a guanine nucleotide in the 5 ′ -3 ′ direction).BS-Seq benefits from the fact that bisulfite treatment will not affect 5 mC but converts unmethylated cytosines to uracils.After polymerase chain reaction and sequencing, the combined experiment leads to single-base resolution information on methylation status by counting the number of times a sequencing read at a single genomic position appears as methylated vs. unmethylated.Coupled with next-generation sequencing (NGS) [14], BS-Seq has become an effective tool for obtaining single-nucleotide resolution data from the whole genome.The rapid decline in NGS costs made whole-genome bisulfite sequencing (WGBS) [15] more accessible for research.Other direct assessment techniques include BC-Seq [16], BSPP [17], and RRBS [3].RRBS provides single-nucleotide resolution estimates of methylation enriched in genomic regions with high CpG content.BSPP is an alternative method for targeting CpG-enriched regions in mammalian genomes [18].Although BS-Seq is one of the most accurate techniques for retrieving methylation data, it has several pitfalls, including false-positive methylation calls due to incomplete conversion, DNA degradation during bisulfite treatment, methylation in pseudogenes, and the inability to discriminate between different methylated states such as 5 mC and 5 hmC [19].
Several data features, arising from assays like BS-Seq and RRBS, lead to analytic and computational challenges.To illustrate these features, we explore a dataset containing samples from patients with APL and with several different cell types for comparison (Section 2.2).
In these data, as in many other sequencing-derived methylation datasets, read depths-the numbers of individual methylated and unmethylated counts recorded via a sequencing platform-vary appreciably and often unsystematically across cytosine positions.CpGs are unevenly distributed [39]; both methylation autocorrelation between samples [40] and between cytosines [41] within a profile change irregularly across chromosomes.High heterogeneity has been observed in DNA methylation in cells from APL patients [8].Also, a large percentage of positions with missing values exist, and the data magnitude leads to computational challenges for any kind of analysis.
In this work, we propose a new method, called DMCFB, built on a Bayesian functional regression model for the analysis of sequencing-based measures of DNA methylation.Despite a handful of existing methods in DNA methylation data analysis from sequencing experiments, DMCFB is more inclusive and addresses more of the data and computational challenges than any other method, as laid out below.

1.
Missing values and imputation: Sequencing data often contain many missing values, e.g., in the APL data, 63% of the CpGs have missing values across samples.Almost all methods (except DMCHMM) remove all or most of the CpGs with missing values and then impute the rest.In DMCFB, we set the methylation read and read depth to zero (i.e., y = 0, n = 0) for missing values in binomial distribution, and we impute methylation level β using the information from p neighboring points via a functional regression model.This approach gives a more efficient imputation than other methods.

2.
Read depth: Many measurements in sequencing data are based on only one or two reads in contrast to a few that have unrealistically high read depths.Several methods (e.g., BiSeq) filter CpGs with low and high read depths, whereas DMCFB keeps all available data and uses the read depth information to adjust their contribution in the model.Since there is a systematic relationship between read depths and methylation levels, i.e., CpGs with high read depth tend to be more hypermethylated (Supplementary Figure S3), DMCFB adds the extra covariate log(n + 1) in the model to account for such a relationship.

3.
Raw methylation level vs. counts: Existing methods model either methylation counts (y, n) or raw methylation levels (or fractions) β raw = y/n.Modeling raw data may not fully capture available information (of read depth).In DMCFB, we model (y, n) through a logit(β) link with a Binomial(y; n, β) distribution to account for read depth information.

4.
Transformation: Those methods that model β raw often tend to use a transformation (e.g., log(β raw ) or logit(β raw )).DMCFB does not require a transformation of the raw data since it is built on a binomial emission model for the underlying proportion rather than the observed proportion.

5.
Functional pattern: Methylation proportions are highly correlated across nearby positions and samples.The most efficient methods tend to use smoothing techniques to capture these correlations.These techniques range from weighted local likelihood (e.g., BiSeq) to HMMs (e.g., DMCHMM) to functional regressions (e.g., wavelet-based functional linear mixed model in WFMM).Here, DMCFB also builds on functional regression concepts.

6.
Distance between CpGs: CpGs are distributed unevenly across the genome, and correlations between methylation levels decrease quickly with distance.DMCHMM and WFMM assume that all positions are equally spaced, which may under/overestimate autocorrelation; however, DMCFB incorporates the distance explicitly.

7.
Sample characteristics: In addition to these characteristics of the methylation data, existing methods take different approaches to test the association between methylation levels and covariates.Some use a two-stage approach: smoothing each sample and testing the smoothed data (e.g., BiSeq).Others perform two-group comparisons, i.e., testing for differential methylation between cases and controls (e.g., HMMFisher).DMCFB, on the other hand, conducts smoothing and DMC calling in one run.
Multiple Covariates: Often, additional features (e.g., clinical information) are collected about subjects.Most widely used methods are not capable of including these covariates in their models.Most of the ones that do are incapable of accounting for multiple covariate types (i.e., categorical, continuous, and combination).
In Section 5, we show that our proposed solutions for these issues collectively create an efficient method for analyzing DNA methylation data which outperforms (in terms of detection capability) other competing methods.We have addressed several computational challenges, including memory management and parallel computation.
We organize the paper as follows.In Section 2, we give brief descriptions of two motivating datasets.In Section 3, we develop a Bayesian functional regression method to identify DMCs.Section 5 covers multiple simulation studies.We analyze both datasets in Sections 6 and 7. Finally, Section 8 contains discussions.

Data
We use two different publicly available datasets.One, which we refer to as the WGBS dataset, is a proof-of-principle dataset containing clear signals and is used to develop our method and validate performance through simulations.The second dataset, APL, contains patients with leukemia and will be extensively analyzed using DMCFB, BiSeq, and DMCHMM.

BLK: WGBS Data on Separated Whole Blood
To develop our method, we used a small part of a publicly available dataset derived from peripheral blood samples in healthy individuals.Cell types were separated, and methylation profiles were estimated with WGBS in CD4 + Tcells (T helper cells), CD14 + monocytes, and CD19 + B cells.We extracted data for a small region near the BLK gene on human Chr 8, known to be hypomethylated in B-cells [33,42].The selected region contains 30,440 CpGs spanning 2 MB (10,352,422,082).The methylation status of 23.39% of CpGs in these data is missing.We have previously studied this dataset extensively-see [33] and Supplement S1.1.Our choice of a small portion of the BLK dataset allowed us to focus on a region known to contain established DMRs of different sizes and magnitudes, providing a thorough benchmark for our analysis.The presence of a reasonable amount of missing data and highly variable read depths, among other challenges, made the dataset particularly suited to testing the robustness of our method under challenging conditions.

APL: RRBS Data on Acute Promyelocytic Leukemia
The second dataset that we analyzed was collected from patients with acute promyelocytic leukemia (APL) [8]) and is publicly available on the Gene Expression Omnibus (accession no.GSE42119).These RRBS data contain information on 18 APL patients and 16 control samples, with eight bone marrow samples from patients in remission (RBM), four profiles of healthy CD34 + cells (CD34), and four profiles from promyelocytes (PMC).Questions of interest include identifying regions where methylation is different among patients versus controls and also looking at different cell types and/or disease status ("active" or "in remission").
The dataset is challenging to analyze.It includes 9,335,693 CpGs for each of the 34 samples.Nearly 63% of CpGs have missing values across samples.Almost all existing methods will remove either all 63% of the CpGs or may impute a proportion of them.This approach leads to unreliable results due to the loss of much useful information as a result of CpG removal from the analysis.Three out of 16  A specific focus in our analysis will be the locations of binding sites of the protein PML-RARα, for which Ref. [8] lists some 225 locations on Chr 17. Epigenetic variation in the vicinity of these sites might indicate modulation of the expression of this protein, and hence, detection of differential methylation in these sites may lead to key insights.

Model
We propose a Bayesian functional regression profiling method to identify DMCs.We describe the main aspects of our method including the introduction of a functional representation of the methylation profiles incorporated into a generalized linear model (GLM), the possible statistical inference approaches, several numerical challenges in model fitting, obtained parameters, and the procedure by which a CpG is classified as a DMC.
We regard the methylation counts as realizations of a binomial(n(t), β(t)) process observed on the interval [0, T], which represents some large genomic region, where the read depth {n(t), t ∈ [0, T]} may vary as the result of the sequencing process or the underlying propensity for methylation and may be platform dependent.In this paper, we condition on the read depth data, thereby ignoring their stochastic properties.We specify that where B(t) = Y(t)/n(t), and we essentially consider a (functional) model for {B(t)}.We have access to replicate data {Y i (t), t ∈ [0, T]}, i = 1, . . ., m, and potentially fixed covariates.We construct a functional representation of β(t) using a natural cubic spline basis in t; other spline bases may also be used.Our representation allows for further decomposition into group-specific effects when a grouping structure is present.

A Spline Model for Methylation Data
We respect the discrete nature of nucleotide positions and regard the observation locations as positive integers t = 1, . . ., T. We denote the methylation data of a given sample profile by {(y t , n t ), t = 1, . . ., T}, where y t and n t , respectively, represent the methylation read count and read depth at the tth genomic position.A reasonable, if imperfect, model assumes that y t |n t , β t ∼ Binomial(n t , β t ), for t = 1, . . ., T, where β t is the propensity for each site t to be methylated at each read instance.In this model, each read is assumed to yield a binary outcome that is conditionally independent of other reads.The methylation level β t changes due to different sources of variation.In the BLK data, there are three cell types (B cells, T cells, and monocytes).In the APL data, there are four types of samples (CD34, PMC, RBM and APL), as well as the individual's age and sex.Thus, there is individual-level variation, group variation, and possible variation influenced by additional covariates, and these can be incorporated into a binary regression model where G is the number of groups in the variable of interest, m g is the number of samples (individuals) in the gth group, and T is the number of cytosines (CpGs) in a sample.
Due to the DNA methylation autocorrelation structure and the unknown functional form of methylation patterns in different regions, we propose a spline-based functional (logistic) regression model.Assume the vector x is formed from a p-dimensional (natural cubic) spline basis derived from methylation positions.We then decompose β t,g,i in (1) as where γ 0 represents the baseline group-level parameter, δ 2 , . . ., δ G are the group-specific contrasts by setting δ 1 ≡ 0, and υ g,i are the individual-level variations.The γ 0 , δ g , and υ g,i are all p × 1 vectors.The design matrix for any individual is denoted as X g,i which is a T × 3p block matrix.The entire design matrix X that includes all groups (G) and all individuals (M = ∑ G g=1 m g ) is of dimension MT × (G + M)p.The dimension of the spline basis, p, is termed the resolution and is equivalent to a bandwidth parameter.Note that other spline bases will produce fairly similar results.Finally, the model can be augmented to for some non-positional and non-spline-related fixed effects, where η is a q × 1 vector.

Modelling the Impact of Read Depth
We emphasize the importance of adding an extra covariate based on read depth because read depth varies systematically with methylation levels and because higher read depths tend to result in larger methylation levels (Supplementary Figure S3).For instance, a plausible model for the BLK data is given as which allows for the effect of read depth in an additive fashion on the linear predictor scale.We treat read depth as a deterministic quantity and perform a conditional analysis, even though the stochastic properties of read depth might also be of interest in other settings.

Bayesian Inference
We first consider maximum likelihood (ML) estimation and then a Bayesian approach.ML estimates are useful in full Bayesian analysis as they allow for the Bayesian computational strategy to be implemented more efficiently by providing initial estimates for the Bayesian procedure.The log-likelihood function is where Ψ = (γ 0 , δ 2 , . . ., δ G , β 1,1 , . . ., β G,m G , η) and β t,g,i / 1 − β t,g,i = e x(γ 0 +δ g +υ g,i )+z g,i η .
One can derive a numerical solution Ψ = arg max Ψ ℓ(Ψ) for the ML estimates; however, the process involves solving a system of nonlinear equations and is computationally burdensome.As is usual in GLMs, one can adopt an iteratively re-weighted least squares (IRLS).Due to high-dimensionality, there are typically memory capacity issues; however, one can exploit the block-structure of the design matrix to speed up the process, or one can resort to minorization-maximization (MM) or quasi-IRLS, which simplifies the iterative estimation procedure.These procedures are appreciably faster to compute but can be slower to converge to MLE.We prefer a fully Bayesian approach, as it yields a more complete representation of associated uncertainties in light of the data and model specification.

Bayesian Model
Denote the estimated logistic scale methylation levels by β * = [β t,g,i ].We propose the full Bayesian model characterized by the factorization We allow for a deterministic relationship between (x, z, γ 0 , δ g , υ g,i , η) and β * , specifically that at each site, logit(β t,g,i ) follows the linear model in (2).By choosing suitable priors, inference for the collection of unobserved quantities (γ 0 , δ g , υ g,i , η), and β * is attained using computational methods based on Markov chain Monte Carlo (MCMC).For full Bayesian inference, we perform block updates of the effect-specific parameters.We devise a Gibbs sampler algorithm by introducing independent Gaussian priors N(0, τ 2 θ ) on the model parameters, although this can be easily relaxed [43].The hyper-parameter τ θ can be specified subjectively, but it is also common to adopt an empirical Bayes approach and choose τ θ after inspection of the data.In Section 5.1.5,we study the effect of priors having common τ θ versus InverseGamma(a, b) on τ θ j , where (a, b) are estimated empirically.
The Gibbs sampler updates are structured as block updates for the term-specific parameters, γ 0 , δ g , υ g,i , η, conditional on the other parameters.In each update, full conditional posterior distribution corresponds to a posterior from a Bayesian binary regression model.There are several possible strategies for a full update for one parameter block: 1.
Multiple 'inner' accept-reject Metropolis-Hastings steps to sample the full conditional posterior giving an exact block update leading to faster convergence in the 'outer' chain; 2.
Single 'inner' accept-reject Metropolis-Hastings steps to sample the full conditional posterior, giving an exact block update but possibly leading to slower convergence in the 'outer' chain; 3.
Gaussian data augmentation samplers [44]; these methods can be useful for logistic and probit models and for models based on scale mixtures of Gaussians, but for large data sets, they involve the introduction of a considerable number of augmenting variables that require storage and sampling; 4.
An exact update using the approximate full conditional distribution given by the Normal approximation to the non-Normal exact version, i.e., from the standard approximation N( θ, (X ⊤ W( θ)X) −1 ) inspired by the quadratic approximation of log-likelihood.
The fourth approach appeals to a Gaussian approximation of the full posterior distribution, and we have found that the Gaussian approximation works well in most cases.The Gibbs sampler approach-rather than a joint update of all parameters simultaneously-is usually more feasible in large-scale problems.Finally, it may not feasible to build a complete functional regression genome-wide in our method.Instead, we use a partitioning approach that fits the model in smaller genomic segments and which is tailored to the computational resources available.Alternative approaches for genome-wide analysis without partitioning are variational Bayes, which constructs an approximate Bayes solution based on a tractable approximating model, or Hamiltonian MCMC, which introduces further augmenting variables and exploits the energy configuration (of the log joint posterior) in the extended space.Our Gibbs sampler algorithm is given in Algorithm 1.

Identifying DMCs
Having MCMC samples for parameters at CpG t, we compute the 100(1 − α)% credible interval for each (multiple) comparison(s) in the categorical variable of interest.If at least one of the intervals does not contain zero at CpG t, the position is classified as a DMC.The process is undertaken for all CpGs.In addition to (overall) DMC, the pairwise DMCs are also reported when there are more than two categories in the variable of interest.An alternative approach to detect DMC is to use global Bayesian p-values [45] or Bayesian FDRs [46].See Section 8 for a discussion and comparison.

Simulation Study
We have assessed the performance of DMCFB in simulations.We carry out four simulation studies.The first uses the scenarios from [33] which are inspired by the cell-separated data described in Section 2.1, and the remaining three include, in Section 5.2, a study adapted from [47], in Section 5.3, a simulation inspired by data generated from HMM models, and in Section 5.4, a study adapted from [27].In simulation studies, we set α = 0.05 and collect a set of 10,000 MCMC samples thinning by taking one in every 10 samples after a burn-in of 5000 iterations.

Simulation Study Example 1 5.1.1. Scenarios and Settings in Example 1
We use identical scenarios and settings as the simulation proposed in [33].The methylation information available in the BC and monocyte samples were chosen to generate simulated data as follows: 1.
First, the read depths the methylation counts of all BC samples were aggregated.

2.
Next, the missing information was imputed using nearby count information.

3.
Using 'lowess' in R, we fitted a smooth curve (span = 0.05) denoted G1; this curve was used to specify a baseline profile.

4.
Using Supplementary Table S2, we generated the second group curve called G2 by adding effect sizes to the chosen regions of G1. 5.
By fitting a Normal distribution to the methylation difference between G1 and all BC samples, we obtained µ ≈ 0 and σ ≈ 0.18.We used these estimates to add extra variation to G1 and G2 for all samples.6.
Read depths of all BC and monocyte samples were chosen to generate R = 500 simulated datasets; within each dataset, we generated 8 samples from G1 and 13 samples from G2 by completing the following steps: (a) Additional variation at each site was generated using N(0, 0.18) random errors, which were added to the methylation levels in G1 and G2.(b) All values are truncated to fall in [0, 1].(c) Methylation counts were generated by multiplying the generated methylation levels in the previous step.The integer parts of these data were chosen as the final methylation counts.
Supplementary Figure S4 presents the graph of the DMRs, a simulated dataset that compares it to the methylation in the real dataset.

Simulation Results in Example 1
Quantification of the performance of each method is based on the following definitions: true positive (TP, CpG is correctly identified as DMC); true negative (TN, CpG is correctly identified as non-DMC (NDMC)); false positive (FP, CpG is incorrectly identified as DMC); false negative (FN, CpG is incorrectly identified as NDMC).The 'sensitivity' (SE) of DMRs and 'specificity' (SP) of NDMRs for the rth simulated dataset are defined as SE j (r) = #TP in the jth DMR #CpGs in the jth DMR , and SP k (r) = #TN in the kth NDMR #CpGs in the kth NDMR , j = 1, . . ., 10, k = 1, . . ., 11, and r = 1, . . ., R = 500.The average SE j of the jth DMR and the average 'specificity' of the kth NDMR are respectively calculated as SE j = ∑ R r=1 SE j (r)/R and SP k = ∑ R r=1 SP k (r)/R.The accuracy is calculated as ACC = {#TP + #TN} #CpGs for each method.We also report the modified accuracy (MACC) by assuming that DMR1 through DMR8 are the only real DMRs as given in [33].
We set the nominal false discovery (FDR) threshold at 5% for any applicable method.We modified the default parameters in each tool to match simulation settings and to increase their efficiency.WFMM does not impute missing data; instead, we used a naive approach based on neighboring information to impute them in the simulated datasets so as to obtain higher performance.In DMCFB, we choose normal priors, p = 30, and α = 0.05.
Figure 1a depicts the ACC and MACC results for different methods.DMCFB performs better than the existing methods.Figure 1b,c illustrates the average SE j and the average 1 − SP k , separated for each DMR and NDMR.DMCFB is superior in terms of SE j .The results for 1 − SP k show that DMCFB is either better than or comparable to other methods.One advantage of DMCHMM is the ability to detect small differences in methylation profiles (DMR1, DMR8) and large differences with high precision (DMR2-7) [33].Our proposed method outperforms DMCHMM in this regard.Supplementary Figure S10 presents Cohen's Kappa.Our method outperforms the competing methods.Supplementary Figure S11 shows the proportions of the times that the start and end of a DMR are identified as DMCs.DMCFB's results are higher than those of other methods.The empirical FDR (eFDR) in Supplementary Figure S12 shows that DMCFB attains the nominal FDR.

Sensitivity Analysis in Example 1
Since our proposed Bayesian approach depends on the choices of bandwidth and prior, we performed sensitivity analyses concerning both specifications.For bandwidth, we have chosen a wide range of values in {20, 25, . . ., 60} to study its impact.Figure 2a,b compares SE j and 1 − SP k of DMCFB using different bandwidth lengths.From this figure and Supplementary Figures S23-S26, we conclude that the results are fairly robust for the bandwidth selection, specifically when the difference in methylation levels is large.Next, concerning prior precision, we examine Gaussian priors with different precisions, τ −1 θ ∈ {30, 20, 10, 1/0.18, 1, 0.3, 0.2, 0.1}, for sensitivity analysis.It transpires that, overall, better results are observed with respect to eFDR if the precision is estimated from the data via empirical Bayes.In this case, the average empirical site-specific standard deviation proves to provide a reasonable scaling, leading to the specification τ θ = 0.18. Figure 2c,d and Supplementary Figures S32-S35 show the robustness of the results.An alternative approach is to use prior p(τ θ ) and estimate its parameters empirically (Section 5.1.5).
A summary of the conclusions drawn from several simulation studies is as follows: 1.
DMCFB outperforms all considered methods in almost all criteria including accuracy, Cohen's Kappa, SE j , and the percentage of detecting start and end of DMRs.

2.
In terms of 1 − SP k , the DMCFB method is either comparable or better than that of other methods.

4.
Similar to DMCHMM, DMCFB's performance is excellent in DMRs with large effect sizes (say 0.2) and does not depend on any other features.5.
For small effect sizes (say 0.1), our method's performance is superior to other methods'.

6.
DMCFB does a better job in smoothing the data, followed by DMCHMM and WFMM.Specifically, DMCFB is robust concerning the length of DMRs and the size differences; see DMRs 2-7.

7.
Similar results are observed using larger variability for noise (σ = 0.24) (see Supplement S4.3).The decline in performance using our method is negligible.This result shows that DMCFB is robust with regard to variation in the data, while other methods have noticeable performance loss.8.
Using a different simulation setting with DMRs of different lengths, we observed trends similar to those described above (see Supplement S4.4).9.
Sensitivity analysis shows that DMCFB is robust for bandwidth selection.10.A sensitivity analysis is conducted by re-assembling regions to see how a regulatory landscape is altered by the grouping factor.The results are observed to be robust (see Supplement S5.4).11.Sensitivity analysis concerning the precision of prior distribution shows the robustness of the results.More specifically, better eFDR is observed when the prior precision is estimated from data using an empirical Bayes approach.

Missing Value Imputation
Missing value imputation in this type of methylation data demands extra caution due to high missingness (e.g., 63% in the APL data) and low read depth.A common Bayesian imputation technique is to use the full conditional posterior p(y mis , n mis |y −mis , n −mis , β * ) (or for non-stochastic read depths p(y mis |y −mis , n, β * )).However, here, the missingness is of a different nature to conventional missing outcome data; in almost all cases, the missingness is due to an absence of any reads at a given location.To this end, we propose a simple yet efficient method to impute missing values by setting (y = 0, n = 0) for missing values and predicting β * .Such a method does not require read depth imputation, and the log-likelihood of missing data becomes zero, playing no role in total log-likelihood.A simulation (Figure 3, for Example 1) shows that our method (blue color) has substantially better performance over Bayesian imputation (khaki color).In addition, our method is faster.

Choice of Priors
The choice of priors is of prime importance.In general, it is straightforward using simulation to study the impact of different prior specifications; elicitation can be readily carried out by simulation from the prior and then reconstructing the prior for the profiles by computing the β values in (3).To expedite the computations, we use conjugate priors and estimate the common uncertainty empirically.This assumption can be easily relaxed by using non-informative priors [43], although we do not recommend the use of improper priors.In a simulation study (Figure 3), we used inverse-gamma priors p(τ θ j ; a, b) and estimated the hyperparameters (a, b) empirically.We observe that a conjugate prior with common uncertainty (blue color) gives a similar performance compared with inverse gamma (red color) but is slightly faster.

Simulation Study Example 2
In the second simulation study, different from Example 1, we adopted the simulation setup in [47].To this end, we used 5000 genomic positions and created two groups each of 10 subjects.We then generated 50 DMRs for each of the R = 100 simulated datasets.As stated in [47], a DMR is constructed by sampling a cluster of neighboring CpGs and simulating the number of methylated reads, conditional on observed coverage, for the samples from one population from a binomial distribution.The binomial probabilities are equal to the observed methylation proportions plus or minus an effect size that is randomly sampled from a distribution that represents small to moderate effect sizes (ranging from approximately 0.1 to 0.5).
Since the lengths and locations of the DMRs were different from each other in each simulated dataset, we only calculated the overall accuracy, sensitivity, specificity, and CKappa.Due to crashing, we were not able to run the simulations for BiSeq and DMCHMM.We further added dmrseq to the methods for comparisons.We depicted the results in Figure 4a.From this figure, we observe that our proposed method outperforms all the rival methods.DMCFB was the only method with both sensitivity and specificity above 95%.Note that we run some other simulation settings using a similar approach, and in some of them, the performance of all methods was slightly lower, but in all of them, our method outperformed all other methods.Note that we changed the default options in bumphunter to increase its efficiency.

Simulation Study Example 3
In the third simulation study, different from Examples 1 and 2, we simulate 5000 positions from an HMM model with eight states spread over [0, 1] and binomial emissions, where the read depths are generated from uniform [15,45] (or Poisson (λ = 30)).This pattern represents Group 1 or baseline.We generate 10 random samples (subjects).We do the same for Group 2 except that we randomly select four regions of random sizes between [100, 300] and add methylation differences {0.2, −0.2, 0.3, −0.3}.These regions represent DMRs.In total, we simulate R = 100 datasets and run the analysis.We present the results in Figure 4b.We can observe that DMCFB is again the dominant method.

Simulation Study Example 4
We follow the settings in Ref. [27].Note that the settings are designed for microarray data and only simulate the logit of β values.Thus, we modified the code to generate single nucleotide data and used the read depths from the BLK data.We set the parameters of the simulation as follows: the number of bumps as 10, the group size as 200, the number of probes in each sample as 5000, the parameter of bumpmaker as 0.3, the correlation structure as AR(1; ϕ = 0.21, σ = 0.2), and the outlier distribution as t 5 .To this end, we simulate eight samples (representing B cells) as the baseline group and 13 samples (representing monocytes) as Group 2 having bumps with random sizes and effects and having random locations.In total, we generated R = 100 datasets and presented the results in Figure 4c.Again, DMCFB outperforms other methods.It has the perfect specificity with the highest sensitivity, accuracy, and CKappa values compared to the other methods.Note that we set the smoothing to FALSE in DSS to increase its efficiency in both Examples 3 and 4.

Proof of Principle Analysis of Sorted Cells
The BLK data resembles a dataset in which several challenges such as multiple groups, variable read depths, and missing values exist jointly.Almost none of the existing methods can efficiently handle such complexity by addressing all the known challenges.To this end, we have compared the methylation profiles between cell types near BLK described in Section 2.1 via DMCFB.The BLK data have 8 BC, 13 monocyte, and 19 TC samples in which methylation counts and read depths for 30,440 CpGs are available.We use normal priors in which τ θ ≈ 0.18 is computed after inspecting the data.The bandwidth is set to 30.The fitted model is identical to (3), where BC is the baseline and n t,g,i is the read depth at position t of sample i in group g where t = 1, . . ., T = 30, 440, g = 1, 2, 3 and i = 1, . . ., m g with m 1 = 8, m 2 = 13 and m 3 = 19.
To achieve genome-wide significance in genetics, the thresholds of α = 10 −5 and α = 10 −8 are commonly used [50].Appropriate thresholds for epigenetic data such as methylation will depend on the platform, the coverage of the genome, and the correlations between signals across the genome.Although an appropriate threshold for WGBS data is not known, for real data analyses in this paper, we set α = 0.05, 10 −5 , and 10 −8 and collect a set of 50,000 MCMC samples, thinning the output by taking one in every 10 samples, after a burn-in of 10,000 iterations.
We depict the trace plots of one parameter in the model in Supplementary Figure S44 for five selected CpGs, which shows that the convergence occurred.Similarly, all other parameters of all positions converged quickly.We constructed the credible intervals for each α (Supplementary Table S5).A CpG site is classified as a DMC if at least one of the (1 − α)% credible intervals for group comparisons does not include zero.
For α = 10 −8 , we observed, in total, 28.6% of CpGs were identified as DMC (Supplementary Table S5).Among these, approximately 13.6%, 11.4%, and 18.5% of CpGs were DMCs, respectively, for pairwise comparisons BC vs. monocyte, BC vs. TC, and monocyte vs. TC.Figure 5 depicts the analysis results for the region near the BLK promoter, where we can observe that the three cell types are differentially methylated specifically at the BLK promoter.The results are in concordance with the literature [42].It is worth noting that most analytical tools are incapable of imputing missing values efficiently and analyzing all three groups in the BLK data at once, i.e., performing three simultaneous comparisons.

Analysis of Patients with Acute Promyelocytic Leukemia vs. Control Cell Types
Ref. [8] described differences in methylation between patients with APL and three types of normal cells: mononuclear cells from remission bone marrow, CD34 + cells from healthy donors, and promyelocytes derived in vitro from CD34 + cells.APL is known to be caused by a Chr 15 and Chr 17 translocation affecting promyelocytic leukemia-retinoic acid receptor α, and these authors explored, in RRBS data, how methylation patterns differed across the genome.Using RRBS profiling, Ref. [8] showed widespread hypermethylation in APL samples, not only restricted to the location of the translocation.
We use DMCFB to reanalyze the data on Chr 15 and 17 to see if additional sensitivity can be achieved.From the point of view of building an analytic approach for these data, the APL data present several significant challenges: 1.
A large proportion of the data is missing.Most methods cannot handle such complexity and will remove either all or big chunks of positions.

2.
Additional covariates such as age and sex are of interest, and this may force users to choose among a few methods that can handle multiple covariates.

3.
Read depths vary dramatically in different regions; few methods account for such information.

4.
It can be seen from Supplementary Figure S2 that the methylation pattern changes dramatically in different regions; simple smoothing techniques cannot efficiently capture such complexity.
We can conclude that the approaches that are taken in [8] (i.e., removal of most positions with missing values and low read depth and not including additional sources of variation in age and sex) have probably led to a set of weak or unreliable results.Hence, we reanalyzed the APL dataset using DMCFB and compared the results with those of BiSeq.
We conduct two reanalyses of the data.First, we perform a simple case-control comparison without considering any additional covariates and grouping the three types of controls (CD34, PMC, and RBM) together.This allows for a direct comparison of the results from DMCFB with the results from BiSeq presented by [8] (Section 7.1).The second analysis compares the four groups of samples and includes covariates (Section 7.2).

Reanalysis of the APL Data by Comparing DMCFB and BiSeq
The model logit(β t,g,i ) = x(γ 0 + δ g + υ g,i ) + η 1 log(n t,g,i + 1), is used in DMCFB, where γ 0 is the baseline group level (control); δ 2 (APL) is the group-specific contrast by setting; δ 1 ≡ 0, v g,i are the individual-level variations; x is the (natural cubic spline transformation) design matrix; n t,g,i is the read depth at position t of individual i in the group g for t = 1, . . ., T (T = 285, 437 and 510, 386 for Chr 15 and 17, respectively); g = 1, 2; and i = 1, . . ., m g with m 1 = 16 control samples and m 2 = 18 APL patients.
We used normal priors in which τ θ ≈ 0.12 after inspecting the data.We depict the trace plots of one parameter in the model in Supplementary Figure S45 for five selected CpGs in Chr 17, which shows that the convergence occurred.Similarly, all other parameters of all positions converged quickly.Here, we choose a threshold of α = 10 −8 and note that very similar numbers of DMCs were identified if we used the more liberal threshold.For BiSeq, we first set α = 10 −8 , but it led to only a very small number of DMCs.Therefore, in BiSeq, we set α = 0.1 and 0.05 for testing and trimming clusters, respectively.
Table 1 (A, B) shows how and when DMCFB and BiSeq agree or disagree for Chrs 15 and 17, respectively.The tables show the numbers of DMCs called for the two methods along with the percentages conditional on an annotation (islands, shores, and deserts).First, in Table 1 (A), we observe that a much smaller number of sites are called DMC by BiSeq relative to DMCFB.The significance thresholds used for BiSeq will affect this proportion, but the level we have chosen is already quite liberal (α = 0.05).Second, we can see that if BiSeq calls a site a DMC, then DMCFB will call the same site a DMC 76.1% of the time.This proportion is fairly consistent across the three site annotations, although it is a little lower for desert regions than for islands and shores.In contrast, we can see substantial differences in performance among the sites that BiSeq does not call DMCs.Overall, the DMCFB method calls of these sites DMCs.However, this percentage ranges from 25.6% for CpG deserts to 44.5% for the CpG islands.In Table 1 (B), similar findings are obtained.The agreement with BiSeq calls is a little higher-80.89%instead of 76.1%-and a similar trend is seen among sites not called by BiSeq, such that DMCFB is calling more DMCs in islands and shores than in desert regions.Supplementary Figures S50 and S51 look at coherence within islands.If an island is identified as differentially methylated by DMCFB, then it is more likely that most of the CpGs inside the island are identified as DMC compared to BiSeq, and this finding indicates better smoothing by DMCFB, as we saw in simulation studies.We now address the detection of differential methylation in the vicinity of PML-RARα binding sites.Supplementary Table S8 presents which of the known binding sites listed in Ref. [51] are DMC, when comparing APL and controls, by either method.We observe that DMCFB identified more binding sites as DMC than did BiSeq.On Chr 15, DMCFB identified 12 sites, while BiSeq identified only 2 sites as DMC; on Chr 17, DMCFB identified 33 sites, while BiSeq identified only 2 sites.We also display the read depth of the known binding sites that are not DMC by either method or identified by only DMCFB in Supplementary Figures S46A,B on Chr 15 and S47A,B on Chr 17.From these figures, we observe that when neither method detects a DMC (at known binding sites), often the read depth is very low, and so the capability to detect DMCs is compromised.However, when the read depth is moderate or high, DMCFB identifies appreciably more binding sites as DMCs.In addition, DMCFB exhibits more sensitivity for capturing the documented widespread dysregulation.
Additional results are found in Supplement S8.1.Some highlights are: method's efficient use of information.When both methods call a DMC, read depth tends to be very high in the APL samples, i.e., a strong clear signal for differential methylation.• Supplementary Figures S52-S55 illustrate the difference between the raw and smoothed data and show how APL's methylation levels are often much higher than those among controls.The differences between patients and controls become far more apparent after smoothing, and overall, DMCFB captures far more sites as differentially methylated than BiSeq.
We present a comparison between DMCFB and DMCHMM in Supplement S8.3.We set α = 0.05, 10 −5 and 10 −8 in constructing the credible intervals.Table 2 shows the results for Chr 15 and 17.The patterns in both Chrs and all three α levels are almost similar, specifically for α = 10 −5 and α = 10 −8 .This analysis leads to six age-and sex-adjusted pairwise comparisons across the four sample types.The greatest number of DMCs are found for the contrast APL vs. CD34.Overall, about 5% of CpGs were identified as DMC using α = 0.05.It is worth noting that almost all of the analytical tools are incapable of analyzing the APL data by comparing all six contrasts in one run and considering all additional information on subjects while keeping all the CpGs in the analysis and imputing the missing values.

Discussion
We have proposed an efficient analysis method, DMCFB, for the identification of epigenetic features including DMC identification based on a functional data model and a Bayesian estimation and inference procedure.We demonstrated its superiority over existing methods via exhaustive simulation studies and its robustness via sensitivity analyses concerning both bandwidth and prior selection.DMCFB is flexible in terms of adding any source of variation and is robust with respect to the true underlying methylation pattern.Missing values are automatically imputed.The method is capable of adding discrete or continuous covariates or combinations.Most existing methods ignore read depth information in the analysis; by adding the read depth as an additional source of variation, our proposed method adjusts the methylation levels for better estimation.Similar to DMCHMM, our proposed method shows consistent behavior; the results depend on the difference in methylation between groups and not other aspects of DMRs such as length, location, autocorrelation pattern, read depth, etc.One drawback of DMCHMM is that the efficiency deteriorates if the distances between CpGs are explicitly included in HMM.Our proposed method is not restricted in this sense.In DMCFB, we reduced the need for the imputation of read depths and methylation counts by setting them to zero and letting the functional pattern estimates the missing methylation levels.This approach has reduced the computation time and increased the performance.
Among others, Ref. [52] has shown that using centered parameterization is a better approach for MCMC efficiency and speed when there are parameters unidentifiable from observed data, such as missing values.They proposed the ancillarity-sufficiency interweaving strategy (ASIS).As a future work, we intend to examine the utility of, and implement, ASIS in DMCFB whenever missingness is a significant factor.
There are multiple approaches for controlling family-wise errors.One may use SimBaS and global Bayesian p-values [45] or Bayesian FDRs [46].Our simulation study showed that our proposed method works better than SimBaS (Supplementary S6).We intend to add SimBaS as an alternative approach in DMCFB.Note that Ref. [53] observed inflation in FDR in their simulations.In our settings, we observed better FDRs for SimBas but better accuracy, sensitivity, and CKappa with DMCFB.
While reanalyzing the APL data, we noted that BiSeq's default significance threshold maintains a very stringent control over the false discovery rate, and only a small number of DMCs are identified; in fact, DMCFB identifies overall 6 (Chr 15) to 10 (Chr 17) times more DMCs than BiSeq.However, it is possible to adjust the BiSeq package options slightly, and we ran another series of analyses where the BiSeq's call rate was 7.5% (Supplementary Table S11 (A)) instead of 6.5% (Table 1 (A)) on Chr 15.However, the results of this sensitivity analysis showed less agreement with DMCFB, i.e., among DMCs called by BiSeq, the percentage also called by DMCFB dropped from 76.12% (Table 1 (A)) to 62.87% (Supplementary Table S11 (A)) on Chr 15.In contrast, among CpGs not called by BiSeq, the percentage of DMCs called by our method remained essentially unchanged from Table 1.This suggests that when BiSeq is less confident in a call, the two methods are using nearby reads and methylation patterns differently.Furthermore, the ability to capture more DMCs in CpG islands reflects DMCFB's ability to use information from adjacent CpGs via our functional smoothing.Given the density of CpGs in the islands, we obtain much more sensitivity in these regions.DMCFB, due to the functional regression and efficient imputation, seems to better capture a signal that persists across several CpGs.DMCFB often identifies all CpGs in an island as differentially methylated, whereas BiSeq tends to capture less than half.Similarly, we have shown that if an island displays differential methylation, that DMCFB more often also finds differential methylation in the adjacent shores.Similarly, the additional sensitivity of DMCFB is also visible when examining the known binding sites of PML-RARα; the performance of our approach was particularly notable when the read depths were moderate.The two methods are both smooth, but they use the information differently.
Let us elaborate on several computing challenges and the techniques implemented in DMCFB.(1) Partitioning: We partition data for two reasons: one, to avoid using a big dimensional design matrix which results in very high-dimensional computation and memory

Figure 2 .
Figure 2. (a) Average proportion of correctly identified DMCs ('sensitivity') and (b) incorrectly identified DMCs ('1-specificity'), and (c) average proportion of correctly identified DMCs and (d) incorrectly identified DMCs for each choice of bandwidth.Results are separated by DMRs and NDMRs in simulated data for Example 1 Scenario 1 using DMCFB.Errors are generated from N(0, 0.18).(SD error bars are added).

Figure 3 .
Figure 3. Comparing DMCFB with three settings: Naive imputation with a common hyperparameter, Bayesian imputation with a common hyperparameter, and Naive imputation with inverse-gamma priors on hyperparameters.

Figure 4 .
Figure 4.The performance of different methods in simulation study Example 2 (a), Example 3 (b), and Example 4 (c).

Figure 5 .
Figure 5. BLK data.Identified DMCs for three pairwise comparisons of cell-type-specific methylation data near the BLK gene promoter.Short black vertical lines indicate CpGs where one cell type was significantly different from the other at credible interval level α = 10 −8 .The average methylation level for each cell type is plotted.Long vertical lines are the start and end of the BLK gene.
samples in the control group and 12 out of 18 samples in the APL group are females.Age ranges from 20 to 83 years.On Chr 15, there are 285,437 CpGs available, and 64.27% of CpGs across the samples have at least one missing value.Respectively, 6.19% and 32.31% of CpGs in the control and APL groups have missing values in all samples.On Chr 17, there are 510,386 CpGs available, and among these, 59.81% of the CpGs have at least one missing value, and all samples are missing in 5.55% of CpGs in the control group and 28.91% of CpGs in the APL group-see Supplement S1.2.

Table 1 .
Comparing DMCFB and BiSeq in identifying DMCs in the APL data.
7.2.Reanalysis of APL Data by Accounting for All the Information Using DMCFBIn this section, we reanalyze the APL data using DMCFB but account for the additional information of sex, age, and the four groups (CD34, PMC, RBM, and APL).The full model is set to logit(β t,g,i ) = x(γ 0 + δ g + υ g,i ) + η 1 log(n t,g,i + 1) + η 2 z g,i + η 3 w g,i ,where γ 0 is the baseline group level (CD34); δ 1 ≡ 0, δ 2 (PMC), δ 3 (RBM), and δ 4 (APL) are the group-specific contrasts; v g,i are the individual-level variations; x is the (natural cubic spline transformation) design matrix; n t,g,i is the read depth at position t of individual i in group g with its effect η 1 ; z g,i is the sex of individual i in group g with its effect η 2 ; and w g,i is the age of individual i in group g with its effect η 3 .Recall t = 1, . . ., T (T = 285,437 and 510,386 for Chr 15 and 17, respectively), g = 1, . . ., 4 and i = 1, . . ., m g with m 1 = m 2 = 4, m 3 = 8, and m 4 = 18 for the four groups CD34, PMC, RBM, and APL, respectively.

Table 2 .
Reanalysis of the APL data using DMCFB by accounting for all available information.