Detection of Differentially Methylated Regions Using Bayes Factor for Ordinal Group Responses

Researchers in genomics are increasingly interested in epigenetic factors such as DNA methylation, because they play an important role in regulating gene expression without changes in the DNA sequence. There have been significant advances in developing statistical methods to detect differentially methylated regions (DMRs) associated with binary disease status. Most of these methods are being developed for detecting differential methylation rates between cases and controls. We consider multiple severity levels of disease, and develop a Bayesian statistical method to detect the region with increasing (or decreasing) methylation rates as the disease severity increases. Patients are classified into more than two groups, based on the disease severity (e.g., stages of cancer), and DMRs are detected by using moving windows along the genome. Within each window, the Bayes factor is calculated to test the hypothesis of monotonic increase in methylation rates corresponding to severity of the disease versus no difference. A mixed-effect model is used to incorporate the correlation of methylation rates of nearby CpG sites in the region. Results from extensive simulation indicate that our proposed method is statistically valid and reasonably powerful. We demonstrate our approach on a bisulfite sequencing dataset from a chronic lymphocytic leukemia (CLL) study.


Introduction
It is now widely accepted that cancer develops through a series of stages [1]. It starts from a very limited area, not invasive and metastatic at the early stage, then spreads to distant sites in the body, and becomes highly invasive and metastatic at the late stage. In addition, patient survival times are significantly reduced at the late stages. For example, the 5-year relative survival rate for lung cancer is 54% at a localized stage, and is reduced to 4% at the distant stage [2]. More than half of lung cancers are diagnosed at a distant stage, which indicates that early diagnosis of cancer is the main factor to enhance patient survival. Therefore, markers for early detection and proper classification of the tumor are extremely critical to improve life expectancy. Furthermore, identifying high-risk cancer patients at an early stage, would allow them to receive standard chemotherapy in advance.
DNA methylation has been found to be a marker for disease diagnosis, such as in cancer [3]. Significant progress has been made using DNA methylation differences to capture substantial information about the molecular and gene-regulatory states among biology subtypes, such as tumor and normal tissues [4].
In addition, DNA methylation can be used as a marker to differentiate disease severity, such as early and late stages in breast cancer [5], ovarian cancer [6] and prostate cancer [7]. Most of them have potential functions in inducing and suppressing cancer metastasis. Moreover, DNA methylation is associated with tumor size in colorectal cancer [8].Patients with higher methylation showed more frequent recurrence as compared with the low-methylation group, and shortened cancer-related survival and recurrence-free survival [8].
These findings show the critical importance of a better understanding of cancer progression and metastasis, which could help make better prediction of the clinical aggressiveness of cancer. Since DNA methylation is associated with disease severity, detecting differentially methylated regions (DMRs) can help understand cancer progression.
Most analyses are conducted by creating dichotomies based on biological subtypes, such as early and late cancer stages, and then detect DMRs by comparing the differences of DNA methylation rates between two groups [5][6][7]. However, when there are actually more than two groups, such approaches may lose information regarding multiple disease status, due to collapsing or ignoring clinically relevant subtypes, resulting in suboptimal clinical conclusions and decisions.
To use multiple disease status, it is possible to run multiple testing for the association between DNA methylation and multiple group responses, using the methods for two groups. Although we can simply run analysis for all pair-wise comparisons and combine the results, it is not trivial when considering the regional correlation of DMRs, and would increase the multiple testing burden.
Another possible method is the generalized linear model that includes indicator variables for different levels of disease status. This method has the advantage that it can adjust for covariates. However analysts are often faced with noisy estimates of category-specific regression coefficients, which can lead to unreasonable patterns in the regression coefficients corresponding to different levels of disease status, and it can reduce the power [9].
To improve the efficacy of an overall test, one can take advantage of the fact that cancer develops through a series of stages, or different levels of disease severity in general, and develop statistical methods that can incorporate the ordering of disease status. However, the widely used trend test is not an ideal method, because it requires scores or weights for different levels of disease status, which are generally unknown.
Here we propose a Bayesian approach and use the Bayes factor to test the association between methylation rates and disease severity. The proposed Bayes Factor Method (BFM) can incorporate monotonicity constraints, and find DMRs in which methylation rates increase (or decrease) as the diseases become more severe. Patients are classified into groups based on the disease severity (e.g., stages of cancer), and DMRs are detected by using moving windows along the genome. Within each window, the Bayes factor is calculated and is used to test the hypothesis of constant versus monotonic increase in methylation rates corresponding to the severity of the disease.
In addition, since DNA methylation rates have been shown to be correlated at nearby CpG sites with complicated correlation structure [10], a linear mixed-effect model is used to incorporate the correlation of methylation rates between and within CpG sites in the region.

Methods
Classical statistical inference under constrained parametric spaces has been addressed by many studies. Among them, Bartholomew [11] presented one of the first tests for K multinomial proportions with inequality constraints. He proposed a test of H 0 : p 1 = p 2 = . . . = p K against the simple ordered H 1 : p 1 ≤ p 2 ≤ . . . ≤ p K with at least one strict inequality, where p k (k = 1, 2, . . . , K) represents the proportion the k th group. Under H 0 , the maximum likelihood estimator of p k is the overall sample proportion π k . If the sample multinomial proportions satisfy π 1 ≤ π 2 ≤ . . . ≤ π K , then the order-restricted ML estimator isp k = π k . However, sometimes the sample proportions may not satisfy the ordering π 1 ≤ π 2 ≤ . . . ≤ π K ; in that case, calculation of the restricted maximum likelihood estimator (RMLE) is subject to arbitrary orderings of the parameters, and it requires specialized algorithms that are not easily generalizable [9].
Robertson and Wegman [12] proposed a likelihood ratio statistic for the inequality-constrained binomial problem, which compares parameters for independent samples from a single-parameter exponential family distribution. Before calculating the test statistic, they used the pool-adjacent-violators algorithm [13] to pool "out-of-order" categories for which π k > π k+1 until the resulting sample proportions are monotone increasing. The order-restricted ML estimatorsp k become the adjusted sample proportions.
The idea of applying an isotonic transformation to the unconstrained parameter estimates motivated Dunson and Neelon [9] to create a Bayesian alternative approach for this problem, which has been adapted here. They proposed to use Bayes factors for assessing ordered trends, which are calculated based on the output from Gibbs sampling. The samples from the order-constrained model are derived by transforming samples draws from an unconstrained posterior density using an isotonic regression transformation. Next, we explain our proposed Bayes factor method (BFM).
Suppose m kij is the count of methylated molecules at CpG site j of individual i in group k.
Within each moving window along the genome, a mixed-effect model is considered to allow the correlation of methylation rates between and within CpG sites. The logit link function for the methylation rate p kij is expressed by where ν 0ki and ν 1kij are the random effects. The random effect ν 0ki ∼ N(0, σ 2 ν 0 ) is used to model the interindividual correlation of methylation rates within each CpG site, while the random effect ν 1ki = (ν 1ki1 , ν 1ki2 , . . . , ν 1kim ) T ∼ N(µ 0 , Σ), with µ 0 = (0, 0 . . . 0) T is used to model the correlation of methylation rates between CpG sites.
Here µ k in (1) is the fixed effect for each group, representing the association between methylation rates and group responses. The strength and direction of the association is modeled by prior distribution N(µ µ , σ 2 µ ), which means the parameters of µ µ and σ 2 µ control the distribution of µ k , and implies that all of the methylation rates are drawn from a common distribution. This brings the advantage of allowing for heterogeneity of effects across CpG sites, instead of just pooling information across CpG sites in a region. Pooling assumes that each CpG site in the region has same methylation rates, while BFM considers the methylation rates of each CpG sites to be a random quantity governed by a prior distribution.
To calculate the Bayes factor, first we drew samples µ 1 , µ 2 , . . . , µ K from the posterior distribution by using Gibbs sampling. After that, an isotonic transformation is used to transform µ 1 , µ 2 . . . µ K into µ 1 , µ 2 , . . . , µ K , with µ 1 ≤ µ 2 ≤ . . . ≤ µ K [8] by using the min-max formula for the isotonic transformation, given by, where V=diagV 1 ,..., V K denotes the posterior covariance matrix and the diagonal submatrix V i , i = 1, . . . , k, is the covariance matrix of the i th ordered group. It is estimated from the samples of the posterior density of µ. U k and L k denote subsets of {1, . . . , K} such that the ordering µ j ≤ µ j for all j ∈ L k and the ordering µ j ≥ µ j for all j ∈ U k . Also samples µ 0 1 , µ 0 2 , . . . , µ 0 K are drawn from the prior density and transformed into µ 0 1 , µ 0 2 , . . . , µ 0 K , with µ 0 1 ≤ µ 0 2 ≤ . . . ≤ µ 0 K , by using the isotonic transformation in (2). The Bayes factor for each window (with moving windows along the genome) is given by, Please note that the isotonic transformation in (2) changes our hypotheses slightly, making the resulting Bayes Factor an approximation rather than exact [14]. The windows with highest value of the Bayes factor among all windows are used for evaluating DMRs.
Thus, the Bayes factor is the ratio of the marginal densities of the data under the two hypotheses, and it can be used to weigh evidence in favor of a hypothesis, by utilizing all the information contained in the full likelihood. Our proposed BFM can detect DMRs associated with disease severity, especially detecting DMRs with monotonically increasing or decreasing methylation rates, as the disease severity increase. It uses a mixed-effect model to not only adjust for correlation of methylation rates between CpG sites within each moving window but also correlations within CpG sites.
In addition, by adding covariates x ki in the model (1), we can account for the effects of covariates that are associated with methylation rates, such as age [15] and gender [16].
To aid in the interpretation of the Bayes factor, Jeffreys [17] proposed the following rule of thumb: "When 3 < BF ≤ 10 the evidence is positive, when 10 < BF ≤ 100 the evidence is strong, and when BF > 100, the evidence is decisive". As Kass and Raftery [18] pointed out, these categories are not precise calibration, but rather a descriptive statement about the standards of evidence in scientific investigations.

Simulation Study of the Properties of BFM
Extensive simulation was conducted to study the statistical validity and power of BFM to detect DMRs. For simplicity, for each individual, we simulated one CpG island (genomic region with CpG sites) consisting of m equally spaced CpG sites, with only one DMR of length r(<m) in the middle of the island. Further, we used equal sample size, N, for each of the K groups, and, we did not include any covariates.
Simulation Setup: The goal here is to simulate methylation rate at each CpG site for each individual. This is achieved in two steps. In step 1, methylation data in the form of NGS short reads sequences were simulated for each CpG site, with correlated methylation status between CpG sites. We also assumed that methylation status at CpG sites among different sequences were independent, as expected in NGS data. In step 2, the individual methylation rates were calculated by summarizing the methylation status at each CpG site from the short read sequences.
The simulation details are described below: First, we generated 100 NGS short reads using 100 pairs of random numbers {a, c} where a is the start point and c is the length of each short read sequence.
Because of the correlation of methylation rates between CpG sites, we treated methylation status Y kis,j at each CpG site j on short read sequence s as a branching process, taking advantage of the property of multivariate Bernoulli distribution [19]. We assumed that, for CpG site j, branching probabilities were the same for each short read sequence of all individuals in group k. Thus, we defined the branching probability p kj = P Y kis,j = 1 |Y kis,j−1 = 1 as the probability of methylated sequence read at CpG site j, conditional on the methylated sequence read at CpG site j − 1 on the same short read sequence of the same individual. Similarly, we defined the branching probability q kj = P Y kis,j = 1 |Y kis,j−1 = 0 as the same probability, conditional on unmethylated sequence read at CpG site j − 1.
The methylation status (Y kis,a , Y kis,a+1 , . . . , Y kis,a+c−1 ) were generated as follows: For the first CpG site of the sequence, the methylation status y kisa was generated from Bernoulli distribution Bern(m a ), with m a = p ka + q ka /2.
After generating all the sequences at every CpG site for each individual, we calculated the total numbers of methylated and unmethylated short read sequences at CpG site j for individual i in group k, s y kis,j = 1 and s y kis,j = 0 ,. Then the methylation count and the sequencing coverage are given by m kij = s y kis,j = 1 and c kij = s y kis,j = 1 + s y kis,j = 0 , respectively.
We generated one CpG region with 24 CpG sites for each individual, 6 of which (from site 10 to 15) constituting the DMR. We simulated four groups of severity levels, with sample size 50 in each group, and repeated it with sample size 100. The branching probabilities, p kj , were pre-determined. Also, we chose q kj = p kj − 0.2. We also simulated two different scenarios of DMR patterns.
Under Scenario 1, we chose the probabilities, p kj, to be symmetric around the middle of the DMR (CpG sites 12 and 13). The predetermined probabilities p kj and their symmetric pattern under Scenario 1 are presented in Table 1. Under Scenario 2, we randomly chose the CpG sites with the peak values of p kj within the simulated DMR (between sites 10 and 15), varying it for different individuals. Specifically, for each individual in each group, we first generated a random number r (between 10 and 15) for the location of the CpG site with the highest methylation, and then chose the branching probabilities p kj to increase from 10 to r and then decrease from r to 15. The p kj for the non-DMCs remained the same as in Scenario 1. The second scenario is a more realistic depiction of the real world. However, the results and conclusions should be the same under both situations.

Simulation Results
A total of 1000 replicates were simulated. For each replicate, the Bayes factor was calculated for each moving window with window size of 6. Calculations were based on 3000 Gibbs samplers, with 1000 Gibbs samplers for the burn-in period. The results of simulation for both the scenarios are presented in Table 2. As expected, the results are very similar for both scenarios. The results of Scenario 1 are plotted in Figures 1 and 2. As evident from Table 2, following Jeffreys' rule, when the moving windows contain at least three of the six CpG sites, we have strong evidence of differential methylation when sample size of 50 in each group and decisive evidence when sample size is 100. All results show that the Bayes factors reach their maximum in the simulated DMR (CpG sites 10-15). However, the Bayes factors are not symmetric, the windows on the right side of the peak have larger values compared to those on the left side. This is attributed to the fact that the methylation status at a given site was generated conditional on that at the previous site of the same sequence. As expected, when the sample size is doubled the Bayes factors and the evidence in support of methylation increases significantly, as seen in Table 2 and Figures 1 and 2.  All results show that the Bayes factors reach their maximum in the simulated DMR (CpG sites [10][11][12][13][14][15]. However, the Bayes factors are not symmetric, the windows on the right side of the peak have larger values compared to those on the left side. This is attributed to the fact that the methylation status at a given site was generated conditional on that at the previous site of the same sequence. As expected, when the sample size is doubled the Bayes factors and the evidence in support of methylation increases significantly, as seen in Table 2 and Figures 1 and 2.
In order to illustrate that our proposed method is statistically valid and to ensure that the BF in our method is a meaningful measure for comparison with frequentist approaches, we computed Bayes factors exclusively for all moving windows that do not include the differentially methylated sites 10-11. Among these Bayes factors, 95% were less than 1.34 and 99% were less than 1.50, both consistent with Jeffreys' rule. These values can be thought of as the cut-offs corresponding to 5% and 1% empirical type I error rates. We calculated the proportions of times the Bayes factors fall above these cut-offs, for all possible numbers of DMCs in the moving window. These results are given in Table 3. For the simulated data they are comparable to the conclusions based on frequentist interpretations of type I error and power. For the real data analysis, one could employ a permutation test to derive the cutoff values under the null hypothesis. However, since the frequentist interpretation is not necessarily consistent with the Bayesian conclusions, using Jeffrey's rule for decision making may be more desirable when analyzing real data. In order to illustrate that our proposed method is statistically valid and to ensure that the BF in our method is a meaningful measure for comparison with frequentist approaches, we computed Bayes factors exclusively for all moving windows that do not include the differentially methylated sites 10-11. Among these Bayes factors, 95% were less than 1.34 and 99% were less than 1.50, both consistent with Jeffreys' rule. These values can be thought of as the cut-offs corresponding to 5% and 1% empirical type I error rates. We calculated the proportions of times the Bayes factors fall above these cut-offs, for all possible numbers of DMCs in the moving window. These results are given in Table 3. For the simulated data they are comparable to the conclusions based on frequentist interpretations of type I error and power. For the real data analysis, one could employ a permutation test to derive the cutoff values under the null hypothesis. However, since the frequentist interpretation is not necessarily consistent with the Bayesian conclusions, using Jeffrey's rule for decision making may be more desirable when analyzing real data. Table 3. Proportions of Bayes factors that fell above the cut-off.

Data analysis
We used our proposed BFM to analyze methylation data from a genome-wide association study of chronic lymphocytic leukemia (CLL), which manifests as a result of clonal expansion of malignant B cells. B-cell lymphoma, mostly prevalent among adults, is a heterogeneous disease [20,21]. It is clinically important to find heterogeneity of patients at the molecular level, which can help design specific interventions for patients at different severity levels.
Over the last decade, research in CLL has resulted in significant advances such as identification of several molecular alternations with prognostic values. These include specific cytogenetic patterns [22], mutational status of the immunoglobulin heavy chain variable gene (IgVH) [23] and expression of CD38 [24]. It has been found that patients lacking the mutation have a poorer prognosis. Patients with lower levels of CD38 have slower disease progression [23,25].

Data analysis
We used our proposed BFM to analyze methylation data from a genome-wide association study of chronic lymphocytic leukemia (CLL), which manifests as a result of clonal expansion of malignant B cells. B-cell lymphoma, mostly prevalent among adults, is a heterogeneous disease [20,21]. It is clinically important to find heterogeneity of patients at the molecular level, which can help design specific interventions for patients at different severity levels.
Over the last decade, research in CLL has resulted in significant advances such as identification of several molecular alternations with prognostic values. These include specific cytogenetic patterns [22], mutational status of the immunoglobulin heavy chain variable gene (IgVH) [23] and expression of CD38 [24]. It has been found that patients lacking the mutation have a poorer prognosis. Patients with lower levels of CD38 have slower disease progression [23,25].
Several research groups have demonstrated that DNA methylation of multiple promoter-associated CpG islands is common in CLL [15,26,27]. Detection of aberrant DNA methylation in CLL could result in the development of an epigenetic classification of the disease with prognostic and therapeutic potential. CD19+ B cells from peripheral blood were collected from CLL samples and normal control subjects. All CLL samples were obtained from patients at the Ellis Fischel Cancer Center (EFCC), the Georgia Cancer Center of Augusta University and the North Shore-LIJ Health System in compliance with the local Institutional Review Boards [28].
Illumina sequencing reads were generated for each sample by using RRBS [29]. In total, 20-30 million reads were sequenced for each sample, and 63%-75% were successfully mapped to either strand of the human genome (hg18) [28]. The average sequencing depth per CpG was between 32x and 43x. Eventually RRBS provided counts of DNA molecules that were methylated or unmethylated at each CpG site, and overall methylation status of approximately 1.8-2.3 million CpG sites were determined consistently for each sample in the study [28].
Tong et al. [30] pointed out that aberrant DNA methylation associated with CLL were located more frequently on chromosome 19. Hence, we analyzed genome-wide methylation data on 17,917 CpG sites on Chromosome 19 of 40 patients.

Comparison of Bayesian Method with Scan Statistic Method for Two Groups
First, we tested for differential methylation under binary response, by dividing the samples into two groups based on CD38 level of 20 as the cut-off. We had 23 subjects with CD38 ≤ 20 and 17 subjects with CD38 > 20. BFM and Scan statistic method (SSM) [31] were compared, using moving windows with 10 CpG sites in each window.

Bayesian Method for Ordinal Group Responses
To test whether the methylation rates increase as the CD38 levels increase, the samples were classified into four risk groups based on CD38 level, with 5 non-leukemia subjects in group 1, 23 patients in group 2 with CD38 ≤ 20, 9 patients in group 3 with 20 < CD38 ≤ 50, and 8 patients in group 4 with CD38 > 50. Though there are advantages of modeling CD38 as a continuous variable, but on the other hand, modeling as an ordinal variable is more robust to distributional assumptions. Again, moving windows of size of 10 were used for analysis. In fact, in clinical studies it is a common practice to put patients into discrete disease risk groups based on continuous measures.
Because of multiple testing issues associated with the comparison of four groups, we used a more stringent criterion of BF > 19 to evaluate the strength of evidence of differential methylation [8]. A total of 789 windows showed strong evidence of differential methylation using this criterion. The start and end positions in base pairs for each detected DMR were used in the UCSC genome browser to find the genes in the regions, and eventually 125 genes were found in these regions. Among them, 35 were associated with leukemia on PubMed literature. Some of these were not detected when only two groups were considered even with a less stringent criterion. They are BRD4, ELL, ERCC1, ERCC2, GDF15, JUND, POLD1, PRDX2, RANBP3, SPIB and TSPAN16 [52][53][54][55][56][57][58][59][60][61][62].

Discussion
Results from our simulation study indicate that BFM is a valid approach to detect DMRs when considering ordinal group responses, since the calculated Bayes factors were very large for simulated DMRs, and close to 1 for non-DMRs. The real data analysis based on the CLL data also demonstrated that BFM is a valid method that is able to detect DMRs with methylation rates increasing (or decreasing) as disease severity increases.
In addition to being able to account for ordering of group responses, BFM also has an advantage of allowing for heterogeneity of methylation effects across CpG sites by modeling the methylation rates with a prior. Methods such as the SSM pools information across variants in a region, assuming that each CpG sites in the region have the same methylation rates.
BFM with mixed-effect regression, not only can allow for covariates, but also the correlation between CpG sites. It takes advantage of the flexibility of the Bayesian framework, including the use of prior information when available as well as computational convenience, and uses distributions such as the multivariate normal to incorporate the correlation structure with inverse Wishart distribution as the prior for the correlation matrix.
One disadvantage of the BFM is that it assumes that methylation rates of CpG sites within each moving window are independent of those outside of the window, while the SSM accounts for the correlation along the whole genome.
BFM used a moving window to help decide the location and length of DMRs. But practically, it is very difficult to know the exact length of DMRs. This limitation is very common in statistical genetics, not only for detecting DMRs, but also for detecting rare variants [63]. Cross validation or bootstrap approaches might help determine the window sizes. It could be possible to develop other methods, for example, using genes and promoters instead of moving windows, along with the BFM to detect DMRs.
As described by George and Laud [64], the Bayes factor used in the context of testing hypotheses is a meaningful measure of evidence because it is a reasonably approximate factor by which the odds are increased by the data. With default priors that are essentially flat over a wide range of the relevant parameter space, the approach is similar to the likelihood-based inference. However, direct comparison between methods such as BFM based on the Bayes factor with frequentist approaches should be done with caution, as the Bayes factor classification for decision process is not a precise calibration, but rather a descriptive statement about the standards of evidence. Our proposed method is rather exploratory in nature, leading to a ranked list of sites for follow up for formal confirmation.
We developed the BFM, focusing only on DNA methylation data. However, large-scale cancer genomics projects such as TCGA (The Cancer Genome Atlas Research Network) are currently generating multiple layers of genomics data for early tumor, including DNA copy number, methylation, and mRNA expression. Similar statistical methods for integrated analysis and systematic modeling of these genomics data deserve further attention.