Computational Methods for Detection of Differentially Methylated Regions Using Kernel Distance and Scan Statistics

Motivation: Researchers in genomics are increasingly interested in epigenetic factors such as DNA methylation because they play an important role in regulating gene expression without changes in the sequence of DNA. Abnormal DNA methylation is associated with many human diseases. Results: We propose two different approaches to test for differentially methylated regions (DMRs) associated with complex traits, while accounting for correlations among CpG sites in the DMRs. The first approach is a nonparametric method using a kernel distance statistic and the second one is a likelihood-based method using a binomial spatial scan statistic. The kernel distance method uses the kernel function, while the binomial scan statistic approach uses a mixed-effects model to incorporate correlations among CpG sites. Extensive simulations show that both approaches have excellent control of type I error, and both have reasonable statistical power. The binomial scan statistic approach appears to have higher power, while the kernel distance method is computationally faster. The proposed methods are demonstrated using data from a chronic lymphocytic leukemia (CLL) study.


Introduction
Genetic variations from genome-wide association studies can explain only a small proportion of the phenotypic variation for most diseases [1]. It has been established that most diseases are caused by both genetic factors and non-genetic factors such as environmental factors, contributing to epigenetic changes, especially changes in DNA methylation at CpG sites. For example, research has found that aberrant DNA methylation of multiple promoter-associated CpG islands can suppress gene expression by inactivating the function of tumor suppressor genes, eventually causing cancer [2].
Methylation data from next-generation sequencing (NGS) such as Methyl-seq have been used to detect aberrant DNA methylation [3]. NGS coupled with bisulphite treatment of DNA converts unmethylated cytosines to uracils and leaves methylated cytosines intact. This results in counts of uracils (unmethylated) and cytosines (methylated) at each CpG site for every sample. The total counts of uracils and cytosines are the sequencing coverage at each CpG site, which could be different for each sample. Samples with large sequencing coverage could have undue influence in statistical analysis. In τ with similar scaled distance. SSM was first introduced by [16] to detect clusters in a point process in the one-dimensional setting. With moving windows, the maximum number of points in the windows is recorded and compared to its distribution under the null hypothesis of a purely random Poisson process. A reasonable method that takes into account accurate underlying distributions of methylation counts needs to be developed. Kulldorff et al. [17] proposed a likelihood-based SSM, which was extended to detect genetic variants by [18] by considering the Bernoulli distribution of variants at each position for every individual. The scan statistic is calculated from the likelihood ratio of the frequencies of variants carried among cases and controls within a window versus outside the window, with moving windows along the whole genome. The maximum of the scan statistics over the windows of all possible sizes is defined as the global statistic. However, the approach considered by [18] may not be appropriate for methylation data, since methylated counts at each CpG site for every individual, conditional on the sequencing coverage, follow a binomial distribution instead.
Here, we propose a binomial SSM, which assumes a binomial distribution for the methylation data. Similarly, we also propose a KDM based on [15]. In both approaches, we use logistic regression on methylation rates to adjust for covariates, including sample-specific covariates such as batch effect, in addition to other confounding variables and predictors.
The details for these statistical methods are presented in Materials and Methods, and the results of our simulation studies are presented in Simulation Results. The methods are applied to a bisulfitesequenced data from a chronic lymphocytic leukemia (CLL) study [19], with the results presented in Analysis of CLL Data, followed by conclusions and discussions presented in Discussion.

Kernel Distance Method
We modified the KDM, proposed by [15], to model methylation rates, using a tri-weight kernel function to measure the correlation of the methylation rates at different CpG sites as a function of the distance between the sites. This is necessary, since the correlation of methylation rates between CpG sites decreases as the distance between the sites increases.
To facilitate the discussion of the kernel distance method, let m kij be the count of the methylated molecules at CpG site j of individual i in group k, where k = A for cases and U for controls. We assume that m kij ∼ Binom c kij , p kij , where Binom() stands for binomial distribution, c kij is a positive integer denoting the coverage, and p kij is the methylation rate at CpG site j for individual i in group k, k = A, U; i = 1, 2, . . . , n k ; j = 1, 2, . . . , s.
To adjust for confounding factors and linear predictors such as age and gender, we first use logistic regression to fit all data from both groups, using the model, where β 0 and β 1 are regression coefficients and x ki represents the vector of covariates of individual i in group k. The fitted odds are calculated for methylation at CpG site j for individual i in group k, to get the corresponding expected methylation rates, Genes 2019, 10, 298 4 of 16 The difference between the observed and expected methylated counts at CpG site j for individual i in group k is calculated as "the adjusted methylation count", Define r Aj = n A i=1 r Aij and r U j = n U i=1 r Uij , then the group effects for cases and controls are quantified asβ Aj = r Aj C Aj The difference between the two groups, δ j =β Aj −β U j , is calculated at each CpG site, and used in the quadratic statistic Q = δ Aδ, where A is a pre-defined matrix of the correlation of methylation rates among CpG sites.
Generally, the correlation of methylation rates decreases as the distance between the two CpG sites increases. Therefore, the kernel matrix A should be based on a function that determines how rapid the correlation decreases to 0 as the distance increases. We use the tri-weight function 2 3 , if d jl ≤ 1 and 0 otherwise [15], where d jl (τ) = d jl /τ is a scaled distance based on the unknown scaling factor τ, and d jl measures the distance between CpG sites j and l.
Since the lengths and number of DMRs are unknown and difficult to predict, and the lengths of DMRs vary across the genome, it is difficult to determine the scaling factor that represents the cluster size. When an appropriate size of clusters cannot be predicted and many clusters are expected, it is common to repeat the procedure using different values of τ. Tango [20] suggests allow τ to vary continuously from a small value near zero upwards until τ reaches about half the size of the region of interest. In this manuscript, as proposed by Schaid et al. [15], we consider 10 values of τ, evaluate kernel distance statistic at each value, and select the one that maximizes the statistic; that is, When a single test statistic is computed based on one scaling factor, the distribution of the kernel distance statistic can be approximated by a scaled chi-square distribution [20]. However, because of multiple scaling factors in our case, scaled chi-square may not be a very good approximation for the distribution of the statistic, and hence we use the permutation method, instead.
When the null hypothesis is rejected, the scaling factor, τ * , that corresponding to the maximum Q value is accepted as the length of DMR, and the corresponding kernel distance statistic is calculated as, where m is the number of CpG sites in a genomic region. The percent contribution to Q(τ * ) at each CpG site is calculated as The distribution of methylation rates can now be plotted based on the percent contribution U j (τ * )/Q(τ * ) versus CpG site j, which gives a graphical view of potential DMRs.

Binomial Scan Statistic Method
Scan statistic method can be used as an alternative to KDM for detecting DMRs associated with the disease status. SSM is a likelihood-based approach that uses the likelihood ratio to test whether the methylation rates are different between groups. We use moving windows along the genome, with multiple window sizes, allowing more accurate evaluation of the location and sizes of DMRs.
Since the methylation rate at each CpG site is correlated with those at the adjacent CpG sites, these correlations are first adjusted by using mixed-effect logistic regression model (see Appendix A for details). Then the "adjusted methylation count" r kij for group k is calculated, using Equations (2) and (3). The mixed-effect logistic regression model also allows us to account for relevant covariates.
We also incorporate an approach proposed by [21] in our proposed SSM to adjust for the clustering structure within each CpG site. By treating the cluster size as random, we can account for the unequal sequencing coverage for individuals at each CpG site. Using the method proposed by [22], the design effect due to clustering is calculated for each CpG site, and used to calculate the adjusted methylation counts r k j and sequencing coverage C k j (See Appendix A for details).
We assume that r Aj ∼ Binom ( C Aj , p A ) and r U j ∼ Binom ( C U j , p U ), where p A and p U are the methylation rates in cases and controls, respectively.
Let η k = log p k 1−p k be the logit transformation of methylation rates of group k within the specific region. In order to test the hypotheses H 0 : η A = η U versus H 1 : η A η U , we propose a test statistic that uses the log of the ratio of the likelihood under H 1 versus H 0 , which is referred to as the scan statistic. It is given by (see Appendix A for details) Genes 2019, 10, x FOR PEER REVIEW 5 of 15 One of the advantages of SSM is that the method can easily be extended to more than two groups, if the groups are classified based on nominal responses. Under the multinomial set up, SSM can be used to test the overall hypotheses of no difference in methylation rates among the groups (See Appendix A).
The scan statistic is calculated for each window using moving windows with variable window (VW) size approach across the whole genome. DMR is defined as the window with the highest value of the scan statistic. Thus, for each window W of size w, the binomial scan statistic is be calculated, and the one with highest value denoted by LRw. Then the maximum of LRw over all values of w is used as the global test statistic.
The LR calculation is unstable if the frequency of methylated counts within a given window is 0 for either cases or controls. To overcome this issue, a pseudo-count of 1 is added to the adjusted methylated and unmethylated counts at each CpG site, these additions implicitly assume that the null hypothesis of no differential methylation is true at all sites. Since the distribution of scan statistic is unknown, an approximate p-value for the window with the largest LRw is calculated using permutation method.
For case-control studies, SSM is expected to have higher power than the KDM, since SSM using moving window with variable window sizes overcomes the difficult problem of determining the value of scaling factor in the KDM. The use of moving windows can also result in more accurate regions of DMRs.

Simulation
We conducted extensive simulation studies to evaluate the performances of both SSM and KDM. They were compared with respect to the empirical type I error, empirical power and computational efficiency.
Since we used logistic regression for both methods to adjust for covariates, for simplicity, we did not include any covariates in the simulation. Although there are many DMRs along the genome, for the power comparisons for various alternate hypotheses at various significant levels, we assumed that there was only one DMR, so that we only simulated a small genome region around the DMR. We simulated two different scenarios with respect to number of CpG sites in the region, 24 and 30, and all CpG sites within the region were equally spaced.

Simulation Parameters
We considered cases and controls and assumed every individual had equally spaced CpG sites in the simulated region, of which consecutive CpG sites in the middle were in the DMR.
One of the advantages of SSM is that the method can easily be extended to more than two groups, if the groups are classified based on nominal responses. Under the multinomial set up, SSM can be used to test the overall hypotheses of no difference in methylation rates among the groups (See Appendix A).
The scan statistic is calculated for each window using moving windows with variable window (VW) size approach across the whole genome. DMR is defined as the window with the highest value of the scan statistic. Thus, for each window W of size w, the binomial scan statistic is be calculated, and the one with highest value denoted by LR w . Then the maximum of LR w over all values of w is used as the global test statistic.
The LR calculation is unstable if the frequency of methylated counts within a given window is 0 for either cases or controls. To overcome this issue, a pseudo-count of 1 is added to the adjusted methylated and unmethylated counts at each CpG site, these additions implicitly assume that the null hypothesis of no differential methylation is true at all sites. Since the distribution of scan statistic is unknown, an approximate p-value for the window with the largest LR w is calculated using permutation method.
For case-control studies, SSM is expected to have higher power than the KDM, since SSM using moving window with variable window sizes overcomes the difficult problem of determining the value of scaling factor τ in the KDM. The use of moving windows can also result in more accurate regions of DMRs.

Simulation
We conducted extensive simulation studies to evaluate the performances of both SSM and KDM. They were compared with respect to the empirical type I error, empirical power and computational efficiency.
Since we used logistic regression for both methods to adjust for covariates, for simplicity, we did not include any covariates in the simulation. Although there are many DMRs along the genome, for the power comparisons for various alternate hypotheses at various significant levels, we assumed that there was only one DMR, so that we only simulated a small genome region around the DMR. We simulated two different scenarios with respect to number of CpG sites in the region, 24 and 30, and all CpG sites within the region were equally spaced.

Simulation Parameters
We considered N 1 cases and N 2 controls and assumed every individual had equally spaced m CpG sites in the simulated region, of which r consecutive CpG sites in the middle were in the DMR.
Methylation counts at each CpG site for every individual were assumed to be distributed as B c kij , p kij , k = A, U, i = 1, 2, . . . , N, j = 1, 2, . . . , m. The sequencing coverage c kij were allowed to vary by sampling the value of it from N(30, 13) and then rounding it to the nearest integer, with a minimum of 5 based on the real data analysis by [20]. The correlated methylation rates p kij were simulated using a two-step procedure proposed by [23] in order to model the spatial dependence of the methylation rates among nearby CpG sites.
First, independent random samples X kij were generated from Beta-distribution for CpG site j of individual i in group k. Under the null hypothesis, X kij were generated as X kij ∼ Beta(α U , β U ), k = A, U. Under the alternative hypothesis, for CpG sites outside the DMR, X kij were generated under the same distribution. Within the DMR under the alternate hypothesis, X kij were generated as X Aij ∼ Beta(α A , β A ), where α A α U or β A β U for all CpG sites within the DMR, so that the methylation rates were different between cases and controls within DMR. Based on the property of the Beta distribution, with fixed α U , β A and β U , only the values of α A were changed, with effect size defined as For each individual in each group, the vector of independent random variables X ki was transformed into a vector of correlated random variables with correlated methylation rates denoted the cumulative distribution function of the standard normal distribution function with Cholesky decomposition C of the correlation matrix Σ = CC . All diagonal elements of the correlation matrix Σ were 1, and the (i, j)th off-diagonal element was defined as the correlation coefficient ρ divided by the distance between CpG sites i and j, in order to account for the fact that the correlation of methylation rates for two CpG sites decreases as the distance increases.

Simulation Results
Simulations were conducted at significance levels of 0.05 and 0.01, total sample sizes of 48 and 60 with equal sample sizes in each group, and regions of 24 and 30 CpG sites with 6 sites in the middle constituting the DMR. We assumed correlation coefficients of ρ = 0.7 and ρ = 0.5 for methylation rates between adjacent CpG sites, and those among non-adjacent sites were scaled down by dividing ρ by the distances between sites. We set α U = 0.1, β A = β U = 0.9, and used different values of α A to get different effect sizes. Since we simulated DMRs with length of 6 CpG sites, we used τ = 6 in KDM, and moving window of size 6 in SSM.
First of all, we generated 10,000 simulated samples using α A = 0.1 and computed the p-values and the empirical type I errors at significant levels 0.05 and 0.01, in order to evaluate the statistical validity of the two approaches. The results are presented in Table 1, and the histogram plots of p-values for SSM and KDM in Figure 1a,b, respectively. For a statistical test to be valid, the p-values must be uniformly distributed between 0 and 1 under the null hypothesis. As evident from Figure 1, the p-value distributions are very close to uniform in both the cases, thus asserting the statistical validity of both our proposed methods. Also, the empirical type I errors are very close to the significant levels, confirming that both methods have excellent control of type I errors.   Figures 2 and 3, corresponding to the 24-site and 30-site regions, respectively. The plots show that values of the power for both SSM and KDM increase as the effect sizes increase, and as well as the sample sizes increase. It is also evident from the plots that SSM has uniformly higher power than KDM.
The conclusions on power at 1% significant level are very similar to and consistent with that at 5% significance level, showing consistently higher power for SSM compared to KDM. The plots show that values of the power for both SSM and KDM increase as the effect sizes increase, and as well as the sample sizes increase. It is also evident from the plots that SSM has uniformly higher power than KDM.
The conclusions on power at 1% significant level are very similar to and consistent with that at 5% significance level, showing consistently higher power for SSM compared to KDM.

Analysis of Chronic Lymphocytic Leukemia Data
We applied our proposed methods to the methylation data from a genome-wide study of chronic lymphocytic leukemia (CLL), which manifests as a result of clonal expansion of malignant B cells. Research in CLL has identified several molecular alternations that are associated with prognostic values. These include specific cytogenetic patterns [24], mutational status of the immunoglobulin heavy chain variable gene (IgVH) [25] and expression of CD38 [26]. It has been observed that patients with lower levels of CD38 have slower disease progression [25,27]. CD19+ B cells from peripheral blood were collected from 40 subjects [19]. Based on CD38 levels, the samples were categorized as low-vs. high-risk, with 23 samples having CD38 levels ≤ 20 (low risk) and 17 samples having CD38 levels > 20 (high risk).
Illumina reduced representation bisulfate sequencing [28] was used to generate sequencing reads for each sample, with average sequencing depth per CpG between 32x and 43x, which provided counts of DNA molecules that were methylated and unmethylated at each CpG site [19]. Tango [20] pointed out that aberrant DNA methylation associated with CLL were located more frequently on chromosome 19. So, we analyzed genome-wide methylation data on 17, 917 CpG sites on Chromosome 19 using both SSM and KDM to identify DMRs between high-risk and low-risk CLL subjects.
Because of the massive computing time needed for simulations under the alternate hypotheses, only 1000 simulations were conducted to evaluate the power of SSM and KDM under various alternate scenarios. The plots of power versus different values of effect sizes and correlations at 5% significance level are presented in Figures 2 and 3, corresponding to the 24-site and 30-site regions, respectively. The plots show that values of the power for both SSM and KDM increase as the effect sizes increase, and as well as the sample sizes increase. It is also evident from the plots that SSM has uniformly higher power than KDM.
The conclusions on power at 1% significant level are very similar to and consistent with that at 5% significance level, showing consistently higher power for SSM compared to KDM.

Analysis of Chronic Lymphocytic Leukemia Data
We applied our proposed methods to the methylation data from a genome-wide study of chronic lymphocytic leukemia (CLL), which manifests as a result of clonal expansion of malignant B cells. Research in CLL has identified several molecular alternations that are associated with prognostic values. These include specific cytogenetic patterns [24], mutational status of the immunoglobulin  The percentage contribution of each CpG site to the kernel distance statistic is plotted at the top of Figure 4. The middle and bottom parts of Figure 4 give the plots of the absolute differences of methylation rates at each CpG site versus the percentage contribution of each CpG site to the kernel distance statistic, based on the CLL data and the simulation data. The absolute value of differences in methylation rates between cases and controls were calculated based on the ratio of adjusted methylation counts and sequencing coverage based on [20] at each CpG site for cases and controls. The wedge shapes in both middle and bottom of Figure 4 show that, a large number of CpG sites with small differences in methylation rates have very small contributions to the kernel distance statistic and are possibly not differentially methylated, while the CpG sites with large contributions to the kernel distance statistic show evidence of differential methylation. This indicates the ability of KDM in detecting DMRs, especially using the tri-weight kernel function to incorporate the correlation structure of methylation rates between CpG sites. The SSM approach detected a total of 66 DMRs with varying window sizes, that containing different number of CpG sites, with a total of 1355 CpG sites (about 7.5% of all CpG sites in Chromosome 19). The top 20 DMRs with highest scan statistic are presented in Table 2, which matches well with the peaks in Figure 4, indicating consistency between SSM and KDM.  The wedge shapes in both middle and bottom of Figure 4 show that, a large number of CpG sites with small differences in methylation rates have very small contributions to the kernel distance statistic and are possibly not differentially methylated, while the CpG sites with large contributions to the kernel distance statistic show evidence of differential methylation. This indicates the ability of KDM in detecting DMRs, especially using the tri-weight kernel function to incorporate the correlation structure of methylation rates between CpG sites.
The SSM approach detected a total of 66 DMRs with varying window sizes, that containing different number of CpG sites, with a total of 1355 CpG sites (about 7.5% of all CpG sites in Chromosome 19). The top 20 DMRs with highest scan statistic are presented in Table 2, which matches well with the peaks in Figure 4, indicating consistency between SSM and KDM. The start and end positions of base pairs for each detected DMR were used in the UCSC genome browser (http://genome.ucsc.edu/) to find the genes in the regions. Some of the genes detected in our study include the apolipoprotein gene cluster (APOC1, APOC2, APOE), which are shown to have tight linkage with a chronic lymphocytic leukemia-associated translocation breakpoint [29]. We also detected the genes CATSPERD, PRR22, RFX2, and MILT1, which have been shown to be associated with leukemia [30]. For example, translocation and fusion of MILT1 with myeloid lymphoid leukemia could result in potent oncogenic activity [31,32].
Several studies have suggested that the transcription factor CREB (cyclic AMP response element binding protein) may have a role in the pathogenesis of human acute myeloid leukemia (AML) and other cancers [33,34]. In our data, replication factor C3 is detected whose expression has been reported to have a direct correlation with CREB in AML cell lines, as well as in the AML cells from the patients [35]. It is suggested that C3 may have a role in neoplastic myelopoiesis by promoting the G1/S progression. Another detected gene, LAIR1, also has been found to have a correlation with CREB [36]. A pathway starts with LAIR1, activates downstream CREB in AML cells, and sustains the survival and self-renewal of AML stem cells. As a result, inhibition of expression of the immunoreceptor tyrosine-based inhibition motif (ITIM)-containing receptor LAIR1 does not affect normal hematopoiesis but abolishes leukemia development [36].

Discussion
Results from our simulation studies and the analysis of CLL data indicate that both methods, SSM and KDM, are valid approaches to detect DMRs. Both methods detect DMRs, while allowing for covariates as well as correlation between CpG sites.
The tri-weight function used in KDM allows for a correlation structure in which the correlation decreases as the distance between CpG sites increases, while SSM use a mixed effects model to incorporate the correlation structure. Although compound symmetry assumption used in SSM may not truly represent actual correlation structure, the sandwich estimates of the fixed effects are appropriate even when the correlation structure is mis-specified, with some trade off of the flexibility for robustness of inference. Our simulation results also show that the mixed effects model is able to adjust for correlation when the simulated correlations decrease as the distances between CpG sites increase. Since the correlation structure can be complicated for methylation data, it may not be easy to find a statistical model that incorporates the correlation structure in its fully complex form.
Both SSM and KDM have reasonable power and good controls of type I error in detecting DMRs between two groups, though SSM performs better in this respect compared to KDM. One reason might be that SSM is a likelihood-based method while KDM is a non-parametric method. Another reason for increased power for SSM could be the use of moving windows with multiple window sizes, which eliminates the difficulty of determining the value of τ in KDM. However, the use of moving windows with a mixed effects model for adjusting the correlation of methylation rates results in substantially longer computation time for SSM.
In addition, SSM accounts for within cluster correlation by incorporating the method proposed by Xu et al. [20]. SSM also has the advantage that it can be used for comparing methylation rates in more than two groups, while KDM can only be used for comparing two groups. But SSM still has a limitation that it cannot consider the ordering of the group responses because the maximum likelihood estimates are very difficult to obtain when constrained space based on ordering is required.
The uncertainty of τ not only leads to disadvantages in terms of power for KDM, but also it causes KDM to detect only DMRs of approximate lengths, since the kernel distance statistic is calculated using only one value of τ. In reality, the lengths of DMRs range from hundreds of base pair as in small CpG islands, to millions of base pairs in cancer aberrations. It is very difficult to know the exact length of DMRs, a limitation very common in statistical genomics, not only for detecting DMRs but also for detecting rare variants [15]. Use of cross-validation or bootstrapping might help improve the estimation of the window sizes.
Another reason for the lower power for KDM compared to SSM may be that KDM is not able to adjust for unequal sequencing coverage for all individuals at each CpG site, while SSM incorporates the method proposed by Xu et al. [20] to adjust sequencing coverage and methylation counts. One possible solution is to use a mixed-effect logistic model with random intercept to adjust for the within cluster correlation, treating methylation data at each CpG site as a cluster.
We have only focused on DNA methylation data in developing both our methods. However, large-scale cancer genomics projects such as TCGA (The Cancer Genome Atlas Research Network) are currently generating multiple layers of genomics data for early tumor, including DNA copy number, methylation, and mRNA expression. Statistical methods for integrated analyses and systematic modeling of such genomics data deserve more attention.
Author Contributions: F.D. and H.X. performed method development and data application; D.R., S.G., and V.G. participated in the method development; H.S. provide the CLL data and participated in data application; all authors participated in manuscript preparation.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Appendix A.1. Adjusting for Correlation Between CpG sites with Mixed-Effects Model
A random slope and random intercept logistic regression model is considered for modeling methylation counts at each CpG site for every individual, which is given by where x ki is the covariate, and s j represents the distance of CpG site j from the starting point.

Appendix A.2. Adjusting for Clustering Structure Within Each CpG Site
Within each CpG site, the design effect due to clustering is calculated using the method proposed by Rao and Scott (1986). To calculate the design effect, we first calculate the adjusted methylation counts r Aj and r U j at CpG site j in groups A (cases) and U (controls), respectively, ignoring the clustering within individuals. That is, r Aj = n A i=1 r Aij and r U j = n U i=1 r Uij . Then the group effects are estimated as,β Aj = r Aj C Aj The variances of the group effects are given bŷ Without clustering, the variances of the group effects under the binomial distribution arê Then the design effects for the two groups are defined as, The design effects are then used to calculate the adjusted methylation counts and sequencing coverage at each CpG site in cases and controls as

. Binomial Scan Statistic for Case-Control Studies
Considering r k j ∼ Binom C k j , p k , where p k is the methylation rate in group k, then the likelihood of r k j , k = A, U, is given by, Since ( r k1 , r k2 , . . . , r ks ) for the s consecutive CpG sites are assumed to be independent, the joint likelihood of adjusted methylation counts over s consecutive CpG sites in the defined region for group k is the product of the likelihoods of the s CpG sites, which can be expressed as, f ( r k1 , r k2 , . . . , r ks ) = s j=1 C k j r k j exp r k j log From this likelihood, we can see the distribution of adjusted methylated counts follow a one-parameter exponential family y = ( r k1 , r k2 , . . . , r ks ) ∼ EXP(η, φ, T, B e , a) with T( r k1 , r k2 , . . . , r ks ) = s j=1 r k j s j=1 C k j and the log-likelihood l(η; y) = (ηT(y) − B e (η))/φ after ignoring an additive constant that does not depend on η. Based on this likelihood function, we can find the maximum likelihood estimator (MLE) of parameter η in EXP(η, φ i , T, B e , a) asη = g e (T(y)), where g e = (B e ) −1 = log(T) − log(1 − T) [37]. Then the scan statistic as the ratio of the likelihood under H 1 versus H 0 , given by where κ(x, y) = (xg e (x) − B e (g e (x)))/y, 1 Here we have, Φ A =

SSM for Multinomial Responses
Before testing the differences among groups, the methylation counts and sequencing coverage need to be adjusted. First, we use the mixed-effect logistic regression model (A1) to adjust for

. SSM for Multinomial Responses
Before testing the differences among groups, the methylation counts and sequencing coverage need to be adjusted. First, we use the mixed-effect logistic regression model (A1) to adjust for covariates and the correlation of methylation rates between CpG sites. Then design effects in (A2) are calculated based on Xu et al. [20], and used to adjust the residual r k j and the sequencing coverage C k j for group k at CpG site j, for all k and j, as in (A3) and (A4).
Assume that all CpG sites in a DMR for group k have same methylation rate p k with adjusted methylation count r k j ∼ B C k j , p k . Let η k = log p k 1−p k be the logit transformation of methylation rates of group k. The hypothesis of interest is, H 0 : η 1 = η 2 = . . . = η K = η vs. H 1 : η 1 , η 2 , . . . , η K not all are equal.
Here the groups are assumed to be independent, and the log of the ratio of the likelihood under H 1 versus H 0 is used as the test statistic as before, given by, where κ(x, y) = (xg e (x) − B e (g e (x)))/y and T k = . Then the scan statistic for more than two groups is given by, where b k = s j=1 C k j