## 1. Introduction

Cytosine DNA methylation (5-methylcytosine; 5mC, CDM) is one of the most well-studied epigenomic marks and mechanistically understood epigenetic modifications to date [

1,

2]. It plays important roles in various biological processes, including X-chromosome inactivation, genomic imprinting, transposon suppression and transcriptional regulation [

3]. Technological improvements and dramatic cost reductions for whole-genome sodium bisulfite sequencing (WGBS) of DNA have opened the door to the quantitative measurement of DNA methylation at a single base resolution, with datasets now available from numerous species.

In humans, significant methylation differences are observed in the white blood cells of matched monozygotic and dizygotic twins [

4], and significant intra- and inter-population differential methylation has been identified in a study of three human populations (Caucasian-American, African-American and Chinese-American) [

5]. Likewise, DNA methylation levels vary considerably in samples experiencing changes during development [

6,

7] or in response to environmental change [

5,

8].

Translation of the advances in DNA methylation to clinical and personalized medical contexts has been proposed [

9,

10,

11]. Evidence of epigenetic alterations induced by disease emphasizes the significant potential value of high-resolution methylation pattern analysis. However, proper translation of this knowledge for diagnostics depends upon the development of genome-wide techniques for the rapid and robust identification of specific epigenetic alterations associated with the disease [

9,

12].

Currently, there are numerous available bioinformatics tools to estimate the methylation status of nearly every cytosine position within a whole-genome bisulfite sequencing dataset. These tools are generally supported by an application of robust statistical approaches; DSS [

13], BiSeq [

14], and methylKit [

15], among others, are ones that apply generalized linear regression (beta-regression) in the estimation of differentially methylated positions (DMPs). In addition, methylKit provides the option to apply Fisher’s exact test. Alternatively, methylpy [

16] bases its estimation on the implementation of the root-mean-square test (RMST) [

17]. Most of these approaches do not incorporate the influence of natural stochasticity (randomness) to their models, limiting their resolution predominantly to genomic regions with the highest probability of having undergone a methylation change [

1,

18,

19,

20]. As a consequence, these approaches do not have the ability to distinguish the DMP signal associated with a specific treatment (or disease) from DMPs deriving from natural variation within control (or heathy) individuals, and are thus not suitable for clinical diagnostics. We suggest that methylation variation in diagnostics is essentially a signal detection problem, equivalent to a binary classification problem for discriminating healthy versus diseased individuals. Signal detection theory provides the methodological framework to address this type of detection problem, evidenced in its common application to clinical diagnostic tests, machine-learning (ML) approaches, and human communication technologies [

21,

22,

23,

24].

The probability of extreme methylation changes occurring spontaneously in a control group of samples by the stochastic fluctuations inherent to biochemical processes [

25,

26,

27] and DNA maintenance [

18], requires the discrimination of this background variation from a biological treatment signal. Regardless of environmental constancy, statistically significant methylation changes are found in control populations with probability greater than zero [

25,

28,

29]. These system fluctuations, inherent in biological stochastic processes [

30,

31,

32], comprise natural background variation detected in human methylomes [

5,

33]. By simulation, it is feasible to demonstrate that DMPs spontaneously arise in control populations, and that regulatory methylation signals also occur naturally in the control group [

33].

Stochastic fluctuation of the methylation process is expected, since methylation regulatory machinery participates, not only in organismal adaptation to micro- and macro-environmental fluctuations [

26,

32], but also in transitions across different stages of organismal ontogenetic development. This variation must be factored into the construction of a methylation pipeline to be used in clinical diagnostics.

The need for the application of signal detection-based approaches in diagnostics was pointed out decades ago [

34], and is standard practice in current implementations of clinical diagnostic tests [

21,

35,

36]. Detection approaches are, by default, included in machine/statistical learning implementations for classification tasks [

37], since the evaluation of classifier performance is basically a detection problem. Thus, the determination of the optimal cutoff (threshold) value at which signal can be discriminated from noise at an acceptable signal-to-noise ratio (i.e., maximum accuracy and sensitivity, lowest false discovery rate (FDR)) is equivalent to a classification problem, a direct corollary to diagnostic detection.

The natural spontaneous variation of DNA methylation in human populations [

33] complicates the discrimination of the methylation signal from background variation. We address this issue by determining whether a given DMP detected in the treatment/patient population occurs within the control population as well, and the probability of observing it. Therefore, the focus is not on the identification of DMPs, but on whether these statistically significant changes occur with high probability (under the fixed experimental conditions) in only the treatment group. Here we address this problem in the context of signal detection theory and ML frameworks [

21,

22].

To illustrate the feasibility of our proposed approach in clinical diagnostics, results from simulation studies and tests of two available methylome datasets from leukemia [

38] and autism [

39] patients are presented and discussed. This approach is implemented in the R package Methyl-IT, available at GitHub:

https://github.com/genomaths/MethylIT. The R scripts to reproduce these analyses are available at the PSU GitLab:

https://git.psu.edu/genomath/MethylIT_examples.

## 3. Discussion

In this work we emphasize the need for the application of detection theory and machine learning on the discrimination of the DNA methylation signal from background population variation for clinical diagnostic purposes. As a consequence of natural background variation, DMPs are detected not only in the patient population, but also in any set of control individuals [

33]. As a result, the diagnosis problem is essentially a classification problem.

The methylation signal is often altered in patients suffering from disease, and Methyl-IT can be effective for the diagnosis of patients based upon signal detection. As highlighted in earlier reports [

21] and [

22], signal detection theory provides the methodological framework to effectively confront a detection problem.

Hence, regardless of the statistical test applied to identify this methylation signal, the application of detection theory and machine learning is valid to discriminate endogenous background signal (DMPs) from that induced by the treatment or disease-state in patients. A proper diagnostic requires evaluation with suitable classification performance measurements: High accuracy, sensitivity and specificity values, low FDR and other performance indicators commonly reported (

Table 1 and

Table 2,

Figure 3 and

Figure 4).

Proper application of signal detection requires knowledge of the probability distribution of the background noise in a system [

21,

22]. The probability distribution of the signal can be inferred from the experimental datasets, control and treatment [

40]. This information provides a strong predictive value, and one can infer the probability of signal values in the control and treatment (patient population) that are not observed in the available datasets. Consequently, this signal probability distribution allows an estimation of the optimal cutoff value to discriminate signal induced by treatment or disease state from background.

Current methylation analysis methods that employ FT, RMST and DSS are limited to direct multiple comparisons of control versus treatment to search for significant statistical differences at each cytosine site in methylome datasets. This approach does not allow for predictive modeling, since the statistical tests are only designed to evaluate differences, not to serve as model classifiers. Moreover, these statistical tests do not directly evaluate background variation.

Proper measurement of the methylation signal requires a reference sample from which an information divergence of methylation levels can be measured for control and treatment samples. In this way, signals derived from background variation and that induced by the treatment are measured with the same origin of coordinates.

Simulation studies showed that, depending on the statistical approach (FT or DSS) and

TV average, ignoring natural background variation can lead to a misestimation of the methylation signal (

Figure 2). In all scenarios, DMPs detected by FT and DSS approaches were valid in statistical terms. However, the signal-to-noise issue comprises a post-DMP detection problem.

As shown in

Figures S1 and S2, the classification performance obtained for the FT and RMST approaches notably improve after being fed the ML classifier with information derived from the methylation signal probability distribution and the detection step (the optimal cutoff

HD value). Therefore, invoking the parsimony principle, we assume that signal detection and machine learning classifiers are sufficient [

58].

The combination of signal detection and machine learning appears to be adequately robust to perform diagnostics on experimental/clinical datasets displaying either a low or high average of absolute methylation level differences (

TV,

Figure 3). To test empirical examples of these natural scenarios, two patient datasets were considered, pediatric acute lymphoblastic leukemia (PALL) and placental tissue from autistic children. Both datasets displayed a relatively high natural background of

TV average in the control population, and a weaker methylation signal in placental tissue from autistic children than in the PALL patient dataset. Results were consistent with those obtained in the simulation study (

Figure 4 and

Table 1). The PALL dataset demonstrated that regardless of any statistical test applied, signal detection was required to reach the high classification performance required for clinical diagnostics (

Figure 4 and

Figure S2). Pronounced signal differentiating control and disease state was observed in association with loci known to be altered during cancer development.

Encouraging results were also obtained with placental tissue from autistic children (

Table 2). This dataset was selected to reflect common sources of variation inherent to clinical studies, including diagnoses from different doctors, tissue samples reflecting collection feasibility rather than site of abnormality, and modest bisulfite sequencing depth per patient sample. In spite of the high natural background variation detected in placental samples, model classifiers which are built in training sets of one group of patients, independently analyzed with respect to control samples, could be applied to predict the entire set of individual DMPs (control and patient) from the other group (cases “G2 pred. G1” and “G1 pred. G2”).

It is important to emphasize the value of the classification performance evaluation, which is built into the Methyl-IT package as a validation procedure. It would not be advisable for users to continue an analysis if the classification performance is poor, even when optimal parameters are used. In the case presented, the robustness of the classification model built on Group G1, previously evaluated by cross-validation, was corroborated by the high classification performance reached on predicting the whole G2 (external data). Admittedly, further studies are needed to properly establish and validate a clinical diagnostic test for autism based on methylome data from placental tissue, but results suggest a potential avenue to address this seemingly intractable challenge.

Epigenetic variation can influence biologically relevant networks that are specific to each cell type, often occurring near genes that have functional relevance to the cell type [

33]. As shown in

Figure 5, we were able to identify relevant genes displaying differential methylation signals distinguishable from the natural background variation and putatively associated with disease, several proposed as drug targets for patient treatment or reported as biomarker candidates.

These observations are not sufficient alone to conclude a direct disease relationship, but the reproducibility of these data, combined with machine learning-based validation, provide a compelling argument for their further study.

## 4. Materials and Methods

#### 4.1. Divergences of Methylation Levels

Information divergences of methylation levels, total variation distance $\widehat{T}{V}_{d}\left({\widehat{p}}_{c},{\widehat{p}}_{t}\right)$ and Hellinger Divergence $\widehat{H}\left({\widehat{p}}_{c},{\widehat{p}}_{t}\right)$, were estimated for control and treatment (disease stage) relative to a reference virtual individual. The reference sample was built from a subset of individuals from the control population that were not included as our control.

In a Bayesian framework assuming uniform priors, the methylation level

${\widehat{p}}_{i}$ can be defined as:

${\widehat{p}}_{i}=\left({n}_{i}^{mC}+1\right)/\left({n}_{i}^{mC}+{n}_{i}^{C}+2\right)$, where

${n}_{i}^{mC}$ and

${n}_{i}^{C}$ represent the numbers of methylated and non-methylated read counts observed at the genomic coordinate

$i$, respectively. We estimate the shape parameters

$\alpha $ and

$\beta $ from the beta distribution minimizing the difference between the empirical and theoretical cumulative distribution functions (ECDF and CDF, respectively):

where

$B\left(\alpha ,\beta \right)$ is the beta function with shape parameters

$\alpha $ and

$\beta $. Since the beta distribution is a prior conjugate of binomial distribution, we consider the parameter

p (methylation level) in the binomial distribution as randomly drawn from a beta distribution. The hyper-parameters

$\alpha $ and

$\beta $ are interpreted as pseudo counts. Then the mean

$E\left[{p}_{i}|D\right]={\widehat{p}}_{i}$ of the methylation levels

${p}_{i}$, given the data

D, is expressed by:

The methylation levels at the cytosine with genomic coordinate

$i$ are then estimated according to this equation.

As shown in

Figure S4, total variation distance

$T{V}_{d}\left({p}_{c},{p}_{t}\right)$ sets the natural metric in the probabilistic space

$\left(p,1-p\right)$, and it is defined the absolute value of methylation level differences:

Notice that

$\widehat{T}{V}_{d}\left({p}_{c},{p}_{t}\right)$ is the Manhattan distance in the space

$\left(p,1-p\right)$. Biostatisticians and biologists in general are familiar with the root-square transformation of the original variables:

$\sqrt{x}$. The root-square transformation maps the space

$\left(p,1-p\right)$ into the new space

$\left(\sqrt{p},\sqrt{1-p}\right)$. The Euclidean distance

${d}_{E}\left(\sqrt{{p}_{c}},\sqrt{{p}_{t}}\right)$ is a ‘natural’ metric to introduce into the space

$\left(\sqrt{p},\sqrt{1-p}\right)$, which turns out to be the Hellinger Divergence of the original variables (

Figure S4). The square of the Euclidean distance

${d}_{E}{\left(\sqrt{{p}_{c}},\sqrt{{p}_{t}}\right)}^{2}$ in the space

$\left(\sqrt{p},\sqrt{1-p}\right)$ corresponds to the Hellinger Divergence

$\widehat{H}\left({\widehat{p}}_{c}^{},{\widehat{p}}_{t}^{}\right)={\left(\sqrt{{\widehat{p}}_{t}^{}}-\sqrt{{\widehat{p}}_{c}}\right)}^{2}+{\left(\sqrt{1-{\widehat{p}}_{t}^{}}-\sqrt{1-{\widehat{p}}_{c}^{}}\right)}^{2}$ in the space

$\left(p,1-p\right)$.

Here, however, the Hellinger Divergence will be used as given in reference [

59], which is defined based on the estimated methylation levels

${\widehat{p}}_{i}$ at given cytosine site

i as:

where

${w}_{i}=2\frac{{m}_{i}^{c}{m}_{i}^{t}}{{m}_{i}^{c}+{m}_{i}^{t}}$,

${m}_{i}^{t}={n}_{i}^{m{C}_{c}}+{n}_{i}^{{C}_{c}}+1$, and

${m}_{i}^{t}={n}_{i}^{m{C}_{t}}+{n}_{i}^{{C}_{t}}+1$.

According with Equation (4), not only the methylation levels are considered in the estimation of

H, but also the control and treatment coverage at each given cytosine site. Under the null hypothesis of non-difference between distributions

${\widehat{p}}_{i}^{c}$ and

${\widehat{p}}_{i}^{t}$, Equation (4) asymptotically has chi-square distribution with one degree of freedom, which sets the basis for a Hellinger chi-square test (HCT) [

59].

Distance $\widehat{T}{V}_{d}\left({\widehat{p}}_{c},{\widehat{p}}_{t}\right)$ and Hellinger Divergence (as given in Equation (4)) hold the inequality: $\widehat{T}{V}_{d}\left({\widehat{p}}_{i}^{c},{\widehat{p}}_{i}^{t}\right)\le \sqrt{2}{\widehat{H}}_{d}\left({\widehat{p}}_{i}^{c},{\widehat{p}}_{i}^{t}\right)$, where ${\widehat{H}}_{d}\left({\widehat{p}}_{i}^{c},{\widehat{p}}_{i}^{t}\right)=\sqrt{\widehat{H}\left({\widehat{p}}_{i}^{c},{\widehat{p}}_{i}^{t}\right)/{w}_{i}}$ is the Hellinger Distance, a direct consequence of the Cauchy-Schwarz Inequality.

Only cytosine sites with methylation level differences ($\widehat{T}{V}_{d}$) greater than a cut-off value were included in the analysis.

#### 4.2. Non-Linear Fit of Distribution Functions

The cumulative distribution functions (CDF) for

${H}_{k}\left({p}_{k}^{c},{p}_{k}^{t}\right)$ can be approached by a Weibull distribution model:

where

$\alpha $,

$\beta $, and

$\mu $ are the parameters shape, scaling, and location, respectively, or the gamma distribution:

where

$\mathsf{\Gamma}\left(\alpha \right)$ is the gamma function.

$\gamma \left(\alpha ,\beta \left({H}_{k}-\mu \right)\right)$ is the lower incomplete gamma function with shape parameters

$\alpha $ and

$\beta $, and location parameter

$\mu $. Model parameters are estimated by non-linear regression analysis of the ECDF

${\widehat{F}}_{n}\left({\widehat{H}}_{k}\le {H}^{0}\right)$ versus

${\widehat{H}}_{k}\left({\widehat{p}}_{i}^{c},{\widehat{p}}_{i}^{t}\right)$. The ECDF of the variable

${\widehat{H}}_{k}$ is defined as:

where

${1}_{{\widehat{H}}_{k}\le {H}^{0}}=\{\begin{array}{c}\text{}1\text{}\mathrm{if}\text{}{\widehat{H}}_{k}\le {H}^{0}\text{}\\ 0\text{}\mathrm{if}\text{}{\widehat{H}}_{k}{H}^{0}\end{array}$ is the indicator function. Function

${\widehat{F}}_{n}\left({\widehat{H}}_{k}\le {H}^{0}\right)$ is easily computed (for example, by using function “

ecdf” of the statistical computing program R).

#### 4.3. Detection of the Methylation Signal

As for any signal in nature or treatment induced, a suitable detection of the methylation signal is based on the knowledge of its probability distribution. The basic idea behind the application of signal detection is illustrated in

Figure 1. Critical values

${H}_{\alpha =0.05}$ are estimated from the best fitted model (Equations (5) or (6)) for each individual sample from the control and treatment group. Depending on the average of methylation on the populations under screening, the true signal would be found at the right of the highest observed critical value

${H}_{\alpha =0.05}$. A cytosine position with a Hellinger Divergence value

H greater than the critical value

${H}_{\alpha =0.05}$ is considered for further downstream analyses. A further step to estimate an optimal cutoff value of

H is required. Although

H is used here, other information divergences can be used as well.

For an estimation of an optimal cutoff value of

H, three approaches were taken: (1) Based on the estimation of the Youden Index [

34], (2) Based on the posterior classification probabilities of the potential signal (potential DMPs) into two classes (given by a model classifier), from control and from treatment, and (3) Based on the posterior classification probabilities derived from a gamma mixture model. Next, cytosine positions with

H values greater than the cutoff value are considered DMPs, regardless of which group they belong to, control or treatment.

For the analysis with the DSS R package, methylation count (COV) files were read into R and prepared by the makeBSseqData function (DSS), DMPs then were computed using DMLtest function (DSS) without smoothing at p-value < 0.05.

#### 4.4. DMP Prediction Based on Machine Learning Model Classifiers

The following model classifiers were tested for DMP predictions: PCA+LDA, PCA + QDA, PCA + logistic, and logistic models. That is, a principal component analysis (PCA) is applied on the original raw matrix of the data and then the derived principal components are used in a further linear/quadratic discriminant analysis (LDA/QDA). A scaling step is applied to the raw matrix of this data before the application of the mentioned procedure, which is not applied for the logistic model. Here, PCA will yield new orthogonal (non-correlated) variables, the principal components, which prevent any potential bias effect originated by correlation or association of the original variables.

Four predictor variables were considered: $T{V}_{d}$, H, relative position of the cytosine site in the chromosome, and the logarithm base two of the probability to observe a Hellinger Divergence value H greater than the critical value ${H}_{\alpha =0.05}$: $lo{g}_{2}P\left(H>{H}_{\alpha =0.05}\right)$.

All data analysis was performed with the R package MethylIT version 0.3.2 available at GitHub (

https://github.com/genomaths/MethylIT), where several user guide examples illustrate the application of MethylIT downstream methylation analysis.

#### 4.5. Simulations

Twelve simulated datasets of methylated cytosines were generated based on three different averages of absolute methylation level differences: mild difference, 0.0356, medium difference, 0.133 and large difference, 0.184, and with different samples size (

Table 1). Simulated data were generated using the function

simulateCounts from the R package

MethylIT.utils (

https://github.com/genomaths/MethylIT.utils). Methylation coverages (minimum 10) were generated from a negative binomial distribution with the function

rnegbin from the R package

MASS. This function uses the representation of the negative binomial distribution as a continuous mixture of Poisson distributions with Gamma distributed means. Prior methylation levels are randomly generated with beta distribution using the

Beta function from R package “

stats”, and posterior methylation levels are generated according to Bayes’ Theorem.

The fact that each dataset of read counts was sampled from the same populations, control or treatment, does not mean that the individual samples will have the same probability distribution for the Hellinger Divergences of the methylation levels. Simulation was performed under the standard clinical assumption that each individual sample of read counts belongs to one of the two possible populations: Control/healthy or treatment/patient. Therefore, although each dataset is sampled from these populations, they are independent up to the limit for the algorithms of pseudorandom number generation (which is a standard simulation assumption). Since we are simulating a stochastic process, the Hellinger Divergence from each sample follows a different probability distribution, as indicated in

Figure 1.

#### 4.6. Experimental Methylation Datasets

The datasets of genome-wide methylated and unmethylated read counts (for each cytosine site) from normal CD19+ blood cell donors (NB) and from patients with pediatric acute lymphoblastic leukemia (PALL) were downloaded from the Gene Expression Omnibus (GEO) database [

38]. DMPs were estimated for control (NB, GEO accession: GSM1978783 to GSM1978786) and for patients (ALL cells, GEO accession number GSM1978759 to GSM1978761) relative to a reference group of four independent normal CD19+ blood cell donors (GEO accession: GSM1978787 to GSM1978790). For the purposes of the analysis presented here, we only focused on the analysis of chromosome 9.

For the autism analysis [

39], raw sequencing reads were downloaded from NCBI (GEO: GSE67615). The following methylome datasets from autistic children were retrieved from the GEO database:

**Group 1**: GSM1655495, GSM1655490, GSM1655488, GSM1652180, GSM1652179, GSM1652173, GSM1652172, GSM1652171, GSM1652157.

**Group 2**: GSM1655498, GSM1655497, GSM1655492, GSM1652167, GSM1652160, GSM1652156, GSM1652155, GSM1652154, GSM1652152, GSM1652149, GSM1652148.

Quality-controlled with FastQC (version 0.11.5), trimmed with TrimGalore! program(version 0.4.1) and Cutadapt (version 1.15), then aligned to the Homo sapiens reference genome (Homo_sapiens.GRCh37.dna.toplevel.fa) using Bismark (version 0.19.0) with bowtie2 (version 2.3.3.1). Bismark methylation extractor with default parameters was used to get methylation counts files (COV files).