Enhancing resolution of natural methylome reprogramming behavior in plants

Background Natural methylome reprogramming within chromatin involves changes in local energy landscapes that are subject to thermodynamic principles. Signal detection permits the discrimination of methylation signal from dynamic background noise that is induced by thermal fluctuation. Current genome-wide methylation analysis methods do not incorporate biophysical properties of DNA, and focus largely on DNA methylation density changes, which limits resolution of natural, more subtle methylome behavior in relation to gene activity. Results We present here a novel methylome analysis procedure, Methyl-IT, based on information thermodynamics and signal detection. Methylation analysis involves a signal detection step, and the method was designed to discriminate methylation regulatory signal from background variation. Comparisons with commonly used programs and two publicly available methylome datasets, involving stages of seed development and drought stress effects, were implemented. Information divergence between methylation levels from different groups, measured in terms of Hellinger divergence, provides discrimination power between control and treatment samples. Differentially informative methylation positions (DIMPs) achieved higher sensitivity and accuracy than standard differentially methylated positions (DMPs) identified by other methods. Differentially methylated genes (DMG) that are based on DIMPs were significantly enriched in biologically meaningful networks. Conclusions Methyl-IT analysis enhanced resolution of natural methylome reprogramming behavior to reveal network-associated responses, offering resolution of gene pathway influences not attainable with previous methods.


The information thermodynamics model and Methyl-IT workflow 108
Methylation level is generally the ratio of methylated cytosine read counts divided by the sum of 109 methylated and unmethylated cytosine read counts for a given cytosine site. This is a descriptive variable 110 that reflects uncertainty of methylation level at a given cytosine site. Most methylation analyses test 111 whether or not the difference between control (CT) and treatment (TT) methylation levels (the uncertainty 112 variation) is statistically significant. The approach measures the absolute value of the difference between 113 methylation levels " ## − " &# from control ( " &# ) and treatment ( " ## ) at each cytosine site. The 114 magnitude of " ## − " &# is known as total variation distance (TVD). 115 116 To improve resolution of methylation signal, we applied Hellinger divergence (HD), ([25], detailed 117 description included in Methods section). Both TVD and HD are information divergences that follow 118 asymptotic chi-square distribution [25]. However, HD converges faster and carries more information than 119 TVD and, consequently, has higher discrimination power [26]. The improvement in discrimination power 120 is visible in Fig. 1 By way of illustration, we used the drought stress data, where CTR designated 121 unstressed control group and STR designated stressed group. Fig. 1a shows that treatment methylation 122 signal on chromosome 1, expressed in terms of methylation level, was indistinguishable from control. 123 Higher resolutions are reached with TVD and HD, with HD providing highest discrimination power. 124 125 Ganguly et al. reported individual variation and pre-existing methylation differences in the drought stress 126 materials [24], which is reflected by HD in Fig. 1c. The improvement in resolution attributed to HD 127 derives from the fact that TVD takes into account only one dimension of the methylation change, while 128 HD is estimated in bi-dimensional space ( " , 1 − " ), where the goodness-of-fit test to detect differences 129 is performed. 130 131 Genome-wide Hellinger divergence for background methylation variation can be modeled by a Weibull 132 distribution [12]. On the other hand, biologically meaningful methylation changes result in an increment 133 of Hellinger divergence distinguishable in the signal detection step (Fig. 2). For a given level of 134 significance α (Type I error  Estimation of optimal cutoff from AUC is an additional step to remove any remaining potential 146 methylation background noise that still remains with probability α = 0.05 > 0. We define as methylation 147 signal (DIMP) each cytosine site with Hellinger divergence values above the cutpoint (shown in Fig. 2 as  148 ). Each DIMP-associated signal may or may not be represented within a DMP derived by Fisher's 149 exact test (or other current tests, Fig. 2 (Table 1). Such methylation changes represent potential 164 DIMPs. The conclusions will remain the same even for a generalized situation with n i mCt running between 165 80 and 350 ( ). Considering that even a small genome like Arabidopsis contains millions 166 of cytosine sites, the situation presented in Table 1 is not rare, and the difference caused by statistical tests 167 listed in Table 1 would be significant. A flow chart integrating the main procedures of Methyl-IT and  168 optional downstream analysis is shown in Fig. 3. 169 170

Methyl-IT sensitivity and genomic regions targeted by DIMPs 171
To investigate the sensitivity of Methyl-IT, we applied DIMP detection to the drought stress dataset and 172 compared with the outputs from other methods. Fig. 4  increasing from COT to MG to PMG to dry seed, and reaching its peak in leaf tissue. CHG and CHH 197 changes were associated predominantly with non-genic and TE regions, and CG DIMPs showed higher 198 density within gene regions, which agreed with the DMP distribution pattern reported in the original 199 study [21]. A surprising CHG peak was observed in leaf tissues relative to seed, which we did not pursue 200 in detail, but may reflect a pronounced tissue-specific transition. Similar DIMP patterns were observed in 201 the drought stress dataset relative to cytosine context, although with higher signal levels in each context.  Table 2. Simulation experiments suggested that classification accuracy mainly depended on 222 the distance separating Weibull distributions (noise plus signal) for control and treatment. Weibull model 223 parameter values (alpha.1 and scale.1) from the first simulation for control samples (S11 to S13) were 224 close to those estimated in the treatment group (S21 to S23), suggesting that corresponding distribution 225 functions were close as well. Although the classifier performance to predict DIMPs could be considered 226 acceptable (about 80% accuracy), discriminatory power to predict DIMPs from an external sample (not 227 included to build the model) was relatively low. If probabilistic models were sufficiently distant, even a 228 classifier trained with samples having an overall mean TVD (absolute values of methylation differences) 229 equal to 0.13 could achieve good discrimination of DIMPs from an external sample. Importantly, a given 230 DIMP with the same HD value in control and treatment groups could be discriminated from control group 231 if the Weibull probability distributions from control and treatment were different. 232 233 Classification of DIMPs was accomplished for the seed development dataset as well. Since each seed 234 development stage comprised only one sample, groups were formed according to the hierarchical cluster 235 presented in Fig. 6. The best classification accuracies were obtained for CG and CHH methylation 236 contexts ( Table 2). These were binary classifications, where control samples were the reference class. 237 Thus, probability P(x) that a new DIMP x could be observed in the control class determined its 238 classification, and the probability that a given DIMP did not classify within the control class was 1 -P(x). 239 A classifier model built on the groups CT: COT and MG, and non-CT: PMG and DRY (Table 2)  DMGs, we conducted a network enrichment analysis test (NEAT). A statistically significant network 260 enrichment of links between genes from the set of seed development DMGs and the set of GO-biological 261 process associated with seed functions was observed (Table 3). The list of 16 networks identified includes 262 positive and negative regulation of GA-mediated signaling, positive and negative regulation of seed 263 germination, regulation of seed dormancy, and raffinose family oligosaccharide biosynthesis, all well-264 established seed processes (full gene list in Additional file 2: Table S2). GeneMANIA [31] identified 265 interaction networks within the data, indicating that many DMGs in the seed development dataset 266 function together (Additional file 3: Figure S1). To test the impact of different minimum cytosine 267 coverage on Methyl-IT output, the pipeline was run without minimum coverage limit (Table 3) and with a 268 minimum coverage of 10 reads (Additional file 4:   Methyl-IT permits methylation analysis as a signal detection problem. The model predicts that most 311 methylation changes detected, at least in Arabidopsis, represent methylation "background noise" with 312 respect to methylation regulatory signal, explainable within a statistical physical probability distribution. 313 Implicit in our approach is that DIMPs can be detected in the control sample as well. These DIMPs are 314 located within the region of false alarm in Fig. 1, and correspond to natural methylation signal not 315 induced by treatment. Thus, using the Methyl-IT procedure, methylation signal is not only distinguished 316 from background noise, but can be used to discern natural signal from that induced by treatment. 317 318 Whereas methods underlying RMST (methylpy approach) and DSS provide essential information about 319 methylation density, context and positional changes on a genome-wide scale, Methyl-IT provides 320 resolution of subtle methylation repatterning signals distinct from background fluctuation. Data derived 321 from analysis with FET, RMST, HCT or DSS alone could lead to an assumption that gene body 322 methylation plays little or no role in gene expression, or that transposable elements are the primary target 323 of methylation repatterning. Yet ample data suggest that this picture is incomplete [38]. Methyl-IT results 324 show that these conclusions more likely reflect inadequate resolution of the methylome system. GLM 325 analysis applied to the identification of DMR-associated genes by methylpy [21] and DSS indicates that 326 DMRs (or DMR associated genes) do not provide sufficient resolution to link them with gene expression.

Methylation level estimation 361
In Methyl-IT pipeline, it is up to the user whether to estimate methylation levels at each cytosine position 362 following a Bayesian approach or not. In a Bayesian framework assuming uniform priors, the methylation 363 level can be defined as: (1), where and represent the numbers 364 of methylated and non-methylated read counts observed at the genomic coordinate , respectively. We 365 estimate the shape parameters and from the beta distribution (2) 366 minimizing the difference between the empirical and theoretical cumulative distribution functions (ECDF 367 and CDF, respectively), where is the beta function with shape parameters and . Since the 368 beta distribution is a prior conjugate of binomial distribution, we consider the p parameter (methylation 369 level ) in the binomial distribution as randomly drawn from a beta distribution.
The total variation of the methylation levels and , Eq. 4 asymptotically has a chi-square distribution with one degree of freedom, 403 which set the basis for a Hellinger chi-square test (HCT). The term introduces a useful correction for 404 the Hellinger divergence, since the estimation of and are based on counts (see Table 1). 405 In Methyl-IT pipeline, the statistics mean, median, or sum of the read counts at each cytosine site of some 406 control samples can be used to create a virtual reference sample. It is up to the user whether to apply the 407 'row sum', 'row mean' or 'row median' of methylated and unmethylated read counts at each cytosine site 408 if, and only if, the probability to observe a 433 methylation change with Hellinger divergence higher than is lesser than . 434 The PMSs reflect cytosine methylation positions that undergo changes without discerning whether they 435 represent biological signal created by the methylation regulatory machinery. The application of signal 436 detection theory is required for robust discrimination of biological signal from physical noise-induced 437 thermal fluctuations, permitting a high signal-to-noise ratio [18]. 438

Robust detection of differentially informative methylated positions (DIMPs) 439
Application of signal detection theory is required to reach a high signal-to-noise ratio [43,44]

Estimation of differentially methylated genes (DMGs) using Methyl-IT 490
Our degree of confidence in whether DIMP counts in both control and treatment represent true biological 491 signal was set out in the signal detection step. To estimate DMGs, we followed similar steps to those 492 proposed in Bioconductor R package DESeq2 To test difference between group counts we applied the fitting algorithmic approaches: PR and PQR if 507 ( ), NBR and NBR with 'prior weights'. Next, best model based on Akaike 508 information criteria (AIC). The Wald test for significance of the independent variable coefficient indicates 509 whether or not the treatment effect is significant, while the coefficient sign (log2FC) will indicate the 510 direction of such an effect. 511

Bootstrap goodness-of-fit test for 2x2 contingency tables 512
The goodness-of-fit RMST 2x2 contingency tables as implemented in methylpy [20] Table 1. Relative sensitivity differences between several statistical tests applied to identify differentially 659 methylated cytosines. P-values for the 2x2 contingency table with read