# Identification of Differentially Methylated Sites with Weak Methylation Effects

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methods

#### 2.1. Wavelet-Based Functional Mixed Models

_{i}(t) represents the logit-transformation of methylation levels at a genomic location t ∈ {t

_{l}; l = 1, …, T} for the $i$-th individual, i = 1,…,N. X

_{ij}= 1 if individual $i$ belongs to treatment $j$ and 0 otherwise, for 1 ≤ j ≤ J. The function B

_{j}(t) represents the fixed effect corresponding to treatment and other covariates of interest). Z

_{im}a random covariate that takes into account variations in y

_{i}(t) that are caused by potential multilevel structures in the measurements (e.g., when multiple subjects from the same family were measured, then each family will introduce its own random effect and Z

_{im}= 1 if individual $i$ is from family m and U

_{m}(t) is the random effect of family m). E

_{i}(t) is a residual error function. Using vectorized formulation, we may write the model (1) as:

**Y**(t) = [Y

_{1}(t),…,Y

_{N}(t)]

^{T},

**B**(t) = [B

_{1}(t),…,B

_{J}(t)]

^{T},

**U**(t) = [U

_{1}(t),…,U

_{M}(t)]

^{T}and

**E**(t) = [E

_{1}(t),…,E

_{N}(t)]

^{T}. Here,

**Y**is a N × T matrix across all T genomic locations for all $N$ individuals.

**X**is an N × J design matrix that indicates which treatment group the $N$ individuals belong to or other covariates of interest (e.g., a phenotype), the

**B**(J × T) matrix contains the fixed effects of the covariates. The t-th column of

**B**, denoted by

**b**

_{t}is a J-dimensional vector describing the effects the $J$ covariates on

**Y**at genomic location t.

**X**be a 1/0 vector to indicate which of the herbicide glyphosate dosage groups the $i$-th plant was treated, i = 1,…,N, then

**b**

_{t}corresponds to the effect of dose levels on

**Y**at genomic location t. In Equation (2),

**Z**is a design matrix for random effects that takes into account variations in

**Y**that are caused by potential multilevel structures in the measurements;

**U**contains the corresponding random effects;

**E**is an N × T matrix of residual errors. We assume that

**E**is multivariate normal with mean 0 and variance-covariance matrix S. For example, in our A. thaliana experiment, there are four plants for each of the 0%, 5%, and 10% glyphosate-treated groups. Therefore, the

**X**design matrix is a 12 × 3 and

**B**is a 3 × T matrix, where T is the number of cytosine locations. Since the A. thaliana data does not involve multilevel structures, the random effect term in Equation (2) is omitted. The resulting functional model can be rewritten as

**b**

_{t}is a column vector consisting of p = 3 elements/groups giving the mean methylation profiles for each group at a given genomic location t.

**Y**to obtain a N × T* matrix of wavelet coefficients

**D**. The corresponding wavelet space model can be obtained by post-multiplying both sides of Equation (3) by

**Φ’**the wavelet transformation operator:

**D**=

**XB*** +

**E***

**Φ’**is a T × T* wavelet transformation operator,

**D**=

**YΦ’**$,\text{}$

**B***=

**BΦ’**, and

**E***=

**EΦ’**. Equation (5) is a wavelet space model with

**D**,

**B***, and

**E***representing the wavelet coefficients of

**Y**,

**B**, and

**E**, respectively. We adopt a Bayesian approach to fit Equation (5) following Morris and Carroll [10]. The posterior samples of the parameters in Equation (5) are obtained by employing a Markov chain Monte Carlo (MCMC) algorithm. Inverse DWT is finally applied to the posterior samples of

**B***to obtain posteriors for

**B**in the data domain, which were subsequently used to identify DMCs following a Bayesian false discovery rate approach.

#### 2.2. Bayesian False Discovery Rate

**B**, we can identify significant regions either on

**B**or on the contrast effects that contains the differences between covariate effects in

**B**. For example, in the A. thaliana data example, since we are interested in identifying DMCs with different dosage effects, we will calculate the contrast effects by pre-multiplying

**B**with a contrast effect operator $\left(\begin{array}{ccc}-1& 1& 0\\ 0& -1& 1\\ -1& 0& 1\end{array}\right)$, which transforms the effect of each dosage level to the contrast effects of Level 2 vs. Level 1, Level 3 vs. Level 2, and Level 3 vs. Level 1, respectively. We will apply this operator to all posterior samples of B to obtain the posterior samples of the contrast effects. Denote C

_{α}(t), t ∈ {t

_{l}; l = 1, …, T} the α th contrast effect, identifying significant DMCs on C

_{α}(t) amounts to identifying locations on ${C}_{a}\left(t\right)$ that are large in magnitude. We achieve this by performing a Bayesian multiple testing that controls the overall false discovery rate following Morris et al. [10], Zhu et al. [14], and Lee and Morris [9].

_{l}; l = 1, …, T} that has C

_{α}(t) values greater than some threshold δ (in absolute value) based on G posterior samples of C

_{α}(t) for all contrast effects. We first calculate the pointwise posterior probability of at least δ difference at t

_{l}by calculating ${\widehat{p}}_{a}\left({t}_{l}\right)=\mathrm{Pr}\left\{\left|{C}_{a}\left({t}_{l}\right)\right|>\delta |\mathit{Y}\right\}\approx \frac{{{\displaystyle \sum}}_{g=1}^{G}I\left\{\right|{C}_{a}{\left({t}_{l}\right)}^{\left(g\right)}|>\delta \}}{G}$, where C

_{α}(t)

^{(g)}denotes the g-th sample of C

_{α}(t) at t

_{l}. Then, we find a cut-point ϕ

_{α}for ${\widehat{p}}_{a}$(t

_{l}) so that the expected global Bayesian FDR is less than or equal to a pre-specified level α. We claim all of the t

_{l}on which ${\widehat{p}}_{a}$(t

_{l}) >ϕ

_{α}as genomic locations with C

_{α}(t

_{l}) greater than δ$.$

## 3. Data and Simulation

#### 3.1. Arabisopsis thaliana Treated with Herbicide Glyphosate

^{−2}s

^{−1}and allowed to grow for approximately 2 weeks for the 0% and 5% glyphosate-treated plants and 8 weeks for the 10% glyphosate-treated plants until fully developed siliques were formed. Following 0, 5 and 10% glyphosate exposure on four-week-old rosettes of the twelve A. thaliana individuals, genomic DNA were isolated from cauline leaves of the newly matured siliques using Biosprint-15 plant DNA extraction kit (Qiagen, Hilden, Germany). The tissue samples from these 12 plants were sent to the Genomics Research Laboratory at the Biocomplexity Institute of Virginia Tech for sequencing. One hundred nanograms of DNA samples were bisulfite converted using EZ DNA methylation-Gold Kit (#D5005, Zymo Research, Irvine, CA, USA). Illumina DNA libraries were prepared from the above purified bisulfite converted DNA samples using EpiGnome Methyl-Seq kit (Epicentre, Illumina Inc., Madison, WI, USA). In the end, each of six samples were barcoded, quantified by qPCR, and pooled to sequence on Illumina Hiseq Rapid Run flowcell (Illumina, San Diego, CA, USA). The bisulfite short reads dataset can be download from NCBI Sequence Read Archive (SRA) BioProject ID: PRJNA322493. In total, there were 872,608,912 bisulfite paired-end short reads with a length of 100 bp for each end. The coverage depth ranged from 48.6 to 76.3× across all samples. First, the sequenced reads’ quality was checked using FastQC [15] to eliminate adapter sequences and barcodes using Trimmomatic [16] and FastX Tookit [17]. Low-quality reads (quality score Q < 30) were excluded. After all quality checks, bisulfite short sequences were aligned to the A. thaliana from Arabidopsis Information Resource version 10 (TAIR 10) reference genome using Bismark aligner (v 0.14.5) with default parameters (n = 1 and l = 50) [18]. Cytosine methylation level information was extracted from aligned reads using Bismark methylation extractor. A total of 3,348,756 cytosines passed the preprocessing steps and thus serve as the basis on which we detect significant methylated cytosines differentiating glyphosate dosage groups.

#### 3.2. Methylation Level Simulation

## 4. Results

#### 4.1. Simulation Results

#### 4.1.1. Effect of the Degree of Methylation Difference

#### 4.1.2. Effect of Sample Size

#### 4.2. Real Data from Herbicide Glyphosate Treatment of Arabidopsis thaliana

#### 4.3. Real Data from Monozygotic Twin Data with Different Pain Sensitivity Scores

^{−5}) (Figure S3). Therefore, we adjusted parameter settings in both WFMM with δ = 3.44 × 10

^{−5}and methylKit (difference = 4.34 × 10

^{−5}; q value= 1.00). These parameter settings from both methods were determined by an empirical function applied on the real twin data and is further described in Section 5. For the 769 significant DMRs detected by WFMM with δ = 3.44 × 10

^{−5}, there were 236 genes recognized by the gene function enrichment program DAVID (Table 3). These genes were clustered into five groups by DAVID (Figure 8; top panel). For the 2023 significant DMRs from methylKit (difference = 4.34 × 10

^{−5}; q value= 1.00), there were 892 genes recognized by DAVID (Table 3) that were clustered into 32 clusters (Figure 8; bottom panel).

## 5. Discussion

^{−5}(i.e., the 99.7th quantile of the absolute pairwise methylation level differences across whole human genome). In methylKit, the q value parameter should also be adjusted accordingly. If diff is very small (<0.1), set q value = 1.00 to collect all significant DMRs. Similarly, WFMM can be empirically tailored to different methylation profiles by controlling the $\delta $ parameter, setting $\delta $ to be the difference between the 100(1 − E)th quantile of the absolute pairwise methylation differences between two phenotype groups across the whole genome and the standard deviation of the methylation differences. For example, in our A. thaliana dataset, the 90th quantile of the absolute pairwise methylation level differences between dosage categories is 0.04 and the standard deviation of pairwise methylation level differences between phenotype categories is 0.03; therefore, δ = 0.04 − 0.03 = 0.01. In the twin dataset, the corresponding 99.7th quantile and standard deviation are 4.34 × 10

^{−5}and 9.2 × 10

^{−6}, respectively; therefore, we use δ = 4.34 × 10

^{−5}− 9.2 × 10

^{−6}= 3.44 × 10

^{−5}. In this way, a better DMC detection result can be achieved based on different methylation datasets.

## Supplementary Materials

^{−5}and q value = 1.01, difference = 0.07, on 25 monozygotic twin pairs with different pain sensitivity temperatures for each chromosome.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- De Pristo, M.A.; Banks, E.; Poplin, R.; Garimella, K.V.; Maguire, J.R.; Hartl, C.; Philippakis, A.A.; Del Angel, G.; Rivas, M.A.; Hanna, M. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet.
**2011**, 43, 491–498. [Google Scholar] [CrossRef] [PubMed] - Guo, J.U.; Su, Y.; Shin, J.H.; Shin, J.; Li, H.; Xie, B.; Zhong, C.; Hu, S.; Le, T.; Fan, G. Distribution, recognition and regulation of non-cpg methylation in the adult mammalian brain. Nat. Neurosci.
**2014**, 17, 215–222. [Google Scholar] [CrossRef] [PubMed] - Robinson, M.D.; Kahraman, A.; Law, C.W.; Lindsay, H.; Nowicka, M.; Weber, L.M.; Zhou, X. Statistical methods for detecting differentially methylated loci and regions. Front. Genet.
**2014**, 5, 324. [Google Scholar] [CrossRef] [PubMed][Green Version] - Leek, J.T.; Scharpf, R.B.; Bravo, H.C.; Simcha, D.; Langmead, B.; Johnson, W.E.; Geman, D.; Baggerly, K.; Irizarry, R.A. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet.
**2010**, 11, 733–739. [Google Scholar] [CrossRef] [PubMed] - Zilberman, D.; Gehring, M.; Tran, R.K.; Ballinger, T.; Henikoff, S. Genome-wide analysis of arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription. Nat. Genet.
**2007**, 39, 61–69. [Google Scholar] [CrossRef] [PubMed] - Martienssen, R.A.; Colot, V. DNA methylation and epigenetic inheritance in plants and filamentous fungi. Science
**2001**, 293, 1070–1074. [Google Scholar] [CrossRef] [PubMed] - Chodavarapu, R.K.; Feng, S.; Bernatavichute, Y.V.; Chen, P.-Y.; Stroud, H.; Yu, Y.; Hetzel, J.A.; Kuo, F.; Kim, J.; Cokus, S.J. Relationship between nucleosome positioning and DNA methylation. Nature
**2010**, 466, 388–392. [Google Scholar] [CrossRef] [PubMed] - Sainani, K. The Importance of Accounting for Correlated Observations; Elsevier: Amsterdam, The Netherlands, 2010. [Google Scholar]
- Lee, W.; Morris, J.S. Identification of differentially methylated loci using wavelet-based functional mixed models. Bioinformatics
**2016**, 32, 664–672. [Google Scholar] [CrossRef] [PubMed] - Morris, J.S.; Carroll, R.J. Wavelet-based functional mixed models. J. R. Stat. Soc. Ser. B Stat. Methodol.
**2006**, 68, 179–199. [Google Scholar] [CrossRef] [PubMed] - Kim, G.; Clarke, C.R.; Larose, H.; Tran, H.T.; Haak, D.C.; Zhang, L.; Askew, S.; Barney, J.; Westwood, J.H. Herbicide injury induces DNA methylome alterations in arabidopsis. PeerJ
**2017**, 5, e3560. [Google Scholar] [CrossRef] [PubMed] - Bell, J.; Loomis, A.; Butcher, L.; Gao, F.; Zhang, B.; Hyde, C.; Sun, J.; Wu, H.; Ward, K.; Harris, J. Differential methylation of the trpa1 promoter in pain sensitivity. Nat. Commun.
**2014**, 5, 2978. [Google Scholar] [CrossRef] [PubMed][Green Version] - Akalin, A.; Kormaksson, M.; Li, S.; Garrett-Bakelman, F.E.; Figueroa, M.E.; Melnick, A.; Mason, C.E. Methylkit: A comprehensive r package for the analysis of genome-wide DNA methylation profiles. Genome Biol.
**2012**, 13, R87. [Google Scholar] [CrossRef] [PubMed] - Zhu, H.; Brown, P.J.; Morris, J.S. Robust, adaptive functional regression in functional mixed model framework. J. Am. Stat. Assoc.
**2011**, 106, 1167–1179. [Google Scholar] [CrossRef] [PubMed] - Andrews, S. FastQC: A Quality Control Tool for High Throughput Sequence Data. Available online: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed on 5 February 2018).
- Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for illumina sequence data. Bioinformatics
**2014**, 30, 2114–2120. [Google Scholar] [CrossRef] [PubMed] - Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J.
**2011**, 17, 10–12. [Google Scholar] [CrossRef] - Krueger, F.; Andrews, S.R. Bismark: A flexible aligner and methylation caller for bisulfite-seq applications. Bioinformatics
**2011**, 27, 1571–1572. [Google Scholar] [CrossRef] [PubMed] - Huang, D.W.; Sherman, B.T.; Tan, Q.; Kir, J.; Liu, D.; Bryant, D.; Guo, Y.; Stephens, R.; Baseler, M.W.; Lane, H.C. David bioinformatics resources: Expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res.
**2007**, 35, W169–W175. [Google Scholar] [CrossRef] [PubMed] - Das, M.; Reichman, J.R.; Haberer, G.; Welzl, G.; Aceituno, F.F.; Mader, M.T.; Watrud, L.S.; Pfleeger, T.G.; Gutiérrez, R.A.; Schäffner, A.R. A composite transcriptional signature differentiates responses towards closely related herbicides in arabidopsis thaliana and brassica napus. Plant Mol. Biol.
**2010**, 72, 545–556. [Google Scholar] [CrossRef] [PubMed] - Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol.
**2009**, 10, R25. [Google Scholar] [CrossRef] [PubMed] - Huang, D.W.; Sherman, B.T.; Lempicki, R.A. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat. Protoc.
**2009**, 4, 44–57. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Correlation of methylation levels of neighboring cytosine regions in monozygotic (MZ) twin and neighboring cytosines in Arabidopsis thaliana datasets. Details of the calculation are in Section 1 in Supplementary Materials.

**Figure 2.**Receiver operating characteristic (ROC) curve comparison between wavelet-based functional mixed model (WFMM) (blue curve) and methylKit (red curve) when the differentially methylated cutoff is 0.04 in correlated cytosines (

**a**) and uncorrelated cytosines (

**b**) and when the differentially methylated cutoff is 0.08 in correlated cytosines (

**c**) and uncorrelated cytosines (

**d**). The gray line represents points where sensitivity equals specificity.

**Figure 3.**ROC curves of WFMM (blue curve) and methylKit (red curve) as differentially methylated cutoff increases from 0.1, to 0.25 (diff = 0.1, 0.12, 0.15, 0.2, 0.25).

**Figure 4.**Effect of different sample sizes on WFMM with $\delta $ = 0.01 and methylKit with adjusted settings (q value = 1.00; difference = 4) using correlated simulated data when the differentially methylated cutoff is 0.04.

**Figure 5.**Percentages of overlapping differentially methylated cytosine (DMCs) from methylKit with adjusted settings (difference = 4; q value = 1.00) and WFMM with δ = 0.01 in correlated simulated data when the differentially methylated cutoff is 0.04 (

**a**) and in the real data (

**b**).

**Figure 6.**Gene ontology of molecular function for significant differentially methylated TAIR genes detected by WFMM with δ = 0.01 (

**a**) and methylKit with default settings (difference = 25; q value = 0.01) (

**b**).

**Figure 7.**Gene clusters based on the gene ontology of molecular function for the top 3000 most significant genes from WFMM with δ = 0.01 (

**a**), methylKit with default settings (difference = 25; q value = 0.01) (

**b**), and methylKit with adjusted settings (difference = 4; q value= 1.00) (

**c**).

**Figure 8.**Gene clusters based on the gene ontology of molecular function for significant genes detected by WFMM with δ = 3.44 × 10

^{−5}(

**a**) and methylKit (difference = 4.34 × 10

^{−5}; q value = 1.00) (

**b**).

**Table 1.**The number of significant differentially methylated cytosine (DMCs), and genes recognized by Database for Annotation, Visualization and Integrated Discovery (DAVID) by applying wavelet-based functional mixed model (WFMM) with δ = 0.01 and methylKit with default settings (difference = 25; q value= 0.01) and methylKit with adjusted settings (difference = 4; q value= 1.00) on a real A. thaliana dataset.

Chromosome | WFMM δ = 0.01; Number of DMCs | methylKit Default; q value = 0.01; Difference = 25, Number of DMCs | methylKit q value = 1.00; Difference = 4, Number of DMCs | WFMM δ = 0.01; Number of Significant Genes | methylKit Default; q value = 0.01; Difference = 25; Number of Significant Genes | methylKit q value = 1.00; Difference = 4; Number of Significant Genes |
---|---|---|---|---|---|---|

Chr1 | 133,512 | 12,048 | 294,153 | 4041 | 3098 | 7760 |

Chr2 | 87,488 | 7627 | 244,683 | 2417 | 1887 | 5129 |

Chr3 | 113,229 | 9863 | 274,382 | 3180 | 2459 | 6254 |

Chr4 | 91,327 | 7708 | 227,539 | 2563 | 1943 | 4815 |

Chr5 | 123,027 | 10,776 | 290,090 | 3622 | 2779 | 6989 |

ChrC * | 9081 | 19 | 7306 | 0 | 0 | 0 |

ChrM * | 0 | 0 | 66 | 0 | 0 | 0 |

Total | 557,664 | 48,041 | 1,338,219 | 15,823 | 12,166 | 30,947 |

**Table 2.**Number of intersecting genes between 484 genes identified by Malay Das et al. [20] that are related to herbicide glyphosate stress and significant genes identified by WFMM and methylKit.

Methods | Number of Significant DMRs | Number of Significant Genes Using DAVID |
---|---|---|

WFMM δ = 3.44 × 10^{−5} | 769 | 236 |

methylKit adjusted; q value = 1.00; difference = 4.34 × 10^{−5} | 2023 | 892 |

**Table 3.**Number of significant DMCs, and genes recognized by DAVID by applying WFMM with δ = 3.44 × 10

^{−5}and difference = 4.34 × 10

^{−5}; q value= 1.00 on 25 monozygotic (MZ) twin pairs with different pain sensitivity temperature.

Methods | Number of Significant Genes | Number of Shared Genes in All Significant Genes | Number of Shared Genes in Top 3000 Most Significant Genes |
---|---|---|---|

WFMM δ = 0.01 | 15,823 | 238 | 51 |

methylKit default; q value = 0.01; difference = 25 | 12,166 | 181 | 39 |

methylKit adjusted; q value = 1.00; difference = 4 | 30,947 | 466 | 44 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tran, H.; Zhu, H.; Wu, X.; Kim, G.; Clarke, C.R.; Larose, H.; Haak, D.C.; Askew, S.D.; Barney, J.N.; Westwood, J.H.;
et al. Identification of Differentially Methylated Sites with Weak Methylation Effects. *Genes* **2018**, *9*, 75.
https://doi.org/10.3390/genes9020075

**AMA Style**

Tran H, Zhu H, Wu X, Kim G, Clarke CR, Larose H, Haak DC, Askew SD, Barney JN, Westwood JH,
et al. Identification of Differentially Methylated Sites with Weak Methylation Effects. *Genes*. 2018; 9(2):75.
https://doi.org/10.3390/genes9020075

**Chicago/Turabian Style**

Tran, Hong, Hongxiao Zhu, Xiaowei Wu, Gunjune Kim, Christopher R. Clarke, Hailey Larose, David C. Haak, Shawn D. Askew, Jacob N. Barney, James H. Westwood,
and et al. 2018. "Identification of Differentially Methylated Sites with Weak Methylation Effects" *Genes* 9, no. 2: 75.
https://doi.org/10.3390/genes9020075