Saliva as a Blood Alternative for Genome-Wide DNA Methylation Profiling by Methylated DNA Immunoprecipitation ( MeDIP ) Sequencing

Background: Interrogation of DNA methylation profiles hold promise for improved diagnostics, as well as the delineation of the aetiology for common human diseases. However, as the primary tissue of the disease is often inaccessible without complicated and inconvenient interventions, there is an increasing interest in peripheral surrogate tissues. Whereas most work has been conducted on blood, saliva is now becoming recognized as an interesting alternative due to the simple and non-invasive manner of collection allowing for self-sampling. Results: In this study we have evaluated if saliva samples are suitable for DNA methylation studies using methylated DNA immunoprecipitation coupled to next-generation sequencing (MeDIP-seq). This was done by comparing the DNA methylation profile in saliva against the benchmark profile of peripheral blood from three individuals. We show that the output, quality, and depth of paired-end 50 bp sequencing reads are comparable between saliva and peripheral blood and, moreover, that the distribution of reads along genomic regions are similar and follow canonical methylation patterns. Conclusion: In summary, we show that high-quality MeDIP-seq data can be generated using saliva, thus supporting the future use of saliva in the generation of DNA methylation information at annotated genes, non-RefSeq genes, and repetitive elements relevant to human disease.


Introduction
DNA methylation plays a pivotal role in gene regulation and has been implicated in several human conditions including cancer, cardiovascular disease, and neuropsychiatric conditions [1][2][3][4].In addition, numerous epidemiological epigenetic studies investigating associations between environmental exposures and DNA methylation have found biologically sensible correlations.An example of such findings is a well-studied change in DNA methylation levels at the AHRR and CYP1A1 loci in the blood of children in response to maternal smoking [5,6], suggesting an epigenetic role in diseases as a response to environmental stimuli.The interest in easily-accessible surrogate tissue for epigenetic epidemiology has resulted in an increased focus on DNA methylation profiling in saliva, buccal, and blood-derived cell material.Especially in epigenetic studies of mental disorders, where brain tissue is mostly inaccessible, the use of surrogate tissue is of major interest.Biologically-relevant findings Phred scores above 30 along the entire read, and the anticipated GC content skewedness of methylated DNA enriched samples (Supplementary Figure S1).Clean reads were aligned to the human genome (hg38) using the Burrows-Wheeler aligner (BWA) algorithm and filtered using quality-associated attributes.The filtering resulted in a final dataset of approximately 64 million reads per library.
As chromosomal abnormalities may lead to incorrect interpretation of the data, an analysis of copy number variations in the six samples was conducted using only non-CpG fragments, as these are not enriched.No copy number variations (CNVs) > 1 Mb were found in any of the samples (Supplementary Figure S2).Pre-analytic quality checks of the mapped libraries (excluding sex chromosomes, as study subjects have different sexes) displayed a high similarity of fragment size distribution, complexity, and depth, as well as genomic coverage among samples (Supplementary Figures S3-S6).This clearly indicated that all libraries had sufficient diversity and depth of reads to saturate the profile of the reference genome in a reproducible manner.
CpG-site enrichment was visualized by the number of reads interrogated as a function of 500 bp genomic bins ranked by the number of reads contained within the bin.Furthermore, the enrichment was measured as the observed/expected ratio of CpGs within the reference genome (Supplementary Figure S7).This verified that all sample libraries were indeed enriched for CpG sites (score > 1) but also that, in saliva samples, CpG-site enrichment was more pronounced than in blood samples (mean scores of 1.36 and 1.16, respectively).The enrichment score is somewhat lower than reported elsewhere for human embryonic stem cells (1.64), human dried blood spots (1.63), and rodent hippocampus (1.4) [43][44][45].
A pair-wise correlation using all mapped reads in 500 bp windows displayed Pearson correlation coefficients (PCCs) between 0.896 and 0.984 (Figure 1A).
Epigenomes 2017, 1, 14 3 of 16 methylated DNA enriched samples (Supplementary Figure S1).Clean reads were aligned to the human genome (hg38) using the Burrows-Wheeler aligner (BWA) algorithm and filtered using quality-associated attributes.The filtering resulted in a final dataset of approximately 64 million reads per library.
As chromosomal abnormalities may lead to incorrect interpretation of the data, an analysis of copy number variations in the six samples was conducted using only non-CpG fragments, as these are not enriched.No copy number variations (CNVs) > 1 Mb were found in any of the samples (Supplementary Figure S2).Pre-analytic quality checks of the mapped libraries (excluding sex chromosomes, as study subjects have different sexes) displayed a high similarity of fragment size distribution, complexity, and depth, as well as genomic coverage among samples (Supplementary Figures S3-S6).This clearly indicated that all libraries had sufficient diversity and depth of reads to saturate the profile of the reference genome in a reproducible manner.
CpG-site enrichment was visualized by the number of reads interrogated as a function of 500 bp genomic bins ranked by the number of reads contained within the bin.Furthermore, the enrichment was measured as the observed/expected ratio of CpGs within the reference genome (Supplementary Figure S7).This verified that all sample libraries were indeed enriched for CpG sites (score > 1) but also that, in saliva samples, CpG-site enrichment was more pronounced than in blood samples (mean scores of 1.36 and 1.16, respectively).The enrichment score is somewhat lower than reported elsewhere for human embryonic stem cells (1.64), human dried blood spots (1.63), and rodent hippocampus (1.4) [43][44][45].
A pair-wise correlation using all mapped reads in 500 bp windows displayed Pearson correlation coefficients (PCCs) between 0.896 and 0.984 (Figure 1A).The intra-individual correlation rate was lower than expected but, interestingly, highest amongst saliva samples.This may partly be explained by the lower enrichment observed in the blood samples or differences in cell-type composition, which is known to vary substantially in buffy coat as a function of season, age, and sex [46][47][48].However, a high correlation, per se, is still expected, even without correction for cell composition and sex, as methylation status at most CpGs is conserved across tissues [49][50][51][52][53].This is also illustrated in a methylation array based pilot study where, at 70%  The intra-individual correlation rate was lower than expected but, interestingly, highest amongst saliva samples.This may partly be explained by the lower enrichment observed in the blood samples or differences in cell-type composition, which is known to vary substantially in buffy coat as a function of season, age, and sex [46][47][48].However, a high correlation, per se, is still expected, even without correction for cell composition and sex, as methylation status at most CpGs is conserved across tissues [49][50][51][52][53].This is also illustrated in a methylation array based pilot study where, at 70% of CpGs, methylation positively correlated between saliva and blood even when cell composition was uncorrected for [54].Moreover, only a minority of CpGs were statistically differently methylated between the two tissues and the number did not change upon correction for cell heterogeneity.
Notably, in the between-tissue comparison, a higher-end correlation was always found inter-individually, suggesting that genotype-associated methylation (methylation quantitative trait loci, mQTLs, or CpG-SNPs) or early life environmental exposures explain the intra-individual difference.Unsupervised hierarchical clustering based on all uniquely-mapped reads indicated that the saliva or peripheral blood origin of the DNA is the main factor in sample differentiation (Figure 1B).
Characterizing the samples in greater detail by the spread and distribution of reads using genomic bins of 300 bp with 150 bp overlap showed that both the median and the interquartile ranges (IQRs) between the samples were comparable (Figure 2A).It is anticipated that even with significant differently methylated regions (DMRs) present, the global distribution of reads should be comparable between samples.Therefore, the sum of reads from the lowest covered percentile of the genome to the highest should followed similar trajectories in samples to be compared.This is indeed the case in the current sample set (Figure 2B).Performing a principle component analysis showed that the first component explained 24% of the common variation, dividing the samples by tissue type.The second component stratifies the samples by individual (Figure 2C). of CpGs, methylation positively correlated between saliva and blood even when cell composition was uncorrected for [54].Moreover, only a minority of CpGs were statistically differently methylated between the two tissues and the number did not change upon correction for cell heterogeneity.Notably, in the between-tissue comparison, a higher-end correlation was always found interindividually, suggesting that genotype-associated methylation (methylation quantitative trait loci, mQTLs, or CpG-SNPs) or early life environmental exposures explain the intra-individual difference.Unsupervised hierarchical clustering based on all uniquely-mapped reads indicated that the saliva or peripheral blood origin of the DNA is the main factor in sample differentiation (Figure 1B).
Characterizing the samples in greater detail by the spread and distribution of reads using genomic bins of 300 bp with 150 bp overlap showed that both the median and the interquartile ranges (IQRs) between the samples were comparable (Figure 2A).It is anticipated that even with significant differently methylated regions (DMRs) present, the global distribution of reads should be comparable between samples.Therefore, the sum of reads from the lowest covered percentile of the genome to the highest should followed similar trajectories in samples to be compared.This is indeed the case in the current sample set (Figure 2B).Performing a principle component analysis showed that the first component explained 24% of the common variation, dividing the samples by tissue type.The second component stratifies the samples by individual (Figure 2C).
Taken together, this indicates that no major technical differences exist between the blood and saliva MeDIP-seq datasets.BoxWhisker plots of filtered sequencing reads in 300 bp genomic bins displaying the median, 25th, and 75th percentile, and whiskers (mediumplus/minus interquartile range ×2); (B) cumulative distribution plot showing the accumulation of sequencing reads in 300 bp genomic bins from the lowest covered percentile of the genome to the highest; and (C) Principal Component Analysis (PCA) plot of PC1 and PC2 using all quality filtered and mapped sequencing reads.

Few Differently-Methylated Windows are Detected between Saliva and Blood
To further characterize the MeDIP-seq dataset, we computed library size normalized 100 bp differently-methylated windows (DMWs) between the pairwise matched blood and saliva samples, excluding sex chromosomes.Setting a threshold of significance at p < 0.05 (Bonferroni corrected) the intra-individual comparison returned 322, 608, and 349 DMWs for individuals 2022, 2023, and 2025, respectively (Supplementary Figure S8).Only a minimal overlap of DMWs existed between the three Taken together, this indicates that no major technical differences exist between the blood and saliva MeDIP-seq datasets.

Few Differently-Methylated Windows are Detected between Saliva and Blood
To further characterize the MeDIP-seq dataset, we computed library size normalized 100 bp differently-methylated windows (DMWs) between the pairwise matched blood and saliva samples, excluding sex chromosomes.Setting a threshold of significance at p < 0.05 (Bonferroni corrected) the intra-individual comparison returned 322, 608, and 349 DMWs for individuals 2022, 2023, and 2025, respectively (Supplementary Figure S8).Only a minimal overlap of DMWs existed between the three datasets, again indicating that no consistent technical variation was introduced during processing of the saliva samples.Importantly, this does not imply that no biological difference exists.
We next computed the DMWs between the concatenated saliva (S2022, S2023, and S2025) and blood (B2022, B2023, and B2025) dataset excluding sex chromosomes.Depicted in an MA plot of the methylation difference as log ratio ("Minus") against the mean methylation ("Average"), the analyses showed a general tendency of higher methylation in saliva and that all significant DMWs were, indeed, hypermethylated in saliva (Supplementary Figure S9A).This appears in contrast to previous findings [19], but may be due to differences in cell composition and technical variation.Most of the DMWs were located in intergenic regions, and fewer in intragenic regions and exons (Supplementary Figure S9B).

Comparable DNA Methylation Patterns between Blood and Saliva at Defined Genomic Regions
With the intent to inspect the methylation profile at genomic regions with a known canonical profile, mean read counts in the saliva and blood samples were calculated for CGIs (±4 kb), the first exon (±2 kb), thereby including transcription start sites (TSS), and Alu repeats (±2 kb) (Figure 3A-C).The expected low level of DNA methylation at CGIs with a step increase in the flanking shore regions was evident.Moreover, the DNA methylation profiles of saliva and blood were almost indistinguishable.Likewise, the methylation trajectories overlapping the TSS and the first exon were in concordance with previous observed profiles, with a sharp drop in methylation at the TSS and a lower methylation level in the first exon compared to downstream parts of the gene body [55][56][57][58][59][60].Notably, the mean read-count level appeared slightly lower in saliva, which is in agreement with previous findings [19,54,61,62].CpG sites within repetitive elements such as L1 and Alu are normally hypermethylated to prevent their expression.Especially, Alu repeats tend to accumulate in gene-rich regions and span roughly one quarter of all CpG dinucleotides in the human genome [63][64][65].
Epigenomes 2017, 1, 14 5 of 16 datasets, again indicating that no consistent technical variation was introduced during processing of the saliva samples.Importantly, this does not imply that no biological difference exists.We next computed the DMWs between the concatenated saliva (S2022, S2023, and S2025) and blood (B2022, B2023, and B2025) dataset excluding sex chromosomes.Depicted in an MA plot of the methylation difference as log ratio ("Minus") against the mean methylation ("Average"), the analyses showed a general tendency of higher methylation in saliva and that all significant DMWs were, indeed, hypermethylated in saliva (Supplementary Figure S9A).This appears in contrast to previous findings [19], but may be due to differences in cell composition and technical variation.Most of the DMWs were located in intergenic regions, and fewer in intragenic regions and exons (Supplementary Figure S9B).

Comparable DNA Methylation Patterns between Blood and Saliva at Defined Genomic Regions
With the intent to inspect the methylation profile at genomic regions with a known canonical profile, mean read counts in the saliva and blood samples were calculated for CGIs (±4 kb), the first exon (±2 kb), thereby including transcription start sites (TSS), and Alu repeats (±2 kb) (Figure 3A-C).The expected low level of DNA methylation at CGIs with a step increase in the flanking shore regions was evident.Moreover, the DNA methylation profiles of saliva and blood were almost indistinguishable.Likewise, the methylation trajectories overlapping the TSS and the first exon were in concordance with previous observed profiles, with a sharp drop in methylation at the TSS and a lower methylation level in the first exon compared to downstream parts of the gene body [55][56][57][58][59][60].Notably, the mean read-count level appeared slightly lower in saliva, which is in agreement with previous findings [19,54,61,62].CpG sites within repetitive elements such as L1 and Alu are normally hypermethylated to prevent their expression.Especially, Alu repeats tend to accumulate in gene-rich regions and span roughly one quarter of all CpG dinucleotides in the human genome [63][64][65].In line with this, plotting the read count over Alu elements showed a considerable increase compared to the flanking regions.Interestingly, within Alu elements the methylation profile of saliva and blood, although following the same trajectory, differs, with a read count higher in saliva (p < 0.0001, Welch t-test).Depending on the subtype of the Alu element analysed, both saliva hypo-and hypermethylation has been reported [62,66].Importantly, however, a consistent low correlation of Alu methylation levels between saliva and blood was observed, which is indicative of high intra-individual variation in methylation across cell types and tissues.
Next we examined genome-wide variation between the saliva and blood sample sets in genomic regions annotated to the following features: proximal promoters (2 kb upstream of RefSeq genes), RefSeq genes, coding sequences (CDS), LINEs (L1 and L2), SINEs (Alu), previously reported mental disorder-associated genes found by DNA methylation studies using peripheral tissue (BDNF, SLC6A4, CACNA1C, HTR1A, COMT, ST6GALNAC1, DRD2, and GAD1) [8,[67][68][69][70][71][72][73][74][75], and the above described DMWs.A box whisker plot of read counts in the grouped datasets at each genomic feature and a matching line graph for the individual datasets (normalized and summarized) showed a very similar profile with little variability and inter-individual difference, with DMWs as the natural exception (Figure 4A,B).In line with this, plotting the read count over Alu elements showed a considerable increase compared to the flanking regions.Interestingly, within Alu elements the methylation profile of saliva and blood, although following the same trajectory, differs, with a read count higher in saliva (p < 0.0001, Welch t-test).Depending on the subtype of the Alu element analysed, both saliva hypo-and hypermethylation has been reported [62,66].Importantly, however, a consistent low correlation of Alu methylation levels between saliva and blood was observed, which is indicative of high intraindividual variation in methylation across cell types and tissues.
Next we examined genome-wide variation between the saliva and blood sample sets in genomic regions annotated to the following features: proximal promoters (2 kb upstream of RefSeq genes), RefSeq genes, coding sequences (CDS), LINEs (L1 and L2), SINEs (Alu), previously reported mental disorder-associated genes found by DNA methylation studies using peripheral tissue (BDNF, SLC6A4, CACNA1C, HTR1A, COMT, ST6GALNAC1, DRD2, and GAD1) [8,[67][68][69][70][71][72][73][74][75], and the above described DMWs.A box whisker plot of read counts in the grouped datasets at each genomic feature and a matching line graph for the individual datasets (normalized and summarized) showed a very similar profile with little variability and inter-individual difference, with DMWs as the natural exception (Figure 4A,B).We next examined intragenic and intergenic CGIs, as well as the associated upstream and downstream shores (±2 kb of CGI) (Figure 5A,B).Notably, saliva displayed a slightly larger variability, as judged by the IQR, which is in concordance with previous findings [18].Taken together, the overall similar profiles between peripheral blood and saliva emphasize the validity of MeDIP-seq results based on saliva.We next examined intragenic and intergenic CGIs, as well as the associated upstream and downstream shores (±2 kb of CGI) (Figure 5A,B).Notably, saliva displayed a slightly larger variability, as judged by the IQR, which is in concordance with previous findings [18].Taken together, the overall similar profiles between peripheral blood and saliva emphasize the validity of MeDIP-seq results based on saliva.Box whisker plots of filtered reads overlapping intergenic or intragenic CGIs or shores (±2 kb from CGI) in 300 bp genomic bins.Displayed are the median, 25th, and 75th percentile, whiskers (median plus/minus interquartile range ×2); and (B) normalized (to the genomic window with highest magnitude of difference) and summarized (mean value with 95% confidence interval) line graphs of filtered reads for all saliva and blood samples overlapping intergenic or intragenic CGIs or shores.(A) Box whisker plots of filtered reads overlapping intergenic or intragenic CGIs or shores (±2 kb from CGI) in 300 bp genomic bins.Displayed are the median, 25th, and 75th percentile, whiskers (median plus/minus interquartile range ×2); and (B) normalized (to the genomic window with highest magnitude of difference) and summarized (mean value with 95% confidence interval) line graphs of filtered reads for all saliva and blood samples overlapping intergenic or intragenic CGIs or shores.
Tissue-specific DNA methylation is known to be enriched in gene bodies, and especially at intragenic CGI shores [31,55,76].Focusing on such intragenic CGI shores we found a clear tissue-specific distinction, as judged by hierarchical and principal component clustering (Supplementary Figure S10A,B).From 300 bp genomic windows overlapping downstream intragenic CGI shores (Figure 6), overlapping genes restricted to a within-tissue PCC above 0.95 and a same directional deviation from the mean were selected, generating a list of 577 genes (Figure 6).Tissue-specific DNA methylation is known to be enriched in gene bodies, and especially at intragenic CGI shores [31,55,76].Focusing on such intragenic CGI shores we found a clear tissuespecific distinction, as judged by hierarchical and principal component clustering (Supplementary Figure S10A,B).From 300 bp genomic windows overlapping downstream intragenic CGI shores (Figure 6A), overlapping genes restricted to a within-tissue PCC above 0.95 and a same directional deviation from the mean were selected, generating a list of 577 genes (Figure 6B). Figure 6.Tissue stratification at downstream intragenic CGI shores.(A) MA plot depicting the average read count (x-axis) in the complete dataset restricted to downstream intragenic CGI shores against fold difference (y-axis) between the two tissues; (B) MA plot from (A) overlayed with hypermethylated regions in blood (red dots) and saliva (green dots) defined as 300 bp regions displaying within-tissue PCCs > 0.95 and above mean coverage; (C) MA plot from (A) overlayed with liver (green dots), squamous epithelial (red dots), neutrophil (purple dots), and monocyte (orange dots)-expressed genes.
Gene ontology (GO) analyses using the 577 genes returned two pathways significant after Bonferroni correction (p < 0.05), being B-cell and T-cell activation (Table 1).
Table 1.GO enrichment analysis.PANTHER overrepresentation test using PANTHER pathways and a significance threshold of p < 0.05 (Bonferroni corrected).Of 577 genes featuring differently-covered downstream intragenic CGI shores, 549 genes were uniquely mapped to the database.The pathways of B-cell and T-cell activation were enriched above four-fold and passed correction for multiple testing.We also performed a DAVID functional annotation using either default settings or restricted to tissue expression.Five functional terms were significant after Benjamini-Hochberg (BH) correction (p < 0.05) (Supplementary Table S1) using default settings.We acknowledge that the small sample set calls for extreme caution in the interpretation of the functional enrichment analysis results.Nevertheless, the enriched terms related to cellular processes involved in alternative splicing and post-transcriptional processing.Utilizing only databases on tissue of expression in the functional annotation of the 577 genes, the top three significant tissues were brain, peripheral lymphocytes, and epithelium (Supplementary Table S2).Whereas only brain tissue survived correction for multiple testing, this finding still indicates that the methylation differences observed are a function of the tissue of origin and, thus, associated with phenotypic differences.Figure 6.Tissue stratification at downstream intragenic CGI shores.(A) MA plot depicting the average read count (x-axis) in the complete dataset restricted to downstream intragenic CGI shores against fold difference (y-axis) between the two tissues; (B) MA plot from (A) overlayed with hypermethylated regions in blood (red dots) and saliva (green dots) defined as 300 bp regions displaying within-tissue PCCs > 0.95 and above mean coverage; (C) MA plot from (A) overlayed with liver (green dots), squamous epithelial (red dots), neutrophil (purple dots), and monocyte (orange dots)-expressed genes.

Bonferroni
Gene ontology (GO) analyses using the 577 genes returned two pathways significant after Bonferroni correction (p < 0.05), being B-cell and T-cell activation (Table 1).
Table 1.GO enrichment analysis.PANTHER overrepresentation test using PANTHER pathways and a significance threshold of p < 0.05 (Bonferroni corrected).Of 577 genes featuring differently-covered downstream intragenic CGI shores, 549 genes were uniquely mapped to the database.The pathways of B-cell and T-cell activation were enriched above four-fold and passed correction for multiple testing.We also performed a DAVID functional annotation using either default settings or restricted to tissue expression.Five functional terms were significant after Benjamini-Hochberg (BH) correction (p < 0.05) (Supplementary Table S1) using default settings.We acknowledge that the small sample set calls for extreme caution in the interpretation of the functional enrichment analysis results.Nevertheless, the enriched terms related to cellular processes involved in alternative splicing and post-transcriptional processing.Utilizing only databases on tissue of expression in the functional annotation of the 577 genes, the top three significant tissues were brain, peripheral lymphocytes, and epithelium (Supplementary Table S2).Whereas only brain tissue survived correction for multiple testing, this finding still indicates that the methylation differences observed are a function of the tissue of origin and, thus, associated with phenotypic differences.

Bonferroni
In a reverse approach we looked for specific patterns in the downstream intragenic CGI shores based on cell-type-specific expressed genes.For this we extracted a gene list from the GeneCardsSuite specific for neutrophils, monocytes, hepatocytes, and oral cavity squamous epithelial cells and overlapped it with the downstream intragenic CGI shores (Figure 6C).No specific patterns were observed, however, it should be noted that, in contrast to promoter CpG methylation, no unidirectional relationship between gene body methylation and expression exists [77][78][79].
With the intent to compare to other MeDIP-seq datasets we included GEO-available datasets on foetal brain, peripheral blood mononuclear cells (PBMCs), and breast epithelial cells.Restricting the analysis to previously identified blood-brain tissue-specific DMRs (TS-DMRs) (n = 50) within CGIs or regions with high inter-individual difference (n = 43) [20] the samples clustered according to tissue or individual of origin, respectively (Figure 7).In the former comparison the linear correlation between foetal brain and saliva (R = 0.173) was, furthermore, larger than between foetal brain and blood (R = −0.06)whereas PBMCs and blood was more correlated (R = 0.639) than PBMCs and saliva (R = 0.558) (Supplementary Figure S11).Notably, in the latter comparison the matched blood-saliva samples clustered together, whereas only the related brain samples clustered closely together.
Epigenomes 2017, 2, x FOR PEER REVIEW 10 of 17 With the intent to compare to other MeDIP-seq datasets we included GEO-available datasets on foetal brain, peripheral blood mononuclear cells (PBMCs), and breast epithelial cells.Restricting the analysis to previously identified blood-brain tissue-specific DMRs (TS-DMRs) (n = 50) within CGIs or regions with high inter-individual difference (n = 43) [20] the samples clustered according to tissue or individual of origin, respectively (Figure 7).In the former comparison the linear correlation between foetal brain and saliva (R = 0.173) was, furthermore, larger than between foetal brain and blood (R = −0.06)whereas PBMCs and blood was more correlated (R = 0.639) than PBMCs and saliva (R = 0.558) (Supplementary Figure S11).Notably, in the latter comparison the matched blood-saliva samples clustered together, whereas only the related brain samples clustered closely together.Taken together, the presented data suggests that even in a suboptimal matched and small sized sample set where up to 74% of the DNA may originate from the same tissue, white blood cells [29], samples can be separated by tissue using MeDIP-seq.A previous study has measured the inter-individual saliva and blood methylation profile to be at a positive correlation for 88.5% of CpGs [18].Limitations to this study include a small sample size and the inability to correct for cell heterogeneity, making a replication study warranted.However, although tissue, age, and sex specific methylation are well described-on a global level, inter-individual correlation remains high [80][81][82][83][84][85], justifying the comparison described herein.A further limitation is the presence of bacterial DNA in saliva samples.However, with the collection kit used in the current study the content has been shown to be low (median 11.8%) compared to mouthwashes and buccal swabs (median 60-90%) [86].

Conclusions
We compared MeDIP-seq profiles of saliva and blood from three individuals.The methylation profile of blood has previously been analysed by MeDIP-seq and follows a pattern also observed in other cell types [20,36,56].On a global level the data quality and methylation profiles described herein are highly similar between saliva and peripheral blood, suggesting that the use of MeDIP-seq on DNA extracted from saliva is a feasible and valid approach for methylome studies.The use of saliva-derived DNA could alleviate logistic challenges, thereby increasing the number of participants in future epigenetic epidemiology studies.Taken together, the presented data suggests that even in a suboptimal matched and small sized sample set where up to 74% of the DNA may originate from the same tissue, white blood cells [29], samples can be separated by tissue using MeDIP-seq.A previous study has measured the inter-individual saliva and blood methylation profile to be at a positive correlation for 88.5% of CpGs [18].Limitations to this study include a small sample size and the inability to correct for cell heterogeneity, making a replication study warranted.However, although tissue, age, and sex specific methylation are well described-on a global level, inter-individual correlation remains high [80][81][82][83][84][85], justifying the comparison described herein.A further limitation is the presence of bacterial DNA in saliva samples.However, with the collection kit used in the current study the content has been shown to be low (median 11.8%) compared to mouthwashes and buccal swabs (median 60-90%) [86].

Biological Samples
As part of the CHANGE project [87], peripheral blood (10 mL) and saliva (2 mL) samples were collected in EDTA tubes (BD, Franklin Lakes, NJ, USA) and Oragene saliva collection tubes for DNA extraction (OG-500, DNA Genotek, Ottawa, ON, Canada), respectively, from adult individuals.Blood samples were stored <6 months at −20 • C and saliva samples 6-12 months at ambient temperatures before extraction.Genomic DNA (gDNA) from peripheral blood was extracted using Maxwell (Promega, WI, USA) and from saliva with the prepIT-L2P kit (DNA Genotek) and stored at −80 • C. Matched saliva and blood samples from three individuals were used in the study; individual 2022 (male, age 54), 2023 (female, age 45), and 2025 (female, age 33).Concentration of the extracted genomic DNA (gDNA) was measured using the Nanodrop 1000 instrument (Thermo Scientific, Waltham, MA, USA).Additionally, a MeDIP-seq input control (blood) from an unrelated study was included in the enrichment analysis.

MeDIP-seq
Extracted gDNA was sonicated in a Pico Bioruptor (Diagenode, Seraing, Belgium) at 10 ng/µL for 13 cycles of 30 s on 30 s off to a mean fragment size of about 180 bp.Fragment length distribution was assessed by microelectrophoresis using the Qiaxcel instrument and a high-resolution gel cartridge (Qiagen, Hilden, Germany).Five-hundred nanograms of sonicated gDNA was subsequently used for end-repair and adaptor ligation employing the NEBNext Ultra DNA Library Prep Kit for Illumina (New England Biolabs, MA, USA).The reaction mix was purified using Ampure XP beads (Beckman-Coulter, CA, USA) after which MeDIP was set up on the SX-8G-IP-Star Compact robot (Diagenode) according to manufacturer's instructions applying the Auto MeDIP kit (Diagenode) and including unmethylated and methylated spike-in controls.Effectively 415 ng of adaptor ligated DNA was used in the MeDIP process.Antibody incubation was performed at 4 • C for 15 h.Immunoprecipitated samples were magnetically purified on the robot using the Auto iPure v2 kit (Diagenode) according to the manufacturer's instructions.Recovery and enrichment was evaluated by qPCR using primer sets specific for the spike-in controls.Minimum criteria were set to 10% recovery and 25-fold enrichment [88].Based on the recovery rate samples were PCR-amplified at 13 cycles using multiplex oligos (New England Biolabs).Samples were size-selected with Ampure XP beads on the SX-8G-IP-Star Compact robot (Diagenode) using a total of 90 µL beads per sample and eluted in 25 µL DNase-free H 2 O. Purity and fragment length distribution was evaluated on the Qiaxcel instrument.Post-amplification enrichment was verified by qPCR using primer pairs targeting the endogenous hypermethylated promoter region of testis specific histone 2B (TSH2B) (Diagenode, cat.no.C17011041) or the hypomethylated transcription start site (TSS) of glyceraldehyde 3-phosphate dehydrogenase (GAPDH) (Diagenode, cat.no.C17011047).

Sequencing and Bioinformatics
Samples were PE50 sequenced in two lanes on a HiSeq2000 instrument (Illumina, San Diego, CA, USA), generating around 68 million reads per sample.The reads were cleaned using Soapnuke 1.5.0 (BGI, Shenzhen, China) applying the following filters; remove reads with adaptors, remove reads with >10% unknown reads, and remove reads with >50% low quality bases (Q score < 5).Clean reads only qualify if Q 20% > 85%.Quality control of the cleaned fastq files was performed using FastQC.Utilizing the Galaxy platform [89], the pair-end reads were aligned to the human genome (hg38) using BWA [90].SAM files were converted to BAM files and filtered to include only de-duplicated, uniquely-mapped reads resulting in 58-67 million clean reads per sample.BAM files were imported to R and SeqMonk v0.32.1 (Barbraham Institute, Cambridge, UK).In SeqMonk segments were generated by quantitation of read counts in genetic windows of 300 bp (150 bp overlap) corrected for total read counts (reads per million).Duplicates were ignored.The R package MEDIPS was used for quality control, genomic coverage estimation, and differential coverage analysis in bin sizes of 100 or 500 bp [91].Differential coverage was calculated with an exact test using trimmed mean of M-values (TMM) corrected libraries with a variance estimated by the quantile-adjusted conditional maximum likelihood (qCML) method.The p-values were corrected for multiple testing using the Bonferroni procedure.CNV analysis was performed using the QSEA and HMMcopy bioconductor packages, using a genomic window size of 1 × 10 6 bp.Annotation was performed using the DAVID bioinformatics and The Gene Ontology Consortium database [92][93][94][95].Publicly-available GEO MeDIP-seq datasets on breast luminal epithelium cells (GSM493615 and GSM613843), PBMCs (GSM543023), and foetal brain (GSM669615, GSM707019, and GSM669614) (methylated fragments enriched using antibody from Diagenode) were downloaded and lifted over to the hg38 build.Blood and brain tissue and inter-individual specific DMRs were obtained from [20].

Figure 1 .
Figure 1.Pair-wise Pearson correlation analyses of saliva and blood samples MeDIP-seq methylomes.(A)The correlation value (r) for each pair-wise correlation is shown.Coefficients typed in green (for blood) and blue (for saliva) mark within tissue comparisons, whereas coefficients in orange denote between-tissue comparisons; and (B) unsupervised hierarchical clustering of the analysed six blood and saliva samples based on correlation coefficients.

Figure 1 .
Figure 1.Pair-wise Pearson correlation analyses of saliva and blood samples MeDIP-seq methylomes.(A)The correlation value (r) for each pair-wise correlation is shown.Coefficients typed in green (for blood) and blue (for saliva) mark within tissue comparisons, whereas coefficients in orange denote between-tissue comparisons; and (B) unsupervised hierarchical clustering of the analysed six blood and saliva samples based on correlation coefficients.

Figure 2 .
Figure 2.Comparable variability and distribution of reads in blood and saliva samples.(A) BoxWhisker plots of filtered sequencing reads in 300 bp genomic bins displaying the median, 25th, and 75th percentile, and whiskers (mediumplus/minus interquartile range ×2); (B) cumulative distribution plot showing the accumulation of sequencing reads in 300 bp genomic bins from the lowest covered percentile of the genome to the highest; and (C) Principal Component Analysis (PCA) plot of PC1 and PC2 using all quality filtered and mapped sequencing reads.

Figure 2 .
Figure 2.Comparable variability and distribution of reads in blood and saliva samples.(A) BoxWhisker plots of filtered sequencing reads in 300 bp genomic bins displaying the median, 25th, and 75th percentile, and whiskers (mediumplus/minus interquartile range ×2); (B) cumulative distribution plot showing the accumulation of sequencing reads in 300 bp genomic bins from the lowest covered percentile of the genome to the highest; and (C) Principal Component Analysis (PCA) plot of PC1 and PC2 using all quality filtered and mapped sequencing reads.

Figure 3 .
Figure 3. Canonical methylation patterns at genomic features.Read count in 300 bp sliding genomic windows with 150 bp overlap across genetic regions corresponding to (A) CGI (CpG Island) ±4 kb flanking regions; (B) the first exon of RefSeq genes ±2 kb flanking regions; and (C) Alu repeat elements ±2 kb flanking regions.TSS: transcription start sites.

Figure 4 .
Figure 4. Equal coverage and read spread at specific genomic features for saliva and blood MeDIP-seq results.(A) Box whisker plots of filtered reads overlapping six different genomic features in 300 bp genomic bins.Displayed are the median, 25th and 75th percentile, whiskers (median plus/minus interquartile range ×2); and (B) normalized and summarized line graphs of filtered reads matching the six specified genomic features for all saliva and blood samples.

Figure 4 .
Figure 4. Equal coverage and read spread at specific genomic features for saliva and blood MeDIPseq results.(A) Box whisker plots of filtered reads overlapping six different genomic features in 300 bp genomic bins.Displayed are the median, 25th and 75th percentile, whiskers (median plus/minus interquartile range ×2); and (B) normalized and summarized line graphs of filtered reads matching the six specified genomic features for all saliva and blood samples.

Figure 5 .
Figure 5. Equal coverage and spread at CGIs and shores for saliva and blood MeDIP-seq results.(A)Box whisker plots of filtered reads overlapping intergenic or intragenic CGIs or shores (±2 kb from CGI) in 300 bp genomic bins.Displayed are the median, 25th, and 75th percentile, whiskers (median plus/minus interquartile range ×2); and (B) normalized (to the genomic window with highest magnitude of difference) and summarized (mean value with 95% confidence interval) line graphs of filtered reads for all saliva and blood samples overlapping intergenic or intragenic CGIs or shores.

Figure 5 .
Figure 5. Equal coverage and spread at CGIs and shores for saliva and blood MeDIP-seq results.(A)Box whisker plots of filtered reads overlapping intergenic or intragenic CGIs or shores (±2 kb from CGI) in 300 bp genomic bins.Displayed are the median, 25th, and 75th percentile, whiskers (median plus/minus interquartile range ×2); and (B) normalized (to the genomic window with highest magnitude of difference) and summarized (mean value with 95% confidence interval) line graphs of filtered reads for all saliva and blood samples overlapping intergenic or intragenic CGIs or shores.

Figure 7 .
Figure 7. Hierarchical clustering based on 50 TS-DMRs or 43 individual specific DMRs.GEO available datasets as well as the saliva and blood datasets were hierarchical clustered based correlation coefficients restricted to (A) 50 TS-DMRs; and (B) 43 individual differentiating DMRs.

Figure 7 .
Figure 7. Hierarchical clustering based on 50 TS-DMRs or 43 individual specific DMRs.GEO available datasets as well as the saliva and blood datasets were hierarchical clustered based correlation coefficients restricted to (A) 50 TS-DMRs; and (B) 43 individual differentiating DMRs.
Canonical methylation patterns at genomic features.Read count in 300 bp sliding genomic windows with 150 bp overlap across genetic regions corresponding to (A) CGI (CpG Island) ±4 kb flanking regions; (B) the first exon of RefSeq genes ±2 kb flanking regions; and (C) Alu repeat elements ±2 kb flanking regions.TSS: transcription start sites.