Aberrant AHRR, ADAMTS2 and FAM184 DNA Methylation: Candidate Biomarkers in the Oral Rinse of Heavy Smokers

Objective. To identify DNA methylation patterns of heavy smokers in oral rinse samples. Methods. Genome-wide DNA methylation data was imported from Gene Expression Omnibus GSE70977 using the GEOquery package. Two independent sets were analyzed: (a) 71 epigenomes of cancer-free subjects (heavy smokers n = 37 vs. non-smokers n = 31); for concordance assessment (b) 139 oral-cancer patients’ epigenomes (heavy smokers n = 92 vs. non-smokers n = 47). Differential DNA methylation for CpG positions and at the regional level was determined using Limma and DMRcate Bioconductor packages. The linear model included sex, age, and alcohol consumption. The statistical threshold was set to p < 0.05. Functional gene prioritization analysis was performed for gene-targeted analysis. Results. In individuals without cancer and heavy smokers, the FAM184B gene was found with two CpG positions differentially hypermethylated (p = 0.012 after FDR adjustment), in a region of 48 bp with an absolute methylation difference >10% between groups (p = 1.76 × 10−8). In the analysis corresponding to oral-cancer patients, we found AHRR differentially hypomethylated cancer patients, but also in subjects without oral cancer in the targeted analyses. Remarkably, ADAMTS2 was found differentially hypermethylated in heavy smokers without a diagnosis of cancer in two consecutive probes cg05575921 (p = 3.13 × 10−7) and cg10208897 (p = 1.36 × 10−5). Conclusions: Differentially methylated AHRR, ADAMTS2, and FAM184B genes are biomarker candidates in oral rinse samples.


Introduction
Tobacco smoke is one of the biggest risk factors for the disease of the oral mucosa caused by different mechanisms, including the induction of an increased number of Langerhans cells and greater induction of aldo-keto reductases, enzymes linked to genotoxicity [1]. In vitro studies report some regulatory dose-dependent changes in genes as CYP1A1, CYP1B1, and AHRR at the transcriptional level [2].
The pack-year index measures smoking intensity defined as the number of packs of cigarettes smoked per day multiplied by the number of years the person has smoked [5]. In heavy smokers, a high pack-year index increases cancer risk and associated mortality compared to non-smokers and heavy non-smokers [6,7]. Additionally, heavy smokers have a worse response to oral cancer treatment [8]. Smoking is known to affect the oral microbiome, with high levels of Gram-negative organism colonization in the tongue, a site with high rates of malignant transformation. Smoking is also the strongest factor in increasing microbial acetaldehyde production and reducing total salivary antioxidant capacity [9,10].
In this context, some risk factors can favor tumor promoter gene expression or inhibit the antitumor immune response, modifying DNA methylation at specific CpG sites in determined loci. If these specific loci are identified in risk factors, such as smoking, screening strategies with specific biomarkers could be established. The ultimate goal of discovering specific biomarkers is to facilitate early diagnosis, risk identification, targeted prevention, the follow-up of proposed treatments, and prognosis to reduce morbidity and mortality [11][12][13][14]. While some studies have been conducted to elucidate the carcinogenic effects of tobacco smoke on the bronchial epithelium and other diseases, the epigenetic profile of buccal cells [15] or DNA methylation differences in cancer [16], the epigenetic effect underlying environmental risk factors remains poorly understood. Accessible biomarkers of heavy smoking in oral tissue demand more research. Interestingly, a study in oral and pharyngeal cancer made epigenome data from 223 oral rinse samples available with the code GSE70977 in the Gene Expression Omnibus repository [16]. These data have been exclusively used to study patients with oral cancer, without determining the differential methylation patterns in oral rinse samples from cancer-free or cancer-diagnosed smokers compared to nonsmokers or studying differentially methylated genomic patterns at the regional level. The use of data from public repositories enables analyses that answer relevant questions while rationalizing resources [17] and contribute to understanding harmful phenotypic effects and to the progress of personalized medicine as a future guideline in the staging, management, prevention, and subclassification of the different neoplasms of the stomatognathic system [18]. Therefore, secondary bioinformatic analyses using available datasets, performing a functional genetic prioritization analysis guided by the results of positional and regional differential methylation analyses, could help to elucidate the clinical differences in oral mucosa diseases related to smoking.
The main objective of this work was to evaluate the buccal rinse cell genome-wide DNA methylation signature in heavy smokers using available Illumina DNA methylation 450 K BeadChip data from oral rinse-derived DNA. A secondary objective was to direct the evaluation of DNA methylation in heavy smokers, restricting the analysis to oral cancerrelated genes selected by a gene prioritization bioinformatic approach. Therefore, this work is proposed as a secondary bioinformatic analysis, relevant to oral epigenetics and dentistry, to study the epigenome of smoking habits on the Illumina HumanMethylation450 BeadChip platform by identifying differentially methylated genomic sites and regions and addressing the knowledge gap about epigenetic patterns in cells collected from oral rinse in patients with a smoking habit, which are relevant in multiple oral diseases [19,20].

Available Data (Population)
We evaluated available data from 223 DNA methylation epigenomes assessed by the Infinium Illumina 450 K platform corresponding to buccal rinse cells primarily used in a cancer study with different aims and analyses [21]. Data were imported from the public dataset GSE70977, freely available in the Gene Expression Omnibus for secondary analyses [22].
The accessible population corresponded to 223 epigenomic profiles from buccal rinses collected at different hospitals in Boston, MA, USA [21]. Seventy-four patients were nonsmokers, and 149 patients were heavy smokers. The present study aimed to compare the DNA methylation profiles of heavy smokers in separate subsets (with or without cancer). However, a supplementary analysis including both subsets may be provided on reasonable request.
For each subset of the genome datasets, with and without a cancer diagnosis, we performed a separate analysis to compare heavy smokers with the respective control group [23]. For each subset, the corresponding heavy smoker group included individuals with a pack-year smoking history index >10, e.g., one daily cigarette pack for 10 years (1 × 10) or half daily cigarette pack for 20 years (0.5 × 20) [5]; for each control group, the epigenomic profiles from subjects with a pack-year smoking index less than 1 were selected. Considering the objective of the present study, all analyses were performed in a stratified form for each subset, avoiding mixing patients of different cancer statuses. The data subset of noncancer subjects had 71 epigenomes. The data subset of cancer subjects diagnosed with head and neck cancer had 152 epigenomes. We evaluated for inclusion all the data available in dataset GSE70977. Therefore, no sample size calculation was performed; however, adequate power was found in the Results section.

Data Importation, Cleansing, and Quality Control
DNA methylation, clinical and demographic data were imported from the GSE70977 dataset using the GEOquery R Bioconductor package [24] for the corresponding statistical and bioinformatics analyses. The dataset content consistency was manually visualized and cross-checked through verification data importation/verification R scripts for both the phenotypic and epigenetic data. Quality control processes were performed, including the corresponding fluorescence detection validation. Only data with a detection p value of less than 0.0000001 compared to the background signal passed this QC step. As a result, both samples and probes with more than 2% were excluded. In addition, QC included bimodal pattern density plot verification of beta values through the minfi package. For this purpose, we used the beta values normalized via the preprocess Funnorm of the minfi package available in the GEO dataset [25].

Statistical and Bioinformatic Analyses
This study used R Bioconductor, a statistical and bioinformatics platform to favor the reproducibility of bioinformatic analyses, based on an open-source policy, which allowed the design of scripts for all relevant analyses [26]. The statistical analysis was performed using the Limma package to compare differential methylation positions (DMPs) and differentially methylated regions (DMRs) in heavy smokers and nonsmokers as described below.

Genome-Wide Differential Methylation Positions Analysis for Heavy Smokers
Genome-wide DNA methylation was analyzed at the CpG position level to detect differentially methylated positions (DMPs) by modeling the study variable (heavy smokers or controls) and covariables, i.e., age, sex, and alcohol consumption, using robust methods. The delta of the beta filter for DMP was set at 0.09. The statistical significance threshold after FDR correction applied to genome-wide analysis was set at p < 0.05. Subsequently, gene assignment to identified DMPs was performed, considering a 2 kb distance from each probe to the nearby transcription start site (TSS associated with the gene), as previously used elsewhere [27,28].

Genome-Wide Differentially Methylated Regions Analysis for Heavy Smokers
In addition, we performed a differentially methylated region (DMR) analysis using the DMRcate package based on the Limma package [29] with code built in-house. We previously used to add a 'delta of beta' filter set to >6% [28].

Gene Prioritization for Analysis Focused on Specific Genes
Endeavour Bioinformatics tools facilitate the identification of promising candidate genes related to a condition or disease, based on training genes previously identified as associated with a condition or disease and curated annotations bioinformatic datasets (e.g., 'Gene Ontology', 'InterPro', 'Biomolecular pathways' such as 'Reactome' gene interaction and protein interaction networks, e.g., 'BioGrid', and 'IntAct', between others. The list of bets candidates is determined by their connections with the training genes [28,30]. This strategy allows complementation of the initial result using a list of genes in a directed analysis, as has been performed in previous studies [31]. Here, we used a combination of a results-driven approach, and a direct evaluation of candidate genes was performed using a prioritization analysis through the Endeavour Bioinformatic tool [28,30].

Endeavour Parameters Were Set As Follows:
(a) Input 1 (list of training): genes identified in the nondirected analysis (AHRR, ADAMTS2, FAM184B) and with an OMIM assignation for "Orolaryngeal cancer" (CDKN2A). (b) Input 2 (list of candidates): genes evaluated/annotated by Infinium Illumina DNA Methylation 450 K. (c) For the gene prioritization analysis, the statistical threshold for identification was p < 0.01; otherwise, the settings in Endeavour were set to default [32].

Analyses of Differentially Methylated Positions (DMPs) and Regions (DMRs) of Specific Loci
The DMP and DMR analyses were performed only for the genes identified in the gene prioritization analysis using the same procedure as the untargeted analysis. The prerequisites for DMPs in this targeted analysis were a limit of significance p value < 0.005 and a minimum DNA methylation difference between groups of 6%.

Effect of Cell Composition Sensitivity Analysis
To address a potential confounding factor due to cell composition, we used a recently available data [33], considering cell composition heterogeneity as a confounding variable in a sensitivity analysis using the estimateLC function from ewastools r-package in conjunction with BeadSorted.Saliva.EPIC Bioconductor package [34]. The results of these cell estimations of leucocytes vs. epithelial cells were included in the final model using Limma Bioconductor Package with the same parameters described above.

Results
Of the 71 available noncancer methylomes, 68 were included for analysis to identify DNA methylation differences associated with heavy smokers (HSs). Three methylomes were excluded by applying the selection criteria to the set of subjects without cancer diagnosis (two samples) and excluding such with poor quality findings in the quality control (QC) step (one sample). Of the 68 noncancer methylomes analyzed, 37 belonged to heavy smokers, and 31 belonged to controls. No significant differences in age or gender were found between heavy smokers and nonsmokers in the subset of patients without cancer ( Figure 1). However, differences were found between smokers and nonsmokers regarding alcohol consumption; therefore, this variable was included in the model to eliminate its effect on differential DNA methylation results. Demographic data are shown by subset for each group of interest (heavy smokers vs. nonsmokers) ( Table 1).
Of the initial subset of 152 available methylomes corresponding to subjects with cancer, 139 were finally included for the analyses to identify DNA methylation differences associated with heavy smokers (HSs). Twelve were excluded by applying the selection criteria, and one more was ruled out due to poor-quality findings in the quality control (QC) step. Of the 139 subjects in this subset, 92 were heavy smokers and 47 were nonsmokers ( Figure 1). No differences were found regarding age and gender. However, regarding alcohol consumption, significant differences were found between smokers and nonsmokers in the cancer patient subset (Table 1); therefore, for this analysis, the alcohol variable was also included in the model to eliminate its effect on differential methylation results.
In the noncancer subset, the genome-wide hypothesis-free analysis showed differentially hypermethylated positions in the ADAMTS2 and FAM184B genes in the initial model (Table 1) and the final model including alcohol consumption (Table 2). Two positions were found in the final model that considered alcohol consumption (Table S1).  Of the initial subset of 152 available methylomes corresponding to subjects with cancer, 139 were finally included for the analyses to identify DNA methylation differences associated with heavy smokers (HSs). Twelve were excluded by applying the selection criteria, and one more was ruled out due to poor-quality findings in the quality control (QC) step. Of the 139 subjects in this subset, 92 were heavy smokers and 47 were nonsmokers ( Figure 1). No differences were found regarding age and gender. However, regarding alcohol consumption, significant differences were found between smokers and nonsmokers in the cancer patient subset (Table 1); therefore, for this analysis, the alcohol  In the cancer-diagnosed subset, the genome-wide hypothesis-free analysis showed four differentially hypomethylated positions, one in the AHRR gene and three in intergenic regions ( Table 3). Two of these positions were also found after applying the final model including alcohol consumption (Table S2).
In the noncancer subset, DMR analysis found 18 regions supported by more than one differential CpG for smokers compared to nonsmokers in the initial model. Among them is the gene FAM184B (Table S3). In the final model, DMR analysis showed 12 regions supported by more than one differential CpG, including the FAM184B and AHRR genes ( Table 4).   In the cancer subset, DMR analysis found one region supported by more than one differential CpG for smokers in contrast to nonsmokers in the initial model corresponding to the AHRR gene (Table S4). In the final model, DMR analysis showed only two DMR regions, only the one corresponding AHRR supported by more than one differentially methylated CpG (Table 5).

Gene Prioritization Results and Corresponding DNA Methylation Analysis
The genes found to be differentially methylated in the genome-wide analysis, supported by DMPs and DMRs, were used as training input in the prioritization software; the results of the prioritization in Endeavour showed the following 167 genes significantly related to the training genes: In the noncancer epigenome subset, the DNA methylation analysis focused on the prioritized genes revealed three differentially methylated genes, TFAP2A, AHRR, and MAPK14, in addition to FAM184B and ADAMTS2, which were already present in the results of genome-wide DNA methylation analysis (Table S5). In cancer-diagnosed subjects, two probes corresponding to AHRR were found (Table S6); one of them, cg05575921, overlapped with the results in the noncancer subset (Tables S5 and S6).
Furthermore, DMPs of AHRR, ADAMTS2, and FAM184 genes passed the sensitivity analysis for cell composition as follows: For the cancer-free subset, cg04450456 for FAM184B, cg02599361 for ADAMTS2. For the cancer-diagnosed subset, cg05575921 for the AHRR gene and two probes cg21566642 and cg05951221 for intergenic regions (Table 6).

Discussion
The present work determined a genome-wide DNA methylation signature in heavy smokers using available Illumina DNA methylation 450 K data from oral rinse-derived DNA. This is a secondary study that has provided new insights into DNA methylation profiles by focusing the analysis on the search for potential biomarkers of heavy smokers, which differs from the research of previous authors. In complementation of the described approach, we also conducted a directed evaluation of DNA methylation in heavy smokers of genes selected by bioinformatic gene prioritization software, a promising tool not previously used for this question. With this approach, the main results presented here support AHRR as an epigenomic biomarker of tobacco smoking in oral rinse but also identified ADAMTS2 and FAM184 as new genes involved in tobacco smoking as candidate biomarkers.
The present study was not limited to evaluating each differentially methylated CpG site regardless of its nearness to other CpGs. In contrast, we analyzed the data to detect differential DNA methylation at a regional level, bearing in mind the integration between DMP and DMR analyses, which was not previously achieved. In addition, we performed a combined strategy including (I) a hypothesis-free genome-wide approach for differential methylation analysis and (II) a prioritization approach that integrated the newly discovered genes and the genes previously related to oropharyngeal cancer. Therefore, the identified genes are suggested to be tobacco-related markers that are possibly related to cancer pathogenesis.

AHRR
Here, we found that the AHRR gene was differentially hypomethylated at the CpG site corresponding to the Illumina probe cg05575921 (hg19 coordinates chr5:373,378), which also intersected with a genomic region of 589 bp in chromosome 5 identified in our DMR analysis (hg19 coordinates chr5:373299-373887); this region was concordantly hypomethylated in heavy smoker subjects in both subsets of the present study (cancer-free and cancerdiagnosed), indicating the utility of this epigenetic pattern in both subsets. The AHRR gene encodes the protein aryl hydrocarbon receptor repressor, which represses the transcriptional activity of the aryl hydrocarbon receptor and mediates dioxin toxicity, xenobiotic metabolism, and the immune response [35,36]. Concordantly, another epigenome-wide association study (EWAS) focused not on oral rinse but on an invasive procedure to analyze solid tissue samples from the oral masticatory mucosa of the hard palate and concluded that hypomethylation of the aryl hydrocarbon receptor repressor AHRR is differentially methylated genes in smokers [37]. The concordance of the present result with that reported by Richter et al. demonstrates the possibility of using AHRR DNA methylation in smokers in a routine noninvasive screening test in saliva.

ADAMTS2
The present work identified a pattern of ADAMTS2 differential hypermethylation in heavy smoker subjects in comparison with controls in two neighboring CpGs corresponding to the cg02599361 and cg10208897 Illumina probes in the cancer-free subset (Tables 2, 6, S1 and S6), located in CpG islands in the gene body of ADAMTS2. This gene encodes a protein that cleaves the pro-peptides of collagen and has been implicated in connective tissue disorders and fibrosis (disintegrin-like and metalloproteinase domain with thrombospondin motifs). Interestingly, ADAMTS2 has been found to be upregulated in cancer [38,39]. Further, this gene was not differentially methylated in the cancer-diagnosed subset; therefore, future studies could explore whether an increase in DNA methylation in the CpG islands of ADAMTS2 could be adaptative or protective for cancer development in heavy smokers.

FAM184B
FAM184B is a family with sequence similarity 184 member B gene expressed in different tissues, including mononuclear cells and the digestive system, according to GTEx Consortium dataset annotation across human tissues [40]. In the present study, we found that FAM184B was differentially hypermethylated in cancer-free subjects in oral rinse samples (Tables 2, 4, and 6). However, at present, its function has not been clarified. Interestingly, the DNA methylation at two CpGs represented by the probes cg16449012 and cg08644678 (localized in different parts of the gene body) was hypermethylated in newborns of mothers exposed to smoking during their pregnancy [41]. In addition, by evaluating DNA methylation using Sequenom MassARRAY, this gene was reported to be differentially hypermethylated in oral cancer compared to adjacent tissue; unfortunately, the exact differentially methylated probe is not available to identify exact concordances [42]. Interestingly, the FAM184B gene was found to be downregulated in striated muscle tissue and therefore possibly involved in skeletal muscle dysfunction, which is a frequent extrapulmonary manifestation in chronic obstructive pulmonary disease [43].

MAPK14 and TFAP2A Genes Found by Gene Prioritization
MAPK14, a gene of the mitogen-activated protein kinase family, is involved in processes such as proliferation, differentiation, and transcription regulation [44]. MAPK14 was significantly differentially hypermethylated even after including the alcohol consumption variable in the model (p = 0.00158, 1.58 × 10 −3 , Table 6). The MAPK14 gene has been related to signaling pathways that act as an immune response against oral pathogens [45]. This gene codes for the MAPK14 enzyme, which has been identified as a possible regulator of inflammation and is a key component of the tumor microenvironment, as chronic inflammation contributes to the development, progression, and regional metastasis of oral carcinomas [46]. MAPK14 gene expression seems to be decreased at the transcription level in response to tobacco metabolites in RNAseq, but this was not confirmed by qPCR. Interestingly, MAPK14 was also found to decrease progressively at higher grades of renal clear cell carcinoma [47].
Finally, the present study found that the TFAP2A gene was differentially hypomethylated in heavy smokers compared to controls in the analysis of noncancer subjects (Table  S6) but not for the subset of cancer-diagnosed subjects. TFAP2A encodes the transcription factor AP-2 alpha protein, which binds specific ADN sequences to activate or inhibit gene expression. Interestingly, in cancer (melanoma cells), it has been related to nasopharyngeal cancer growth [48], and aberrant DNA hypermethylation has been found at its promoter, associated with reduced expression and proposed as a possible target/marker [49]. MAPK14 and TFAP2A genes were not identified in the genomic approach, which considers FDR correction. Instead, these genes were detected through gene prioritization for targeted DNA methylation analysis, demonstrating statistical significance. However, could be important to be included both genes in future studies.
Previous studies reported DNA differentially methylated genes in smoking. Richter et al. [37] reported significant hypomethylation of AHRR and CYP1B1 in the buccal and airway epithelium of smokers. Our study showed concordat differential hypomethylation for the AHRR gene (cg04066994) but not for the CYP1B1 gene, in mouthwashes, with CpG sites evaluated a very low delta of Betas showing <0.05 and with no significant p-values (p value > 0.5). Considering that the buccal rinse uses different cell types than the biopsies in the cited study [37], only partial coincidences of DNA methylation are expected.
The results from epithelial cells may have significant implications for epithelial cancer. On the other hand, the buccal rinse procedure is noninvasive and could make the AHRR marker, more easily usable on tests with personalized medicine applications. In another study on smoking and DNA methylation, Christiansen et al. identified the SLAMF7 gene as differentially hypomethylated in smokers using blood samples [50]. In our all CpG sites evaluated for SLAMF7 (cg23844325, cg07837085, cg11721194, cg04244970, and cg04345766 probes) showed a very low delta of betas values (delta of beta <<0.05) with no significant differences between groups (p-value > 0.5). This discrepancy may be due to the differences in the tissues evaluated between the two studies.
DNA methylation plays a particular role in cell differentiation and the maintenance of cell specificity. It represents an important mechanism for cells to react to persistent external stimuli, allowing somatic cells to adjust gene expression to a particular environment in a long-term manner that can be passed on to daughter cells [51]. Therefore, the deleterious effects of smoking on tissue integrity are driven by mechanisms such as changes in DNA methylation patterns and reduced expression of repair genes [52]. It is appropriate to test the DNA methylation patterns reported here in the low-cost molecular assays [53].
There are several limitations to the present study. As this investigation is a secondary analysis, we could only access the data included in the Gene Expression Omnibus Repository, and full raw fluorescence data were unavailable. The cell composition analysis in this study differentiates merely two cell types (epithelial cells and leukocytes) and is optimized for children, potentially reducing its effectiveness for adults. Nevertheless, we employed the most suitable and currently available tool to account for cell composition heterogeneity, ensuring reliable results. Lastly, this study does not address transcriptomic analysis due to the lack of expression data, a limitation linked to the mouthwash samples examined in this research.
This study contributes new markers to be verified in future longitudinal studies in oral rinse samples, which are local but noninvasive specimens, promising for future routine screening and implementable for precision/personalized medicine. Identifying altered epigenetic patterns also allows us to focus future studies on identifying strategies to restore or modify gene regulation in the cells of patients for convenience and to handle plastic epigenetic changes/patterns to benefit patient health [54].

Conclusions
Differential methylation in FAM184B, AHRR, and ADAMTS2 genes establish these genes as biomarker candidates in mouthwash samples, relevant for further studies in clinical settings. Analysis targeting these genes and related pathways is warranted.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/biomedicines11071797/s1, Table S1: Genomic differentially methylated probes in the final statistical model for heavy smokers, inside the cancer-free subset; Table S2: Genomic differentially methylated probes in the final statistical model for smoking habit, inside the cancer-diagnosed subset; Table S3: Differentially methylated regions in the initial statistical model for smoking habit in the group, inside the cancer-free subset; Table S4: Differentially methylated regions in heavy smokers, genome wide-analysis for the cancer-free subset; Table S5: Differentially methylated probes focused on genes prioritized using the final statistics model; inside the cancer-free subset; Table S6: Differentially methylated probes focused on genes prioritized using the final statistics model; resulting in the cancer-diagnosed subset. Funding: This project was funded by award code AC31041108/AC-project-10 from FODEIN-grand of Universidad Santo Tomás-Colombia.

Institutional Review Board Statement:
The study was conducted following the Declaration of Helsinki and approved by the Institutional Review Board of Universidad Santo Tomás, protocol code CAT-163 statement from 5 February 2020. The study was approved by the institutional ethics committee of Universidad Santo Tomás (01142020-730012020 statement from 30 January 2020).

Informed Consent Statement: Not applicable.
Data Availability Statement: Data used in this study was available in Gene Expression Omnibus data set GSE70977 (available at https://www.ncbi.nlm.nih.gov/geo/ accessed on 1 January 2021).