Gene–Smoking Interaction Analysis for the Identification of Novel Asthma-Associated Genetic Factors

Asthma is a complex heterogeneous disease caused by gene–environment interactions. Although numerous genome-wide association studies have been conducted, these interactions have not been systemically investigated. We sought to identify genetic factors associated with the asthma phenotype in 66,857 subjects from the Health Examination Study, Cardiovascular Disease Association Study, and Korea Association Resource Study cohorts. We investigated asthma-associated gene–environment (smoking status) interactions at the level of single nucleotide polymorphisms, genes, and gene sets. We identified two potentially novel (SETDB1 and ZNF8) and five previously reported (DM4C, DOCK8, MMP20, MYL7, and ADCY9) genes associated with increased asthma risk. Numerous gene ontology processes, including regulation of T cell differentiation in the thymus (GO:0033081), were significantly enriched for asthma risk. Functional annotation analysis confirmed the causal relationship between five genes (two potentially novel and three previously reported genes) and asthma through genome-wide functional prediction scores (combined annotation-dependent depletion, deleterious annotation of genetic variants using neural networks, and RegulomeDB). Our findings elucidate the genetic architecture of asthma and improve the understanding of its biological mechanisms. However, further studies are necessary for developing preventive treatments based on environmental factors and understanding the immune system mechanisms that contribute to the etiology of asthma.


Introduction
Asthma is one of the most prevalent chronic respiratory diseases, with a worldwide burden associated with loss of life and overall morbidity [1,2]. It affected more than 330 million people in 2019 [3,4] and results in approximately 400,000 deaths annually [3,5,6]. The general symptoms of asthma include wheezing, coughing, chest tightness or pain, and shortness of breath [7]. Asthma can lead to respiratory distress, obstructive sleep apnea, psychological disorders, seizures, and other complications [8][9][10].
However, exome sequencing revealed that the significance of rare variants in the development of asthma remains unknown [28]. Furthermore, several studies on asthma did not provide strong evidence that structural variants have a role in asthma susceptibility [29][30][31]. Further, the "missing heritability" is due to an overestimation of the role of genetics in asthma development [26,27,32]. Regarding the findings of twin studies, monozygotic twins have been found not only to be genetically identical when compared with dizygotic twins but also to have an environment identical to that of dizygotic twins [33]. Lastly, the "missing heritability" could be due to gene-gene and gene-environment interactions [27]. Systems biology applications that account for highly interconnected molecular networks have been proposed to better understand gene-gene interactions; however, the missing heritability of complex diseases still remains poorly understood. In contrast, gene-environment interaction studies have steadily progressed, and several environmental causes of asthma and allergic diseases have been identified [12,[34][35][36][37][38]. Smoking status is a major environmental factor associated with the development of asthma [36,37]. Cigarettes contain nicotine and more than 4000 compounds that bind to and chemically alter or damage DNA [39]. For instance, genetic variants in 17q21 and 20p13-p12 were found to be significantly associated with asthma in patients exposed to smoking [40][41][42][43]. Furthermore, several GWASs have investigated the interactions between single nucleotide polymorphisms (SNPs) and smoking status [44][45][46]. GWAS have made considerable efforts in better understanding missing heritability. However, due to the small number of replicates, insufficient biological plausibility, and lack of statistical power, missing heritability remains poorly understood [47].
Therefore, to further understand the genetic architecture of asthma, we aimed to discover the effects of smoking status on asthma by analyzing genetic information from the Korean Genome and Epidemiology Study (KoGES) [48]. We evaluated gene-smoking interactions using SNP, gene, and gene-set analyses. To prioritize regulatory variants from the GWAS loci, we implemented linkage disequilibrium analyses and publicly available functional annotation tools: combined annotation-dependent depletion (CADD) [49], deleterious annotation of genetic variants using neural networks (DANN) [50], and RegulomeDB [51]. The study expands our knowledge of the genetic contribution to the development of asthma, thereby providing insight into asthma susceptibility and disease mechanisms.

General Characteristics of Study Participants
As shown in Table 1, a total of 66,857 participants were enrolled in the current study. The general characteristics of the participants for asthma analysis were described by the defined groups "case" and "control" in the Health Examination Study (HEXA), Cardiovascular Disease Association Study (CAVAS), and Korea Association Resource Study (KARE) cohorts of the KoGES consortium. The HEXA cohort included 975 cases with a mean age of 55.4 ± standard deviation (SD) 8.4 years and 57,479 controls with a mean age of 53.8 ± 8.0 years. The CAVAS cohort included 95 cases and 2908 controls with a mean age of 57.9 ± 7.8 and 55.4 ± 7.8 years, respectively. The KARE cohort included 112 cases and 5308 controls with a mean age of 53.3 ± 7.9 and 51.5 ± 8.5 years, respectively. Age was significantly associated with asthma in all three cohorts (p < 0.005, t-test). The average body mass index (BMI) values of patients with asthma (cases) and controls were 24.3 ± 3.2 and 23.9 ± 2.9 kg/m 2 in the HEXA cohort (p = 0.0002, t-test), 25.5 ± 3.4 and 24.5 ± 3.0 kg/m 2 in the CAVAS cohort (p = 0.0002, t-test), and 25.0 ± 3.5 and 24.6 ± 3.0 kg/m 2 in the KARE cohort (p = 0.1536, t-test), respectively. The male-to-female ratio (SEX) of cases and controls was 29.0% and 71.0% in the HEXA cohort, 38.9% and 61.1% in the CAVAS cohort, and 34.8% and 65.2% in the KARE cohort, respectively. The SEX was significantly associated with asthma in the HEXA and KARE cohorts. The allergy (ALLER) status was significantly associated with asthma in all three cohorts (p < 0.0001, chi-square test). To investigate the effect of smoking status on asthma, we performed an association analysis, using a logistic regression model adjusted for age, sex, BMI, and ALLER status. A significant association between smoking status and asthma was found in both the HEXA (p = 0.0003, logistic regression) and KARE cohorts (p = 0.0129, logistic regression). However, such an association was not found in the CAVAS cohort (p = 0.6627, logistic regression). a Means ± standard deviation (SD); b body mass index (BMI); c allergy status (ALLER); d smoking status (Nonsmoker: never smoker/Smoker: former smoker or current smoker); e the association between smoking status and asthma was analyzed by a logistic regression model adjusted for age, sex, BMI, and ALLER status.

GWAS Results
We performed gene-environment (smoking status) interaction analysis (GWAS) to identify the genetic factors (7,060,677 SNPs from 66,857 subjects from the HEXA, CAVAS, and KARE cohorts) associated with asthma, which were analyzed after genotype quality control ( Figure 1). The p-values for the gene-environmental interactions factor were determined using the PLINK software (v1.90) [52], and the "--interaction" option. In Figure S1a, the quantile-quantile (Q-Q) plot showed no inflation of p-values. The lambda GC (λ GC , genomic control) [53] in our analysis was 0.98. Figure S1b shows the Manhattan plot of the p-values in the GWAS gene-environment interaction analysis for asthma, where the horizontal red line denotes the threshold for a 0.05 genome-wide significance level by a Bonferroni correction of 7.08 × 10 -9 . Table 2 summarizes the top 40 SNPs with p-values less than the nominal significance level of 1.00 × 10 -5 . Notably, the most significant SNP-smoking status interacting signal was rs77079226 (OR = 3.083, 95% CI 1.985-4.788, p = 5.37 × 10 −7 ), located in TMEM74 on chromosome 8. This study anticipated a strong interaction between rs77079226 and smoking status in asthma, indicating the deleterious effect of rs77079226(C). Conversely, in the case of rs2292731 (OR = 0.637, 95% CI 0.523-0.775, p = 6.68 × 10 −6 ), we found an protective interaction effect between rs2292731 and smoking status on the risk of asthma. Furthermore, KDM4C [54][55][56][57] and MMP20 [58][59][60] were previously known to be associated with asthma.

Functional Annotation
We applied follow-up bioinformatics analyses to two novel genes (SETDB1 and ZNF8), identified in this study, and five previously reported genes (KDM4C, DOCK8 MMP20, MYL7, and ADCY9) using genome-wide functional prediction scores (CADD DANN, and RegulomeDB). For the SETDB1 gene, three variants had CADD scores > 10 or DANN scores > 0.7 and were associated with four transcription factor (TF)-binding The regulation of T cell differentiation in the thymus (GO:0033081) is associated with nicotine as a main component of cigarettes via smoking [83][84][85]. In previous studies, nicotine exposure was found to affect the thymus, an important organ of the immune system, and to interfere with epithelial cell adhesion and growth [83,84]. Nicotine leads to T cell downregulation and maturation and to T helper (Th)1/Th2 imbalance and reduced immunity [83,84,[86][87][88]. Nicotine also activates the nicotinic acetylcholine receptor (nAChR) and modulates thymocyte development [83]. T cell nicotinic responses were demonstrated to occur through the α7 subunit [84]. α7 nAChR activation triggers extracellular Ca 2+ influx associated with upregulated Fas expression [89]. In a previous study, α7 nAChR was activated, and its levels increased concomitantly with caspase expression and apoptosis rate [90]. Moreover, α7 nAChR was shown to mediate nicotine pro-apoptotic effects on thymocytes by upregulating Fas expression and the Fas-mediated apoptotic pathway [91].

Discussion
In this study, we conducted a gene-smoking interaction analysis to identify candidate genetic markers associated with asthma. We combined data from Korean multi-cohorts, (HEXA, CAVAS, and KARE) and identified two novel genes. We further contributed to understanding the functional associations of five previously identified asthma-associated genes. Using follow-up bioinformatics analyses, we also confirmed that two novel genes (SETDB1 and ZNF8) and three previously reported genes (DOCK8, MMP20, and ADCY9) are potential asthma-associated genes. The SETDB1 encodes a histone lysine N-methyl transferase; this important enzyme, involved in histone modification, is related to the Th1 gene downregulation mechanism in asthma [72,73]. As shown in Figure 3, SETDB1 causes epigenetic modifications through methylation of histone H3, which surrounds the DNA on the chromosome [72,146]. When histone H3 lysine 9 (H3K9) is methylated, deposition occurs in endogenous retroviruses and enhancer regions, which are necessary for Th1 gene expression [72,73]. Downregulation of the Th1 gene leads to overexpression of the Th2 gene, which increases the number of Th2 cells, causing Th1/Th2 cell imbalance [73]. Th2 cells secrete IL-4 and IL-5 and increase IgE levels and eosinophil numbers, which can trigger inflammation, leading to asthma [146][147][148]. Moreover, SETDB1 can induce macrophage recruitment by promoting AKT/mTOR-dependent CSF-1 induction and secretion [149]. Increased macrophage levels due to overexpression of SETDB1 can cause inflammation by causing an increase in the pro-inflammatory cytokine IL-8 [150][151][152]. High expression of IL-4 by Th2 cell upregulation hyperstimulates M0 macrophages to differentiate into M2 macrophages [153]. Upregulated M2 macrophages overexpress monocyte chemotactic protein-1 (MCP-1), IL-8, IL-13, etc., which can increase the risk of inflammation and asthma onset [154][155][156]. In addition, SETDB1 interacts with the asthma-related gene SETDB2 [157]; the latter antagonizes the KDM4C gene to control H3K9 methylation [157], which affects the expression of ORMDL3, a well-known asthma-related gene [55,158]. Therefore, SETDB2 dysregulation is associated with genomic instability and increased H3K9 methylation [157], leading to Th1/Th2 cell imbalance and asthma onset [73].  Another novel gene identified in our study is ZNF8 (Figure 4). ZNF8 encodes the zinc finger protein 8, an important protein that interacts with and binds to MH1 and MH2 domains of the Smad1 protein and is related to Smad1 suppression [166]. The main symptoms of asthma include reversible airway constriction, airway hyper-responsiveness, and chronic airway inflammation [167,168]. These symptoms are a complex process involving damage to the epithelial layer, hypertrophy and hyperplasia of the smooth muscles, and subepithelial fibrosis [169,170]. The key feature of subepithelial fibrosis is that human bronchial fibroblasts, derived from patients with asthma, show a characteristic transforming growth factor (TGF)-β1-induced fibroblast-to-myofibroblast transition (FMT) [166,170]. Antifibrotic TGF-β/Smad1 pathway activation is downregulated in fibroblasts; however, profibrotic TGF-β/Smad2/3 pathways are upregulated in patients with asthma [170]. Furthermore, the TGF-β/Smad1 and TGF-β/Smad2/3 pathways are antagonistic mechanisms [166,168,170]. Moreover, high expression of TGF-β1 is associated with eosin- Furthermore, several studies have shown that nicotine from smoking can interact with SETDB1 and induce asthma by causing Th1/Th2 cell imbalance [83][84][85]159,160]. In particular, these facts need to be investigated in conjunction with late-onset asthma (LOA) disease studies [161]. The phenotype of LOA is divided into two types according to the presence or absence of eosinophilic inflammation, Th2 and non-Th2-related LOA [162,163]. Th2-related LOA has been reported to be associated with sinusitis, nasal polyps, and sometimes aspirinexacerbated respiratory disease (AERD) [164]. This type has recently been defined as uncontrolled asthma, often with severe symptoms from the onset [165]. Non-Th2-related LOA has been reported to be associated with gender, obesity, smoking, age, and corticosteroids and requires other treatment strategies, such as diet, macrolides, or smoking cessation [161,162]. In this study, nicotine exposure through smoking in adulthood degraded T cell regulation and maturation, induced Th1/Th2 imbalance and reduced immunity [83,84,[86][87][88], and upregulated Fas expression and the Fas-mediated apoptosis pathway through α7 nAChR activation [91]. It has been suggested that non-Th2-related LOA can develop into severe Th2-related LOA. Further, this study suggested that the upregulation of SETDB1 by genetic variants, such as rs139189121, rs75406390, and rs59024312 (Table 3), can accelerate Th2-related LOA during nicotine exposure from smoking. rs139189121, rs75406390, and rs59024312 are genic upstream transcript variants (Table S2) located between exon 3 and exon 4 of SETDB1, which could influence SETDB1 expression activity and transcription of the SETDB1 protein, resulting in a different asthma prognosis. Fine-mapping analysis revealed that rs139189121, rs75406390, and rs59024312 were related to an increased risk of asthma when exposed to smoking (Table S2), and it was confirmed through RegulomeDB chip-seq data (hg19). As a result, they could be bonded asthma-related proteins, such as AR [92], FOXA1 [93], GATA3 [94,95], and GATA6 [96,97] (Table 3). Therefore, further biological investigation is needed to elucidate whether rs139189121, rs75406390, and rs59024312 can affect the expression of SETDB1 and asthma prognosis.
Another novel gene identified in our study is ZNF8 (Figure 4). ZNF8 encodes the zinc finger protein 8, an important protein that interacts with and binds to MH1 and MH2 domains of the Smad1 protein and is related to Smad1 suppression [166]. The main symptoms of asthma include reversible airway constriction, airway hyper-responsiveness, and chronic airway inflammation [167,168]. These symptoms are a complex process involving damage to the epithelial layer, hypertrophy and hyperplasia of the smooth muscles, and subepithelial fibrosis [169,170]. The key feature of subepithelial fibrosis is that human bronchial fibroblasts, derived from patients with asthma, show a characteristic transforming growth factor (TGF)-β1-induced fibroblast-to-myofibroblast transition (FMT) [166,170]. Antifibrotic TGF-β/Smad1 pathway activation is downregulated in fibroblasts; however, profibrotic TGF-β/Smad2/3 pathways are upregulated in patients with asthma [170]. Furthermore, the TGF-β/Smad1 and TGF-β/Smad2/3 pathways are antagonistic mechanisms [166,168,170]. Moreover, high expression of TGF-β1 is associated with eosinophil counts and asthma severity [171]. In asthmatic lungs, inflammatory cells express and secrete TGF-β1, which leads to an increase in eosinophils [171][172][173]. In particular, it has been reported that TGF-β1 is increased in chronic asthma and that eosinophils are the main source of asthma [172]. Thus, the ZNF8 protein, transcribed from ZNF8, interacts with Smad1, binds to the MH1 and MH2 domains of the Smad1 protein, and suppresses antifibrotic function [166]. Noteworthily, nicotine exposure from smoking increased KDM4C expression through epigenetic mechanisms [174]. These results suggest that the two novel genes (SETDB1 and ZNF8) may play important roles in the etiology of asthma.
One of the main strengths of this study is that it utilizes an older cohort sample to provide a better LOA phenotype. In particular, among the two novel genes associated with asthma in this study, the association of SETDB1 with LOA could be newly suggested through our analysis pipeline. Furthermore, this study focused on analyzing East Asian subjects, who have not been thoroughly investigated, unlike different ethnic groups. In this regard, we evaluated gene-smoking interactions using GWAS and gene analyses, revealing that genes, such as DOCK8, MMP20, and ADCY9 were suggested by previous studies. Additionally, we conducted gene-set analysis using GO processes, finding that the regulation of T cell differentiation in the thymus (GO:0033081) is related to the pathogenesis of asthma.
Our study had several limitations. First, our sample size was relatively small compared to that of recent asthma GWAS cohorts, such as the Trans-National Asthma Genetic Consortium (TAGC) [175] and the UK Biobank cohort [176]. Additional asthma risk variants were identified as the sample size increased. Second, GWAS is subject to unbalanced binary traits; that is, there are fewer case samples than control samples. The imbalance could produce large type-I error rates in logistic regression results [177][178][179], and the significance of true causal variants may be overlooked [180,181]. Further machine learning approaches, such as the synthetic minority oversampling technique (SMOTE) [182] and cost-sensitive learning [183], are required to address for the imbalance-associated errors. Third, the lack of independent replication datasets did not allow us to validate all the top signal GWAS loci, genes, and pathways for asthma. Therefore, replication studies on asthma using Korean or East Asian cohorts are needed as follow-up studies. Fourth, bioinformatics analysis identified certain genetic factors and pathways related to asthma pathogenesis; however, the underlying biological mechanisms of these factors require further investigation, such as rare variants or tissue and cell-type enrichment analyses. Finally, GWAS functional analysis did not provide sufficient biological insight into the two novel genes associated with asthma. Further in vitro and vivo experimental studies related to the role of the two novel genes are needed to identify individuals at risk of asthma.  Our study had several limitations. First, our sample size was relatively small compared to that of recent asthma GWAS cohorts, such as the Trans-National Asthma Genetic Consortium (TAGC) [175] and the UK Biobank cohort [176]. Additional asthma risk variants were identified as the sample size increased. Second, GWAS is subject to unbalanced binary traits; that is, there are fewer case samples than control samples. The imbalance could produce large type-I error rates in logistic regression results [177][178][179], and the significance of true causal variants may be overlooked [180,181]. Further machine learning approaches, such as the synthetic minority oversampling technique (SMOTE) [182] and cost-sensitive learning [183], are required to address for the imbalance-associated errors. Third, the lack of independent replication datasets did not allow us to validate all the top signal GWAS loci, genes, and pathways for asthma. Therefore, replication studies on asthma using Korean or East Asian cohorts are needed as follow-up studies. Fourth, bioinformatics analysis identified certain genetic factors and pathways related to asthma pathogenesis; however, the underlying biological mechanisms of these factors require further investigation, such as rare variants or tissue and cell-type enrichment analyses. Finally, GWAS functional analysis did not provide sufficient biological insight into the two novel genes associated with asthma. Further in vitro and vivo experimental studies related to the role of the two novel genes are needed to identify individuals at risk of asthma.

Study Participants
Epidemiologic and genetic data were collected by the KoGES consortium [48] and

Study Participants
Epidemiologic and genetic data were collected by the KoGES consortium [48] and included three population-based study cohorts: HEXA (n = 61,568), CAVAS (n = 9715), and KARE (n = 8840). The KoGES consortium was a long-term follow-up cohort study conducted, in the Korean population, from 2001 to 2010. It was designed to investigate and assess genetic and environmental factors as correlates to the incidence of chronic diseases, such as type 2 diabetes, hypertension, and cancer [48]. The current study excluded subjects without Korean Biobank Array genotype data and those with missing epidemiologic information (SEX, AGE, BMI, ALLER, SMOKE, and asthma diagnosis). After applying the exclusion criteria, 66,887 participants were included in the study (HEXA, n = 58,434; CAVAS, n = 3003; KARE, n = 5420; Figure 1). Patients with asthma were defined based on their answer to the asthma history question, with participants being categorized as "case" or "control" if their answer was "yes" or "no," respectively. The allergy status was defined based on the participants' answer to the allergy history (rhinitis, atopy, allergic conjunctivitis, food allergy, etc.) question, with participants being categorized as "ALLER" or "non-ALLER" if their answer was "yes" or "no," respectively. The smoking status was defined based on the participants' answer to the smoking history question, with participants being accordingly categorized as "never-smokers," "former-smokers," or "current-smokers." However, in this study, we gathered former-smokers and currentsmokers in a single group (smokers).
All participants provided written informed consent to participate in the study. The participants' genetic information was produced using the Korean chip through the Korea Biobank Array Project of the National Institutes of Health, Korea Centers for Disease Control and Prevention [184,185]. This study was approved by the Institutional Review Board of Hanyang University (IRB no. HYUIRB-202210-013).

Quality Control
Genomic DNA was genotyped using the Korean Biobank Array (Korean Chip, KORV1.1, Thermo Fisher Scientific, Waltham, MA, USA), designed by the Center for Genome Science (Korea National Institute of Health (KNIH)), based on the platform of the UK Biobank Axiom array and manufactured by Affymetrix [184,185]. SNP imputation was performed using the IMPUTE v2 software [186] with East Asian (Chinese and Japanese, 286 samples) 1000 Genomes Project Phase 3 data as a reference panel. Then, the PLINK software (version 1.90, Boston, MA, USA) [52] was used for SNP loci quality control. We excluded SNPs with missing genotype call ratio > 0.05, minor allele frequency (MAF) < 0.01, and Hardy-Weinberg equilibrium (HWE) p-value ≤ 10 −5 . After imputation and quality control, 7,060,677 SNPs were selected for association analysis.

Genome-Wide Association Analysis
Genome-wide association analysis was conducted using a logistic regression model, with the PLINK-G × E option, to test the interaction between genetic variants and environmental factors (smoking status). Single-SNP interaction effects were assessed using the following model: where β 0 , β 1 , β 2 , and β 3 are the intercept value, effect sizes of the SNP, environmental factor, and interactions, respectively, and β 4 , β 5 , β 6 , and β 7 denote coefficients of the covariate variables SEX, AGE, BMI, and ALLER, respectively. For the ASTH phenotype, the case group was denoted as 1, and the normal group was denoted as 0. The SNP was used as a marker to indicate the genetic information of each sample, and the genotype was assumed to be an additive genetic model (AA = 0, Aa = 1, and aa = 2, where "A" and "a" indicate the major and minor alleles, respectively). ALLER was categorized as either non-ALLER (coding = 0) or ALLER (coding = 1). SMOKE was classified into two groups: non-smokers (coding = 0) and smokers (former or current smokers; coding = 1). Moreover, as three integrated cohorts were analyzed in this study, we used covariates to adjust potential population stratifications in each cohort using dummy variables (HEXA: group 1 = 0, group 2 = 0; CAVAS: group 1 = 1, group 2 = 0; and KARE: group 1 = 0, group 2 = 1). Additionally, we conducted principal component (PC) analysis (PCA), which is the most widely used method to adjust for population stratification in GWASs [187]. We used the top 10 PC score that was calculated using the PLINK software (v1.90) and the "--pca 10" option [52]. β 8 , β 9 , and β 9 -β 19 denote the effect size of Groups 1, Groups 2, and PC1-PC10 values, respectively. To identify SNPs that interact with smoking status, we considered the following null hypothesis: The p-values for the gene-environmental interaction factor, β 12 , that passed the Bonferroni correction threshold of 7.08 × 10 -9 (0.05/7,060,677) were considered statistically significant.

Gene Analysis and Gene-Set Analysis
Gene and gene-set analyses were implemented in FUMA (v1.4.0) [62]. It used the input GWAS summary statistics to compute the p-value of gene-based and gene-set p-values, using the multi-MAGMA (v1.08) tool [61]. The MAGMA tool, with the multiple linear PC regression model [188], was used to identify multiple variant combined effects by incorporating linkage disequilibrium (LD) information among multiple variants within extended +/− 250 kb downstream or upstream of the gene [61]. We performed a MAGMA gene-based analysis using a reference panel from Phase 3 of the 1000 Genome Project East Asian data. For the gene-set analysis, we conducted the GENE2FUNC process to identify the putative biological mechanisms of the prioritized genes. The gene-set p-value was computed using a gene-based p-value for 9988 GO terms obtained from MsigDB (v7.0). Bonferroni correction was used to correct for multiple testing in both the gene and gene-set analyses.

Functional Annotation
Fine-mapping analysis was performed to prioritize candidate causal variants in the genomic regions surrounding the significant GWAS hits. For each GWAS hit, a SNP set consisting of genetic variants belonging to the same gene was selected and mapped to a significant SNP. Then, the LD among SNPs was computed using the Haploview software (v4.2) [189] to view the correlation structure. This is a commonly used functional annotation process considering LD to find the causal variant for the top SNP signals from the GWAS [190][191][192]. The most commonly used measures of LD are D and r 2 . D represents a statistic for estimating recombination differences, whereas r 2 is the squared value of the correlation of the allelic states of two loci in the same gamete. Within the LD block containing a significant SNP, we extracted a list of adjacent SNPs with D > 0.9 or r 2 > 0.8. Finally, functional annotation studies were performed to examine the relationship between asthma and the fine-mapping loci.
To annotate SNPs at our candidate loci, we conducted variant annotation (ANNO-VAR) [193] (table_annovar.pl) using the hg19/GRCh37 reference genome and Ensembl gene (build 106). For intergenic SNPs, the gene boundaries overlap such that a single variant can be annotated to multiple genes. Therefore, the intergenic SNPs were mapped to the two closest upstream and downstream genes. All candidate SNPs were annotated using CADD (https://cadd.gs.washington.edu/ (accessed on 1 June 2023)), DANN (https://cbcl.ics.uci.edu/public_data/DANN/ (accessed on 1 June 2023)), and RegulomeDB scores (www.regulomedb.org/ (accessed on 1 June 2023)) to predict potential pathogenicity. For both coding and non-coding variants, CADD used more than 60 functional annotations to calculate deleterious scores. The raw CADD scores were then converted into PHRED-scaled scores ranging from 1 to 99, with higher values indicating more deleterious cases. For example, a PHRED-scaled score of 10 or greater indicates a raw CADD score in the top 10% of all possible reference genome SNPs. We also used a DANN to identify potentially functional SNPs using the VarSome prediction tool (v11.3.3, https://varsome.com (accessed on 1 June 2023)) [194]. The DANN score is a functional prediction scoring methodology with an output value in the range of 0 and 1, where higher values are more likely to indicate deleterious cases. RegulomeDB comprises datasets from the ENCODE Project Consortium, Gene Expression Omnibus, and other sources, totaling 962 datasets as well as published literature. RegulomeDB provides a SNP score that ranges from 1 to 6: the lower the SNP score, the stronger the regulatory association. The comparative characteristics of the RegulomeDB score categories are summarized in Table 4 [51]. Motif hit a Expression quantitative trait locus (eQTL); b transcription factor (TF); c deoxyribonuclease (DNase).

Conclusions
This study demonstrated that two potentially novel genes (SETDB1 and ZNF8) are associated with the development of asthma, based on gene-smoking interaction analyses using multiple cohorts of the KoGES consortium. Furthermore, the causal relationship between five genes (two potentially novel genes and three previously reported genes) and asthma was confirmed through follow-up bioinformatics analyses. Moreover, gene-set analysis revealed that the regulation of T cell differentiation in the thymus (GO:0033081) is significantly associated with the pathogenesis of asthma. Therefore, our findings indicate the importance of multi-levels analysis (GWAS loci, genes, pathways, and functional annotation) in gene-environment interaction studies, which may provide insight and an improved understanding of these complex traits. Our results reveal potential biological mechanisms that are involved in the pathogenesis of asthma. The results contribute to a greater understanding of asthma etiology and provide insights for further investigation.