Development of a Polygenic Risk Score for BMI to Assess the Genetic Susceptibility to Obesity and Related Diseases in the Korean Population

Hundreds of genetic variants for body mass index (BMI) have been identified from numerous genome-wide association studies (GWAS) in different ethnicities. In this study, we aimed to develop a polygenic risk score (PRS) for BMI for predicting susceptibility to obesity and related traits in the Korean population. For this purpose, we obtained base data resulting from a GWAS on BMI using 57,110 HEXA study subjects from the Korean Genome and Epidemiology Study (KoGES). Subsequently, we calculated PRSs in 13,504 target subjects from the KARE and CAVAS studies of KoGES using the PRSice-2 software. The best-fit PRS for BMI (PRSBMI) comprising 53,341 SNPs was selected at a p-value threshold of 0.064, at which the model fit had the greatest R2 score. The PRSBMI was tested for its association with obesity-related quantitative traits and diseases in the target dataset. Linear regression analyses demonstrated significant associations of PRSBMI with BMI, blood pressure, and lipid traits. Logistic regression analyses revealed significant associations of PRSBMI with obesity, hypertension, and hypo-HDL cholesterolemia. We observed about 2-fold, 1.1-fold, and 1.2-fold risk for obesity, hypertension, and hypo-HDL cholesterolemia, respectively, in the highest-risk group in comparison to the lowest-risk group of PRSBMI in the test population. We further detected approximately 26.0%, 2.8%, and 3.9% differences in prevalence between the highest and lowest risk groups for obesity, hypertension, and hypo-HDL cholesterolemia, respectively. To predict the incidence of obesity and related diseases, we applied PRSBMI to the 16-year follow-up data of the KARE study. Kaplan–Meier survival analysis showed that the higher the PRSBMI, the higher the incidence of dyslipidemia and hypo-HDL cholesterolemia. Taken together, this study demonstrated that a PRS developed for BMI may be a valuable indicator to assess the risk of obesity and related diseases in the Korean population.


Introduction
Obesity is a medical condition involving the excessive accumulation of fat, which causes various health problems. As obesity can impair quality of life as a risk factor for numerous metabolic diseases and cancer [1], the recent increase in its prevalence poses a threat to wellbeing in human populations [2]. A number of genetic studies have been conducted to understand the genetic basis of obesity, which is known as a heritable trait [3][4][5][6][7][8]. To date, genome-wide association studies (GWASs) have identified over 250 common genetic variants for body mass index (BMI) [9,10], a simple index generally used as an indicator of obesity [11].
Most obesity-related genes are involved in appetite-related signals, adipocyte growth and differentiation, energy expenditure regulation, or insulin metabolism and adipose tissue inflammation [12]. As an example, LEP encodes leptin that is an adipocyte-secreted

Production of Base Data for Computing BMI PRS
Summary statistics needed as base data to calculate the BMI PRS were obtained from a GWAS for BMI using 8,056,211 variants of 57,110 HEXA cohort subjects (Table 1). Association analysis between SNPs and BMI revealed 20 independent SNPs reaching genome-wide significance (p-value < 5 × 10 −8 ) (Supplementary Figure S1 and Table S1). With the exception of rs143349795, most SNPs with genome-wide significance detected in this study have been identified in East Asian GWASs for BMI [8]. The SNP rs143349795 is located in the intron of Cyclic Nucleotide Binding Domain Containing 2 (CNBD2) (Supplementary Figure S2). The protein encoded by CNBD2 possessing cAMP-binding activity is known to be involved in spermatogenesis [27]. The above findings suggest the value of also elucidating the functional role of CNBD2 in obesity.  ) 23.9 ± 2.9 24.5 ± 3.0 24.6 ± 3.0

Derivation of PRS for BMI
We computed the PRS for BMI using the GWAS summary statistics of 57,110 HEXA cohort subjects as base data and 13,504 genotype data from KARE and CAVAS cohorts as target data ( Figure 1). Using PRSice-2 software, the best-fit PRS for BMI (PRS BMI ) was selected at a p-value threshold of 0.064, at which the model fit had the greatest R 2 score ( Figure 2).

Variable
Base

Derivation of PRS for BMI
We computed the PRS for BMI using the GWAS summary statistics of 57,110 HEXA cohort subjects as base data and 13,504 genotype data from KARE and CAVAS cohorts as target data ( Figure 1). Using PRSice-2 software, the best-fit PRS for BMI (PRSBMI) was selected at a p-value threshold of 0.064, at which the model fit had the greatest R 2 score (Figure 2).
Logistic regression was performed to demonstrate the association between PRS BMI and obesity-related diseases. Significant associations of PRS BMI with obesity (p = 8.73 × 10 −45 ), hypertension (p = 6.84 × 10 −4 ), and hypo-HDL cholesterolemia (p = 2.75 × 10 −2 ) were detected in subjects of the target dataset (Table 3). It was estimated that the highest-risk group of PRS BMI (the fourth quartile, Q4) had approximately 2-fold, 1.1-fold, and 1.2-fold risk for obesity, hypertension, and hypo-HDL cholesterolemia, respectively, in comparison to the lowest-risk group (the first quartile, Q1) in the test population of the target dataset ( Figure 4).  Logistic regression was performed to demonstrate the association between PRSBM and obesity-related diseases. Significant associations of PRSBMI with obesity (p = 8.73 × 10 −45 ), hypertension (p = 6.84 × 10 −4 ), and hypo-HDL cholesterolemia (p = 2.75 × 10 −2 ) were detected in subjects of the target dataset (Table 3). It was estimated that the highest-risk group of PRSBMI (the fourth quartile, Q4) had approximately 2-fold, 1.1-fold, and 1.2-fold risk for obesity, hypertension, and hypo-HDL cholesterolemia, respectively, in compari-  Analysis was performed on the target dataset. The same adjustments for age, sex, and recruitment area were performed. OR, odds ratio.

Prevalence of Obesity and Related Diseases among Genetic Risk Groups in the Population
The prevalence of obesity and related diseases (such as hypertension, T2D, dyslipidemia, hypo-HDL cholesterolemia, hyper-LDL cholesterolemia, and hyperglyceridemia) was compared according to each decile group of PRSBMI in about 13,000 subjects in the target dataset (from KARE and RURAL cohorts). Significant correlations were detected between the decile groups of PRSBMI and the prevalence of obesity, hypertension, and hypo-HDL cholesterolemia. In the test population, there were differences in the prevalence of obesity, hypertension, and hypo-HDL cholesterolemia of about 26%, 2.8%, and 3.9% between the highest-and lowest-risk groups, respectively ( Figure 5).

Figure 5.
Comparison of the prevalence of obesity and related diseases by PRSBMI decile. The significance of the relationship between disease prevalence and decile groups of PRSBMI was measured by

Prevalence of Obesity and Related Diseases among Genetic Risk Groups in the Population
The prevalence of obesity and related diseases (such as hypertension, T2D, dyslipidemia, hypo-HDL cholesterolemia, hyper-LDL cholesterolemia, and hyperglyceridemia) was compared according to each decile group of PRS BMI in about 13,000 subjects in the target dataset (from KARE and RURAL cohorts). Significant correlations were detected between the decile groups of PRS BMI and the prevalence of obesity, hypertension, and hypo-HDL cholesterolemia. In the test population, there were differences in the prevalence of obesity, hypertension, and hypo-HDL cholesterolemia of about 26%, 2.8%, and 3.9% between the highest-and lowest-risk groups, respectively ( Figure 5).

Incidence of Obesity and Related Diseases among Genetic Risk Groups in the Population
We evaluated whether the PRS BMI that we developed predicts the incidence of obesity and related diseases. It is assumed that the higher the PRS, the higher the incidence of such conditions. In this study, we divided the PRS BMI into quartiles and analyzed the predictive power of the quartile groups for the incidence of diseases. Follow-up survey data from 2001 to 2016 from the KARE cohort subjects were used for this purpose. Here, individuals who never been followed up were excluded from the analysis.
Kaplan-Meier survival analysis followed by a log-rank test demonstrated that the incidences of dyslipidemia and hypo-HDL cholesterolemia differed significantly among the quartile groups of PRS BMI , while other diseases did not show significant differences ( Figure 6). For dyslipidemia and hypo-HDL cholesterolemia, the higher-risk group of PRS BMI had a higher incidence of diseases in the test population. idemia) was compared according to each decile group of PRSBMI in about 13,000 subjects in the target dataset (from KARE and RURAL cohorts). Significant correlations were detected between the decile groups of PRSBMI and the prevalence of obesity, hypertension, and hypo-HDL cholesterolemia. In the test population, there were differences in the prevalence of obesity, hypertension, and hypo-HDL cholesterolemia of about 26%, 2.8%, and 3.9% between the highest-and lowest-risk groups, respectively ( Figure 5).

Incidence of Obesity and Related Diseases among Genetic Risk Groups in the Population
We evaluated whether the PRSBMI that we developed predicts the incidence of obesity and related diseases. It is assumed that the higher the PRS, the higher the incidence of such conditions. In this study, we divided the PRSBMI into quartiles and analyzed the predictive power of the quartile groups for the incidence of diseases. Follow-up survey data from 2001 to 2016 from the KARE cohort subjects were used for this purpose. Here, individuals who never been followed up were excluded from the analysis.
Kaplan-Meier survival analysis followed by a log-rank test demonstrated that the incidences of dyslipidemia and hypo-HDL cholesterolemia differed significantly among the quartile groups of PRSBMI, while other diseases did not show significant differences ( Figure 6). For dyslipidemia and hypo-HDL cholesterolemia, the higher-risk group of PRSBMI had a higher incidence of diseases in the test population.

Discussion
Genome-wide association studies (GWASs) have identified over 55,000 unique loci for numerous common diseases and traits since the first GWAS was reported in 2005 [28]. To date, more than 70,000 common SNPs with genome-wide significance (association pvalue ≤ 5.0 × 10 −8 ) have been accumulated and are publicly available in the GWAS catalog

Discussion
Genome-wide association studies (GWASs) have identified over 55,000 unique loci for numerous common diseases and traits since the first GWAS was reported in 2005 [28]. To date, more than 70,000 common SNPs with genome-wide significance (association p-value ≤ 5.0 × 10 −8 ) have been accumulated and are publicly available in the GWAS catalog (https://www.ebi.ac.uk/gwas/, accessed on 1 May 2023). The majority of GWASs have been conducted in populations of European ancestry, with only about 10% of all GWAS subjects being of non-European descent [28]. For example, in research on studies reported from 2005 to 2016, East Asian participants accounted for only 9% of the ancestral data included in the GWAS catalog (https://www.ebi.ac.uk/gwas/, accessed on 1 May 2023). This disproportional representation of ancestry populations prevents an accurate understanding of the transferability of GWAS results across populations and makes it difficult to apply results informed by genetic research to clinical care.
Like many other complex traits, obesity has been a major trait subjected to largescale GWA analysis. To date, GWASs have detected over 250 common genetic variants for BMI [9,10], including those from East Asian populations [3,5,8]. In this study, we performed a GWA analysis for BMI in the Korean population in the part of the generation of base data for the PRS calculation. The GWA analysis using 57,110 HEXA cohort subjects detected 20 SNPs showing genome-wide significant associations with BMI. Of these, variants in or near FTO (P GWAS = 9.8 × 10 −24 ), SEC16B (P GWAS = 1.4 × 10 −26 ), BDNF (P GWAS = 2.7 × 10 −21 ), and TMEM18 (P GWAS = 7.4 × 10 −12 ) had also been discovered in previous studies [29][30][31]. Meanwhile, the SNP rs143349795 in CNBD2 (P GWAS = 1.2 × 10 −8 ) was detected for the first time in this study. The fact that rs143349795 is monomorphic in populations of European ancestry explains why no association of this SNP with BMI has been detected in Europeans.
As most of the SNPs identified in GWASs are in introns and intergenic regions, it is believed that they exert small effects on disease risk and explain only a fraction of the heritability. As such, loci identified in GWASs may not make major contributions to disease prediction or causality. To overcome this limitation, a PRS combining risk alleles across the whole genome has been developed to improve the prediction of target diseases or traits [32]. With summary statistics for most GWASs being publicly available, the Polygenic Score (PGS) catalog has recently been established to provide information on PRSs to predict the genetic predisposition to diverse phenotypes such as diseases (https://www.pgscatalog.org/). As is typically the case for GWASs, most PRSs have been developed in populations of European ancestry. This European bias presents a crucial limitation in predicting the risk of diseases across populations globally.
Against this background, we developed a PRS for predicting obesity in the Korean population. The best-fit PRS generated from our GWAS for BMI (PRS BMI ) showed strong associations with BMI (p = 1.36 × 10 −73 ) and obesity (p = 8.73 × 10 −45 ) from linear regression and logistic regression analyses, respectively. The proportion of variance in BMI explained by PRS BMI was about 2.4% in the Korean population. Of several obesity-related quantitative traits, SBP, DBP, INS0, HDL, and TG also showed significant associations with PRS BMI (Table 2). These results match the significant associations of PRS BMI with obesityrelated diseases such as hypertension (p = 6.84 × 10 −4 ) and hypo-HDL cholesterolemia (p = 2.75 × 10 −2 ) well (Table 3). Based on these results, we further aimed to examine whether PRS BMI could reliably predict the prevalence of obesity and related diseases in the Korean population. The distribution of PRS BMI demonstrated that individuals with a high PRS BMI tend to be more susceptible to obesity than those with a low PRS BMI . In addition, we observed that the prevalence of obesity-related diseases such as hypertension and hypo-HDL cholesterolemia increased in the high-risk PRS BMI group.
In an effort to predict the incidence of obesity and related diseases using PRS BMI in the Korean population, we also performed Kaplan-Meier survival analysis using follow-up survey data from 2001 to 2016 from the KARE cohort, one of our study cohorts. Kaplan-Meier survival analysis followed by a log-rank test demonstrated that individuals in the high-risk PRS BMI group a had nominally significant higher likelihood of developing dyslipidemia and hypo-HDL cholesterolemia. Meanwhile, no clear increases in the incidences of other diseases were observed in the high-risk PRS BMI group in this study. This result may be partly due to the small sample size in the follow-up longitudinal study of the KARE cohort. To generalize prediction of the incidence of diseases using PRS BMI in the Korean population, it may be necessary to increase the sample size.
Given the need for studies aimed at developing genome-wide PRS in more diverse populations, it is meaningful that our study is, to the best of our knowledge, the first to develop and test a genome-wide PRS for obesity and related diseases in an East Asian population. Our results demonstrated the promise of the PRS BMI developed in this study as a useful index to predict obesity and related diseases in the Korean population. Accordingly, it was suggested that PRS BMI could be used clinically to prevent obesity and related diseases in advance. Subsequent large-scale studies of PRSs for diverse phenotypes such as diseases may open up various avenues for the application of genetic findings in a clinical context.

Study Subjects
Subjects for the association analyses were recruited from the Korean Genome and Epidemiology Study (KoGES). KoGES is a consortium project designed by the Korea Disease Control and Prevention Agency and consists of population-based and geneenvironmental study cohorts comprising approximately 225,000 participants [33]. We used the epidemiological data from three population-based cohorts in KoGES: the RURAL cohort derived from the KoGES cardiovascular disease association study (CAVAS), the KoGES Ansan and Ansung study cohort (KARE), and the KoGES Health Examinees study cohort (HEXA) [34][35][36] (Table 1). The subjects of the KARE cohort designed for longitudinal prospective study have been examined every 2 years since 2001 [33].
After phasing genotype data using Eagle v2.3, SNP imputation was performed with IMPUTE4 using 1000 Genomes Project phase 3 and Korean reference genome (397 samples) as a reference panel. After imputation, SNPs with INFO score < 0.8 and MAF < 1% were eliminated.

Phenotyping
In three population-based cohorts of KoGES, a BMI above 25 and between 18.5 and 22.9 were considered obese and normal, respectively, in accordance with the Asia-Pacific guidelines of obesity classification system [38].

Quality Control across the Base and Target Data for PRS Derivation
The base data for PRS derivation were obtained from the summary statistics of GWA analyses for BMI using the KBA dataset of 57,110 individuals from the HEXA cohort. Associations between SNPs and BMI were tested by linear regression analysis adjusting for sex, age, and recruitment area using PLINK v1.07 (https://zzz.bwh.harvard.edu/plink/, accessed on 1 May 2023) [40]. The KBA genotype data of 13,595 individuals from KARE and CAVAS cohorts were used as the target data for computing the PRS in the present study.
The standard GWAS QC process removed SNPs with MAF < 1%, HWE p < 1 × 10 −6 , or imputation INFO score < 0.8 from both the base and the target datasets. In addition, SNPs with genotype missingness > 1% were further excluded from the initial target dataset. Ambiguous SNPs (i.e., those with complementary alleles, either C/G or A/T SNPs) across the datasets, duplicate SNPs, and SNPs on sex chromosomes were removed for subsequent PRS analysis. SNPs that were mismatched between the base and target data were not considered in the data QC because the base and target data were generated from the same genotyping platform. The BMI summary statistics of the base data were on the same genome build (Human GRCh37/hg19) as the target data.
Individuals with gender discrepancy or cryptic first-degree relatives were removed from the base and target datasets. In addition, individuals with genotype missingness > 1% or very high or low heterozygosity rates were further excluded from the initial target dataset. Finally, 6,916,878 variants for 57,110 individuals and 7,975,625 variants for 13,504 individuals remained in the base and target datasets, respectively ( Figure 1). The detailed QC procedure for PRS analysis is presented elsewhere [41].

Derivation of PRS
PRSice-2 [26] software was used to derive the PRS from the QCed base and target data. As the target sample size was larger than 500 samples, the target file was used as the reference panel for the LD estimation in performing PRS calculation. For the LD clumping, r 2 > 0.1 was applied. The phenotype data of BMI and the covariate data such as sex, age, and recruitment area from 13,504 individuals of the QCed target dataset were concomitantly incorporated in computing PRS. The best-fit PRS was selected for a given phenotype (here, BMI) at a p-value threshold where the model fit had the greatest R 2 score.

Validation of PRS
To validate the best-fit PRS for BMI (PRS BMI ), the association between PRS BMI and BMI was tested by linear regression and Pearson's correlation analyses. In addition, associations between PRS BMI and obesity-related quantitative traits including blood pressure (SBP and DBP), glycemic traits (GLU0, GLU120, and HbA1c), and lipid traits (HDLC, LDLC, TG, and TCHL) were also tested by linear regression and Pearson's correlation analyses. For these analyses, association p-values were obtained from linear regression with adjustment for age, sex, and recruitment area in the target dataset (about 13,000 subjects from KARE and RURAL cohorts). All quantitative traits except LDLC and TCHL were natural logtransformed before association analyses. The proportion of variance for the traits explained by the PRS was computed as the R 2 obtained from a full model including both PRS and covariates (age, sex, and recruitment area) minus the R 2 obtained from a model including covariates alone. In addition, the associations of PRS BMI with obesity and related diseases such as T2D, hypertension, and dyslipidemia (including hyperglyceridemia, hyper-LDL cholesterolemia, and hypo-HDL cholesterolemia) were tested by logistic regression analyses adjusting for age, sex, and recruitment area in the target dataset. Statistical analyses for all association tests were performed using R software.

Assessment of PRS on the Prevalence and Incidence of Obesity-Related Diseases
The prevalence of obesity and related diseases was compared according to each decile group of PRS BMI in about 13,000 subjects in the target data (from KARE and RURAL cohorts). The significance of the relationship between the disease prevalence and decile groups of PRS BMI was measured by correlation and regression analyses using R software.
Kaplan-Meier survival analysis was used to assess the prognostic value of PRS on the incidence of obesity and related diseases in about 5400 subjects of the KARE longitudinal prospective cohort. In the Kaplan-Meier survival analysis, the incidence of obesity and related diseases over time was compared among quartiles of PRS BMI . The association between quartiles of PRS BMI and disease incidence was further assessed by a log-rank test using R software (version 4.3.0).

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijms241411560/s1. Author Contributions: All authors made a significant contribution to the work reported. N.Y. contributed to manuscript preparation, construction of tables and figures, and statistical analysis. Y.S.C. contributed to study design, data collection and synthesis, manuscript preparation and revision, and submission of the manuscript. All authors have read and agreed to the published version of the manuscript.