Polygenic Risk Score Improves Cataract Prediction in East Asian Population

Cataracts, characterized by crystalline lens opacities in human eyes, is the leading cause of blindness globally. Due to its multifactorial complexity, the molecular mechanisms remain poorly understood. Larger cohorts of genome-wide association studies (GWAS) are needed to investigate cataracts’ genetic basis. In this study, a GWAS was performed on the largest Han population to date, analyzing a total of 7079 patients and 13,256 controls from the Taiwan Biobank (TWB) 2.0 cohort. Two cataract-associated SNPs with an adjustment of p < 1 × 10−7 in the older groups and nine SNPs with an adjustment of p < 1 × 10−6 in the younger group were identified. Except for the reported AGMO in animal models, most variations, including rs74774546 in GJA1 and rs237885 in OXTR, were not identified before this study. Furthermore, a polygenic risk score (PRS) was created for the young and old populations to identify high-risk cataract individuals, with areas under the receiver operating curve (AUROCs) of 0.829 and 0.785, respectively, after covariate adjustments. Younger individuals had 17.45 times the risk while older people had 10.97 times the risk when comparing individuals in the highest and lowest PRS quantiles. Validation analysis on an independent TWB1.0 cohort revealed AUROCs of 0.744 and 0.659.


Introduction
Cataracts are the leading cause of blindness globally, with the proportion of cataractinduced blindness ranging from 12.7% in North America to 42% in Southeast Asia [1]. It is an eye disease characterized by the opacification of the crystalline lens, a biconvex structure in human eyes that converges light and focuses images on the retina. Opacification in crystalline lenses causes blurry or scattered vision and glare problems, eventually leading to vision loss. The diagnosis of cataracts involves confirming the lens opacity through a slit lamp examination. Once the severity of cataracts interferes with patients' demands of daily living, intervention is needed. However, the current therapeutic medications based on quinoid and free radical theories of the development of cataracts have shown little efficacy [2,3]. The only effective treatment relies on surgery aimed at replacing the opaque, cataractous lens with a transparent artificial one. This approach, however, has its limitations and complications. For example, a monofocal intraocular lens cannot adjust the focus, while a multifocal intraocular lens has the disadvantages of glare, light scattering, and halo at night. When complications such as endophthalmitis or suprachoroidal hemorrhage occur, the patient's vision may be severely impaired even after surgery. In the post-genomic era, investigation into cataract-related genes that allows for risk stratification by genetic variations opens new avenues into medical treatment and early intervention for cataracts.
The most common form of cataract is age-dependent among the various types of cataracts, such as childhood, secondary to trauma, glaucoma, infection-related, and so on [4]. It is influenced by a range of environmental and genetic factors [5,6], including age, sunlight exposure [7], smoking [8], alcohol, metabolic syndrome [9], and iatrogenic corticosteroids [10]. Twin studies and family aggregation studies, on the other hand, have revealed that genetic factors play a role in the development of senile cataracts, with the estimated heritability ranging from 21% to 64% [11]. In a multi-ethnic genome-wide association study (GWAS), 54 cataract risk loci with several potential drug targets, such as RARB, KLF10, DNMBP, HMGA2, MVK, BMP4, CPAMD8, and JAG1, were identified [12]. Most GWAS analyses of cataracts have been limited to only the Western population. In 2014, the first large-scale meta-analysis in multiethnic Asians identified two loci for agerelated nuclear cataracts, namely rs7615568 on the KCNAB1 gene, and rs11911275 on the CRYAA gene [13]. However, heterogeneities among the associations were found due to the recruitment of the multiethnic Malay and Indian individuals in the study.
Here, we present the largest pure Han population GWAS on cataracts in the Taiwan Biobank (TWB) 1.0 and 2.0 databases, with more than 144,000 Taiwanese individuals from both the community and teaching hospitals (http://www.twbiobank.org.tw/, accessed on 25 February 2022) [14]. Genome-wide SNP data were collected from custom SNP arrays versions 1.0 and 2.0. Principal components analysis of the genetic data confirmed that over 99% of the TWB participants are Han Chinese, differing from previous multiethnic GWAS research on cataracts [14,15]. In this study, 11,110,260 SNP variants from 20,335 TWB2.0 participants were analyzed. Cataract-related SNP loci were identified by GWAS, and differential genetic architecture for young cataract patients (aged under 60) and old cataract patients (aged over 60) is illustrated in Manhattan plots. Finally, we constructed a polygenic risk score (PRS) to identify high-risk cataract individuals, and the results were validated in 5-fold cross-validation subsets. Replication was performed on TWB1.0. We further compared our association results with BioBank Japan (http://jenger.riken.jp/en/result, accessed on 25 February 2022) and the UK Biobank (https://pheweb.org/UKB-TOPMed, accessed on 25 February 2022) (Tables S1-S3).

Study Population and Genome-Wide Association Study
The participants and their data were obtained exclusively from the TWB (https://www. biobank.org.tw/, accessed on 25 February 2022). The Taiwan Biobank (TWB) collects extensive phenotypes, including demographics, socioeconomic status, environmental exposures, lifestyle, dietary habits, family history, and self-reported disease status, through structured questionnaires. Up to 15 April 2021, more than 144,000 participants were recruited. The demographic and health-related survey data for 105,388 study subjects were released in December 2019. Detailed genotyping and the imputation procedure are described by Wei et al. [14]. In brief, 105,388 demographic and health-related survey data were released in December 2019. There were 95,252 participants who had been genotyped with custom TWB1.0 array (TWB1.0 = 27,737) or TWB2.0 array (TWB2.0 = 68,978).
The control samples in this study were restricted to individuals aged 60 or over because most cataracts are age-related. Sample quality control was carried out to exclude samples and SNP markers based on the following criteria: (i) with a missing call rate > 2%, (ii) with a minor allele frequency (MAF) < 1%, or (iii) with significantly deviated from the Hardy-Weinberg equilibrium (p < 1.0 × 10 −6 ) using PLINK (v1.9) [16].
After performing quality control for the samples, 11,785,052 variants from 7993 (3003 cases, 4990 controls) TWB1.0 and 11,110,260 variants from 20,335 (7079 cases, 13,256 controls) TWB2.0 participants were used in the subsequent analysis. After sex, age, diabetes, hypertension, dyslipidemia, asthma, glomerular filtration rate, and body mass index were adjusted as covariates, logistic regression analysis was performed using PLINK software.

Polygenic Risk Score (PRS) Analyses
To build the PRS prediction models, we used the standard clumping and thresholding (C + T) method. The hyperparameters for this method were the cut-off of correlation r 2 and p-value threshold p. The parameter spaces for r 2 and p were {0.2, 0.04} and {10 −4 , 2.5 × 10 −4 , 5 × 10 −4 , 7.5 × 10 −4 , 10 −5 }, respectively. For each combination of (r 2 , p), we used PLINK with a window size of 10 Mb to select SNPs. For the model selection, we considered TWB2.0 as the training sample to report the prediction performance (AUC) and TWB1.0 as the testing sample to evaluate the AUC of the prediction model. For the SNPs whose minor alleles showed protective effects on cataracts, we converted their minor alleles to major alleles as risk alleles, which resulted in positive weight values for all variants. For the genetic risk estimation, individuals were divided into quintiles based on the PRS values in each study cohort. The (min,Q1) group was defined as the minimum and bottom 25% of the PRS values, the (Q1,Q2) group was defined as the bottom 25% and bottom 50% of the PRS values, and the same applies to the other groups. We calculated the PRS as the weighted sum of the risk alleles ∑ k i=1 β i SNP i , where k is the number of SNPs, SNP i is the number of risk alleles, and β i is the coefficient of logistic regression [17]. The PRS analyses were performed using PLINK 1.9. In this study, the predictive abilities of TWB1.0 and TWB2.0 PRS were compared using the area under the receiver operating characteristic curve (AUROC) [18]. The analyses were performed using the R package "pROC".

Cross-Validation
Cross-validation (CV) is a model training method that can assess prediction accuracy [19]. Since TWB is not split into training or testing data, we resorted to five-fold CV, which is often used in machine-learning modeling [20,21]. Five-fold CV is used to assess how well a classification model generalizes to independent datasets and splits the dataset into five equal and mutually exclusive subsets. Then, each of the subsets is used once for testing (with the other four being used for training). This process is repeated five times, with each of the five subsets being tested only once.

Participant Characteristics in TWB 2.0 and TWB 1.0
A total of 68,978 TWB2.0 participants and 27,737 TWB1.0 participants were analyzed in our study, with the TWB2.0 participants as the cataract risk loci discovery set and those of TWB1.0 as the validation set. Furthermore, 10.2% (7079/68,978) of the individuals in TWB2.0 who passed our quality control could be set as the discovery set (Table 1). Among the 7079 individuals, 1959 patients were under 60 years old, while 5120 patients were older than 60 years old. In subsequent discussions, the former group will be referred to as the younger group, while the latter as the older group. The 13,256 individuals fitting our quality control from the remaining participants were considered the control group, making the total sample size for the discovery set 20,335. On the other hand, in the validation set (TWB1.0), there were 7993 individuals, consisting of 3003 self-reported cataract cases and 4990 controls. Among the cases, 757 individuals belonged to the younger group. The baseline characteristics for both the discovery and validation sets are shown in Table 1. In the discovery set, females predominated among the cases (71%) compared to the controls (63%). The mean ages of the older and younger cases were 65 and 54 years old, respectively. Since age, sex, diabetes, body mass index (BMI), hypertension, asthma, and chronic kidney diseases are statistically heterogeneous between the cases and controls, all of these were adjusted in the following analysis.

Cataract Risk Loci
To investigate which SNPs are significantly associated with cataracts, we performed a GWAS on TWB2.0, and a mirror Manhattan plot was generated after the covariate adjustment ( Figure 1, Tables S1-S3). A Bonferroni-corrected significance threshold of p = 4.5 × 10 −9 (0.05/11,110,260) was prespecified to adjust for multiple testing. However, Bonferroni correction is thought to be too stringent and conservative [22]. Hence, associations with p-values between 1.01 × 10 −7 and 1.01 × 10 −5 were considered suggestive associations, and those between 1.01 × 10 −7 and 4.5 × 10 −9 were considered putative associations [23][24][25]. In the older group, 167 SNPs related to cataracts showed values of less than 1 × 10 −5 , including two SNPs with adjustments of <1 × 10 −7 , 142 with adjustments of p < 1 × 10 −6 , and 23 with adjustments of <1 × 10 −5 . In the younger cataract group, there were nine SNPs with adjustments of <1 × 10 −6 , and 31 SNPs with adjustments of <1 × 10 −5 . There were no overlapping SNPs with <1 × 10 −5 in both the younger and older groups. The selected SNPs are listed in Table 2 and the details are listed in Tables S1-S3. While most SNPs were not replicated in other biobanks, rs9788929 in the gene XYLT1 was shown to be <0.05 in the UK Biobank, and rs2272537 in the gene ZBTB32, rs56792854 in KMT2B, and rs60128322 in PROSER3 were <0.05 in the BioBank Japan (Tables S1-S3).
Biomedicines 2022, 10, x FOR PEER REVIEW 5 of 1 23 with adjustments of <1 × 10 −5 . In the younger cataract group, there were nine SNPs wit adjustments of <1 × 10 −6 , and 31 SNPs with adjustments of <1 × 10 −5 . There were no overlap ping SNPs with <1 × 10 −5 in both the younger and older groups. The selected SNPs are listed in Table 2 and the details are listed in Tables S1-S3. While most SNPs were not replicated i other biobanks, rs9788929 in the gene XYLT1 was shown to be <0.05 in the UK Biobank, and rs2272537 in the gene ZBTB32, rs56792854 in KMT2B, and rs60128322 in PROSER3 wer <0.05 in the BioBank Japan (Tables S1-S3).

Polygenic Risk Score (PRS) and Cataract Risk Prediction
To predict cataract risk, we constructed PRS models for both the younger and older cataract groups based on the associated SNPs discovered from the TWB2.0. Table 3 shows different models based on various combinations of the linkage disequilibrium (LD) clumping threshold (r 2 ) and the genome-wide significance level threshold (p). The mean PRS was significantly higher among the cataract cases compared to the controls across all models in both groups (Table 3 and Figure 2A,D). Considering the clinical significance, r 2 , p, and AUC, we constructed the model with 218 selected independent SNPs for the younger cataract group (r 2 < 0.04, and p < 2.5 × 10 −4 , referred to as PRS_younger), and 287 SNPs for the older cataract group (r 2 < 0.04, and p < 2.5 × 10 −4 , referred to as PRS_older).
Regarding the PRS performance, the PRS_younger and PRS_older effectively distinguish individuals with high cataract risk from those with low risks in the younger and older groups, respectively (Figure 2A,D). Such an association demonstrates a dose-response effect ( Figure 2B,C,E,F, Table 4). In the younger group, the individuals in the highest quantile of PRS_younger (Q3,Q4) demonstrated a 17.45-fold increase in risk compared to those in the lowest risk quantile (min,Q1). The second (Q2,Q3) and third (Q1,Q2) highest quantiles showed 5.52-and 2.12-fold increases in risk, respectively, compared to the lowest group. In the older group, the odds ratios were 10.97, 2.48, and 2.26 for the individuals in the highest, second highest, and third highest quantiles compared to those in the lowest quantile (min,Q1). Table 4 shows the case-control distribution among the quantiles. Furthermore, in the high-risk group (the top 5% to 25% in the PRS distribution), Table 5 shows a significantly elevated risk of cataracts compared to the remaining population. For the younger group, the top 25% of the PRS had a 6.24-fold increased risk, the top 10% had a 7.09-fold increased risk, and the top 5% had a 9.16-fold increased risk of developing cataracts compared to the remaining population. For the older group, the relative risk was 4.63, 5.48, and 6.74 when comparing the top 25%, 10%, and 5% groups to the remaining population.
In PRS_younger and PRS_older model, the area under the receiver operating curve (AUROC) were 0.786 and 0.738, respectively (Figure 3). After additional covariates were included in the model, the AUROC reached 0.829 and 0.785 in the younger and older groups, respectively (orange curves, Figure 3A,B). Replication results in TWB1.0 showed that AUROC was 0.744 in the younger group and 0.659 in the older group.

Polygenic Risk Score (PRS) and Cataract Risk Prediction
To predict cataract risk, we constructed PRS models for both the younger and older cataract groups based on the associated SNPs discovered from the TWB2.0. Table 3 shows different models based on various combinations of the linkage disequilibrium (LD) clumping threshold (r 2 ) and the genome-wide significance level threshold (p). The mean PRS was significantly higher among the cataract cases compared to the controls across all models in both groups (Table 3 and Figure 2A,D). Considering the clinical significance, r 2 , p, and AUC, we constructed the model with 218 selected independent SNPs for the younger cataract group (r 2 < 0.04, and p < 2.5 × 10 −4 , referred to as PRS_younger), and 287 SNPs for the older cataract group (r 2 < 0.04, and p < 2.5 × 10 −4 , referred to as PRS_older).     1 PRS_younger was used to assess the younger population (<60 years old), while PRS_older was used to assess the older population (≥60 years old) Abbreviations: OR = odds ratio with the reference being the lowest PRS quantile group (min,Q1); Q = quantile; 95% C.I. = 95% confidence interval. In PRS_younger and PRS_older model, the area under the receiver operating curve (AUROC) were 0.786 and 0.738, respectively (Figure 3). After additional covariates were included in the model, the AUROC reached 0.829 and 0.785 in the younger and older groups, respectively (orange curves, Figure 3A

Discussion
In this study, we included 20,335 individuals (7079 cases and 13,256 controls) from the Taiwan Biobank to identify cataract risk loci and build a polygenic risk score (PRS). We used the genotype data, as well as extensive phenotypes, including demographics, socioeconomic status, environmental exposures, lifestyle, dietary habits, family history, and self-reported disease status, collected using structured questionnaires answered by a Taiwanese population who are mostly of Han Chinese ancestry. According to a recent study investigating the population admixture of the Han Chinese residing in Taiwan, the Taiwanese subpopulations demonstrate high genetic homogeneity given Taiwan's population structure and migration history [15]. Han Chinese ancestry is a less studied population in cataract-related research and is necessary for solving the genetic puzzle of cataractogenesis.
The importance of clarifying the molecular mechanism of cataracts-the world's leading cause of blindness [26]-cannot be overstated. Cataracts, which can be defined as an opacity of the crystalline lens, are produced by the misfolding and aggregation of proteins [1] that adversely affect the transmission of light on the retina. Because genetic mutations and environmental stress can affect the protein-folding process in different ways, the molecular mechanisms of how disruptions to the crystalline lens protein stability, solubility, and interactions [27][28][29] result in cataracts remain unclear. Lens proteins undergo various alterations, including oxidative, osmotic, and other stresses [30]. Meanwhile, the study of gene polymorphism and new molecular markers may reveal the stresses associated with cataracts.
A total of 209 cataract-associated SNPs at a significance level of p < 1 × 10 −5 were identified in our GWAS. Most of the identified SNPs were unreported, including the topmost SNPs, rs74774546 in GJA1, rs237885 in OXTR, and others. While mapping a list of newly-identified loci from the GWAS to genes is a known challenge, it is nonetheless necessary and relevant for further functional follow-ups. In this work, the 209 cataractassociated SNPs were mapped to a list of 30 genes (Tables S1 and S2). HMX1, GJA1, and PROSER3 were identified to be the leading genes associated with cataracts in our older population. Additionally, there were seven SNPs intersecting the groups containing all the cataract cases and those aged above 60; three of which (rs76840465, rs28433905, and rs60128322) could be mapped to the AGMO, SCFD2, and PROSER3 genes. Furthermore, some SNPs identified in the younger population can be mapped to CAV3, OXTR, ROR1, and ERG genes. We also verified the SNPs identified in TWB2.0 in the UK Biobank (UKB) and BioBank Japan (BBJ). The gene XYLT1 was identified in our younger population and replicated in the UKB at a p < 0.05; the genes ZBTB32, KMT2B, and PROSER3 were identified in the older group and replicated in BBJ at a p-value < 0.05. Although all of these SNP associations with cataracts remain unclear and unreported, we must not rule out their relevance to the disease. We provide a brief description of some of their essential functions below that could help guide future follow-up experiments and pathway analyses into their mechanisms related to cataracts.
Firstly, the H6 family homeobox 1 (HMX1) gene is the leading SNP identified in our older population. It is located at 4p16.1. A previous genome-wide study of two individuals from a consanguineous family found an association of HMX1 with congenital cataracts. A homozygous missense mutation (c.650A>C; p.(Gln217Pro)) that abrogates the HMX1 function results in a rare oculoauricular syndrome associated with congenital cataracts, anterior segment dysgenesis, and retinal dystrophy [31,32]. Although rs145208055, located near HMX1, was identified in our older population, the linkage between senile cataracts and HMX1 remains to be explored.
The gap junction protein alpha 1 (GJA1) gene, located on chromosome 6 at the location of 6q22.31, was identified in the older population, and it encodes a connexin protein responsible for intercellular transmembrane channels at gap junctions. The channels provide a communication route for the diffusion of molecules between neighboring cells and play a particularly crucial role in the heart and embryonic development [33,34]. This gene is also related to the signaling receptor binding and protein domain-specific binding pathways. The GJA1 gene has been associated with oculodentodigital dysplasia [35], autosomal recessive craniometaphyseal dysplasia [36], and heart malformations [37], and may also play a role in the physiology of hearing by participating in the recycling of potassium to the cochlear endolymph and in cell growth inhibition [38]. GJA3 and GJA8, but not GJA1, have been associated with cataracts in previous studies. GJA3 and GJA8 are expressed on the specialized lens fibers that maintain the homeostasis and transparency of the lens [39]. However, the evidence for the involvement of the GJA1 gene in cataracts remains unclear.
The sec1 family domain (SCFD2) gene, identified in the older group, is a proteincoding gene that participates in protein transport and exocytosis [40] and is involved in multiple personality disorders. In terms of its potential role in ophthalmic diseases, a previous study has found the opposite effect of the Scfd2 gene on STAT1 and miR-493 regulators, which are associated with ischemia, a type of common pathological pathway for neuronal cell degeneration associated with many retinal diseases [41,42]. Other results from GWAS analysis also indicated that several variants within the SCFD2 gene locus achieved genome-wide statistical significance in their association with cataracts in the Australian Shepherd breed of domestic dogs [43]. Additionally, the SCFD2 gene has been associated with adiposity and diabetes [44].
The alkylglycerol monooxygenase gene (AGMO), identified in the older population, is a protein-coding gene located at 7p21.2. It is a tetrahydrobiopterin-and iron-dependent enzyme that cleaves the O-alkyl bond of ether lipids. The protective roles of AGMO against cataractogenesis, central nervous system myelination abnormalities, and spermatogenesis arrest have been proven based on the phenotypical report of ether lipid-deficient mice [45,46]. Additionally, AGMO may play a role in the development of type II diabetes [47], which is a risk factor for cataracts.
The proline-and serine-rich 3 (PROSER3) gene, identified in the older group, is located at the location 19q13.12. PROSER3 has shown its associations in previous GWAS with serum albumin [48], calcium [49], and sex hormone-binding globulin measurements [50]. Nonetheless, there have been no reported associations with cataracts and other eye diseases or molecular pathways so far.
The SNP rs237885, identified in the younger cataract group, is mapped to the oxytocin receptor (OXTR) gene located at 3p25.3. Such a gene locus also contains the coding region for caveolin 3 (CAV3). OXTR is a G-protein-coupled receptor that activates a phosphatidylinositol-calcium second messenger system [51]. The oxytocin-oxytocin receptor system plays a crucial role in the uterus during parturition and lactation and is associated with prosopagnosia [52]. In addition, the gene is related to pathways, including myometrial relaxation, contraction pathways, and RET signaling [53,54]. Additionally, OXTR is also found expressed in the amacrine cells of the inner nuclear layer. Transcriptome analysis revealed that the gene is implicated in the neuroactive ligand-receptor interaction, calcium signaling pathway, and cAMP signaling pathways during age-related transcriptional changes in the human retinal pigment epithelium (RPE) [55]. Although there are no existing studies on the roles oxytocin receptors play in the cataract molecular mechanism, epidemiological studies have found that breastfeeding is associated with a decreased likelihood of acquiring cataracts [56], with a largely unexplored molecular mechanism. It should also be noted that the oxytocin system has been associated with diabetes and adiposity [57].
The caveolin 3 (CAV3) gene located at 3p25.3 encodes caveolin proteins that are components of the caveolae plasma membranes. It interacts with and regulates G-proteins and voltage-gated potassium channels. The gene is involved in pathways, including smooth muscle contraction and the remodeling of adherens junctions. CAV3 mutations lead to disruptive protein oligomerization or intracellular routing, and further causes limb-girdle muscular dystrophy type-1C (LGMD-1C) [58], hyperCKemia [59], or rippling muscle disease (RMD) [60]. While the muscular manifestations in RMD are often misinterpreted as myotonia [61], and there is no known link between the two conditions, the association between CAV3 and myotonia presents with cataract symptoms and remains to be investigated.
ETS transcription factor ERG (ERG) is located on chromosome 21 at 21q22.2. Its association with cataracts was identified in the younger group in this study. This gene encodes a transcription factor belonging to the erythroblast transformation-specific (ETS) family, which comprises the key regulators of embryonic development, cell proliferation, differentiation, angiogenesis, inflammation, and apoptosis [62,63]. The protein is required for inducing vascular cell remodeling and regulating hematopoiesis. The translation of the ERG gene gives rise to different fusion gene products, such as TMPSSR2-ERG and NDRG1-ERG in prostate cancer, EWS-ERG in Ewing's sarcoma, and FUS-ERG in acute myeloid leukemia [64,65]. Despite over two dozen recombination variants reported, the functions of these variants have not been determined, and the association with cataracts and other eye diseases remains to be unraveled.
In addition to finding the molecular pathway of cataracts, we also expanded this analysis to derive a polygenic risk score (PRS) that predicts cataract risks. The statistical significance of the PRS model supports the multifactorial nature of cataracts. Previous studies have created a PRS model containing six SNPs to predict cataract risks. Their results illustrated a 2.47-fold increase in risks in the high PRS group compared to the low PRS group after covariate adjustments [66]. Here, we provide the result of 200 independent SNPs. In our model, the younger patients within the highest quantile of PRS had a 17.45-fold increased risk of acquiring cataracts than those in the lowest quantile. Older patients in the highest PRS quantile had a 10.97-fold increased risk. Thus, our model offers a sufficient genetic tool to recognize high-risk cataract groups early. Additionally, since the number of older cases is three times more than that of younger cases, it is reasonable to confirm that genetics play a larger role in the younger population than in the older population. Additionally, the area under the curve can be improved by adding comorbidities such as aging and diabetes. This suggests that the interplay between environmental and genetic factors functions in the development of cataracts. Furthermore, this is the largest Taiwanese-based PRS cataract prediction model to date, proving its potential for clinical applications.
The advantage of the large-scale multi-center biobank in this study allowed us to determine cataract risk with great statistical power. Previous studies have presented the differential genetic landscape of cataracts among ethnic groups. In the comparison with BBJ and the UK Biobank, the results support the genetic disparity between the Han population and other ethnic groups and provide hints to common cataractogenic molecular pathways given the validation of the SNPs in BBJ and the UK Biobank. As for the PRS model, it exhibits potential for clinical application in the post-genomic era. Such a PRS model may aid ophthalmologists in prompting high-risk individuals to avoid modifiable risk factors, such as UV exposure or steroid usage. Early recognition and the prevention of cataracts may, as a result, reduce the demands of surgery. Nonetheless, our study is limited to the unavailability of data such as the age at diagnosis, clinical verification of diagnosis, and environmental risk factors, such as UV exposure or steroid usage. In addition, the inability to distinguish between different cataract subtypes, including subcortical, nuclear, and posterior subcapsular subtypes, decreases the statistical power to identify cataractassociated alleles. Since most genome-wide studies fail to adjust such covariates, future investigation into the gene-environment interaction is needed.
In conclusion, in our study, we analyzed the data from the TWB2.0 and TWB1.0 databases. A total of 167 and 43 cataract-related SNPs were identified in the older and younger cataract groups, respectively. Further analyses are required to survey the risk loci differences between cataract subtypes. Furthermore, a novel PRS model was built to identify patients susceptible to cataracts in each of the older and younger populations. The model was validated by an independent Han-based cohort from the TWB1.0 database. Overall, the newly identified genome-wide SNP loci, along with the PRS model, highlight the genetic bases of cataracts, open new avenues for molecular research, and present clinical significance for distinguishing high-risk cataract individuals.

Supplementary Materials:
The following supporting information can be downloaded at: https://www.