In Silico Study to Predict the Structural and Functional Consequences of SNPs on Biomarkers of Ovarian Cancer (OC) and BPA Exposure-Associated OC

Background: Recently, we have shown that seven genes, namely GBP5, IRS2, KRT4, LINCOO707, MRPL55, RRS1 and SLC4A11, have prognostic power for the overall survival in ovarian cancer (OC). Methods: We present an analysis on the association of these genes with any phenotypes and mutations indicative of involvement in female cancers and predict the structural and functional consequences of those SNPS using in silico tools. Results: These seven genes present with 976 SNPs/mutations that are associated with human cancers, out of which 284 related to female cancers. We have then analysed the mutation impact on amino acid polarity, charge and water affinity, leading to the identification of 30 mutations in gynaecological cancers where amino acid (aa) changes lead to opposite polarity, charges and water affinity. Out of these 30 mutations identified, only a missense mutation (i.e., R831C/R804C in uterine corpus endometrial carcinomas, UCEC) was suggestive of structural damage on the SLC4A11 protein. Conclusions: We demonstrate that the R831C/R804C mutation is deleterious and the predicted ΔΔG values suggest that the mutation reduces the stability of the protein. Future in vitro studies should provide further insight into the role of this transporter protein in UCEC.


Introduction
Ovarian carcinoma (OC) is the most fatal gynaecologic malignancy, accounting for more than 200,000 deaths annually (WHO; Cancer Today). Over 80% of patients with advanced OC will relapse, and despite further good remissions from additional chemotherapy and surgery, they will usually die from their disease [1]. The median progression-free survival (PFS) for relapsed ovarian cancer (ROC) patients who last had treatment within 3-12 months previously is 4-9 months, with overall survival (OS) of~12-20 months [2]. It should be noted that there is a genetic variation of response to chemotherapy and subsequently to tumour progression [3].
A plethora of studies-primarily via genome-wide association studies-have conclusively demonstrated an association between single-nucleotide polymorphisms (SNPs) and cancer risk [4]. There is a high frequency of SNPs occurrence in the human genome. In particular, amino acid point mutations or non-synonymous single-nucleotide polymorphisms (nsSNPs) may alter the structure and subsequently affect the function of the mutated protein [5]. More than 13,000 known SNPs are in exon regions, of which 58% are nsSNPs [6]. Indeed, a number of nsSNPs are associated with an increased cancer risk [7]. For example, nsSNPs in codon 31 of the p21 gene are associated with an increased risk of cervical cancer development [8].
In this study, we aimed to predict the structural and functional consequences of SNPs mapped in genetic variants of these seven biomarkers in gynaecological malignancies.

Landscape of Mutations in Seven Biomarker Genes Based on TCGA, cBioPortal and UK Biobank
We have previously identified seven biomarkers of OC and exposure-associated OC, as discussed [17]. We found that these 7 biomarkers represent 976 and 284 SNPs/mutations associated with human cancers and female cancers, respectively. It should be noted that in Figure 1, we did not illustrate UK BioBank (PhenoScanner)-associated mutations (Table 1) as it has no overlapping/intersection with any other database (cBioPortal or TCGA).     Table 2 summarises the mutation impact on protein structure and function, including amino acid (aa) polarity, charges and water affinity.  We extracted the gynaecological cancer amino acid changes (n = 30) ( Table 3) according to the selection criteria in Figure 5.  Out of 30 gynaecological cancer amino acid changes, only 1 amino acid change, at R831C/R804C, has detected the structural damage of the protein SLC4A11, therefore, we modelled this protein (SLC4A11) with SNP at R831C/R804C in uterine corpus endometrioid carcinoma ( Figure 6). The reason for the 2 different positions is due to the presence of 3 distinct N-terminal variants of human SLC4A11: 918 amino acid splice form 1 (where the mutation is at position 831), 891 amino acid splice form 2 (where the mutation is at position 804) and 875 amino acid splice form 3 (where the mutation is at position 788) [18,19].
For the 918 amino acid variant, the R831C substitution does not alter the secondary structure, but this substitution leads to the expansion of cavity volume by 97.2 Å 3 . Cavity also refers to a pocket on the surface ( Figure 6). This substitution also results in a change between the buried and exposed state of the target variant residue. ARG is buried (RSA 7.6%) and CYS is exposed (RSA 20.7%). In the same protein, an increased z-score from −3.23 to −1.19 was noted, whereas for the mutant-type protein, the z-score changed from −3.24 to −1. 16.
For the 891 amino acid variant, the R804C substitution does not alter the secondary structure, but this substitution leads to the expansion of cavity volume by 99.792 Å 3 . Cavity also refers to a pocket on the surface (Figure 7). This substitution also results in a change between the buried and exposed state of the target variant residue. ARG is buried (RSA 6.8%) and CYS is exposed (RSA 20.0%). Similarly, an increased z-score from −3.22 to −1.09 was also recorded for the wildtype protein and a similar change (from −3.22 to −1.11) for the mutant. Moreover, we created an electrostatic potential surface for solute carrier family 4, sodium borate transporter, member 11 protein ( Figure 8). As the colour legend indicates, the red colour (negative potential) arises from an excess of negative charges near the surface and the blue colour (positive potential) occurs when the surface is positively charged. The white regions correspond to fairly neutral potentials.
Arginine (R) is a positively charged, polar and hydrophilic amino acid in proteins that has a profound role in protein structure and function that involves electrostatic interactions and protein solvation [20]. Alternatively, cysteine (C) is a non-polar, uncharged and hydrophobic amino acid, and the substitution from R to C may have a deleterious impact on the protein hydration and electrostatic interactions of the protein. When we used PROVEAN (Protein Variation Effect Analyzer), a software tool which predicts whether an amino acid substitution has an impact on the biological function of a protein, it provided a score of −7.292 with the annotation "Deleterious" for both R831C and R804C. The default score threshold is currently set at −2.5 for binary classification (i.e., deleterious vs. neutral). We have further evaluated changes in protein stability using MUpro: Prediction of Protein Stability Changes for Single-Site Mutations from Sequences [21,22], where Delta Delta G (DDG), a metric for predicting how a single point mutation will affect protein stability, was measured. In both variants, the predicted DDG was −0.704, suggesting a decrease in protein stability. Similar data were obtained from the BIOCOMP.UNIBO prediction server [23], with a DDG of −0.67 and a prediction of a disease-related mutation. Finally, we have used the DeepDDG server [24] that predicts the stability change of protein point mutations using neural networks and calculated a DDG value of −1.802 (kcal/mol).

Discussion
In this study, we provided a comprehensive overview of a wide repertoire of mutations of seven recently predicted biomarkers for OC that can be acquired using a number of in silico tools. These 7 genes present with 976 SNPs/mutations that are associated with human cancers, out of which 284 are related to female cancers that include ovarian, cervical, endometrial cancer, as well as endometrioid and uterine carcinomas. The most prevalent type of mutation occurring on six (i.e., GBP5, IRS2, KRT4, MRPL55, RRS1 and SLC4A11) out of seven genes was missense mutation, followed by silent and 3 untranslated region (3 UTR) mutations. In the case of LINC00707, being a long non-coding RNA (lncRNA), non-coding transcript exon and intron mutations were the only two types identified in both all cancers and female ones. In both cases, SLC4A11 had the largest percentage of mutations out of all 7 genes at 29.4% and 28.9%, respectively.
In missense mutations, there is a change of a single nucleotide, resulting in a codon that can produce a different amino acid. Using the Human Genome Database as a paradigm, it is evident that several missense mutations are linked with inherited predispositions to malignancies [25]. For example, in a recent analysis of more than 113,000 women, missense variants for BRCA1, BRCA2 and TP53 were associated with a risk of breast cancer [26]. Equally, a number of studies have indicated that mutations at the 3 UTR can drive oncogene activation or inactivation of tumour suppressors by altering the binding efficiency of microRNAs [27,28]. For example, a GAPDH mutation in the 3 UTR creates a miR-125b binding site, and as a result facilitates the development of OC [27].
On the other hand, the mutational landscape for the lncRNA LINC00707 is quite different. We know that lncRNAs exhibit a complex biology and are involved in a number of processes, including gene transcription or gene silencing [29]. Although there is no published data on intronic mutations and their impact on LINC00707, a recent study highlighted their importance in cancer, since 64 tumour suppressors were affected by intronic mutations, and blood cancers showed higher proportions of deep intronic mutations [30].
We have then provided a deeper insight into the percentage of mutation of each of the seven genes of interest in all cancers and in female cancers. For the latter, the largest percentage (28.9%) was attributed to SLC4A11, with GBP5 and KRT4 exhibiting a high percentage as well (21.5% and 20.4%, respectively). In this cohort of cancers, the largest datasets were of uterine endometrioid carcinoma (n = 102) and uterine corpus endometrioid carcinoma (UCEC; n = 85). UCEC is the most common female pelvic malignancy, and the sixth most common gynaecological malignancy in females, with an estimated 417,367 new cases and 97,370 deaths worldwide in 2020 [31]. Despite the wide repertoire of therapeutic options for UCEC, there is an increase in the incidence of endometrial cancer. Of note, numerous shared and cancer type-specific mutation signatures have been identified, with UCEC depicting a number of clusters with distinct mutation frequencies [32]. Out of the seven genes in question, only one study associates the IRS2 polymorphism G1057D with endometrial cancer [33].
We then analysed the mutation impact on amino acid polarity, charge and water affinity, leading to the identification of 30 mutations in gynaecological cancers where amino acid changes lead to opposite polarity, charges and water affinity. Out of 30 gynaecological cancer amino acid changes, only missense mutation (i.e., R831C/R804C in UCEC) was suggestive of structural damage on the solute carrier family 4, sodium borate transporter, member 11 protein. Therefore, we modelled this protein and provided in silico evidence of how a change from arginine (R) to cysteine (C) can exert potential deleterious consequences.
SLC4A11 is a member of the SLC4 family of bicarbonate transporters that is primarily expressed as an integral membrane protein, with aberrant expression in the cornea, thyroid, salivary gland and kidney. This transporter is also involved in sodium-mediated fluid transport in different tissues. The human SLC4A11 gene encodes three splice variants at the NH2 terminus. These include the 918 variant A, the 891 amino acid variant B and the 875 amino acid variant C [18,19]. Of these, according to UniProt, SLC4A11-B is the canonical sequence. To date, most of the work on SLC4A11 is concentrated on corneal dystrophies. Indeed, mutations of SLC4A11 are the cause of congenital hereditary endothelial dystrophy (CHED) and some cases of late-onset Fuchs endothelial corneal dystrophy (FECD) [18]. Interestingly, one the mutations found in families with autosomal recessive corneal endothelial dystrophy (CHED2) was on arginine 804 (G804A). The authors of the study argued that the mutation can alter the hydrophobic interaction of methyl groups located in the arginine stem, thus impacting on the loop stability [34].
In this study, we have shown that (1) the R831C/R804C mutation is deleterious and (2) predicted ∆∆G values suggest that the mutation reduces the stability of the protein. As mentioned, DDG is the change in Gibbs free energy (Gibbs free energy (G) = Enthalpy (H) − Temperature (T) × Entropy (S)) [24]. There is also a strong structural explanation for the change in stability: Arg-831 is in a salt bridge with nearby Glu-519, so R831C will have a large enthalpic impact. However, we acknowledge that it is difficult to further dissect the functional impact of this change in stability without embarking on in vitro studies, mutating the protein in cellular models of UCEC. We also acknowledge that the cavity hypothesis is limited by the neglect of protein-membrane interactions in YASARA. Very recently, a new artificial intelligence system (AI) that predicts 3D protein structures with high accuracy has emerged, termed AlphaFold [35]. Subsequently, we have modelled our predicted structures of the two SLC4A11 protein variants with that of AlphaFold and there is 100% alignment in the R804 transmembrane region (Supplementary Figure S1), suggesting a conserved 3D configuration irrespective of the modelling software.
In terms of its role in female reproductive organs, the only data available come from a study in OC, where high expression of SLC4A11 is a predictor for poor overall survival in serous OC (grade 3/4) [36]. Leveraging data from TCGA and GTEX, we also demonstrated significant upregulation of SLC4A11 in UCEC (Supplementary Figure S2). Future studies should concentrate on gaining a deeper understanding of the actual role of this transporter protein in UCEC and how this deleterious mutation might affect its function, as the normal function(s) of SLC4A11 in gynaecological malignancies still remains unclear.

Data Availability
Xena Repository: Somatic mutation data and sample phenotype information were extracted from the data generated by The Cancer Genome Atlas (TCGA) research network and TCGA somatic mutations (Pan-cancer Atlas), as published in the Xena repository hosted at the University of California Santa Cruz (UCSC) [37].
UK BioBank: Genetic variation/mutation data were extracted from PhenoScanner (version 2), which is a curated database holding publicly available results from large-scale genome-wide association studies (GWAS) for the UK Biobank data. This tool helps to facilitate "phenome scans", the cross-referencing of genetic variants with a broad range of phenotypes, to help aid the understanding of disease pathways and biology.
cBioPortal: Genomic alterations across a set of patients were quarried from cBioPortal (for cancer genomics), an exploratory analysis tool for exploring large-scale cancer genomic datasets that hosts data from large consortium efforts, such as TCGA and TARGET, as well as publications from individual labs. The cBioPortal assists to explore specific genes or a pathway of interest in one or more cancer types.
Statistical Analysis: All unstructured data gathering, processing, modelling and statistical analyses were conducted using R (v. 4

Protein Structure Prediction Tools
UniProt Knowledgebase: The amino acid sequence of the protein of interest was extracted from the UniProt Knowledgebase (UniProtKB) (https://www.uniprot.org (accessed on 10 November 2021)), which is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. It records the information extracted from the literature and curator-evaluated computational analysis.
Protein Data Bank (RCSB PDB): We used the Protein Data Bank (PDB) (https://www. rcsb.org (accessed on 10 November 2021)) to gather the known protein structure information of our genes of interest. It is the single worldwide archive of structural data of biological macromolecules. It includes data obtained by X-ray crystallography and nuclear magnetic resonance (NMR) spectrometry submitted by biologists and biochemists from all over the world.
Phyre2: In order to predict the three-dimensional (3D) structure of our desired protein sequence/gene, we used Phyre2 (v. 2.0). The software assists with the construction of 3D models of our protein of interest based on the alignments between the hidden Markov model (HMM) of the desired sequence and the HMMs of known structure. SWISS-MODEL: We also used a fully automated 3D protein structure homologymodelling server, SWISS-MODEL (https://swissmodel.expasy.org/ (accessed on 10 November 2021)), to predict the 3D structure of our desired protein sequence. Homology modelling is currently the most accurate method to generate reliable 3D protein structure models, as it makes use of experimental protein structures ("templates") to build models for evolutionary-related proteins ("targets").
AlphaFold: The Protein Structure Database (https://alphafold.ebi.ac.uk/ (accessed on 10 November 2021)), an AI system which is able to computationally predict protein structures with unprecedented accuracy and speed, was also used to predict the 3D structure.
YASARA Energy Minimisation Server: Energy minimisation of the protein was performed using the YASARA server (http://www.yasara.org/minimizationserver.htm (accessed on 10 November 2021)), and the YASARA application (v. 21.8.26) was used to view and save the 3D energy-minimised structure in PDB format.
PyMOL: Electrostatic potential surfaces, electron densities and three-dimensional (3D) visualisation of proteins were analysed by PyMOL (v. 2.4.1), which is an open-source molecular visualisation platform.