Computational Screening and Analysis of Lung Cancer Related Non-Synonymous Single Nucleotide Polymorphisms on the Human Kirsten Rat Sarcoma Gene

The human KRAS (Kirsten rat sarcoma) is an oncogene, involved in the regulation of cell growth and division. The mutations in the KRAS gene have the potential to cause normal cells to become cancerous in human lungs. In the present study, we focus on non-synonymous single nucleotide polymorphisms (nsSNPs), which are point mutations in the DNA sequence leading to the amino acid variants in the encoded protein. To begin with, we developed a pipeline to utilize a set of computational tools in order to obtain the most deleterious nsSNPs (Q22K, Q61P, and Q61R) associated with lung cancer in the human KRAS gene. Furthermore, molecular dynamics simulation and structural analyses of the 3D structures of native and mutant proteins confirmed the impact of these nsSNPs on the stability of the protein. Finally, the experimental results demonstrated that the structural stability of the mutant proteins was worse than that of the native protein. This study provides significant guidance for narrowing down the number of KRAS mutations to be screened as potential diagnostic biomarkers and to better understand the structural and functional mechanisms of the KRAS protein.


Introduction
Lung cancer remains the most frequent cause of cancer-related death worldwide in the past few decades [1]. Kirsten rat sarcoma (KRAS) viral oncogene homolog mutant tumors constitute the most prevalent targetable molecular subtype of non-small cell lung cancer, which accounts for most of all lung cancer cases [2][3][4]. The KRAS gene encodes a small GTPase membrane-bound protein as the signaling molecule, whose mutations are vital to cellular proliferation and survival. Thus, the precise identification of mutations in the KRAS gene and the encoded protein is extremely important for a clearer understanding of their effects on cancer cell proliferation and survival. However, the experimental methods to detect the functional mutations in a genome or even in a single gene are both time-and resource-consuming. Therefore, it is crucial to develop in silico approaches to identify the functional significant mutations that might aid in the development of cancer cells regarding the KRAS gene.
Single nucleotide polymorphisms (SNPs) are the most frequent type of genetic variations that occur in the coding or non-coding regions of a DNA sequence. There is one variation in every 200-300 bp in the whole human genome. These types of variations account for approximately 90%

Data Collection
All information about the human KRAS gene was retrieved from public web-based resources. The reported SNP mutations in the KRAS gene was collected from the dbSNP database (http://www.ncbi.nlm.nih.gov/snp/) [5]. The amino acid sequence (UniProt ID: P01116) that encodes PCA and DSSP Analysis

Data Collection
All information about the human KRAS gene was retrieved from public web-based resources. The reported SNP mutations in the KRAS gene was collected from the dbSNP database (http: //www.ncbi.nlm.nih.gov/snp/) [5]. The amino acid sequence (UniProt ID: P01116) that encodes a KRAS protein was retrieved from the UniProt database (https://www.uniprot.org/), while the protein 3D crystal structure (shown in Figure 2) was obtained from PDB (Protein Data Bank, http://www.rcsb.org/) with PDB ID 5VQ2 [19,20].

Prediction of Functional Consequences of nsSNPs
The functional effects of nsSNPs were predicted by SIFT (Sorting Intolerant from Tolerant) [7], SNAP2 (screening of non-acceptable polymorphism 2) [10], and PROVEAN (Protein Variation Effect Analyzer) [11]. nsSNPs were assigned as deleterious mutations by the consistent predictions of all three tools.
SIFT (http://sift.bii.a-star.edu.sg) is a program that predicts whether or not an amino acid substitution is responsible for changes in the protein function. Its prediction is based on the physicochemical properties of amino acids in the protein sequence and its sequence homologies [7]. The prediction results of the SIFT program can be categorized into two classes: deleterious and tolerated. The amino acid substitution is predicted to be deleterious if a SIFT score is between 0 and 0.05, while a score between 0.05 and 1 is regarded as tolerable.
SNAP2 (https://rostlab.org/services/snap) is a neural network-based prediction server which identifies the functional effects of amino acid sequence variants [10]. The prediction score ranges from -100 (strongly neutral prediction) to 100 (strong effect prediction), which reflects the likelihood of the single amino acid mutation that may alter the native protein function. PROVEAN (http://provean.jcvi.org) is a web-based server, which utilizes an alignment-based score approach, for prediction of the functional effect of amino acid variants [11]. We submitted the query protein sequence and amino acid variations to the PROVEAN server, which performed a BLAST search to collect homologous sequences, and the scores for each mutation were calculated. The threshold for PROVEAN scores was set to −2.5 to discriminate deleterious substitutions from neutral ones.

Estimation of Evolutionary Conservation of nsSNPs
The level of evolutionary conservation of each sequence position corresponds to the evolutionary rate, which is not constant among all amino acids in a protein. The amino acid positions which evolve slowly are commonly considered as conserved sites that are important for protein structure and function. The ConSurf server (http://consurf.tau.ac.il/) was used to estimate

Prediction of Functional Consequences of nsSNPs
The functional effects of nsSNPs were predicted by SIFT (Sorting Intolerant from Tolerant) [7], SNAP2 (screening of non-acceptable polymorphism 2) [10], and PROVEAN (Protein Variation Effect Analyzer) [11]. nsSNPs were assigned as deleterious mutations by the consistent predictions of all three tools.
SIFT (http://sift.bii.a-star.edu.sg) is a program that predicts whether or not an amino acid substitution is responsible for changes in the protein function. Its prediction is based on the physicochemical properties of amino acids in the protein sequence and its sequence homologies [7]. The prediction results of the SIFT program can be categorized into two classes: deleterious and tolerated. The amino acid substitution is predicted to be deleterious if a SIFT score is between 0 and 0.05, while a score between 0.05 and 1 is regarded as tolerable.
SNAP2 (https://rostlab.org/services/snap) is a neural network-based prediction server which identifies the functional effects of amino acid sequence variants [10]. The prediction score ranges from -100 (strongly neutral prediction) to 100 (strong effect prediction), which reflects the likelihood of the single amino acid mutation that may alter the native protein function. PROVEAN (http://provean.jcvi.org) is a web-based server, which utilizes an alignment-based score approach, for prediction of the functional effect of amino acid variants [11]. We submitted the query protein sequence and amino acid variations to the PROVEAN server, which performed a BLAST search to collect homologous sequences, and the scores for each mutation were calculated. The threshold for PROVEAN scores was set to −2.5 to discriminate deleterious substitutions from neutral ones.

Estimation of Evolutionary Conservation of nsSNPs
The level of evolutionary conservation of each sequence position corresponds to the evolutionary rate, which is not constant among all amino acids in a protein. The amino acid positions which evolve slowly are commonly considered as conserved sites that are important for protein structure and function. The ConSurf server (http://consurf.tau.ac.il/) was used to estimate the level of evolutionary conservation of amino acid positions in a protein, based on the phylogenetic relationships between homologous sequences [13,21,22]. We submitted both the protein sequence and structure to the ConSurf server, which calculates the conservation scores partitioned into a discrete scale of nine bins. The positions with bin 9 indicate the most conserved sites, while the positions with bin 1 indicate the most variable sites.

Prediction of Protein Change Stability of nsSNPs
Accurate prediction of protein stability changes upon single point mutations is important for understanding protein structure and function. In the present study, we used MuPro [14] and I-Mutant 2.0 [15] to predict protein stability changes for the SNPs. MuPro (http://mupro.proteomics.ics.uci.edu/) is a support vector machine-based tool to predict protein stability changes for single amino acid mutations based on protein sequence or/and structural features [14]. I-Mutant 2.0 (http://folding. biofold.org/i-mutant/i-mutant2.0.html) is another support vector machine-based web tool to make automatic predictions of protein stability changes upon single point mutations [15]. We uploaded the protein sequence, position of mutation, and the mutant residue, and the protein stability was predicted at default temperatures and pH. The reliability index value of the prediction that ranges from 0 (unreliable) to 10 (reliable) was also calculated.

Identification of Somatic Mutations that can Cause Cancer
Furthermore, we identify the somatic mutations that can cause cancer in the KRAS gene. The COSMIC (Catalogue of Somatic Mutations in Cancer, https://cancer.sanger.ac.uk/cosmic/) website was developed for curating the somatic mutations information related to human cancer [23].

Modeling of Native and Mutant KRAS Proteins
The crystal structure of the KRAS protein was obtained from PDB (Entry ID: 5VQ2; Chain: A; Resolution: 1.96 Å) [20]. All water molecules and ligands were removed from the crystal structure, and the Modeler 9.19 package was used to map the missing parts of structure on the wild-type (WT) protein [24]. Moreover, the WT structure was mutated by each one of the three most deleterious mutants predicted in the previous sections. The three structures of mutant (MT) proteins, such as Q22K, Q61R, and Q61P, were modeled by making a point mutation in the wild-type (WT) protein structure using PyMOL software. Then, we used the DynaMut [25] web server (http://biosig.unimelb.edu.au/dynamut) for an initial assessment of the impact of point mutations on protein dynamics and stability. The WT and three MT structures are shown in Figure 3. and the Modeler 9.19 package was used to map the missing parts of structure on the wild-type (WT) protein [24]. Moreover, the WT structure was mutated by each one of the three most deleterious mutants predicted in the previous sections. The three structures of mutant (MT) proteins, such as Q22K, Q61R, and Q61P, were modeled by making a point mutation in the wild-type (WT) protein structure using PyMOL software. Then, we used the DynaMut [25] web server (http://biosig.unimelb.edu.au/dynamut) for an initial assessment of the impact of point mutations on protein dynamics and stability. The WT and three MT structures are shown in Figure 3.

Molecular Dynamics Simulation and Trajectories Analysis
We used molecular dynamics simulation (MD) techniques to investigate the mechanism of structural impacts of the mutations on KRAS. MDs were performed using GROMACS 5.1.2 [26,27] software on an Ubuntu 16.04.5 operating system running on a machine equipped with a 12 terabyte hard-disk, 63 gigabytes RAM, Intel(R) Xeon(R) CPU E5-2640 processor. The Modeler 9.19 package was used to refine the structure of the WT protein, and the PyMOL software was used to map the mutations on the structures of mutant proteins. Then, the protein systems were solvated in a cubic box with SPC (simple point charge) water molecules and the walls were located ≥12 Å from all protein atoms. The box size was set to 4.256 nm × 4.061 nm × 4.142 nm, with box vectors of 6.7 × 6.7 × 6.7 nm, and box angles were kept at 90 • for each side. The total number of atoms in WT, Q22K, Q61P, and Q61R were 29,063, 30,817, 29,638, and 30,554, respectively. The simulation was performed using the CHARMM 36 force field [28] at a neutral pH, which was neutralized by adding a number of Na+ counter ions (7, 6, 7, and 6 for WT, Q22K, Q61P, and Q61R, respectively). The energy of each solvated system was minimized with 50,000 iterations, and the steepest descent minimization was terminated when the maximum force was below 1000 KJ/mol −1 /nm −1 . After the process of energy minimization, the system was equilibrated with pressure (1 bar) and constant temperature (310 K) at a time step of 2 fs. The LINCS (LINear Constraint SolVer) [29] constraints and non-bonded pair list were updated every 10 steps under the position restraint conditions for the heavy atoms. Electrostatic interactions were calculated using the particle mesh Ewald method. The v-rescale (modified Berendsen thermostat) temperature coupling method [30] was used to maintain the constant temperature inside the box. Finally, all the systems were simulated for a duration of 100 ns MD simulations and the coordinates were saved after an interval of every 2 ps.
After the completion of MD, trajectories were analyzed to compare and observe the structural deviation among the KRAS wild-type and mutant structures (MT). The root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (Rg), solvent-accessible surface area (SASA), and secondary structure calculation, were calculated by using the g_rms, g_rmsf, g_gyrate, g_sasa, and do dssp utilities of Gromacs.

Principal Component Analysis
Principal components analysis (PCA) or essential dynamics (ED) were used to reduce the dimensionality of the molecular dynamics simulations data in order to identify the configuration space of anharmonic motion with only a few degrees of freedom. PCA is a method that analyzes the MD trajectory and extracts dominant modes in the overall molecular motion. The motion of structures in a multidimensional space was identified by the most vital eigenvectors projection in Cartesian trajectory coordinates. In the ED analysis, we constructed a covariance matrix of WT and MTs backbone Cα atoms simulation trajectories which removed the rotation and translational movements. Furthermore, we calculated the eigenvectors and eigenvalues of the covariance matrices, and the projection of the first two principal components. We achieved the principal component analysis of trajectories using the Gromacs built-in utilities, such as gmx covar and gmx anaeig.

SNP Data Set from dbSNP
The dbSNP database contains a total of 12,071 SNPs for the KRAS gene. Among the 12,071 SNPs, 261 (2.2%) are missense mutations, which is a type of nonsynonymous substitution in DNA sequences, 131 (1.1%) are coding synonymous SNPs, 2005 (16.6%) SNPs are in the mRNA 3 UTR region, 257 (2.1%) are in the 5 UTR region, and 9754 (80.8%) are in the intronic region. The remaining 42 (0.3%) are nonsense, frame shift, 3 splice site, 5 splice site and stop gained SNPs. The distribution of SNPs is illustrated in Figure 4. We selected the 261 missense mutations for further investigation on the basis of our proposed workflow. Out of 261 mutations in the KRAS gene, 106 of them were mapped to the amino acid positions on the protein sequence (UniProt ID: P01116). Next to this, we identified the most likely pathogenic mutations that confer susceptibility to human diseases regarding the KRAS gene, with six in silico tools-SIFT, SNAP2, PROVEAN, ConSurf, MuPro, and I-Mutant2.0. In order to improve the prediction accuracy, we combined those computational methods that are based on the protein structural and/or functional parameters with necessary evolutionary information. Finally, we used the COSMIC database to identify the three nsSNPs associated with lung cancer. of SNPs is illustrated in Figure 4. We selected the 261 missense mutations for further investigation on the basis of our proposed workflow. Out of 261 mutations in the KRAS gene, 106 of them were mapped to the amino acid positions on the protein sequence (UniProt ID: P01116). Next to this, we identified the most likely pathogenic mutations that confer susceptibility to human diseases regarding the KRAS gene, with six in silico tools-SIFT, SNAP2, PROVEAN, ConSurf, MuPro, and I-Mutant2.0. In order to improve the prediction accuracy, we combined those computational methods that are based on the protein structural and/or functional parameters with necessary evolutionary information. Finally, we used the COSMIC database to identify the three nsSNPs associated with lung cancer.

Screening of Missense SNPs Based on Functional Analysis
A total of 106 missense mutations were used for the prediction of their functional effects via SIFT, SNAP2, and PROVEAN tools. Out of 106 nsSNPs, SIFT predicted 70 nsSNPs as 'intolerant', with scores ≤ 0.05 and the remaining 36 nsSNPs were predicted as 'tolerated', with a score greater than 0.05. SNAP2 predicted 90 nsSNPs as 'effect', with scores > 0, out of which 54 were predicted as 'effect' with scores ranging from 50 to 100, and 36 nsSNPs were predicted as 'effect' with scores ranging from 0 to 50. The remaining 16 nsSNPs were predicted as 'neutral', with scores < 0. Moreover, all the missense SNPs were also analyzed by PROVEAN. The mutations with scores less than or equal to −2.5, in case of PROVEAN, were considered 'deleterious', while the mutations with

Screening of Missense SNPs Based on Functional Analysis
A total of 106 missense mutations were used for the prediction of their functional effects via SIFT, SNAP2, and PROVEAN tools. Out of 106 nsSNPs, SIFT predicted 70 nsSNPs as 'intolerant', with scores ≤ 0.05 and the remaining 36 nsSNPs were predicted as 'tolerated', with a score greater than 0.05. SNAP2 predicted 90 nsSNPs as 'effect', with scores > 0, out of which 54 were predicted as 'effect' with scores ranging from 50 to 100, and 36 nsSNPs were predicted as 'effect' with scores ranging from 0 to 50. The remaining 16 nsSNPs were predicted as 'neutral', with scores < 0. Moreover, all the missense SNPs were also analyzed by PROVEAN. The mutations with scores less than or equal to −2.5, in case of PROVEAN, were considered 'deleterious', while the mutations with scores greater than −2.5 were predicted to be 'neutral'. According to the default threshold, out of 106 nsSNPs, 78 were predicted to be 'deleterious' and 28 nsSNPs were predicted to be 'neutral'. The predicted results of all three tools are shown in Table 1. The 64 nsSNPs (shown in bold) were predicted to be deleterious and were chosen for further investigation. The conservational level of an amino acid highly affects the protein's overall structure and function. Evolutionary information of proteins is vital for understanding those variations which are disease-causing mutations. Therefore, another round of confirmation was carried out to test the validity of our selected mutations, by analyzing the degree of conservation for a particular amino acid via the ConSurf server. ConSurf is an evolutionary conservation analysis tool that constructs a protein structural representation map with the colorimetric conservation score. The conservation scale in the range of 7 to 9 is considered to be conserved, while those in the range of 4 to 6 and 1 to 3 are considered to be average and variable, respectively. As we all know, disease-causing mutations often exist in the functional domains and reside on highly conserved positions. Based on the protein structural representation map (shown in Figure 5) and the results mentioned in Table 2, 32 mutations (shown in Table 2 with bold) out of 64 nsSNPs were observed to be highly conserved and were found to be located on the highly exposed accessible surfaces. All of the 32 mutations were further subjected to stability inspection. Residues are predicted to be exposed (e), buried (b), functional (i.e., highly conserved and exposed; f), structural (i.e., highly conserved and buried; s), or have insufficient data (x). Numbers indicate the residue number of KRAS protein.

Screening of Deleterious nsSNPs Based on the Stability Analysis
In this step, the stability analysis of our 32 nsSNPs was conducted with I-Mutant 2.0 [15] and MuPro [14]. The MuPro server predicted 31 nsSNPs to be 'decrease' and the remaining 1 to bes 'increase'. A negative score obtained from MuPro means the mutation decreases the protein's structure stability. On the contrary, if the score is >0, it means the mutation increases the protein's structure stability. At a pH of 7.0 and a temperature of 25 °C, the I-Mutant 2.0 was used to evaluate the stability of the mutants, whether they will cause a change in the protein structure stability. For this purpose, the free energy value (DDG value) and reliability index (RI) were computed. According to I-Mutant's threshold, a DDG score less than 0 (<0) or greater than 0 (>0) will be claimed Figure 5. ConSurf output using the UniRef90 protein database. Colors of the ConSurf output indicate the degree of sequence conservation. The two pole colors (blue, purple) indicate variability and conservation, respectively. Residues are predicted to be exposed (e), buried (b), functional (i.e., highly conserved and exposed; f), structural (i.e., highly conserved and buried; s), or have insufficient data (x). Numbers indicate the residue number of KRAS protein. Table 2. Results of the evolutionary conservation analyses using the ConSurf server. SNPs indicated in bold are predicted to be highly conserved and are selected for further evaluation.

rs ID Variant
Conservation Score SWISS-PROT UniProt UniRef90

Screening of Deleterious nsSNPs Based on the Stability Analysis
In this step, the stability analysis of our 32 nsSNPs was conducted with I-Mutant 2.0 [15] and MuPro [14]. The MuPro server predicted 31 nsSNPs to be 'decrease' and the remaining 1 to bes 'increase'. A negative score obtained from MuPro means the mutation decreases the protein's structure stability. On the contrary, if the score is >0, it means the mutation increases the protein's structure stability. At a pH of 7.0 and a temperature of 25 • C, the I-Mutant 2.0 was used to evaluate the stability of the mutants, whether they will cause a change in the protein structure stability. For this purpose, the free energy value (DDG value) and reliability index (RI) were computed. According to I-Mutant's threshold, a DDG score less than 0 (<0) or greater than 0 (>0) will be claimed as decreased or increased stability, respectively. I-Mutant 2.0 showed that 28 nsSNPs (shown in Table 3 with bold) have decreased the stability of the protein structure and 4 nsSNPs have increased the stability of the protein structure.

Lung Cancer Related Mutations by COSMIC Database
By combining the predictions of the SIFT, SNAP2, PROVEAN, Consurf, I-Mutant, and MuPro tools, of 106 nsSNPs, only 28 nsSNPs were found to be more deleterious. Furthermore, we used the COSMIC database to confirm the nsSNPs which confer a lung cancer phenotype. From the COSMIC database, we identified that rs121913236 (Q22K) and rs121913240 (Q61P and Q61R) cause lung tumors. Hence, those three nsSNPs structures were selected for further MD analysis.

Molecular Dynamics Simulation
Mutations may cause conformational changes in protein structures, which might lead to zero or poor protein function or production. In order to better understand the functional and structural behaviors of the prioritized deleterious mutations, we used molecular dynamics simulation to analyze the native and mutant (Q22K, Q61P, and Q61R) proteins. Four systems were built and 100 ns MDs were run by Gromacs. The trajectory files were generated after the molecular dynamics simulation, and various analyses, such as root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (Rg), solvent-accessible surface area (SASA) variations, and PCA analysis for the native and the three mutant structures were carried out.

Structural Stability Analysis
The RMSD value of backbone and C α atoms in WT and three MTs were calculated to assess the conformational stability of the protein during the simulations. As seen in Figure 6A (backbone-RMSD), the mutations caused a notable difference in RMSD pattern between WT and MTs. The RMSD values of the three mutants were constantly higher fluctuated and increased to 80 ns, but after that the three mutants' RMSD values were stable, with a comparatively lower fluctuation rate of around 0.35 nm. For the native protein, there was just one sharp increase in the first 9 ns and two lower fluctuations at 28-40 ns and 75-80 ns. The average backbone-RMSD values for WT and the three MTs are 0.1853, 0.2108, 0.2504, and 0.2240 nm, respectively (shown in Table 4). We compared the average backbone-RMSD values, which showed the order Q61P > Q61R > Q22K > WT. Meanwhile, the Figure 6B (Cα-RMSD) graph is similar to Figure 6A Table 4). There is an interesting result; it showed a higher average value for mutants and a lower average value for the native protein. From the results, we reached the conclusion that MTs could change the protein structure and mutations are not stable like native proteins. Since the protein needs a proper and stable structure to perform its function, we can speculate that the mutations alter the stability and activity of the protein.  Table 4). We compared the average backbone-RMSD values, which showed the order Q61P > Q61R > Q22K > WT. Meanwhile, the Figure 6B (Cα-RMSD) graph is similar to Figure 6A Table 4). There is an interesting result; it showed a higher average value for mutants and a lower average value for the native protein. From the results, we reached the conclusion that MTs could change the protein structure and mutations are not stable like native proteins. Since the protein needs a proper and stable structure to perform its function, we can speculate that the mutations alter the stability and activity of the protein.

Structural Flexibility Analysis
The RMSF values of WT and the three MTs Cα amino acids were calculated to determine whether the mutants affected the dynamic behavior of the residue. As seen in Figure 7, different fluctuating behaviors were observed in mutants and WT. All cases of mutant simulations had higher average Cα-RMSF values than the WT simulation, with the average RMSF values for mutants being 0.1251, 0.1183, 0.1289 nm for Q22K, Q61P, and Q61R, respectively, while the RMSF value for WT is 0.0939 nm (Table 4). According to the fluctuation score, we ranked the collected values as follows: Q61R > Q22K > Q61P > Wild. These results show that a higher degree of flexibility was observed in mutants (Q22K, Q61P, and Q61R) than that of the native protein structure. For a small protein, a fluctuation value below 2 Å is acceptable. Figure 7 shows that, for all cases of WT and MTs, most of the higher fluctuation occurred in the N-terminal domain and residues from Ala11 to Thr74 showed significant fluctuation. In the C-terminus domain, they also showed higher fluctuations in all cases of WT and the three mutations. At the same time, Figure 7 shows that the RMSF value of each

Structural Flexibility Analysis
The RMSF values of WT and the three MTs Cα amino acids were calculated to determine whether the mutants affected the dynamic behavior of the residue. As seen in Figure 7, different fluctuating behaviors were observed in mutants and WT. All cases of mutant simulations had higher average Cα-RMSF values than the WT simulation, with the average RMSF values for mutants being 0.1251, 0.1183, 0.1289 nm for Q22K, Q61P, and Q61R, respectively, while the RMSF value for WT is 0.0939 nm (Table 4). According to the fluctuation score, we ranked the collected values as follows: Q61R > Q22K > Q61P > Wild. These results show that a higher degree of flexibility was observed in mutants (Q22K, Q61P, and Q61R) than that of the native protein structure. For a small protein, a fluctuation value below 2 Å is acceptable. Figure 7 shows that, for all cases of WT and MTs, most of the higher fluctuation occurred in the N-terminal domain and residues from Ala11 to Thr74 showed significant fluctuation. In the C-terminus domain, they also showed higher fluctuations in all cases of WT and the three mutations. At the same time, Figure 7 shows that the RMSF value of each mutant residue was greater than that of WT. Therefore, the results indicate that mutation affected the flexibility of the whole protein, not just of the residual level. Overall, the mutations alter the flexibility and activity of proteins in our cases. The RMSF results are in agreement with that of the RMSD.

Structural Compactness Analysis
The radius of gyration (Rg) is defined as the mass-weight root mean square distance of a collection of atoms from their common center of mass. Rg provides an insight into the overall dimension of the protein. Hence, Rg is also a vital parameter to describe the dynamic stability and compactness of the total protein systems. The Rg was plotted for both Cα atoms and proteins against time over the whole 100 ns simulations at 310 K. From Figure 8A (Cα-Rg), the mutant (Q22K, Q61P, and Q61R) structures show a notable fluctuation and a clearly higher average Rg value than that of the native structure. The three mutant curves are similar to the native in the 0-60 ns time period, but after that native protein showed a stable Rg value, while the three mutant curves showed a sharp increase and higher fluctuation during the last simulation time. From Figure 8B (protein-Rg), it can be seen that the trend is similar to Figure 8A (Cα-Rg). Based on the Rg graph, it was found that the conformations of these three mutants are getting more dispersed and becoming significantly different to the native conformation in the simulation time period, whereas the native structure was stable compared to the mutant. The average Cα-Rg values were 1.4960, 1.5072, 1.5130, and 1.5128 nm in native and mutant (Q22K, Q61P, and Q61R) structures, respectively (shown in Table 4). According to the fluctuation scores, we ranked the collected values as follows: Q61P > Q61R > Q22K > Wild. The Rg results suggest that the mutation changed the protein structure with increasing flexibility in its conformation. In all, the Rg results are in good agreement with that of RMSD and RMSF.

Structural Compactness Analysis
The radius of gyration (Rg) is defined as the mass-weight root mean square distance of a collection of atoms from their common center of mass. Rg provides an insight into the overall dimension of the protein. Hence, Rg is also a vital parameter to describe the dynamic stability and compactness of the total protein systems. The Rg was plotted for both Cα atoms and proteins against time over the whole 100 ns simulations at 310 K. From Figure 8A (Cα-Rg), the mutant (Q22K, Q61P, and Q61R) structures show a notable fluctuation and a clearly higher average Rg value than that of the native structure. The three mutant curves are similar to the native in the 0-60 ns time period, but after that native protein showed a stable Rg value, while the three mutant curves showed a sharp increase and higher fluctuation during the last simulation time. From Figure 8B (protein-Rg), it can be seen that the trend is similar to Figure 8A (Cα-Rg). Based on the Rg graph, it was found that the conformations of these three mutants are getting more dispersed and becoming significantly different to the native conformation in the simulation time period, whereas the native structure was stable compared to the mutant. The average Cα-Rg values were 1.4960, 1.5072, 1.5130, and 1.5128 nm in native and mutant (Q22K, Q61P, and Q61R) structures, respectively (shown in Table 4). According to the fluctuation scores, we ranked the collected values as follows: Q61P > Q61R > Q22K > Wild. The Rg results suggest that the mutation changed the protein structure with increasing flexibility in its conformation. In all, the Rg results are in good agreement with that of RMSD and RMSF. stable compared to the mutant. The average Cα-Rg values were 1.4960, 1.5072, 1.5130, and 1.5128 nm in native and mutant (Q22K, Q61P, and Q61R) structures, respectively (shown in Table 4). According to the fluctuation scores, we ranked the collected values as follows: Q61P > Q61R > Q22K > Wild. The Rg results suggest that the mutation changed the protein structure with increasing flexibility in its conformation. In all, the Rg results are in good agreement with that of RMSD and RMSF.

SASA Analysis
The solvent-accessible surface area (SASA) is the surface area of a biomolecule that is accessible to a solvent. It is used to measure the degree to which an amino acid is exposed to its environments. A lower SASA value indicates a compact protein structure, while a higher SASA value indicates a diffused structure. An increase or decrease in SASA value indicates a change in the protein's structural conformation. The SASA values of the WT and three MT proteins were analyzed for predicting how the mutations affect the structure of the native protein. The SASA values calculated for the WT and three MTs with time are shown in Figure 9, and average SASA values are depicted in Table 4. The figure clearly shows that the three mutant proteins have a higher SASA value than that the native protein.
Results from Table 4 clearly indicate that the average SASA value of the WT protein is smaller than that of the MT proteins. The rank of collected average SASA values are listed as: Q61R (96.159 nm 2 ) > Q61P (96.109 nm 2 ) > Q22K (94.806 nm 2 ) > WT (93.008 nm 2 ) ( Table 4). These values represent that mutants may change the protein's tertiary structure. The three mutant structures increased the values of SASA so that the structure expands in comparison to the native structure. Therefore, the SASA results are also in agreement with the RMSD, RMSF, and Rg results.

SASA Analysis
The solvent-accessible surface area (SASA) is the surface area of a biomolecule that is accessible to a solvent. It is used to measure the degree to which an amino acid is exposed to its environments. A lower SASA value indicates a compact protein structure, while a higher SASA value indicates a diffused structure. An increase or decrease in SASA value indicates a change in the protein's structural conformation. The SASA values of the WT and three MT proteins were analyzed for predicting how the mutations affect the structure of the native protein. The SASA values calculated for the WT and three MTs with time are shown in Figure 9, and average SASA values are depicted in Table 4. The figure clearly shows that the three mutant proteins have a higher SASA value than that the native protein. Results from Table 4 clearly indicate that the average SASA value of the WT protein is smaller than that of the MT proteins. The rank of collected average SASA values are listed as: Q61R (96.159 nm 2 ) > Q61P (96.109 nm 2 ) > Q22K (94.806 nm 2 ) > WT (93.008 nm 2 ) ( Table 4). These values represent that mutants may change the protein's tertiary structure. The three mutant structures increased the values of SASA so that the structure expands in comparison to the native structure. Therefore, the SASA results are also in agreement with the RMSD, RMSF, and Rg results.

Principal Component Analysis
Principal component analysis (PCA) or essential dynamics (ED) analysis are widely used for predicting the dynamic behaviors of a protein. We performed PCA to identify large-scale collective motions of the WT and three MTs on the trajectories generated by our simulations. The eigenvalues of the WT and MT proteins were plotted against the corresponding eigenvector index for the first fifty modes of motion ( Figure 10A). The eigenvalues indicated fluctuations of the eigenvector in the hyperspace, and the figure shows that only a few eigenvectors have larger eigenvalues which played a major role in the overall motion of the WT and MTs. It was found that the first five eigenvectors

Principal Component Analysis
Principal component analysis (PCA) or essential dynamics (ED) analysis are widely used for predicting the dynamic behaviors of a protein. We performed PCA to identify large-scale collective motions of the WT and three MTs on the trajectories generated by our simulations. The eigenvalues of the WT and MT proteins were plotted against the corresponding eigenvector index for the first fifty modes of motion ( Figure 10A). The eigenvalues indicated fluctuations of the eigenvector in the hyperspace, and the figure shows that only a few eigenvectors have larger eigenvalues which played a major role in the overall motion of the WT and MTs. It was found that the first five eigenvectors had significantly dominant motions with a higher eigenvalue, and the remaining eigenvectors were observed to have extremely low eigenvalues in the overall system ( Figure 10A). Thus, we calculated the percentage of first five principal components occupying up to the total observed fifty modes of motion. Throughout the four systems, the first five eigenvectors accounted for 67.32%, 79.94%, 73.87%, and 72.02% of the WT and three MTs (Q22K, Q61P, and Q61R) respectively. These analyses suggest that the mutation changed the structural dynamics of the mutant proteins. Furthermore, we selected the first two principal components (PC1, PC2) to analyze their projection of trajectories during the WT and MT simulations in the phase space (shown in Figure  10B-D). During the four system simulations, the results clearly show that the WT protein covered a smaller region of phase space, while all three MTs occupied a larger region of phase space. Therefore, the PCA results suggest that the WT protein is more stable than the three MT proteins, and these mutations highly altered the structural stability and flexibility. In short, the PCA results are also in agreement with the RMSD, RMSF, Rg, and SASA results, which enhances the validity of the performed analysis.

Secondary Structure Analysis
The secondary structure of the protein was investigated for general alterations in the domain layout. In order to investigate the secondary structure changes over time, the built-in do_dssp function of GROMACS was used. The secondary structure of the four protein systems (WT, Q22K, Q61P, and Q61R) during the 100 ns MD simulation was retrieved with a follow up step every 100 ps. Figure 11 represents the four secondary structures' layout, illustrating that non-significant changes can be observed in the case of mutant layouts as compared to the wild-type. Hence, the number of residues involved in the formation of each type of secondary structure for all the examined protein systems was particularly monitored to address the protein structure and outcomes more clearly, which are presented in the Figure 12. From Figure 12, the changes in the given protein secondary structure pattern illustration are not very clear. Therefore, it was necessary to obtain more secondary structure information to validate our obtained outcomes. The total secondary structure content averaged over the trajectories for four protein systems is given in Table 5. The listed values in the given table indicate the percentages of secondary structures, which reveals a minor difference between the WT and MT systems. This confirmed the results shown in the secondary structure diagram. Furthermore, we selected the first two principal components (PC1, PC2) to analyze their projection of trajectories during the WT and MT simulations in the phase space (shown in Figure 10B-D). During the four system simulations, the results clearly show that the WT protein covered a smaller region of phase space, while all three MTs occupied a larger region of phase space. Therefore, the PCA results suggest that the WT protein is more stable than the three MT proteins, and these mutations highly altered the structural stability and flexibility. In short, the PCA results are also in agreement with the RMSD, RMSF, Rg, and SASA results, which enhances the validity of the performed analysis.

Secondary Structure Analysis
The secondary structure of the protein was investigated for general alterations in the domain layout. In order to investigate the secondary structure changes over time, the built-in do_dssp function of GROMACS was used. The secondary structure of the four protein systems (WT, Q22K, Q61P, and Q61R) during the 100 ns MD simulation was retrieved with a follow up step every 100 ps. Figure 11 represents the four secondary structures' layout, illustrating that non-significant changes can be observed in the case of mutant layouts as compared to the wild-type. Hence, the number of residues involved in the formation of each type of secondary structure for all the examined protein systems was particularly monitored to address the protein structure and outcomes more clearly, which are presented in the Figure 12. From Figure 12, the changes in the given protein secondary structure pattern illustration are not very clear. Therefore, it was necessary to obtain more secondary structure information to validate our obtained outcomes. The total secondary structure content averaged over the trajectories for four protein systems is given in Table 5. The listed values in the given table indicate the percentages of secondary structures, which reveals a minor difference between the WT and MT systems. This confirmed the results shown in the secondary structure diagram.     Figure 12. Number of residues involved in the formation of each type of secondary structure for wild and mutant (Q22K, Q61P, and Q61R) KRAS proteins, with respect to simulation time. In the previous analysis, we learned that mutation changed the stability and flexibility of the protein. The mutant will cause the protein stability to weaken and the flexibility to increase. However, in the secondary structure analysis, we found that the mutation did not cause the protein to produce an obvious conformational drift. To examine how the structure changed and affected the functions upon the incorporation of these mutations, we analyzed the wild and mutants (Q22K, Q61P, and Q61R) superimposed structures at various time steps, depending on backbone RMSD (shown in Figure 13). The results show that mutants can cause severe turbulences in several loop regions of the protein. In addition, we also extracted the average structure (Wild, Q22K, Q61P, and Q61R) from the trajectory to analyze the interaction structure plot ( Figure S1) with LIGPLOT, and analyzed the hydrogen bonds ( Figure S2) between the mutation and the direct neighborhood during the simulation. Both results showed that these three mutants led to a considerable decrease in the number of H-bonds for the mutants and their direct neighborhood. These results indicate that point mutation can directly affect the stability of its interaction with the surrounding residues, which in turn results in the change in protein structure.
The results further indicated that mutant (Q22K, Q61P, Q61R) structures has more f instability and flexibility than the wild structure. It is well known that the KRAS protein is a signal switch molecule that regulates cell fates by coupling receptor activation to downstream effector pathways that control diverse cellular responses, including proliferation, differentiation, and survival. The three mutations we chose can change the stability of the natural KRAS protein; these changes induce protein structural alterations, which in turn affect its function, as reported by Chikan et al. [31]. Therefore, these three mutations may cause the KRAS protein to be unable to perform its native function. That is to say, its binding mode with GTP and GDP cannot be normally converted, which makes the regulation of the KRAS protein on the downstream invalid, leading to various diseases. Much research on KRAS mutations mainly focuses on targeted therapy drugs [32][33][34], testing for KRAS mutations [35], and clinical research [36,37]. We used molecular dynamics to study the conformation of mutated proteins, which is of great significance to the follow-up study of molecular mechanism after mutation and drug design.
HIS-PHE-VAL-ASP-GLU-TYR-ASP-PRO-THR-ILE-GLU). In addition, we also extracted the average structure (Wild, Q22K, Q61P, and Q61R) from the trajectory to analyze the interaction structure plot ( Figure S1) with LIGPLOT, and analyzed the hydrogen bonds ( Figure S2) between the mutation and the direct neighborhood during the simulation. Both results showed that these three mutants led to a considerable decrease in the number of H-bonds for the mutants and their direct neighborhood. These results indicate that point mutation can directly affect the stability of its interaction with the surrounding residues, which in turn results in the change in protein structure. Wild-Q22K: Wild-Q61P: Wild-Q61R:

Conclusions
In this study, we combined computational screening approaches and a public pathogenic database to identify three disease-associated nsSNPs (Q22k, Q61P, and Q61R), which are confirmed to be highly deleterious and can play a crucial role in the progression of lung cancer. Furthermore, the molecular dynamics simulation approach was used to validate the effect of these deleterious point mutations on the KRAS protein structure. The stability, flexibility, and compactness alterations in the mutants were observed in the RMSD, RMSF, and Rg graphs. The experimental results were further supported by an increase in the SASA values and a larger region of phase space in the PCA analysis. Finally, the secondary structure analysis results also suggest that WTs have a more stable cluster in comparison to MTs, and mutation induces structural change in WT proteins. All the results proved that these three mutations can alter the stability and function of the native KRAS protein. Overall, our study provides a comprehensive pipeline to detect the lung cancer-associated nsSNPs which are highly responsible for affecting the native protein dynamics that make the carrier, i.e., humans, more susceptible to developing oncogenic conditions. This study also provides insight and guidance for the design of therapeutic strategies against human lung cancer in the future. It should be noted that PyMol manipulation possibly produces destabilization in the presented study. The manipulation of the structure can introduce strain that cannot really be removed later by energy minimization and relaxation. We will devote ourselves to the resolution of this potential problem in further studies.
Supplementary Materials: The following are available online. Figure S1: Interaction plot of wild-type and the mutated residues with the neighborhood; Figure S2: The Number of hydrogen bonds for WT and MTs with respect to simulation time. The simulations in this work were supported by the Center for High Performance Computing, Shanghai Jiao Tong University. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Conflicts of Interest:
The authors declare that there are no competing interests.