You are currently viewing a new version of our website. To view the old version click .
International Journal of Molecular Sciences
  • Article
  • Open Access

10 December 2025

Nucleotide Substitution Biases in Related Cancer Driver Genes

,
,
and
1
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA
2
Department of Molecular Genetics, The Ohio State University, Marion, OH 43302, USA
3
The Ohio State University James Comprehensive Cancer Center, Columbus, OH 43210, USA
*
Authors to whom correspondence should be addressed.
Int. J. Mol. Sci.2025, 26(24), 11903;https://doi.org/10.3390/ijms262411903 
(registering DOI)
This article belongs to the Special Issue New Advances in Cancer Genomics

Abstract

Nucleotide substitutions are common in cancer cells, and they occur in both protein coding regions and non-coding regions (5′ and 3′ UTRs and introns). Although substitutions in non-coding regions have the potential to alter gene expression, it is the alteration of coding regions that affects protein function and has the most drastic effect on cellular transformation. Mutations in certain genes (e.g., TP53, KRAS) are common to nearly all cancers, but most cancers are characterized by specific gene mutation signatures. In this report, we investigated nucleotide substitution signatures in coding regions of the top 25 most frequently mutated genes in multiple human cancers. The goal was to examine whether unique nucleotide substitution biases are associated with various cancers. A pan-cancer analysis showed that the most altered nucleotide is guanine, which is biased towards G->A transitions. A per-cancer analysis identified ten cancers with biased substitutions in certain genes. Some of these biases were expected (e.g., KRAS in gastrointestinal cancers or JAK2 in blood cancers). Our analysis revealed biased signature substitutions in 17 genes, of which 14 were characterized as drivers and constituted a closely related set of cell cycle regulators. We conclude that nucleotide substitution biases contribute to specific alterations in cancer genes that produce cellular transformation. Principle component analysis of nucleotide substitutions shows that most cancers cluster together, meaning that they have similar nucleotide changes. However, certain cancers, most notably lung, pancreas, and blood cancers, can be differentiated from each other based on specific nucleotide signatures. Thus, nucleotide substitution patterns can be used to differentiate between some cancers.

1. Introduction

Mutation is the primary driver of genetic alteration that produces both evolutionary change and cellular transformation which causes cancer [1,2]. Mutations can be generated by endogenous processes such as DNA replication or may be caused by exogeneous factors like mutagens and carcinogens. Recent comprehensive analyses of cancer genomes have shown that endogenous and exogeneous factors can cause distinct patterns of mutation signatures [3,4,5,6,7,8]. Although many forms of genomic alterations are observed in cancer cells (e.g., nucleotide substitutions, deletions, insertions, translocations, and other chromosomal re-arrangements), substitutions are the most common [9].
Not all nucleotide substitutions have a similar effect on cellular transformation. For example, synonymous mutations do not alter the protein sequence and are considered less deleterious than non-synonymous mutations. However, new evidence has shown that even synonymous mutations can act as driver of cellular transformation as they may affect transcription, translation, and mRNA stability among other factors [10,11,12,13,14,15]. Non-synonymous mutations have a greater effect and can significantly alter or even inactivate gene function. However, of the 20,000+ genes in the human genome, mutations in only a few have been shown to significantly contribute to cancer. Those genes that have the greatest potential to affect cellular transformation and immortalization have been called “driver” genes [16], while genes that do not significantly contribute to cancer are known as “passenger” genes [17].
Mutation hotspots have been identified in certain genes. A hotspot is defined as a residue or a region of a gene that is mutated more often than would be expected (e.g., found in multiple cancer patients) [18]. For example, the KRAS proto-oncogene accumulates primarily the G12D/V/C mutation [19], and IDH1 has a high-frequency signature at R132H [20]. Numerous other examples exist [21]. The accumulation of these hotspots in cancer genomes is driven by so-called “purifying selection”, which is defined as removal of harmful mutations and selection of mutations that promote fitness [22,23,24,25]. Purifying selection acts both at the evolutionary level and in cancer [18,26].
Although mutations in certain key cell cycle regulators permeate nearly every cancer (e.g., TP53 [27]), most cancers are characterized by mutations in a distinct set of genes that promote transformation and metastasis [28,29,30]. This important cancer classification has allowed development of targeted therapies that are cancer-specific [31,32] as opposed to broad spectrum chemotherapy [33]. Unlike chemotherapy, targeted therapy has the advantage of reducing side effects, though it often leads to acquired resistance [34,35]. Targeted therapy also requires a deeper molecular understanding of specific mutation that can be actionable (e.g., single molecule inhibitors can be developed) [36,37]. Thus, an understanding of the nature of somatic mutations acquired by cancer cells is important for the development of increasingly effective therapies.
With technological advances in high-throughput genome sequencing, there has been a push to identify cancer mutation signatures. Genome data have been deposited in repositories such as the Catalogue of Somatic Mutations in Cancers (COSMIC) [38], and investigators have used these data to identify cancer signature profiles. For example, an elegant study by Alexandrov et al. [9] has identified and catalogued several single-nucleotide mutation signatures associated with various cancer etiologies. Several other studies using the same COSMIC data have revealed more complex signatures such as DNA double-strand breaks patterns and have also illuminated some of the molecular processes by which they occur [21,39,40]. Other repositories, such as the cBioPortal [41,42] or The Cancer Genome Atlas [43], have also been used to identify cancer signatures [7]. These studies have laid the groundwork distinguishing cancers based on mutational profiles as well as the development of clinical therapies [8,44].
In this report, we built on these previous studies and used the COSMIC database [38] to identify signatures in the 25 most frequently mutated genes in every human cancer. Our reasoning was that while previous studies characterized genome-wide mutations, most cancers are characterized by alterations in a few key cell cycle regulators. The goal was to see if certain nucleotide substitutions are more likely to occur than others within the most frequently mutated genes rather than looking at entire genomes.

2. Results and Discussion

2.1. Pan-Cancer Nucleotide Substitution Patterns in 25 Most Frequently Mutated Genes

We first performed a pan-cancer analysis of nucleotide substitutions. Using the COSMIC database, we downloaded nucleotide substitutions data for 43 cancers and an additional file where cancers were not specified (NS). For the analysis presented in this report, we only interrogated mutation signatures in the 25 most frequently mutated genes, which represents 7,035,115 total mutations. Note that this includes multiple samples with the same mutation (e.g., hotspots). This pan-cancer parsing of the most frequently mutated genes in each cancer revealed an expected trend: TP53 was the most mutated gene followed by several other cell cycle regulators (Figure 1A). Although for TTN and MUC16 there is evidence that relevant mutations have been detected in cancer cells [45,46], the data needs to be taken with a caveat because it can lead to false positives due to the length of the genes (104,301 and 46,191 nucleotides, respectively) [47]. Longer genes are more likely to register mutations that may not be as significant to cancer development. For example, KRAS has only 5417 nucleotides, and its cancer frequency of mutations and their driver potential are well established, while the role of TTN and MUC16 mutations in driving cancers is debatable. Integrative Oncogenomics lists KRAS as a driver gene but not MUC16 and TTN [16], clearly indicating that gene function matters more than length when registering mutation frequency and cellular transformation potential.
Figure 1. Pan-cancer distribution of nucleotide changes in the 25 most frequently mutated genes in human cancers. (A). Pan-cancer analysis of most frequently mutated genes. The graph shows the cumulative rank of most mutated genes. Please see Material and Methods for description. (B). Pan-cancer percent occurrence of mutations by T, C, G, and A nucleotide types. The graph shows how likely it is for a mutation to occur in each of the four nucleotides in the 25 most frequently mutated genes. (C). Pan-cancer percent nucleotide occurrence in the 25 most frequently mutated genes. (D). Frequency of nucleotide change occurring for each of the four nucleotides. (E). Percent of nucleotide change type occurring for each of the four nucleotides. Transitions and transversions from each nucleotide are represented as a fraction of 100% (e.g., the three T->C, T->G, T->A changes should add up to 100%).
We next investigated a pan-cancer nucleotide substitution bias within the 25 most frequently mutated genes (Figure 1B, Supplementary Table S1B). By far most substitutions involve guanine (48.13%), followed by cytosine (24.18%), thymine (15.09%), and adenine (11.95%). The high frequency of guanine substitutions is not due to a higher percentage of guanines because the occurrence of the four bases is roughly equal in the coding regions of the genes studied here (Figure 1C). Thus, we interpret these findings to mean that substitutions from guanine are more likely than those from any other base. When we investigated the nature of the mutation, we discovered that mutations from thymine are biased primarily towards transversions to A (T->A) and secondarily to C (T->C) (Figure 1D,E, Supplementary Table S1B). However, cytidine and adenine mutations are biased towards transitions (C->T and A->G) while guanine mutations can occur with equal probability towards transversions (G->T, G->C) and transitions (G->A).

2.2. Cancer-Specific Biased Substitutions

Although a pan-cancer analysis reveals general trends, it has a major caveat in that not all cancers are represented equally. More data has been deposited for some cancers than others, and this can skew the analysis in favor of more represented cancers. Additionally, each cancer may acquire mutation signatures by different mechanisms. We, therefore, parsed the data by cancer type. Indeed, this analysis shows that some cancers (e.g., large intestine, and skin) are much more represented than others (e.g., uterine adnexa) (Figure 2A, Supplementary Table S2A).
Figure 2. Cancer-specific distributions of mutations. (A). A histogram showing mutation counts queried in each cancer. Numbers on top of each bar represent the exact mutations queried in each cancer. Note that repeated mutations in multiple samples (hotspots) are also included in this graph. (B). The likelihood of a nucleotide being mutated in each cancer. (C). Percent substitutions from each nucleotide in each cancer. NS = not specified.
When we analyzed substitutions arising at any of the four nucleotides, we identified that G is the most mutated base with C being the second most likely (Figure 2B, Supplementary Table S2B). This is not unexpected because genes are GC rich. We next analyzed substitution type at each nucleotide (e.g., G->A, G->T, G->C, etc.) (Figure 2C, Supplementary Table S2C). This analysis shows that changes are not complementary (e.g., an equal percentage of G->A and C->T). An incorporation of complementary changes on both strands would be expected to result from replication errors. Conversely, processes such as transcriptional strand bias can result in higher levels of mutation on the transcribed strand compared to the non-transcribed strand [48].
Other strand bias mechanisms can also operate in cancer cells. For example, C->T transition chemical mechanisms have been well documented: they can occur by deamination of CpG islands [9,49], clock-like mutation signatures [39], and decreased polymerase delta processivity [50]. A specific analysis of nucleotide changes revealed some minor discrepancies between different cancer types (Figure 2C, Supplementary Table S2C), suggesting that some of these mechanisms may operate in some cancers. Certain cancers may also appear skewed towards specific mutations (e.g., autonomic ganglia, fallopian tube), but this is due to low sample number (Supplementary Table S2A). Vaginal cancer did not have any A->G transitions, but any meaningful interpretation of this observation should be taken with a caveat because only 35 samples were reported.
To investigate whether there are cancer-specific biased substitutions, we created heatmaps (Supplementary Figure S1, Supplementary Table S3). In the interest of consistency and accurate comparison across all cancer types, we implemented a normalization process that adjusted mutation counts based on reference bases and sample sizes. This approach allowed us to capture significant mutation patterns while minimizing the impact of lower-priority variations. We created matrices for each cancer from normalized data (see Materials and Methods and Supplementary Figure S1). This identified ten cancers with high and moderate biased nucleotide substitutions in certain genes (Figure 3A). As expected for KRAS there is a bias for substitutions to occur predominantly at the guanine nucleotide because of the signature G12D/V/R/C/A and G13D. In the gastrointestinal tract (exact location not specified), G12D accounted for 28%, G12V for 14.7%, and G13D for 30.7%. Substitution at G12 to either A, C, R, and S comprised 26% of the mutations. Only one sample had a different substitution (K117N). In samples where the site was specified (large intestine and small intestine), there was a similar, albeit more moderate, bias toward KRAS G12 mutations, indicating that other KRAS mutations were present. The pancreas also showed a mutation distribution bias towards G12 (46.2% G12D, 31.3% G12V, 12.5% G12R, 2.6%.
Figure 3. Pan-cancer biased nucleotide substitutions. (A). Genes with high and moderate nucleotide bias. Most significant nucleotide changes. The table shows high nucleotide substitution bias (over 30) and moderate (over 15) extracted using the per-cancer matrix analysis. Cancers with fewer than 500 samples were excluded from the table. Also, please see Materials and Methods for description of these analyses. (B). Genes with statistically significant nucleotide skews. The graph shows the genes in each cancer with a chi-square p-value below 0.05. Please see Materials and Methods for description of analysis. (C). Connections between the genes in A generated using the STRING database. (D). Driver status of genes identified in A as reported using Integrative Oncogenomics.
G12C, 1.2% G12A, 6.2% other substitutions at any nucleotide. There were no substitutions at the G13 residue in the pancreas. This data agrees with previous observations [51,52]. Moderate G12 KRAS nucleotide skews were also identified in the peritoneum and testis (Figure 3A, Supplementary Figure S1).
We also identified moderate bias substitutions in KCNJ5 (adrenal gland), JAK2 (hematopoietic and lymphoid), and CDKN2A (penis) (Figure 3A, Supplementary Figure S1). For KCNJ5, 53% of mutations were L168R (c.503T>G) and 42.8% were G151R (c.451G>C). KCNJ5 encodes a potassium channel with high expression in the adrenal gland, and these two mutations have been identified in aldosterone-producing adenomas. Remarkably, the L168R mutation inhibits cell proliferation while the G151R has no effect [53]. In fact, they both appeared to induce apoptosis. Nevertheless, these are signature KCNJ5 mutations in adrenal glands [54]. For JAK2, the bias is caused by the signature oncogenic V617F (c.1849G>T), which causes ligand-independent constitutive phosphorylation of the tyrosine kinase [55]. In the penis, two CDKN2A truncations (R58* c.172C>T and R80* c.238C>T) account for 44.2% and 20.9% of mutations, respectively. Another 20.9% is due to the H83Y (c.247C>T) mutation [56,57,58,59,60,61,62]. To our knowledge, these truncations in CDKN2A were not reported before. However, these penile mutations should be taken with a caveat because the sample size for penis cancer is low. Thus, it is hard to conclude whether a CDKN2A mutational bias exists in penile cancers.

2.3. Statistically Significant Mutations Yielding Biased Substitutions

We next performed a more comprehensive analysis to identify biased nucleotide substitutions. We use the chi-square test to identify statistically significant nucleotide substitution skews. The null hypothesis was that mutations should occur randomly throughout the region of a gene (Supplementary Table S4). This analysis immediately revealed the genes identified in Supplementary Figure S1 using heatmaps (Figure 3A). However, it also identified additional patterns in other cancers (Figure 3B). Guanine is the most frequently mutated nucleotide in these statistically significant skews (Supplementary Figure S2A). Guanine is substituted primarily to adenine and thymine (Supplementary Figure S2B,C). Mutations at cytidine did not occur with equal frequency with guanine but were similar in frequency with adenine. Substitutions involving the thymidine base were the rarest. Thus, it appears that nucleotide substitution skews are primarily generated by G->A transitions. An analysis of the coordinates of the statistically significant skews revealed precisely that (Table 1).
Table 1. Coordinates of statistically significant mutations producing nucleotide skews.
The guanine biases are primarily caused by mutations in Ras family GTPases involved in cell cycle-regulating signal transduction (Figure 3B). We used the STRING database to understand the relationship between the other genes identified and Ras family GTPases (Figure 3C). This analysis shows that most of the identified genes constitute a closely related set of cell cycle regulators. There were three “outliers” that were more distantly related. USP8 was identified in the pituitary by having statistically significant enriched mutations at two residues (S718P/Y, P720Q/R) [63,64,65,66]. USP8 is a ubiquitin ligase involved in degrading tyrosine kinase receptors including EGFR [67]. Thus, inactivating USP8 mutations causes a gain of function phenotype by increasing or prolonging signaling from cell cycle-regulating receptors. KCNJ5 was identified in the adrenal gland and is described in the previous section. SAA1 had a pronounced mutation skew in penile cancers (c.230C>G, p.T77S). SAA1 codes for a protein within the amyloid A family of apolipoproteins involved in fat metabolism, inflammation, and tissue injury [68,69]. Not unexpectedly, the gene is highly expressed in fat and liver tissues [70]. SAA1 mutations have been identified in various cancers [69], but to our knowledge selection of the T77S mutation in penis cancers has not been well characterized. It was identified from characterization of several penis cancer cell lines [57], and remarkably all T77S substitutions were homozygous. SAA1 is characterized by four alpha helixes and a C-terminal tail. The mutation was identified in the ENST00000405158.2 isoform, which has 122 residues and is the longest transcript. Single nucleotide polymorphisms have been identified at the T77 position (rs1671926), but the clinical significance of these mutations is uncertain. ClinVar reports that the alternate G allele at position c.230 is more predominant in Asian populations, and the penile cancer cell lines were of Asian origin [57]. However, as discussed above for CDKN2A mutations, it is hard to conclude whether SAA1 mutations predispose patients to penile cancers because the sample size is too small. Thus, although this finding may be intriguing, more data is necessary to make any solid conclusions about the role of SAA1.
We used Integrative Oncogenomics to interrogate whether the nucleotide skews identified in this analysis occur within driver genes [16]. Indeed, we find that 14 out of the 17 genes identified are characterized as drivers (Figure 3D). Thus, we conclude that nucleotide substitution biases act as a purifying selection for mutations in driver cell cycle regulators.

2.4. Genetic Correlations Between Cancers

To understand whether mutations at the nucleotide level can be actionable to identify correlations between different cancers, we use principal component analysis (PCA) (Figure 4). By applying PCA, the complex, high-dimensional data is reduced to just two principal components (PC1 and PC2), which capture the most significant variations in the mutation data. This reduction simplifies the visualization and analysis of the data, allowing us to observe patterns and clusters that might indicate similarities or differences in the genetic mutation profiles of different cancers.
Figure 4. Principal component analysis (PCA) plot of nucleotide change patterns across various cancer types. The plot displays the first two principal components (PC1 and PC2), capturing the most significant variance in the mutation data. Each point represents a different cancer type, with proximity between points indicating similarity in mutation profiles.
We find that most cancers form a cluster, indicating that the nucleotide substitution changes are similar among these cancers. However, some cancers (placenta, bone, paratesticular, lung, hematopoietic, lymphoid, pancreas, large intestine, peritoneum, gastrointestinal, and central nervous system (CNS)) separate from this cluster, suggesting that these cancers have unique nucleotide substitution skews. The analysis presented here considers nucleotide substitutions in the 25 most frequently mutated genes and includes all substitutions, including rare mutations. Taken together, these data show that although most cancers display a similar nucleotide mutational spectrum, some could be differentiated even at nucleotide resolution. Generally, cancers have been classified by differences in the mutated genes, but here we show that nucleotide substitution differences can also be used to classify cancers. This suggests that for some cancers unique genomic mutational processes are at work.

3. Materials and Methods

3.1. Data Accession and Processing

Excel files with mutation data for 43 different cancer types and one non-specified were downloaded from the Catalogue of Somatic Mutations in Human Cancers (COSMIC) [38]. All files were first sorted by GENE_NAME. Then, non-substitution samples were removed according to the MUTATION_CDS column, and alternative transcripts of each gene were removed by filtering out trailing _ENST in the gene names. Finally, only those mutations occurring within coding regions were retained based on the MUTATION_AA column.

3.2. Nucleotide Substitution Parsing

For each cancer type, we first categorized the samples into 12 nucleotide mutation groups based on the ‘MUTATION_CDS’. Next, we used the value_counts function from the Pandas library in Python (version 3.0) to tally the occurrences of each gene name, identifying the top 25 most frequently mutated genes. We then iterated through these top 25 genes, counting their occurrences within each of the 12 mutation groups. The results were compiled into a matrix with rows representing the top 25 frequently mutated genes, columns representing the 12 nucleotide mutations, and the values indicating the count of each gene within each mutation group.

3.3. Gene Frequency Rank

To quantify mutation prevalence across cancers, the top 25 most frequently mutated genes in each cancer type were assigned rank values from 25 (highest frequency) to 1 (lowest frequency). These ranks were consolidated into a gene-by-cancer table and grouped by gene name. For each gene, rank values across cancers were summed to generate a cumulative pan-cancer rank representing its overall mutation frequency. The full ranked list parsed by cancer type is provided in Supplementary Table S1A, and the cumulative ranks (scaled by dividing by 10 for visualization only) are shown in Figure 1A.
To summarize nucleotide usage across these frequently mutated genes, we calculated pan-cancer nucleotide alteration frequencies by dividing the number of mutations at each reference base (T, C, G, A) by the total number of mutations across all bases (463,968). For example, 70,030 mutations at T correspond to 15.9% of all observed mutations. These distributions are shown in Figure 1B. The total counts of each substitution type from each reference base (e.g., T->C, T->G, T->A) are reported in Figure 1C, and their percentages relative to all substitutions from each reference base are summarized in Figure 1D.

3.4. Normalization

To correct for differences in cohort size and nucleotide abundance, mutation counts were normalized in two steps. First, for each cancer type, counts of each substitution (e.g., T->C, T->G, T->A) were divided by the total number of mutations observed at that reference nucleotide within that cancer. This adjusts for unequal frequencies of T, C, G, and A in coding regions. Second, these values were divided by the total number of samples in that cancer type, so that cancers with very large cohorts do not artificially appear to have stronger mutation biases. The distribution of sample sizes used in this normalization is shown in Supplementary Figure S2A.

3.5. Heatmaps

To analyze mutation patterns in both individual cancers (per-cancer) and across multiple cancer types (pan-cancer), we created heatmaps—visual representations that illustrate the intensity of mutations for each gene. Cancers with fewer than 500 samples were excluded from this analysis because small cohorts produce unstable mutation-rate estimates with high variance and sparse substitution counts, making heatmap comparison unreliable. A Python code was developed to process and visualize the heatmaps, focusing on identifying mutations that exceed a specified threshold.
The analysis begins by importing normalized mutation data, with each dataset corresponding to a different cancer type. For each dataset, a heatmap is generated to visualize the mutation patterns, with genes on the y-axis and nucleotide changes on the x-axis. Heatmaps were created for each cancer type separately (Supplementary Figure S2). To identify meaningful mutation enrichments in the heatmaps (Supplementary Figure S1) without overwhelming visual noise, we evaluated multiple cutoffs (1, 3, 5, 8, 10). A threshold of 5 produced the clearest and most biologically interpretable patterns across all cancers. Thus, the threshold of 5 was selected empirically to optimize signal-to-noise representation.

3.6. Chi-Square Calculations of Nucleotide Bias

A chi-square test was used to calculate statistically significant nucleotide biases. The calculations were conducted on normalized data (Supplementary Table S4, columns J–M). Because the chi-square test does not work well with small values, the normalized values were multiplied by 106 to make them larger (see columns N, O, P, Q, Supplementary Table S4). These adjusted larger values are the observed values. The expected values were calculated based on the observed values (see columns R, S, T, U). The null hypothesis here would be that each nucleotide should be hit equally (e.g., mutation should occur with equal probability in every residue within the coding region). The chi-square value was calculated (column W), and the chi-square probability value was extracted using an online calculator from here: https://www.socscistatistics.com/pvalues/chidistribution.aspx (accessed on 1 September 2025). The p-values were extracted at 3 degrees of freedom since there are four values (Supplementary Table S4, column X). Statistically significant p-values (<0.05) are presented in Supplementary Table S3A. Supplementary Table S3B lists coordinates of genes, and mutations are in Supplementary Table S3A.

3.7. STRING Database

The STRING database (https://string-db.org) (accessed on 23 October 2024) was used to extract connections between the statistically significant genes in Figure 3C. The parameters used were network type-full STRING; meaning of network edges-evidence; active interaction sources-textmining, experiments, databases, co-expression, neighborhood, gene fusion, co-occurrence.

3.8. Gene Driver Potential

Integrative oncogenomics (https://www.intogen.org/search) (23 October 2024) [16] was used to determine whether the gene was classified as a driver.

3.9. Principal Component Analysis

We aimed to plot the cancers in a single figure to compare the distribution of mutation types among frequently mutated genes. To achieve this, we aggregated the top 25 frequently mutated genes from each cancer type, resulting in a list of 373 identical genes. We then counted the mutation distribution for these 373 genes following the same procedure as described in the “Nucleotide Substitution Parsing” section. This produced a matrix for each cancer, where the rows represent the 373 genes, the columns represent 12 mutation groups, and the values are the mutation counts. To mitigate biases arising from the unequal distribution of T, C, G, and A in gene sequences and the varying number of samples across cancer types, we normalized the counts by dividing them by the corresponding gene name and cancer type counts. One gene was removed due to a lack of sequence information. Before plotting, we aggregated all the matrices into a single matrix by summing the counts from all the frequently mutated genes into a vector representing each cancer. Each vector contained 12 values corresponding to the mutation groups, and all vectors were compiled into a large matrix.
Finally, we standardized each row using L1 normalization and applied principal component analysis (PCA). PCA is a statistical technique that simplifies complex datasets by reducing their dimensions while preserving as much information as possible. In this case, it allowed us to compress the data into two dimensions, making it easier to visualize and compare the mutation patterns across different cancer types. The PCA plot was generated using Python and is shown in Figure 4.

4. Conclusions

Here we wanted to know whether nucleotide substitution patterns could be used to differentiate between cancers. We decided to analyze only the 25 most frequently mutated genes to eliminate noise from rare mutations. Our analysis shows that certain cancers are clearly biased by key signatures (e.g., KRAS G12 substitutions). Most importantly, we show that this analysis can quickly identify a close knit of key driver cell cycle regulators. Therefore, nucleotide substitution bias can be used in high-throughput genomics as a preliminary diagnostic tool and to classify certain cancers.
We are cognizant of the limitations of this study, including the fact that certain cancers do not have enough data to analyze. As genome sequencing becomes less expensive and more common, our data show that it may be possible to quickly diagnose cancers using raw sequencing data.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms262411903/s1.

Author Contributions

Conceptualization: G.C.M. and R.C.P.; methodology, A.K. and Y.D.; software, A.K. and Y.D.; validation, A.K., Y.D., G.C.M. and R.C.P.; formal analysis, A.K., Y.D., G.C.M. and R.C.P.; investigation, A.K., Y.D., G.C.M. and R.C.P.; resources, G.C.M. and R.C.P.; data curation, G.C.M. and R.C.P.; writing—original draft preparation, G.C.M. and R.C.P.; writing—review and editing, A.K., G.C.M. and R.C.P.; visualization, A.K., G.C.M. and R.C.P.; supervision, G.C.M. and R.C.P.; project administration, G.C.M. and R.C.P.; funding acquisition, R.C.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Institutes of Health, USA, grant number R03CA304032, and the APC was funded by the same grant.

Institutional Review Board Statement

Institutional review statement not necessary because all data used in this publication is publicly available.

Data Availability Statement

All data presented here are available in public databases, COSMIC (https://cancer.sanger.ac.uk/cosmic) and cBioPortal (https://www.cbioportal.org/) (both accessed on 24 October 2024). These repositories anonymize all patient data and make it freely available to investigators. No consent is required to access or use the data.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hanahan, D.; Weinberg, R.A. The hallmarks of cancer. Cell 2000, 100, 57–70. [Google Scholar] [CrossRef] [PubMed]
  2. Lynch, M. Evolution of the mutation rate. Trends Genet. 2010, 26, 345–352. [Google Scholar] [CrossRef] [PubMed]
  3. Sherman, M.A.; Yaari, A.U.; Priebe, O.; Dietlein, F.; Loh, P.; Berger, B. Genome-wide mapping of somatic mutation rates uncovers drivers of cancer. Nat. Biotechnol. 2022, 40, 1634–1643. [Google Scholar] [CrossRef] [PubMed]
  4. Akdemir, K.C.; Le, V.T.; Kim, J.M.; Killcoyne, S.; King, D.A.; Lin, Y.P.; Tian, Y.; Inoue, A.; Amin, S.G.; Robinson, F.S.; et al. Somatic mutation distributions in cancer genomes vary with three-dimensional chromatin structure. Nat. Genet. 2020, 52, 1178–1188. [Google Scholar] [CrossRef]
  5. Li, Y.; Roberts, N.D.; Wala, J.A.; Shapira, O.; Schumacher, S.E.; Kumar, K.; Khurana, E.; Waszak, S.; Korbel, J.O.; Haber, J.E.; et al. Patterns of somatic structural variation in human cancer genomes. Nature 2020, 578, 112–121, Erratum in Nature 2023, 614, E38. [Google Scholar] [CrossRef]
  6. Tarabichi, M.; Demeulemeester, J.; Verfaillie, A.; Flanagan, A.M.; Loo, P.V.; Konopka, T. A pan-cancer landscape of somatic mutations in non-unique regions of the human genome. Nat. Biotechnol. 2021, 39, 1589–1596. [Google Scholar] [CrossRef]
  7. The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 2020, 578, 82–93. [Google Scholar] [CrossRef]
  8. Koh, G.; Degasperi, A.; Zou, X.; Momen, S.; Nik-Zainal, S. Mutational signatures: Emerging concepts, caveats and clinical applications. Nat. Rev. Cancer 2021, 21, 619–637. [Google Scholar] [CrossRef]
  9. Alexandrov, L.B.; Kim, J.; Haradhvala, N.J.; Huang, M.M.; Ng, A.W.T.; Wu, Y.; Boot, A.; Covington, K.R.; Gordenin, D.A.; Bergstrom, E.N.; et al. The repertoire of mutational signatures in human cancer. Nature 2020, 578, 94–101, Erratum in Nature 2023, 614, E41. [Google Scholar] [CrossRef]
  10. Rheinbay, E.; Nielsen, M.M.; Abascal, F.; Wala, J.A.; Shapira, O.; Tiao, G.; Hornshoj, H.; Hess, J.M.; Juul, R.I.; Lin, Z.; et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 2020, 578, 102–111, Erratum in Nature 2023, 614, E40. [Google Scholar] [CrossRef]
  11. Teng, H.; Wei, W.; LI, Q.; Xue, M.; Shi, X.; Li, X.; Mao, F.; Sun, Z. Prevalence and architecture of posttranscriptionally impaired synonymous mutations in 8,320 genomes across 22 cancer types. Nucleic Acids Res. 2020, 48, 1192–1205. [Google Scholar] [CrossRef]
  12. Zeng, Z.; Bromberg, Y. Inferring Potential Cancer Driving Synonymous Variants. Genes 2022, 13, 778. [Google Scholar] [CrossRef] [PubMed]
  13. Cheng, N.; Li, M.; Zhao, L.; Zhang, B.; Yang, Y.; Zheng, C.H.; Xia, J. Comparison and integration of computational methods for deleterious synonymous mutation prediction. Brief Bioinform. 2020, 21, 970–981. [Google Scholar] [CrossRef] [PubMed]
  14. Sharma, Y.; Miladi, M.; Dukare, S.; Boulay, K.; Caudron-Herger, M.; Grob, M.; Backofen, R.; Diederichs, S. A pan-cancer analysis of synonymous mutations. Nat. Commun. 2019, 10, 2569. [Google Scholar] [CrossRef] [PubMed]
  15. Chevance, F.F.; Le Guyon, S.; Hughes, K.T. The effects of codon context on in vivo translation speed. PLoS Genet. 2014, 10, e1004392. [Google Scholar] [CrossRef]
  16. Martinez-Jimenez, F.; Muinos, F.; Sentis, I.; Deu-Pons, J.; Reyes-Salazar, I.; Arnedo-Pac, C.; Mularoni, L.; Pich, O.; Bonet, J.; Kranas, H.; et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer 2020, 20, 555–572. [Google Scholar] [CrossRef]
  17. Pon, J.R.; Marra, M.A. Driver and passenger mutations in cancer. Annu. Rev. Pathol. 2015, 10, 25–50. [Google Scholar] [CrossRef]
  18. Nesta, A.V.; Tafur, D.; Beck, C.R. Hotspots of Human Mutation. Trends Genet. 2021, 37, 717–729. [Google Scholar] [CrossRef]
  19. Prior, I.A.; Hood, F.E.; Hartley, J.L. The Frequency of Ras Mutations in Cancer. Cancer Res. 2020, 80, 2969–2974. [Google Scholar] [CrossRef]
  20. Pirozzi, C.J.; Yan, H. The implications of IDH mutations for cancer development and therapy. Nat. Rev. Clin. Oncol. 2021, 18, 645–661. [Google Scholar] [CrossRef]
  21. Alexandrov, L.B.; Nik-Zainal, S.; Wedge, D.C.; Campbell, P.I.; Stratton, M.R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013, 3, 246–259. [Google Scholar] [CrossRef]
  22. Page, R.D.M.; Holmes, E.C. Molecular Evolution: A Phylogenetic Approach; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
  23. Charlesworth, B.; Morgan, M.T.; Charlesworth, D. The effect of deleterious mutations on neutral molecular variation. Genetics 1993, 134, 1289–1303. [Google Scholar] [CrossRef] [PubMed]
  24. Dukler, N.; Mughal, M.R.; Ramani, R.; Huahg, Y.F.; Siepel, A. Extreme purifying selection against point mutations in the human genome. Nat. Commun. 2022, 13, 4312. [Google Scholar] [CrossRef] [PubMed]
  25. Ohta, T.; Gillespie, J.H. Development of Neutral and Nearly Neutral Theories. Theor. Popul. Biol. 1996, 49, 128–142. [Google Scholar] [CrossRef]
  26. Bignell, G.R.; Greenman, C.D.; Davies, H.; Butler, A.P.; Edkins, S.; Andrews, J.M.; Buck, G.; Chen, L.; Beare, D.; Latimer, C.; et al. Signatures of mutation and selection in the cancer genome. Nature 2010, 463, 893–898. [Google Scholar] [CrossRef] [PubMed]
  27. Levine, A.J. Spontaneous and inherited TP53 genetic alterations. Oncogene 2021, 40, 5975–5983. [Google Scholar] [CrossRef]
  28. Chakravarty, D.; Solit, D.B. Clinical cancer genomic profiling. Nat. Rev. Genet. 2021, 22, 483–501. [Google Scholar] [CrossRef]
  29. Mendiratta, G.; Ke, E.; Aziz, M.; Liarakos, D.; Tong, M.; Stites, E.C. Cancer gene mutation frequencies for the U.S. population. Nat. Commun. 2021, 12, 5961. [Google Scholar] [CrossRef]
  30. Schneider, G.; Schmidt-Supprian, M.; Rad, R.; Saur, D. Tissue-specific tumorigenesis: Context matters. Nat. Rev. Cancer 2017, 17, 239–253. [Google Scholar] [CrossRef]
  31. Montano-Samaniego, M.; Bravo-Estupinan, D.M.; Mendez-Guerrero, O.; Alarcon-Hernandez, E.; Ibanez-Hernandez, M. Strategies for Targeting Gene Therapy in Cancer Cells with Tumor-Specific Promoters. Front. Oncol. 2020, 10, 605380. [Google Scholar] [CrossRef]
  32. Kumar, A.; Das, S.K.; Emdad, L.; Fisher, P.B. Applications of tissue-specific and cancer-selective gene promoters for cancer diagnosis and therapy. Adv. Cancer Res. 2023, 160, 253–315. [Google Scholar]
  33. Behranvand, N.; Nasri, F.; Emameh, R.Z.; Khani, P.; Hosseini, A.; Garssen, J.; Falak, R. Chemotherapy: A double-edged sword in cancer treatment. Cancer Immunol. Immunother. 2022, 71, 507–526. [Google Scholar] [CrossRef]
  34. Min, H.Y.; Lee, H.Y. Molecular targeted therapy for anticancer treatment. Exp. Mol. Med. 2022, 54, 1670–1694. [Google Scholar] [CrossRef] [PubMed]
  35. Hamid, A.B.; Petreaca, R.C. Secondary Resistant Mutations to Small Molecule Inhibitors in Cancer Cells. Cancers 2020, 12, 927. [Google Scholar] [CrossRef] [PubMed]
  36. Meisner, N.C.; Hintersteiner, M.; Uhl, V.; Weidermann, T.; Schmied, M.; Gstach, H.; Auer, M. The chemical hunt for the identification of drugable targets. Curr. Opin. Chem. Biol. 2004, 8, 424–431. [Google Scholar] [CrossRef] [PubMed]
  37. Sioud, M.; Leirdal, M. Druggable signaling proteins. Methods Mol. Biol. 2007, 361, 1–24. [Google Scholar] [PubMed]
  38. Tate, J.G.; Samford, S.; Jubb, H.C.; Sondka, Z.; Beare, D.M.; Bindal, N.; Boutselakis, H.; Cole, C.G.; Creator, C.; Dawson, E.; et al. COSMIC: The Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 2019, 47, D941–D947. [Google Scholar] [CrossRef]
  39. Alexandrov, L.B.; Jones, P.H.; Wedge, D.C.; Sale, J.E.; Campbell, P.J.; Nik-Zainal, S.; Stratton, M.R. Clock-like mutational processes in human somatic cells. Nat. Genet. 2015, 47, 1402–1407. [Google Scholar] [CrossRef]
  40. Alexandrov, L.B.; Nik-Zainal, S.; Wedge, D.C.; Aparicio, S.A.J.R.; Behjati, S.; Biankin, A.V.; Bignell, G.R.; Bolli, N.; Borg, A.; Borresen-Dale, A.L.; et al. Signatures of mutational processes in human cancer. Nature 2013, 500, 415–421, Erratum in Nature 2013, 502, 258. [Google Scholar] [CrossRef]
  41. Cerami, E.; Gao, J.; Dogrusoz, U.; Gross, B.E.; Sumer, S.O.; Aksoy, B.A.; Jacobsen, A.; Byrne, C.J.; Heuer, M.L.; Larsson, E.; et al. The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012, 2, 401–404, Erratum in Cancer Discov. 2012, 2, 960. [Google Scholar] [CrossRef]
  42. Gao, J.; Aksoy, B.A.; Dogrusoz, U.; Dresdner, G.; Gross, B.; Sumer, S.O.; Sun, Y.; Jacobsen, A.; Sinha, R.; Larsson, E.; et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 2013, 6, pl1. [Google Scholar] [CrossRef] [PubMed]
  43. The Cancer Genome Atlas Research Network; Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Shaw, K.R.M.; Ozenberger, B.A.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.M. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013, 45, 1113–1120. [Google Scholar] [CrossRef] [PubMed]
  44. Petljak, M.; Alexandrov, L.B. Understanding mutagenesis through delineation of mutational signatures in human cancer. Carcinogenesis 2016, 37, 531–540. [Google Scholar] [CrossRef] [PubMed]
  45. Loescher, C.M.; Hobbach, A.J.; Linke, W.A. Titin (TTN): From molecule to modifications, mechanics, and medical significance. Cardiovasc. Res. 2022, 118, 2903–2918. [Google Scholar] [CrossRef]
  46. Wi, D.H.; Cha, J.H.; Jung, Y.S. Mucin in cancer: A stealth cloak for cancer cells. BMB Rep. 2021, 54, 344–355. [Google Scholar] [CrossRef]
  47. Lawrence, M.S.; Stojanov, P.; Polak, P.; Kryukov, G.G.; Cibulskis, K.; Sivachenko, A.; Carter, S.L.; Stewart, C.; Mermel, C.H.; Roberts, S.A.; et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013, 499, 214–218. [Google Scholar] [CrossRef]
  48. Helleday, T.; Eshtad, S.; Nik-Zainal, S. Mechanisms underlying mutational signatures in human cancers. Nat. Rev. Genet. 2014, 15, 585–598. [Google Scholar] [CrossRef]
  49. Shen, J.C.; Rideout, W.M., 3rd; Jones, P.A. The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Res. 1994, 22, 972–976. [Google Scholar] [CrossRef]
  50. Nick McElhinny, S.A.; Stith, C.M.; Burgers, P.M.J.; Kunkel, T.A. Inefficient proofreading and biased error rates during inaccurate DNA synthesis by a mutant derivative of Saccharomyces cerevisiae DNA polymerase delta. J. Biol. Chem. 2007, 282, 2324–2332. [Google Scholar] [CrossRef]
  51. Luo, J. KRAS mutation in pancreatic cancer. Semin. Oncol. 2021, 48, 10–18. [Google Scholar] [CrossRef]
  52. Bteich, F.; Mohammadi, M.; Li, T.; Bhat, M.A.; Sofianidi, A.; Wei, N.; Kuang, C. Targeting KRAS in Colorectal Cancer: A Bench to Bedside Review. Int. J. Mol. Sci. 2023, 24, 12030. [Google Scholar] [CrossRef] [PubMed]
  53. Yang, Y.; Gomez-Sanchez, C.E.; Jaquin, D.; Prada, E.T.A.; Meyer, L.S.; Knosel, T.; Schneider, H.; Beuschlein, F.; Reincke, M.; Williams, T.A. Primary Aldosteronism: KCNJ5 Mutations and Adrenocortical Cell Growth. Hypertension 2019, 74, 809–816. [Google Scholar] [CrossRef] [PubMed]
  54. Lu, C.C.; Yen, R.F.; Peng, K.Y.; Huang, J.Y.; Wu, K.D.; Chueh, J.S.; Lin, W.Y. NP-59 Adrenal Scintigraphy as an Imaging Biomarker to Predict KCNJ5 Mutation in Primary Aldosteronism Patients. Front. Endocrinol. 2021, 12, 644927. [Google Scholar] [CrossRef] [PubMed]
  55. Staerk, J.; Constantinescu, S.N. The JAK-STAT pathway and hematopoietic stem cells from the JAK2 V617F perspective. JAKSTAT 2012, 1, 184–190. [Google Scholar] [CrossRef]
  56. McDaniel, A.S.; Hovelson, D.H.; Cani, A.K.; Liu, C.J.; Zhai, Y.; Zhang, Y.; Weizer, A.Z.; Mehra, R.; Feng, F.Y.; Alva, A.S.; et al. Genomic Profiling of Penile Squamous Cell Carcinoma Reveals New Opportunities for Targeted Therapy. Cancer Res. 2015, 75, 5219–5227. [Google Scholar] [CrossRef]
  57. Zhou, Q.H.; Deng, C.Z.; Li, Z.S.; Chen, J.P.; Yao, K.; Huang, K.B.; Liu, T.Y.; Liu, Z.W.; Qin, Z.K.; Zhou, F.J.; et al. Molecular characterization and integrative genomic analysis of a panel of newly established penile cancer cell lines. Cell Death Dis. 2018, 9, 684. [Google Scholar] [CrossRef]
  58. Soufir, N.; Queille, S.; Liboutet, M.; Thibaudeau, O.; Bachelier, F.; Delastaing, G.; Balloy, B.C.; Breuer, J.; Janin, A.; Dubertret, L.; et al. Inactivation of the CDKN2A and the p53 tumour suppressor genes in external genital carcinomas and their precursors. Br. J. Dermatol. 2007, 156, 448–453. [Google Scholar] [CrossRef]
  59. Huang, K.B.; Liu, R.Y.; Peng, Q.H.; Li, Z.S.; Jiang, L.J.; Guo, S.J.; Zhou, Q.H.; Liu, T.Y.; Deng, C.Z.; Yao, K.; et al. EGFR mono-antibody salvage therapy for locally advanced and distant metastatic penile cancer: Clinical outcomes and genetic analysis. Urol. Oncol. 2019, 37, 71–77. [Google Scholar] [CrossRef]
  60. Ali, S.M.; Pal, S.K.; Wang, K.; Palma, N.A.; Sanford, E.; Bailey, M.; He, J.; Elvin, J.A.; Chmielecky, J.; Squillace, R.; et al. Comprehensive Genomic Profiling of Advanced Penile Carcinoma Suggests a High Frequency of Clinically Relevant Genomic Alterations. Oncologist 2016, 21, 33–39. [Google Scholar] [CrossRef]
  61. Zehir, A.; Benayed, R.; Shah, R.H.; Syed, A.; Middha, S.; Kim, H.R.; Srinivasan, P.; Gao, J.; Chakravarty, D.; Devlin, S.M.; et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med. 2017, 23, 703–713, Erratum in Nat. Med. 2017, 23, 1004. [Google Scholar] [CrossRef]
  62. Trafalis, D.T.; Alifieris, C.E.; Kalantzis, A.; Verigos, K.E.; Vergadis, C.; Sauvage, S. Evidence for Efficacy of Treatment With the Anti-PD-1 Mab Nivolumab in Radiation and Multichemorefractory Advanced Penile Squamous Cell Carcinoma. J. Immunother. 2018, 41, 300–305. [Google Scholar] [CrossRef] [PubMed]
  63. Castellnou, S.; Vasijevic, A.; Lapras, V.; RAverot, V.; Alix, E.; Borson-Chazot, F.; Jouanneau, E.; Raverot, G.; Lasolle, H. SST5 expression and USP8 mutation in functioning and silent corticotroph pituitary tumors. Endocr. Connect. 2020, 9, 243–253. [Google Scholar] [CrossRef] [PubMed]
  64. Martins, C.S.; Camargo, R.C.; Coeli-Lacchini, F.B.; Saggioro, F.P.; Moreira, A.C.; de Castro, M. USP8 Mutations and Cell Cycle Regulation in Corticotroph Adenomas. Horm. Metab. Res. 2020, 52, 117–123. [Google Scholar] [CrossRef] [PubMed]
  65. Abraham, A.P.; Pai, R.; Beno, D.L.; Chacko, G.; Asha, H.S.; Rajaratnam, S.; Kappor, N.; Thomas, N.; Chacko, A.G. USP8, USP48, and BRAF mutations differ in their genotype-phenotype correlation in Asian Indian patients with Cushing’s disease. Endocrine 2022, 75, 549–559. [Google Scholar] [CrossRef]
  66. Rebollar-Vega, R.G.; Zuarth-Vazquez, J.M.; Hernandez-Ramirez, L.C. Clinical Spectrum of USP8 Pathogenic Variants in Cushing’s Disease. Arch. Med. Res. 2023, 54, 102899. [Google Scholar] [CrossRef]
  67. Islam, M.T.; Chen, F.; Chen, H. The oncogenic role of ubiquitin specific peptidase (USP8) and its signaling pathways targeting for cancer therapeutics. Arch. Biochem. Biophys. 2021, 701, 108811. [Google Scholar] [CrossRef]
  68. Sack, G.H., Jr. Serum Amyloid A (SAA) Proteins. Subcell. Biochem. 2020, 94, 421–436. [Google Scholar]
  69. Sun, L.; Ye, R.D. Serum amyloid A1, Structure, function and gene polymorphism. Gene 2016, 583, 48–57. [Google Scholar] [CrossRef]
  70. Fagerberg, L.; Hallstrom, B.M.; Oksvold, P.; Kampf, C.; Djureinovic, D.; Odeberg, J.; Habuka, M.; Tahmasebpoor, S.; Danielsson, A.; Edlund, K.; et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteom. 2014, 13, 397–406. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.