Use of Publication Dynamics to Distinguish Cancer Genes and Bystander Genes

de Magalhães has shown recently that most human genes have several papers in PubMed mentioning cancer, leading the author to suggest that every gene is associated with cancer, a conclusion that contradicts the widely held view that cancer is driven by a limited number of cancer genes, whereas the majority of genes are just bystanders in carcinogenesis. We have analyzed PubMed to decide whether publication metrics supports the distinction of bystander genes and cancer genes. The dynamics of publications on known cancer genes followed a similar pattern: seminal discoveries triggered a burst of cancer-related publications that validated and expanded the discovery, resulting in a rise both in the number and proportion of cancer-related publications on that gene. The dynamics of publications on bystander genes was markedly different. Although there is a slow but continuous time-dependent rise in the proportion of papers mentioning cancer, this phenomenon just reflects the increasing publication bias that favors cancer research. Despite this bias, the proportion of cancer papers on bystander genes remains low. Here, we show that the distinctive publication dynamics of cancer genes and bystander genes may be used for the identification of cancer genes.


Introduction
In a recent study, de Magalhães presented the results of an analysis of PubMed publications on human genes and the ones that also mention the term "cancer" [1]. The author has shown that of the 17,371 human genes with at least one paper in PubMed, 15,233 (87.7%) also have at least one paper mentioning cancer. The overall conclusion of de Magalhães (conveyed by the title of his paper) is that "every gene can (and possibly will) be associated with cancer" and that "if a gene has not been associated with cancer yet, it probably means it has not been studied enough and will most likely be associated with cancer in the future" [1]. The author is correct in pointing out that this conclusion would have important implications for analyzing and interpreting large-scale analyses in cancer genomics, especially as it contradicts the dominant view that cancer is driven by a few hundred cancer genes, whereas the vast majority of genes are just bystanders (or passengers) in carcinogenesis (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, 2020; [2][3][4]).
de Magalhães is aware of the fact that since cancer is the most widely studied topic in biological and biomedical sciences, "human biases can confound systematic analyses" based on publication metrics [1]. Since cancer is one of the most common diseases in modern times, publication bias and funding bias are likely to have a major impact on the number and proportion of publications mentioning cancer. It is important to take this bias into account, as in scientific research human biases (any trend or deviation from the truth in data collection, data analysis, interpretation and publication) can cause false conclusions [5,6].

Datasets Analyzed
PubMed_Timeline_Results_by_Year.csv files for human genes were downloaded from the PubMed website (https://pubmed.ncbi.nlm.nih.gov/) on 10 March 2022. We have also analyzed the PubMed dataset included as Supplemental information of the paper of de Magalhães [1].

Lists of Cancer Genes and Bystander Genes
As the gold standard of 'known' cancer driver genes, we have used the lists of 54 oncogenes and 71 tumor suppressor genes identified by Vogelstein et al., (V-genes) [10]. As another list of known cancer driver genes, we have also used the 719 cancer genes of the Cancer Gene Census (CGC-genes) [17]. Information on the involvement of CGCgenes in gene-fusion events was obtained from the gene fusion mutation dataset downloaded from the COSMIC database (https://cancer.sanger.ac.uk/cosmic/, accessed on 12 February 2022).
Lists of bystander/passenger genes, candidate oncogenes, tumor suppressor genes and tumor essential genes defined in our earlier work were from Supplementary files S2, S4, S5 and S30 of that study [3].

Analyses of Publication Dynamics of Genes in the Context of Cancer Research
Changes in the proportion of PubMed publications on genes that mention cancer were monitored for the period of 1970-2021. Time points with less than 10 publications on the given gene were excluded from the graphical representations of publication dynamics.

Statistical Analyses
The statistical package of Origin 2018 was used for data processing and statistical analysis. Statistical significance was set as a p value of <0.05.

Network Analyses
Gene co-occurrence (or connectivity) graphs of CGC fusion genes were defined by the vertex set consisting of all genes present in the gene fusion mutation dataset of the COSMIC database. Two genes (vertices in the network) were regarded as being connected by an edge in the network if they co-occur in at least one fusion gene. Pajek 32-XXL 5.14, a program for large-network analysis and visualization [18]; (http://mrvar.fdv.uni-lj.si/ pajek/, accessed on 12 February 2022) was used for the calculation and visualization of gene-fusion networks.

Results and Discussion
We have first analyzed the dataset of de Magalhães that lists the number of publications associated with human genes in PubMed (termed All_Pubs) and those publications on the genes that also mention cancer (termed Cancer_Pubs) [1]. Assuming that the number of publications associated with a given human gene (All_Pubs) reflects the intensity of research on that gene, whereas the number of publications that also mention cancer (Cancer_Pubs) reflects the intensity of work on the role of that gene in the context of cancer research, one expects that the numbers in Cancer_Pubs would be the highest for known cancer genes. Surprisingly, the top-ranking genes in Cancer_Pubs are not particularly enriched in known cancer genes (see Supplementary file S1). Out of the 100 top-ranking genes, only 22 come from the cancer driver gene list of V-genes [10] and only 42 come from the more generous cancer gene list of the Cancer Gene Census (CGC-genes) [17].
A survey of the literature on the 100 top-ranking genes in Cancer_Pubs, revealed that, for many of the "non-cancer" genes (defined as genes absent in the CGC gene set), there is no evidence for their role in carcinogenesis. We have noted that "non-cancer" genes with the highest numbers in Cancer_Pubs, such as genes TH, PC, IMPACT and ACHE, have significantly higher numbers in All_Pubs than in Cancer_Pubs (Suppementary file S1), reflecting the fact that research on these genes has not focused on cancer. Conversely, in the case of known cancer genes (V-genes) with the highest values in Cancer_Pubs (e.g., EGFR, ERBB2, TP53, BRCA1, BRAF, KRAS), the numbers in Cancer_Pubs and in All_Pubs are comparable, suggesting that the Cancer_Pubs/All_Pubs ratio (reflecting focus of research on cancer) might be a better indicator of cancer genes.
To test the suitability of Cancer_Pubs/All_Pubs ratios for the distinction of cancer genes and bystander genes, we have analyzed different subsets of the dataset of Supplementary file S1; to increase statistical power, we have used increasing publication cutoff values (0, 10, 100, 500, 1000, 5000, 10,000 publications) in All_Pubs (Supplementary file S2). Our analyses have confirmed that for the best-known cancer genes (e.g., BRCA2, KRAS, PIK3CA, ERBB2, BRAF, IDH1, NRAS, BRCA1 and ABL1), the Cancer_Pubs/All_Pubs ratio is very high, close to 1.0 (see for example Supplementary file S2, cut-off 1000).
In the case of the V-genes, the average value of Cancer_Pubs/All_Pubs was 0.65829 ± 0.21241 (at cut-off 0), a value that is significantly (p < 0.05) higher than the value of 0.32873 ± 0.24935 calculated for the passenger genes present in the dataset (Supplementary file S3). The average value of Cancer_Pubs/All_Pubs of 0.55428 ± 0.22967 for the CGC-genes present in the dataset (at cut-off 0) was also significantly higher (p < 0.05) than that calculated for passenger genes (Supplementary file S3). As shown in Supplementary file S3, these differences between passenger genes and cancer genes were also significant at higher publication cut-off values: at cut-off 10000, the average values of Cancer_Pubs/All_Pubs for V-genes (0.6814 ± 0.30702) and CGC-genes (0.55809 ± 0.28948) were significantly (p < 0.05) different from those of passenger genes (0.25511 ± 0.20613).
As a corollary, the 100 top-ranking genes with the highest Cancer_Pubs/All_Pubs values are enriched in known cancer genes (Supplementary file S2). For example, in the case of genes with more than 1000 papers in All_Pubs, the 100 genes with the highest Can-cer_Pubs/All_Pubs values contained 65 genes from the list of CGC-genes, 36 of which were also present in the list of V-genes (Supplementary file S2). These observations suggest that the Cancer_Pubs/All_Pubs ratios may help the distinction of cancer genes and bystander genes and thus may promote the identification of cancer genes. Nevertheless, it appears that the Cancer_Pubs/All_Pubs ratio per se does not necessarily reflect the importance of a gene for cancer. First, several known cancer genes of the V-gene list have rather low Cancer_Pubs/All_Pubs ratios (Supplementary file S4). For example, AR, the gene for the androgen receptor has a Cancer_Pubs/All_Pubs value of 0.19197. Our analyses, however, confirmed that this low ratio is primarily due to misidentification of papers on the AR gene. When we carried out PubMed searches ensuring disambiguation of AR (including the terms AR and 'androgen'), the Cancer_Pubs/All_Pubs value for the AR androgen was increased to 0.61395 (Supplementary file S4). Thus, the case of the androgen receptor gene AR illustrates the point that it is essential to guarantee that papers relevant for the selected genes are correctly identified. To avoid misidentification, we have used terms for disambiguation; disambiguating terms for gene names were taken from UniProt (https://www.uniprot.org/, accessed on 12 February 2022). In addition to the AR gene, disambiguation had significantly increased the Cancer_Pubs/All_Pubs values for the CIC, MET, APC, KIT, ATM, CBL, RB1 and MPL genes (Supplementary file S4). Our analyses have shown that disambiguation is especially critical in cases where the gene names are not unique enough to avoid confusion with other terms. For example, we have noted that the unexpectedly high numbers of publications for genes TH, PC, IMPACT and ACHE in the dataset of de Magalhães (Supplementary file S1) reflects such errors in data retrieval, rather than intense research on these genes (data not shown).
Another important message of the analysis of publications on the androgen receptor gene AR is that it is important to follow the temporal dynamics of the Cancer_Pubs/All_Pubs ratio, as it may reach high values for cancer genes relatively late, as a result of research efforts. For example, in the case of the AR gene, the Cancer_Pubs/All_Pubs ratio for the disambiguated publications showed a linear increase between 1995-2021, from 0.25 to 0.7 (see Figure 1A, Supplementary file S5). (Note that, in the absence of disambiguation, there was just a very slow increase of the Cancer_Pubs/All_Pubs parameter ( Figure 1A, Supplementary file S5).
Another important message of the analysis of publications on the androgen receptor gene AR is that it is important to follow the temporal dynamics of the Can-cer_Pubs/All_Pubs ratio, as it may reach high values for cancer genes relatively late, as a result of research efforts. For example, in the case of the AR gene, the Can-cer_Pubs/All_Pubs ratio for the disambiguated publications showed a linear increase between 1995-2021, from 0.25 to 0.7 (see Figure 1A, Supplementary file S5). (Note that, in the absence of disambiguation, there was just a very slow increase of the Can-cer_Pubs/All_Pubs parameter ( Figure 1A, Supplementary file S5).  In the case of genes that were recognized as cancer genes very early, such as TP53 ( Figure 1D Analysis of publications on TSHR, the gene for the thyrotropin receptor cautions that even well-known cancer genes do not necessarily achieve high Cancer_Pubs/All_Pubs values, if research on that gene focuses on questions distinct from cancer (e.g., another major disease). The Cancer_Pubs/All_Pubs ratio for this gene is very low (0.20446) and it has not changed over the period from 1990-2021, illustrating that research on this gene is not limited to its role in carcinogenesis ( Figure 1F). In fact, a higher fraction (0.5375) of publications on TSHR deal with hyperthyroidism or hypothyroidism than with cancer; this explains why TSHR can't reach very high Cancer_Pubs/All_Pubs values.
Next, we have analyzed the publication dynamics of a representative set of passenger genes (randomly selected from the dataset, All_Pubs, cut-off value 1000, Supplementary file S2), with a view of determining the influence of publication bias on Cancer_Pubs/All_Pubs values. Our analyses have shown that there is a general tendency for a gradual, yearly increase in Cancer_Pubs/All_Pubs ratio for passenger genes, such as CYP2C19, VWF, NOX4, TLR4 and LRRK2 (Supplementary file S6, Figure 2). This linear increase reflects the bias favoring inclusion of the term "cancer" in publications in life sciences, as illustrated by the fact, that similar tendencies are observed for the association of the term "cancer" with neutral terms such as "protein", "cell" or "tissue" (Supplementary file S6, Figure 2F). The list of passenger genes includes genes for such intensely studied proteins as VWF (von Willebrand factor), C3 (Complement C3) and APOB (Apolipoprotein B-100), which have never been implicated in carcinogenesis, underlining the fact that in these cases, the increase in Cancer_Pubs/All_Pubs ratio simply reflects bias favoring the term 'cancer'. Nevertheless, this publication bias has limited influence on the Cancer_Pubs/All_Pubs ratio of genes: the CA2017-2021 values of the passenger genes do not exceed 0.2 (Supplementary file S6, Figure 2). This observation confirms that the Cancer_Pubs/All_Pubs ratios may help the distinction of cancer genes and bystander genes and thus may promote the identification of cancer genes.
As a further test of this assumption, we have examined the publication metrics of genes that we have identified earlier as novel oncogenes, tumor suppressor genes and tumor essential genes [3]. Although in some cases the number of publications was too low (All_Pubs < 50), to permit meaningful analyses (IDH3B, Supplementary file S7), the dynamics of Cancer_Pubs/All_Pubs ratios clearly support our conclusion that genes AU-RKA ( Figure 3A), YAP1 ( Figure 3B), SLC16A1 ( Figure 3C), TP73 ( Figure 3D), YES1 and TTK qualify as cancer genes (Supplementary file S7). As discussed earlier [3], in the case of these genes, a large body of evidence has accumulated to support their key role in carcinogenesis as oncogenes (AURKA, YAP1, YES1), tumor suppressor genes (TTK, SLC16A1) or as tumor essential genes (TP73). As a further test of this assumption, we have examined the publication metrics of genes that we have identified earlier as novel oncogenes, tumor suppressor genes and tumor essential genes [3]. Although in some cases the number of publications was too low (All_Pubs < 50), to permit meaningful analyses (IDH3B, Supplementary file S7), the dynamics of Cancer_Pubs/All_Pubs ratios clearly support our conclusion that genes AURKA ( Figure 3A), YAP1 ( Figure 3B), SLC16A1 ( Figure 3C), TP73 ( Figure 3D), YES1 and TTK qualify as cancer genes (Supplementary file S7). As discussed earlier [3], in the case of these genes, a large body of evidence has accumulated to support their key role in   In the case of the tumor essential gene, G6PD ( Figure 3E), the CAt ratio is very low (Supplementary file S7). It should be noted, however, that the gene is involved in a major disease, (non-spherocytic hemolytic anemia, due to G6PD deficiency) and a large fraction of Pubmed publications on this gene (>0.6) are dedicated to this disease. Despite this limitation, it is noteworthy that the publication pattern of G6PD suggests that its importance for carcinogenesis is increasingly recognized in the last decade ( Figure 3E). In the case of the tumor essential gene, G6PD (Figure 3E), the CAt ratio is very low (Supplementary file S7). It should be noted, however, that the gene is involved in a major disease, (non-spherocytic hemolytic anemia, due to G6PD deficiency) and a large fraction of Pubmed publications on this gene (>0.6) are dedicated to this disease. Despite this limitation, it is noteworthy that the publication pattern of G6PD suggests that its importance for carcinogenesis is increasingly recognized in the last decade ( Figure 3E).
In the case of SLC2A1 (Figure 3F), the gene for glucose transporter GLUT1, there is a steeper yearly rise of Cancer_Pubs/All_Pubs value, indicating that there is a growing awareness of its importance in carcinogenesis (Supplementary file S7). The relatively high CA2017-2021 value (0.56) is especially significant if we consider that a large portion of papers on this gene deal with neurological diseases (e.g., GLUT1 deficiency syndromes, epilepsy, dystonia, mental retardation) caused by mutations of this gene, indicating that there is an upper limit for the CA2017-2021 value, that is significantly lower than 1.0.
In summary, our studies indicate that publication dynamics of cancer genes and bystander genes have characteristic differences, and this may help the identification of cancer genes through analyses of dynamics of Cancer_Pubs/All_Pubs ratios for genes. As a further test of this assumption, we have analyzed the publication dynamics of the 100 top ranking genes, with the highest CAt values of Supplementary file S2, All_Pubs cut-off 1000. The publication dynamics of 36 of these genes, present in the list of V-genes, have been discussed above (Supplementary file S5, Figure 1); the results for 29 genes that are CGC-genes are shown in Supplementary file S8. Analyses of the publication dynamics of these CGC-genes have shown that they have CAt and CA2017-2021 values that significantly exceed the value characteristics of passenger genes, consistent with their cancer gene status (Supplementary file S8, Figure 4).  Surprisingly, this analysis has revealed that the gene PRCC (encoding the proline-rich protein PRCC of unknown function) with the highest CAt value (0.97704, Supplementary file S8) is a false hit, due to significant contamination of Pubmed hits with mismatches for Papillary Renal Cell Carcinoma (PRCC). Disambiguation, focusing on papers that deal with the PRCC gene, has shown that there is practically no PubMed Surprisingly, this analysis has revealed that the gene PRCC (encoding the proline-rich protein PRCC of unknown function) with the highest CAt value (0.97704, Supplementary file S8) is a false hit, due to significant contamination of Pubmed hits with mismatches for Papillary Renal Cell Carcinoma (PRCC). Disambiguation, focusing on papers that deal with the PRCC gene, has shown that there is practically no PubMed evidence to support the cancer gene status of this gene per se. The true "cancer" hits for this gene refer to it only as the most common fusion partner of TFE3 transcription factor. TFE3 forms fusion-oncogenes with several other partners (e.g., NONO in renal cell carcinoma, ASPSCR1 in alveolar soft part sarcoma, CLTC in renal cell carcinoma and SFPQ in perivascular epithelioid cell tumor), suggesting that TFE3 plays a more important role in carcinogenesis than its various ancillary fusion partners ( Figure 5). Our observation is that there is no CAt based evidence for the cancer gene status of the PRCC gene, whereas in the case of TFE3, the CAt value (0.80218) is very high, suggesting that this parameter may help the distinction of contributions of gene partners to fusion oncogenes. To check this possibility, we have examined the CAt parameters for CGC genes that are hubs of fusion networks versus those that are just partners of such hub genes. This analysis has confirmed that the genes that are central hubs with numerous partners in the fusion network have very high CAt values, whereas those that have few or no partners have lower CAt values (Supplementary file S9, Figure 5). For example, analysis of the data (0 cut-off, All_Pubs) revealed that the CAt values of CGC genes with at least five fusion partners (0.75812 ± 0.20102) are significantly (p < 0.05) higher than those for CGC genes with only one fusion partner (0.5251 ± 0.24143). Thus, the example of the PRCC gene cautions that some CGC genes may have acquired their cancer gene status as fusion partners of cancer genes, rather than in their own right. It is well known that gene fusions may inactivate tumor suppressor genes or may activate proto-oncogenes by changing their regulatory properties and it is clear that in such cases the role of the genes that are fused may be non-equivalent. Whereas one of the partners is a proto-cancer gene, the other may be just a passenger gene.
As a final test of the utility of publication metrics for the discovery of cancer genes, we have examined whether the 35 genes in the top-ranking 100, with the highest CAt values that are not included in the lists of V-genes or CGC-genes also qualifying as cancer genes (Supplementary file S10). Our analysis has identified three false hits due to ambiguity of gene names (HCCS, GAN and SLN); these gene names are confused with other terms (e.g., HCCS, for Hepatocellular carcinomas, HCCs, SLN for Sentinel Lymph Node, SLN). Following disambiguation, these genes had very low PubMed matches, precluding meaningful analyses. The remaining genes, however, all had publication dynamics and high CAt and CA2017-2021 values, suggesting that they are highly relevant to cancer (Supplementary file S10, Figure 6).
We have surveyed the literature on these genes to explore the reasons why they are associated with cancer. Here, we summarize only the major conclusions of our analyses; for brief descriptions and key references on these genes, the reader should consult Appendix SA. In our earlier work [3], we have summarized the evidence for three of these genes (AURKA, YAP1, TWIST1) as novel cancer genes; these summaries are not repeated here.
A survey of the literature on these genes has revealed that there are three major reasons for their relevance for cancer: they may play a key role in carcinogenesis, may be important as tumor markers and may serve as targets for tumor therapy. Our survey has revealed that the majority of these genes are bona fide cancer genes that contribute to carcinogenesis, but in the case of some other genes, known as tumor markers, genes relevant for tumor therapy, it is less clear whether they play a critical role in carcinogenesis.
cut-off, All_Pubs) revealed that the CAt values of CGC genes with at least five fusion partners (0.75812 ± 0.20102) are significantly (p < 0.05) higher than those for CGC genes with only one fusion partner (0.5251 ± 0.24143). Thus, the example of the PRCC gene cautions that some CGC genes may have acquired their cancer gene status as fusion partners of cancer genes, rather than in their own right. It is well known that gene fusions may inactivate tumor suppressor genes or may activate proto-oncogenes by changing their regulatory properties and it is clear that in such cases the role of the genes that are fused may be non-equivalent. Whereas one of the partners is a proto-cancer gene, the other may be just a passenger gene.  Genes for tumor markers include MLANA that encodes melanoma antigen recognized by T-cells, MKI67, the gene for proliferation marker protein Ki-67, ALDH1A1, the gene for aldehyde dehydrogenase 1A1, a marker of stem cells and CD7, the gene for T-cell antigen CD7, a tumor marker for acute myeloblastic leukemia (AML) and acute lymphoblastic leukemia (ALL). Genes relevant for tumor therapy include ERCC1, the gene for DNA excision repair protein ERCC-1 implicated in Cysplatin resistance and ABCG2, the gene for broad substrate specificity ATP-binding cassette transporter ABCG2 that plays a role in multidrug resistance.
The remaining genes appear to be cancer genes; based on a survey of the literature, they could be assigned to various cellular processes of cancer hallmarks in which they are involved (Table 1). For example, the genes for DNA repair protein XRCC1 (XRCC1) and E3 ubiquitin-protein ligase XIAP (XIAP) are involved in the hallmark of genome and proteome maintenance, the TGFβ subfamily member Nodal homolog (NODAL) plays a role in sustained proliferation and the reprogramming of metabolism of tumor cells, and programmed cell death protein 1 (PDCD1) is involved in the evasion of immune destruction of tumor cells, whereas vascular endothelial growth factor C (VEGFC) promotes tumor progression through the induction of angiogenesis. guity of gene names (HCCS, GAN and SLN); these gene names are confused with other terms (e.g., HCCS, for Hepatocellular carcinomas, HCCs, SLN for Sentinel Lymph Node, SLN). Following disambiguation, these genes had very low PubMed matches, precluding meaningful analyses. The remaining genes, however, all had publication dynamics and high CAt and CA2017-2021 values, suggesting that they are highly relevant to cancer (Supplementary file S10, Figure 6).  It is noteworthy that some of the cancer genes identified in the present work as genes with very high Cat values have been identified earlier as members of the candidate cancer gene set, showing signs of strong positive selection for driver mutations in their coding region; see for example AURKA and YAP1 [3]. The PDCD1 gene identified in the present study as a gene with high CAt value is also present in the gene set displaying strong signs of positive selection for driver mutations [3], providing an additional argument for the cancer gene status of this gene. Interestingly, the publication dynamics of the PDCD1 gene indicates that its involvement in carcinogenesis has been recognized only recently ( Figure 6C). There are two additional genes, XIAP ( Figure 6D) and FOLH1 ( Figure 6E), that also show significant signs of positive selection for missense mutation; their mutation parameters deviate by more than 1SD from those for bystander genes; see Supplementary file S5 of [3]. Furthermore, a recent study has identified genes PGR and E2F1 as tumor suppressor genes under significant selection in tumors [4].
However, the majority of the cancer genes with high CAt values that did not show significant signs of positive or negative selection of their coding region during tumor evolution [3]. A possible explanation for this apparent contradiction between the proposed cancer gene status of these genes and lack of selection of their coding region is that selection may act on non-coding regions that control the expression of these genes. A typical example of such driver genes is TERT, the gene for the telomerase reverse transcriptase. The codingregion of this gene shows no significant sign of selection [3], whereas its promoter is a target for driver mutations in many types of cancer [19,20]. It is noteworthy in this respect that recent studies have shown that the enhancer region of one of the cancer genes identified here, TNFSF10, is subject to somatic mutation in kidney cancer [20].
An alternative explanation for the lack of selection of the coding region of some cancer genes is that these genes belong to the group of Epi-driver genes, rather than Mutdriver genes [10], i.e., they are expressed aberrantly in tumors due to aberrant promoter methylation. It is noteworthy in this respect that the promoter hypermethylation has been shown to inactivate two of the cancer genes, CDKN2B and RASSF1A, that appear to function as tumor suppressor genes in various types of cancer [21][22][23][24].

Conclusions
We have shown that known cancer genes (V-genes and CGC-genes) have significantly higher Cancer_Pubs/All_Pubs (CAt and CA2017-2021) ratios than bystander genes, suggesting that these parameters of research metrics may help the identification of cancer genes.
We have also shown that, although there is significant and increasing publication and funding bias favoring the inclusion of the term "cancer" in biomedical publications, this bias does not increase Cancer_Pubs/All_Pubs ratios to prevent the distinction of cancer genes and bystander genes. Paradoxically, this bias increases the reliability of the distinction of bystander genes and cancer genes, since the eagerness of scientists to prove the relevance of a gene for cancer research weakens the argument that if a gene has not been associated with cancer yet, it just means that its role in cancer has not been studied enough. One may argue that if a gene has been intensely studied but the CA parameters remain very low, it is very unlikely that future research will identify it as an important cancer gene.
In harmony with the expectation that high CA values may be used for the identification of cancer genes, our survey of the literature on genes with high CAt values, but not previously assigned to the cancer gene category, has shown that the majority of these genes qualifies as cancer genes, involved in cellular processes contributing to carcinogenesis (Appendix SA, Table 1). The file contains information on the terms used for disambiguation of gene names, and the im-pact of disambiguation on Cancer_Pub/All_Pub ratios. Parameters All_Pubs, Cancer_Pubs, Cancer _Pubs/All_Pubs refer to the 2021 dataset of de Magalhães, parameters All_Pubs22, Cancer_Pubs22, Cancer _Pubs22/All_Pubs22 refer to the dataset updated on 10 March 2022. DIS_All_Pubs22, DIS_Cancer_Pubs22 and DIS_Cancer_Pubs22/DIS_All_Pubs22 indicate the parameters corrected by disambiguation. Separate sections show the data for Genes with Unique Names (disambiguation changed CA value by less than 20%) and Genes with Non Unique Names (disambiguation changed CA value by more than 20%). The file also contains information on the yearly counts of PubMed publications in which the term "cancer" co-occurs with Neutral Terms "protein", "cell" or "tissue" and visual representations of changes of their Cancer_Pub/All_Pub ratios as a result of research efforts. Separate sections show Cancer_Pub/All_Pub ratios, calculated for the entire publication history of genes (CAt) and the average values calculated for years 2017-2021 (CA2017-2021). Supplementary file S7: Publication dynamics of novel oncogenes, tumor suppressor genes and tumor essential genes identified by Bányai et al., 2021 [3]. The file contains information on the yearly counts of PubMed publications on the various genes and visual representations of changes of Can-cer_Pub/All_Pub ratios as a result of research efforts. The file contains information on the terms used for disambiguation of gene names, and the impact of disambiguation on Cancer_Pub/All_Pub ratios. Parameters All_Pubs, Cancer_Pubs, Cancer _Pubs/All_Pubs refer to the 2021 dataset of de Magalhães, parameters All_Pubs22, Cancer_Pubs22, Cancer _Pubs22/All_Pubs22 refer to the dataset updated on 10 March 2022. DIS_All_Pubs22, DIS_Cancer_Pubs22 and DIS_Cancer_Pubs22/DIS_All_Pubs22 indicate the parameters corrected by disambiguation. The file contains information only for genes with more than 50 publications in DIS_All_Pubs22. Separate sections show the data for Genes with Unique Names (disambiguation changed CA value by less than 20%) and Genes with Non Unique Names (disambiguation changed CA value by more than 20%). Separate sections show Cancer_Pub/All_Pub ratios, calculated for the entire publication history of genes (CAt) and the average values calculated for years 2017-2021 (CA2017-2021). Supplementary file S8: Publication dynamics of CGC-genes from the 100 top ranking genes (i.e., genes with the highest CAt values) of Supplementary file S2, cut-off 1000. The file contains information on the yearly counts of PubMed publications on the various genes and visual representations of changes of Cancer_Pub/All_Pub ratios as a result of research efforts. The file contains information on the terms used for disambiguation of gene names, and the impact of disambiguation on Cancer_Pub/All_Pub ratios. Parameters All_Pubs, Cancer_Pubs, Cancer _Pubs/All_Pubs refer to the 2021 dataset of de Magalhães, parameters All_Pubs22, Cancer_Pubs22, Cancer _Pubs22/All_Pubs22 refer to the dataset updated on 10 March 2022. DIS_All_Pubs22, DIS_Cancer_Pubs22 and DIS_Cancer_Pubs22/DIS_All_Pubs22 indicate the parameters corrected by disambiguation. Separate sections show the data for Genes with Unique Names (disambiguation changed CA value by less than 20%) and Genes with Non Unique Names (disambiguation changed CA value by more than 20%). Separate sections show Cancer_Pub/All_Pub ratios, calculated for the entire publication history of genes (CAt) and the average values calculated for years 2017-2021 (CA2017-2021). Supplementary file S9: Impact of the number of fusion partners on the Cancer_Pub/All_Pub ratios of CGC-genes. The file contains information on the number of fusion partners of the various CGC genes and visual representations of changes of Cancer_Pub/All_Pub ratios as a function of the number of fusion partners. The file contains separate sections using increasing publication cut-off values (0, 100, 500 and 1000 publications) in All_Pubs. Genes are ranked according to the number of fusion partners. Supplementary file S10: Publication dynamics of genes from the 100 top ranking genes (i.e., genes with the highest CAt values) of Supplementary file S2, cut-off 1000, not assigned to the CGC category. The file contains information on the yearly counts of PubMed publications on the various genes and visual representations of changes of Can-cer_Pub/All_Pub ratios as a result of research efforts. The file contains information on the terms used for disambiguation of gene names, and the impact of disambiguation on Cancer_Pub/All_Pub ratios. Parameters All_Pubs, Cancer_Pubs, Cancer _Pubs/All_Pubs refer to the 2021 dataset of de Magalhães, parameters All_Pubs22, Cancer_Pubs22, Cancer _Pubs22/All_Pubs22 refer to the dataset updated on 10 March 2022. DIS_All_Pubs22, DIS_Cancer_Pubs22 and DIS_Cancer_Pubs22/DIS_All_Pubs22 indicate the parameters corrected by disambiguation. Separate sections show the data for Genes with Unique Names (disambiguation changed CA value by less than 20%) and Genes with Non Unique Names (disambiguation changed CA value by more than 20%). Separate sections show Cancer_Pub/All_Pub ratios, calculated for the entire publication history of genes (CAt) and the average values calculated for years 2017-2021 (CA2017-2021). Appendix SA: The file contains a brief description of genes identified in the present study as genes with the highest CAt values. The majority of these genes are cancer genes that are new in the sense that they are not included in the most widely used cancer gene lists [10,17] (Vogelstein et al. 2013;Sondka et al., 2018). The assignment of genes to key cellular processes of carcinogenesis is summarized in Table 1  Data Availability Statement: All data generated or analyzed during this study are included in the manuscript and supporting files.