1. Introduction
One of the current great challenges in biology is integrating knowledge into a coherent model, thus allowing predictions to be made. However, this quest heavily relies on our understanding of all the different features that define our biological question. How well do we understand the different features, and has the manner or motivation for study affected our conclusions about them? As systems biologists, we wondered if we could somehow address these questions based on the intrinsic properties of the genes. Similar studies have previously addressed this question, making a great contribution in highlighting social features (funding, transitioning to principal investigator status, model organism and scientific literature database availability) and a plethora of physicochemical properties of protein-coding genes [
1]. However, the amount of literature about factors behind gene popularity integrating biological feature information yielded by NGS-derived datasets, CRISPR-screens, gene regulatory (GRNs), and protein–protein interaction (PPis) networks is limited.
Scientometry is the discipline that studies scientific and technologic literature from the quantitative perspective [
2]. From scientometry it is known that literature is usually skewed to cover some subjects at a much greater depth than others. Various statistical distributions seem able to explain current and past publication trends, proposed to follow laws such as those of Bradford and Lotka, or the Pareto distribution [
3]. It is, however, not clear what constitutes a “subject” nor how generally consistent these principles are. Pareto-like distributions are generated in systems having the Matthew effect, in other words, a positive feedback loop where “the rich become richer”. In scientific research, a few critical discoveries nucleate fields of which some grow much faster than others. Throughout this paper we will call this effect “reinforcement”.
In this study we consider individual genes as “subjects” and show that literature follows a Matthew-like principle. We theorize that this is because it is easier to study genetics once a few “reference genes” have been discovered and studied. However, this also means that the number of papers might mainly reflect a social process of discovery rather than reflect the real relevance of the genes to the subject of interest.
To correlate citations with social driving force versus a gene’s biological relevance, we made use of unbiased datasets that may suggest a gene to be perceived as important, including single-cell RNA-seq data, protein–protein interactions, as well as CRISPR screens (
Figure 1a). The model is by necessity semi-qualitative—there are multiple ways to encode the features mathematically. Furthermore, it will always be possible to add further features that could be of relevance. Thus, the results need to be interpreted considering the model formulation. We try to avoid demarcating different sources as “social” or “biological”, but our choice of biological factors unavoidably reflects our own view of “importance”. Our unit of study is one gene, but we could have considered transcript isoforms, post-translationally modified proteins, or protein domains. These are all valid alternative objects of enquiry but outside our scope. We have largely ignored how different experimental methodologies have impacted citations (e.g., proteomics vs. transcriptomics). Finally, gender, class, and ethnicity could all be included as social reinforcement factors, but here we were mainly interested in the overall balance of social vs. biological factors.
If we consider the initial reporting date (or time) since the discovery of a gene as the main social component predictor, then roughly 25% of the papers are a result of social reinforcement. Gene expression level is the second strongest indicator, followed by markers of disease relevance to a lesser extent. We believe that further use of unbiased data generation methods will widen the set of genes considered and hopefully enrich our understanding of cell biology.
2. Materials and Methods
2.1. Pubmed Data Retrieval and Pre-Processing
Genes and PMIDs were retrieved using the publicly available FTP service (Available online:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2pubmed.gz (accessed on 29 April 2020)); released on 16 December 2019). Only mouse and human IDs were retained. Mouse ENSMUSG gene IDs were processed and converted to human Ensembl IDs (ENSID) using BioMart. Further Pubmed article metadata was downloaded (
ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline (accessed on 29 April 2020)). A custom Java program (available on our Github) was used to extract date of publication and PMID.
Cell type keywords were defined semi-manually based on the cell type annotation in the Tabula Muris dataset. MeSH terms could have been a better choice, but literature also suggests they are not commonly used [
4]. Poorly represented cell types were removed or merged with other categories. The mapping cell type—{PMIDs} was created by searching Pubmed for keywords (see Supplementary File S1). This association was used along with the gene-PMIDs mapping to create cell type-specific lists of papers.
The measure #citation was defined as log10 (1 + number of papers for a gene), either total paper counts or subsetted for one cell type. We also tried to use the rank (number of papers per gene) as a measure, hoping it would better even out the statistical distribution; however, because too many genes had similar (low) number of citations, many ranks were tied, and we decided against the use of this measure.
2.2. Citation Distribution Analysis
We investigated distributions for several genes, and they followed similar trends. Comparison with the exponential distribution was made with fitdistr() from the R MASS package [
5]. Comparison with the Pareto distribution was made with the R ParetoPosStable package.
2.3. Gene Family Analysis
Gene symbol nomenclature for the mouse genome (Mus musculus) was extracted from Tabula Muris datasets. Only gene symbols with a structure containing any combination of characters from a to z (English alphabet) followed by any combination of digits from zero to nine were analyzed (structures similar to Abc followed by 1, 2, 3, or specified as a regular expression: (a-zA-Z) + (0–9) +), with a total of 18,330 unique gene symbols. The digits make up fFI, while #citation of the gene with fFI = 1 was used as ffounder. For founders themselves, ffounder was set to N/A. fFI was capped at 30, and this value was also used for genes that do not have a founder, following our nomenclature. In the total model, ffounder was set to the median value whenever N/A.
2.4. Gene Homology Analysis
All the mouse protein sequences were obtained from Uniprot (ID UP000000589). A Java script was used to reduce the FASTA header name to just the gene symbol, and only genes included in the Tabula Muris count tables were considered. The command “blastp -db uniprot.fa -query uniprot.fa -out results_prot.out -outfmt 6” was used to do all-against-all mapping (version 2.6.0+) [
6]. Only the highest blastp “pident” score for each pair of genes was retained.
Two methods were used to enrich the graph for edges between the most similar genes. First, only the top 3 edges were retained (largely to speed up following the calculations). Second, a triangle-inequality method was used, inspired by ARACNe DPI [
7]. For genes X, Y, and Z, we compared their protein identity I. If I
xz > I
xy and I
yz > I
xy, the edge X–Y was removed. Informally, this means that Z was sufficiently well-matched to intermediate X–Z–Y that the link X–Y could be considered superfluous. This algorithm was implemented in Java. The #citation of the connected genes then defined f
homology.
2.5. Gene Expression Analysis
Gene count tables from the Tabula Muris [
8] were retrieved from
https://tabula-muris.ds.czbiohub.org/ (accessed on 29 April 2020). We used the “FACS sorted, SMART-Seq2 RNAseq libraries” as the depth appeared better for co-expression analysis, so we used these libraries throughout for consistency. Based on the existing cell type annotation, the number of cells in each tissue was counted. The tissue with the largest number of cells matching a given cell type was designated “the primary tissue”. The average counts were calculated for each cell type in their primary tissue (by focusing on one tissue we needed not consider batch effects). f
exp was defined as rank (expression level).
2.6. Gene Co-Expression Network Analysis
The Tabula Muris RNA-seq count table for different cell types was used again. Instead of the average, single cell counts from the primary tissue were retained. The counts were rescaled as log10 (1 + count). The first 6 PCA components were calculated by prcomp_irlba. A projection was made with UMAP [
9]. The k-nearest neighbor (KNN) graph was calculated with k = 10. f
coexp was defined as average(#citations) of these neighbors.
2.7. Protein–Protein Interaction (PPI) Network Analysis
We downloaded HuRI.tsv (available online:
http://www.interactome-atlas.org/ (accessed on 29 April 2020) [
10]. As this data were already in the form of a network, we could use them directly without intermediate processing. The triangle inequality was not applied, as we assumed the dataset to only consider direct interactions. f
PPI was defined as average(#citations) of the neighbors of each gene.
2.8. Gene Essentiality Analysis
We downloaded
Supplementary Table S3 from the CRISPR screen study online supplement [
11] and used the column “% Dependent Cell lines”. The global essentiality score as defined by their fuzzy set AdAM algorithm was within the range (0, 100), and thus we used it directly as f
essential. For genes not included in the dataset, we set the corresponding value to the median. We also attempted to generate cell type-specific essentialities by manually curating the cancer cell line types and comparing them to Tabula Muris tissues; however, we usually did not manage to clearly decide which exact cell type was the origin and so this was not used in the end.
2.9. Gene Chromatin Proximity Analysis
The mouse genome GRCm38.97 GTF-file was downloaded from Ensembl. Features of the type “gene” were retained, and gene positions were calculated as (from-to)/2. The gene symbol was extracted from the attributes field. The coordinate table was merged with the cell type-specific paper counts. For each chromosome and cell type, the features were sorted. Then for each gene, the closest other genes obtained and fchromatin were defined as the average #citations of these.
2.10. GWAS Analysis
The file gwas_catalog_v1.0-associations_e98_r2020-03-08.tsv was downloaded from the EBI GWAS catalog. We considered as targets those genes in the column “REPORTED.GENE.S.”. Intergenic SNPs were removed. The smallest p-value for any SNP was calculated but capped at 10−40. fGWAS was defined as –rank (p-value) such that high positive values implied high relevance.
2.11. COSMIC Analysis
We downloaded the file CosmicGenomeScreensMutantExport.tsv.gz and used a Java program to extract the number of mutations and length (amino acids) for each gene. The gene names were mapped to mouse genes. The feature fCOSMIC was defined as rank (number of mutations/length of gene). For genes with no COSMIC entry, the smallest value of fCOSMIC was used.
2.12. The Total Model (Linear)
The total model was set up as #citation = m + ∑
i c
i f
i. Features were first scaled to have unit variance and zero mean. The intercept was discarded. The model was fitted in R using the limma package [
12]. Because the features were normalized, we here report the raw coefficient values.
2.13. The Total Model (Nonlinear)
Several neural network models were fitted using the PyTorch library [
13]. To avoid overfitting, we only considered networks with low numbers of layers. We picked one representative model, with 2 RelU layers (16 parameters); for example, having 3 RelU layers give similar output. Parameters were then searched using the ADAM optimizer (convergence shown in
Supplementary Figure S2a). The relative importance of the features was estimated using a LIME [
14]-like approach; for each feature and for each data point (gene), the neural network was asked to predict #citations if one standard deviation was added. The average difference in #citations was taken as the indicator of importance. The neural network explanation for T cells is shown in
Supplementary Figure S2b. The RMSE was 0.55.
We also tested an approach based on gradient boosting (XGBoost) [
15], resulting in an RMSE of 0.51. The model was tested in the same manner as the neural network (
Supplementary Figure S2c).
The Jupyter notebooks containing the non-linear models are provided in the Github repository.
2.14. Creation of Online Data Visualizer
The online visualizer is provided at
http://data.henlab.org/genepub created with the Python framework Dash (available online:
https://plotly.com/dash/, version 1.10.0). Most of the underlying data is stored in SQLite3 files, which enable data to be read efficiently upon need. The files were generated using the R package RSQLite.
2.15. Drug Availability Analysis
The XML database from DrugBank 5.0 was downloaded and parsed in Java [
16]. This program took all drugbank-id records, looked for a gene-symbol record and all gene-name records within target-records. These human gene symbols were translated into mouse gene symbols and compared. We tried taking both all targets and just the first target. Both yield similar citation correlations and drugs-per-gene trends, but including all the targets emphasized GABA-ergic genes higher. This is the approach used for
Supplementary Figure S3.
4. Discussion and Conclusions
Our final model is shown in
Figure 3c. From our analysis, self-reinforcement (the Matthew effect) has a large impact on which genes we study. This is in line with similar findings from past studies [
1,
26,
27,
28,
29]. Surprisingly, the effect increased post-1990. It cannot be attributed to new sequencing methods (pyrosequencing emerging in 2005 [
30]), possibly rather to expression microarrays in 1995 [
31], but more so to the Human Genome Project that started in 1990 [
32], culminating in 2001 [
33] (mouse in 2002 [
34]). We had expected self-reinforcement to be stronger in the early days, given how few genes were known, but it is possible that the first genes were discovered due to their importance for the model organism used for study (e.g., insulin already in 1921). For simplicity, we limited ourselves to 1:1 human–mouse orthologs. As hinted by Stoeger et al., genes initially studied on certain model organisms (especially
Mus musculus and
Rattus norvegicus) have had an enormous impact on their citation popularity over their human homologues. As our study is primarily focused on human genes and we wished to retain as many genes as possible, we did not include additional species. However, the availability of model organisms has also impacted gene popularity [
1].
Our model also highlights the impact of gene expression and co-expression features on gene popularity, for both coding and non-coding gene transcripts. This outcome particularly enforces the idea that high-throughput methods like (sc)RNA-Seq, DNA-Seq, protein biology focused methods, and CRISPR screens are central tools in the generation of unbiased datasets. This translates directly into blurring the historical perspective of traditional “one gene at the time” research (especially since 1990) and broadening the field´s scope towards a more integrative, systemic, and less biased understanding of the biological question studied by the researcher. Genes are now prioritized better according to their relevance. It is possible that this spills over into social bias, with some research into a handful of well-recognized genes being promoted instead of broadening the attention towards emerging secondary players (not exclusively restricted to families) that most likely complete the explanation for the biological event studied by the researcher. Our f
coexp + f
PPI features attempt to capture the interaction between the queried genes and their important secondary players at different levels. The relevance of these features, however, seem to have a low impact on gene popularity, potentially highlighting one of the main limitations of this study. GRNs are highly powerful tools for biological stochastic behavior analysis [
35].
According to our model, features like gene homology (in terms of intraspecific amino acid sequence similarity between gene products), the presence of pre-existing gene family founders, and gene index (within the same family) are key players in spurring researchers towards exploring further gene families. An example of this is the family of olfactory receptor (Olfr) genes. This has a direct impact on the attention that some genes receive (especially from a family perspective).
Interestingly, our model shows disease relevance and essentiality features to be relevant for gene popularity, hinting at a cryptic transition of the genomics (and related) field from an essentially exploratory perspective towards a more goal oriented and context driven strategy (fueled partially by the advent of drug-target discovery [
36]). This could be due to differences in funding, among other factors [
1,
37]. We did not include data to further investigate the impact of the funding system, which also might indirectly affect recruitment and researchers interests. However, other similar studies have shown interesting social cues that are responsible for part of the explanation of some genes’ popularity, including funding [
1]. Altered researcher’s behavior may also affect citations in unclear ways; for example, newer generations of scientists tend to switch between topics more frequently [
38]. Would they focus primarily on the commonly known landmark genes if they were to move to a new topic? The exponentially increasing pace of publications (
Figure 1b) and the concept of “least publishable unit” is likely to also alter behavior in ways not analyzed here.
We have here included several features that we suspected were important; more features can be constructed. Some properties may be difficult to capture, and some genes are akin to black swans—their importance relies on unlikely events. For example, the COVID-19 target
Ace2 is likely to obtain disproportional coverage with our model and emphasize f
age. However, even if our features were poorly affected by false positives/negatives, this would not affect their behavior over time. That said, if gene citation or annotation style has changed over time, this is something that would negatively affect our model. Thus, the trends in
Figure 2 can be considered quantitative, even if other comparisons are better seen as qualitative. Overall, interpreting the results requires thinking carefully about the meaning of the features. The co-expression and expression features are influenced by the choice of tissues sampled. Cancer relevance may not be well represented through f
essentiality, as it was calculated through a CRISPR KO screen rather than CRISPR activation, biasing it toward one type of cancer gene. There are several other caveats in interpreting essentiality as a proxy for cancer relevance [
39]. However, this just begs for a harder question: Why else did it surge in importance in the 1990s? Many of the top drivers according to GWAS and COSMIC seem to have come earlier (
Figure 3a). Was essentiality the best driver that could be found, after having run out of other strong disease candidate genes? In this regard, our analysis opens for more questions than we can answer at this time.
One other limitation in our study is that we have not investigated the impact of a changing cell type ontology. To avoid this, we have subjectively stuck with the most popular cell types. For example, the T cell type has been broken down into subtypes, and CD4 T helper cells eventually came to include not just Th1 and Th2, but also Th17 and the still somewhat ignored Th9 (as judged by citation counts). In future work it would be relevant to study, for example, how new cell types are “populated” with new genes from their founding type.
The top genes after 1990 are, in descending order,
Pten,
Mthfr,
Pparg,
Mapk1, and
Tlr2-genes familiar to many biologists or clinicians as they frequently appear in textbooks (or as part of a mentioned protein complex or pathway). It is hard to imagine how we would have approached biology if we did not have at least some reference points. However, the number of drugs targeting (or known to target) a gene correlates highly with the citations (
Supplementary Figure S3, r = 0.4, Pearson correlation on log scale). Thus, as the scientific field of biology has matured, we likely need to look past our “comfort zone of familiar genes” and better integrate regulatory networks to find new drug targets. Unbiased methods such as CRISPR screens and single-cell analysis are likely to be of help. To further guide colleagues toward poorly explored areas, we provide
http://data.henlab.org/genepub, showing properties of genes and indicating if they appear understudied. We hope this work enables reflective analysis and enables us to focus where it matters the most.