Transformer for Gene Expression Modeling (T-GEM): An Interpretable Deep Learning Model for Gene Expression-Based Phenotype Predictions

Simple Summary Cancer is the second leading cause of death worldwide. Predicting phenotype and understanding makers that define the phenotype are important tasks. We propose an interpretable deep learning model called T-GEM that can predict cancer-related phenotype prediction and reveal phenotype-related biological functions and marker genes. We demonstrated the capability of T-GEM on cancer type prediction using TGCA data and immune cell type identification using scRNA-seq data. The code and detailed documents are provided to facilitate easy implementation of the model in other studies. Abstract Deep learning has been applied in precision oncology to address a variety of gene expression-based phenotype predictions. However, gene expression data’s unique characteristics challenge the computer vision-inspired design of popular Deep Learning (DL) models such as Convolutional Neural Network (CNN) and ask for the need to develop interpretable DL models tailored for transcriptomics study. To address the current challenges in developing an interpretable DL model for modeling gene expression data, we propose a novel interpretable deep learning architecture called T-GEM, or Transformer for Gene Expression Modeling. We provided the detailed T-GEM model for modeling gene–gene interactions and demonstrated its utility for gene expression-based predictions of cancer-related phenotypes, including cancer type prediction and immune cell type classification. We carefully analyzed the learning mechanism of T-GEM and showed that the first layer has broader attention while higher layers focus more on phenotype-related genes. We also showed that T-GEM’s self-attention could capture important biological functions associated with the predicted phenotypes. We further devised a method to extract the regulatory network that T-GEM learns by exploiting the attributions of self-attention weights for classifications and showed that the network hub genes were likely markers for the predicted phenotypes.


Introduction
The ever-growing genomics data from omics profiling and single-cell technologies have spurred rapid developments of deep learning (DL) methods for genomics, especially gene expression-based phenotype predictions [1,2]. In precision oncology, deep learning • T-GEM can accurately predict phenotypes based on gene expression; • T-GEM captures the biological function and potential marker genes.
To assess these hypotheses, we design the T-GEM model for modeling gene-gene interactions and evaluate its ability for gene expression-based predictions and functional interpretation of cancer-related phenotypes, including cancer type prediction and immune cell type classification. The remaining paper is organized as follows. In the Materials and Methods, we describe the datasets and preprocessing steps. We also detail the proposed T-GEM model and discuss the developed methods to interpret T-GEM for its learned functions, key regulatory networks, and marker genes. In the Results, we carefully assess T-GEM's prediction performance and compare it with existing popular approaches. We further analyze the learning mechanism of T-GEM and show that T-GEM's self-attention could capture important biological functions associated with the predicted phenotypes. We also extract the regulatory network that T-GEM learns by exploiting the attributions of self-attention weights for classifications and show that the network hub genes were likely markers for the predicted phenotypes. Concluding remarks are discussed in the Conclusion.

Materials and Methods
In this study, we proposed a novel interpretable deep learning model called T-GEM, or the Transformer for Gene Expression Modeling, to predict gene expression-based cancerrelated phenotypes, including cancer type prediction and immune cell type classification. The interpretability of T-GEM is also presented.

Data Collection and Preprocessing
The T-GEM is evaluated on two gene expression-based datasets from RNA-seq and scRNA-seq, respectively.

TCGA RNA-Seq Data
The Cancer Genome Atlas (TCGA) [23] pan-cancer dataset was collected via R/Bioconductor package TCGAbiolinks [24] in December 2018. The dataset contains 10,340 and 713 samples for 33 cancer types and 23 normal tissues, respectively. All gene expression was processed by log-transferred and normalized by min-max. To facilitate the investigation of T-GEM's learning mechanism and functional interpretation, we selected 1708 genes from 24 genesets containing 14 cancer functional genesets from CancerSEA [25] and 10 KEGG biological pathways that are less relevant to cancer.

PBMC scRNA-Seq Data
To test T-GEM's generalizability and interpretability, we applied T-GEM for immune call type classification using the single-cell RNA-seq (scRNA-seq) gene expression profile of Human peripheral blood mononuclear cells (PBMC; Zhengsorted) [26]. The PBMC dataset includes 10 cell populations extracted through antibody-based bead enrichment and FACS sorting. Each cell population has 2000 cells and, therefore, there is a total of 20,000 cells from 10 cell classes. We followed the method in [27] to preprocess the data and selected 1000 highly variable genes as input features.

T-GEM Model
Let x ∈ R G×1 be the vector of normalized log-transformed gene expression of G genes and y∈Z be the classification label. The goal is to design a deep learning model F parameterized by θ such that p(y|x) = F θ (x). The overall architecture of T-GEM follows that of a Transformer ( Figure 1A), which includes multiple attention layers, each of which is composed of multiple self-attention heads. Similar to the classical Transformer, the attention layer generates new attention values z ∈ R G×1 , which are representations of the input gene expression x. One important feature of the attention layer is that there is a one-to-one correspondence between the input gene and the output representation, i.e., z g is the representation of x g ( Figure 1B). This input-out consistency is a key modeling feature of T-GEM that results in a biological-aware design of the self-attention units and an essential ingredient that enables an interpretable deep learning model. The core computing element in T-GEM is the self-attention module ( Figure 1B), which takes the representation of each gene from the previous layer and computes a set of new representations. We use the self-attention module of the first attention layer to explain its inner operations ( Figure 1B). To help understand the connections and differences between the self-attention modules of T-GEM and Transformer, we draw the parallel between an input gene for T-GEM and an input word for Transformer. However, a notable difference lies in the fact that a gene denotes its scalar expression value, whereas a word is embedded by a word vector. Similar to the Transformer, the self-attention module of T-GEM performs two sets of operations. First, three entities, i.e., Query (q g ), Key (k g ), and Value (v g ) are computed for each gene g, where q g = w qg x g , k g = w kg x g , and v g = w vg x g , with w qg , w kg , and w vg being the weights to be learned. Compared to Transformer, they are similar in that all the computations are linear; they are nevertheless different because for Transformer, the linear weights are shared for all the words, whereas for T-GEM, the weights w qg , w kg and w vg are gene-dependent. Introducing gene-dependent weights makes T-GEM more flexible and powerful to learn gene-specific representations. It is also computationally feasible because the input gene is a scalar as opposed to a word vector. In the second operation, the representations for each Query gene g are computed. Specifically, for Query gene g, an attention weight vector a g or a probability distribution of the similarity between g and all other Key genes except g is computed as a gi = so f tmax f q g , k i ∀ i = 1, . . . , G and i = g where a gi is the gth attention weight in a g and f is a similarity between q g and k i defined as where is an activation function and is the representations of last layer .

Entropy of the Attention Weights of a Head
The attention weight vector measures the attentions that the Query gene has on the Key genes, and it is in the form of a probability distribution. Therefore, to assess the characteristics of the attentions of a Query gene, we compute the entropy of  Equation (2) makes Query attend to the Key gene that has a similar up/down behavior and consistently large magnitudes (either highly or lowly expressed) across different samples. Since the Key genes with such behavior are likely to be marker genes, Equation (2) would result in Query gene paying attention to marker genes.
Once the attention weights a g are computed, a representation of the Query gene is obtained as where v g− is the vector of the values excluding gene g. The computation in Equation (3) suggests that the genes that are deemed more "similar" by Equation (2) (i.e., correlated marker genes) will have bigger contributions to gene g's representation. This progress is repeated for each gene until the representations for all the genes are computed. This concludes the operations for the self-attention module. When there are H self-attention heads, the representations from each head will be linearly combined to obtain a single set of representations z 1 for layer 1 where the summation represents the skip connection, w H ∈ R H×1 is the weight vector, Z ∈ R H×G has its (h, g)th element as the representation of the query gene in head h, and layernorm represents the layer normalization function [18]. The entire computation from Equaions (1)-(4) will repeat to calculate the Query gene representations of additional layers.
In the end, a classification layer is added where φ is an activation function and z L is the representations of last layer L.

Entropy of the Attention Weights of a Head
The attention weight vector a g measures the attentions that the Query gene g has on the Key genes, and it is in the form of a probability distribution. Therefore, to assess the characteristics of the attentions of a Query gene, we compute the entropy of a g H a g = − G ∑ i a gi log a gi (6) Intuitively, a lower entropy demonstrates that the distribution of a gi becomes more skewed, and therefore Query gene g pays higher attention to specific genes or focused attention. In contrast, higher entropy indicates that the distribution of a gi is flatter, and therefore Query gene g pays similar attention to its Key genes or broad attention. We averaged the entropies across all samples to obtain the entropy of a Query gene.

Attribution Scores of Attention Weights of a Head and a Layer
To assess the contributions of the attention weights to the classification, we adopted the method proposed in [28] to compute the attribution scores of the attention weights using an Integrated gradient (IG). Given the T-GEM model with attention weights a gi and some baseline weights a gi , an attribution score is computed as where the integral is taken along a straight line from the a gi to a gi parameterized by the parameter α. To simply the computation, we set baseline a gi as 0. We used the PyTorch interpretability library Captum [29] to compute the attribution scores in Equation (7).

Visualization of the Regulatory Networks of the Top Layer
A regulatory network representing the regulations of Query genes by Key genes was derived from the last layer using attention weight attribution scores computed from the testing samples of a specific label. We used Cytoscape [30] to visualize this network.

Training Procedure
The model was trained with optimizer Adam [31], with a batch size of 16 and an epoch of 50. The negative log-likelihood function was used as the loss function. The model was trained on multiple sets of hyper-parameters (Table 1), and the best model parameters were selected based on their validation performance. All of the baseline models were trained according to the same procedure. For CNN, AutoKeras [32] was applied to determine the best model architecture. All of the model training was conducted by using Pytorch 1.9.0 on NVIDIA A100 GPUs.

Performance Evaluation Metrics
The proposed model was evaluated on two multi-labeled datasets while TCGA has the unbalance samples for each class and PBMC has the equal size for each class. We adopted two metrics to assess the models' performance, including ACC (accuracy), MCC (Matthews correlation coefficient), and AUC (the area under the ROC curve). MCC is a reliable metric that produces a high score only if the model performs well in TP, TN, FP, and FN. ACC and MCC is defined as where TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives, respectively. The ROC curve is plotted by using FPR (false positive rate) against TPR (true positive rate) and AUC was computed as the area under the ROC curve.

Results and Discussion
In this section, we first compare T-GEM's performance with CNN and several classical classifiers on the TCGA and PBMC datasets. Then, we analyze the learning mechanism of T-GEM and examine the biological functions and marker genes captured by T-GEM.

T-GEM's Performance for Cancer Type Classification Using TCGA Data
We evaluate the performance of the T-GEM model on cancer type prediction using TCGA pan-cancer gene expression data. The data contain 10,340 patients with 33 different cancer types and 713 normal samples. The classification includes 34 labels (33 cancer types and normal). For each label, we extracted 70% of the samples as the training set, 10% as the validation set, and 20% as the testing set. To facilitate the evaluation of interpretation, we selected genes from 14 cancer functional gene sets from CancerSEA, including apoptosis, angiogenesis, invasion, etc., and 10 KEGG biological pathways gene sets that are not related to cancer. A total of 1708 genes are used as input features (Detailed information about 14 functional gene sets and 10 KEGG pathways can be found in the Supplementary Materials file). Because we know which genes belong to cancer-related genesets, we could examine the genes that each T-GEM's attention layers focus on and understand if T-GEM's predictions are based on cancer-related genes and functions and what they are.
To determine the architecture and hyperparameter, T-GEM with multiple sets of hyper-parameters (Table 1) were trained, and the obtained model with the best validation performance included three layers, five heads, and no activation in the classification layer of Equation (5).
We evaluated the ACC, MCC, and AUC of T-GEM, CNN, and several classical classifiers (Table 2), where CNN was trained using AutoKeras so that its architecture and hyperparameters were optimized. As we can see, deep learning-based methods, including T-GEM and CNN, have overall better performance, whereas T-GEM achieves the overall best performance, although the improvement over CNN is small, especially on AUCs.
To gain a better understanding of the differences in improvements by T-GEM shown by different performance metrics, we examined the confusion matrices of their predictions. We observed that most of the misclassifications were associated with two cancer types, namely colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ). For SVM, it misclassified all 34 READ samples to COAD, while CNN misclassified about half of the 96 COAD samples to READ. In contrast, T-GEM correctly predicted 84/96 COAD samples and 15/34 READ samples. The lower misclassification by T-GEM was clearly revealed by ACC, but this could only register a slight AUC improvement because of the small sample sizes of these two cancer types. Collectively, T-GEM outperformed CNN and classical machine learning algorithms and could achieve lower misclassifications for especially cancer types with small samples.

Investigation of T-GEM's Learning Mechanism
Next, we investigated how T-GEM learns to produce a classification prediction. Specially, we sought to understand the importance of each layer/head for classification, what T-GEM focuses on in each layer, and the biological functions of each layer. We designed three following analyses to address these questions.
To assess the importance of a layer/head, we first pruned that layer/head and checked the accuracies after pruning. Note that we consider the head output before the skip connection in this experiment. The larger the accuracy reduction that the pruning causes, the more important the head/layer is. To prune a head, we set this head's corresponding weight in w H in Equation (4) as 0 while keeping the rest of elements in w H unchanged. Similarly, to prune a layer, we zeroed the weights of all the heads in that layer. The original test samples were used to compute this pruned model's accuracy.
As can be seen in Table 3, the accuracy after pruning layer 1 drops the most from the original 94.92% to 57.48%, indicating that layer 1 is the essential layer. Examining different heads in layer 1 reveals that head 1 could be the most important head. Intriguingly, we observe much smaller accuracy drops for the heads than the layer. This could be due to heads in the same layer capturing similar features. Notice that there are slight drops associated with layers 2 and 3. Recall that each layer includes a skip connection, and thus, this result suggests that being able to pass the learned information from layer 1 to layers 2 and 3 could be an important mechanism to achieve the overall performance. To further support this observation, we directly assessed the classification performance of each head/layer. To that end, we took the output from a layer/head and trained an SVM. The ACCs are shown in (Table 4). Now, the larger the accuracy, the more important a head/layer is. Overall, all accuracies of all layers/heads are close to the original, indicating that each layer/head contains a large amount of information needed for accurate classification. Note that the accuracies decrease from layer 1 to layer 3, which supports the conclusion from the pruning experiment that passing outputs of layer 1 to other layers through skip connection ensures achieving the reported performance of the T-GEM model. We also notice that the accuracies of different heads, especially in layer 3 vary. This suggests that features extracted from these heads could be different and different heads might focus on learning information from different genes.  (3), that the head output for each Query is computed as a linear combination of the Values weighted by attention weight vector a g . As an important part of T-GEM, a g reflects the attention in the form of the discrete probability distribution that Query gene g pays to all Key genes. When this distribution is flat and close to the uniform distribution, the Query gene pays broad attention to Key genes. Otherwise, the Query gene put more attention on a subset of genes. This distribution varies in each layer and each head depending on what and how Query genes attend to Key genes. To understand if a head pays more attention on a subset of genes (i.e., smaller entropy) or has broad attention (i.e., large entropy), we computed the entropy of a g for every head in each layer averaged across all of the samples (Figure 2). We notice that the entropies of all the heads in layer 1 are similar and among the largest, suggesting that Query genes in layer 1 tend to pay similar and broad attention to key genes. In contrast, layer 2 and layer 3 contain heads with Query genes having lower entropies and layer 3 has some of the genes with the lowest average entropy. This suggests that layers 2 and 3 start to pay more attention to specific genes. To further understand what the specific genes the heads focus on, we examined five Query genes with the lowest entropy in layer 3 ( Figure 2). We found that all of them are related to cancer. EMX2 is the candidate tumor suppressor that regulates tumor progression [33]. EPCAM is a well-known marker for tumor-initiating cancer stem cells, and it is a penitential target for cancer therapy [34]. POU5F1 is a penitential prognostic and diagnostic biomarker for various epithelial cancers [35]. A lack of BHMT impacts susceptibility to fatty liver and hepatocellular carcinoma [36]. Finally, GFAP is a classical marker of astrocytoma, and it is marker of less malignant and more differentiated astrocytoma [37].
Cancers 2022, 14, 4763 9 of 1 epithelial cancers [35]. A lack of BHMT impacts susceptibility to fatty liver and hepatocellu lar carcinoma [36]. Finally, GFAP is a classical marker of astrocytoma, and it is marker o less malignant and more differentiated astrocytoma [37]. In summary, T-GEM pays broad attention in layer 1 but has more concentrated at tention in higher layers. The genes with more focused attention or lowest entropies i layer 3 are cancer-related genes.

T-GEM Makes Decisions by Focusing on Important Cancer Pathways
Next, we investigated the biological functions that T-GEM learns in each layer. To thi end, we assessed the attribution of attention weights to the classification by computing thei attribution scores using IG, as in Equation (7). Then, we averaged the weight attributio scores of the same Query and Key gene pairs from all heads in the same layer to obtain th attribution scores of the layer function. Then, according to [28], we thresholded the attribu tion scores to retain the Query and Key genes associated with the high weight attributio scores. A snapshot of layer 3 is shown in Figure 3A. We see that Query genes VCAN MMP11, and FBN1 are connected with Key genes, indicating that the weight attributio scores between these Query and Key genes are large and, therefore, the information passe from these Key genes to the Query genes contributes to the classification. Consequently, w called these Query genes informative genes. On the contrary, we called Query genes wit no linked Key genes (e.g., PGM1, PSMB4, LOX) non-informative genes.
To determine the biological function of each layer, we performed the functional en richment of the associated informative genes. We investigated the predicted functions fo BRAC classification as BRCA has the last number of samples ( Figure 3B). Since our gene were selected from 14 cancer functional state genesets and 10 KEGG pathways less relate to cancer, we used these 24 gene sets for enrichment.
As we can see from Figure 3B, the 4, 3, and 4 pathways were enriched for layers 12, an 3, respectively, (FDR < 0.05). Except one pathway (VALINE_LEUCINE_AND_ISOLEU CINE_DEGRADATION) in the first layer, all other ways are cancer functional states, ind cating that T-GEM is able to quickly focus on genes related to cancer even starting from th In summary, T-GEM pays broad attention in layer 1 but has more concentrated attention in higher layers. The genes with more focused attention or lowest entropies in layer 3 are cancer-related genes.

T-GEM Makes Decisions by Focusing on Important Cancer Pathways
Next, we investigated the biological functions that T-GEM learns in each layer. To this end, we assessed the attribution of attention weights to the classification by computing their attribution scores using IG, as in Equation (7). Then, we averaged the weight attribution scores of the same Query and Key gene pairs from all heads in the same layer to obtain the attribution scores of the layer function. Then, according to [28], we thresholded the attribution scores to retain the Query and Key genes associated with the high weight attribution scores. A snapshot of layer 3 is shown in Figure 3A. We see that Query genes VCAN, MMP11, and FBN1 are connected with Key genes, indicating that the weight attribution scores between these Query and Key genes are large and, therefore, the information passed from these Key genes to the Query genes contributes to the classification. Consequently, we called these Query genes informative genes. On the contrary, we called Query genes with no linked Key genes (e.g., PGM1, PSMB4, LOX) non-informative genes.
To determine the biological function of each layer, we performed the functional enrichment of the associated informative genes. We investigated the predicted functions for BRAC classification as BRCA has the last number of samples ( Figure 3B). Since our genes were selected from 14 cancer functional state genesets and 10 KEGG pathways less related to cancer, we used these 24 gene sets for enrichment.
As we can see from Figure 3B, the 4, 3, and 4 pathways were enriched for layers 12, and 3, respectively, (FDR < 0.05). Except one pathway (VALINE_LEUCINE_AND_ISOLEUCINE _DEGRADATION) in the first layer, all other ways are cancer functional states, indicating that T-GEM is able to quickly focus on genes related to cancer even starting from the first layer. Layer 2 and layer 3 focus on specific cancer pathways including metastasis, inflammation, and stemness, which are all highly correlated to breast cancer [38][39][40]. and DNA Repair, suggesting that only genes in these cancer pathways impact the informative Query genes and thus enriched pathways of layer 1. Interestingly, among these significantly enriched pathways of layer 1, only genes in invasion influence the significantly enriched pathways in layer 2, which are metastasis, inflammation, and stemness. Among these pathways, only invasion and metastasis contribute to the significantly enriched pathways of layer 3, which are invasion, metastasis, inflammation, and stemness, and further contribute to the classification of BRCA. Overall, this investigation reveals that T-GEM attends to more pathways in layers but manages to focus on genes in key cancer pathways in layers 2 and 3 to make decisions. To investigate how the information related to these enriched cancer states is transferred from the input to generate the classification decision, we identified the input Key genes associated with the Query genes in all the significantly enriched pathways for each layer and performed the functional enrichment on these Key genes. Then, for each layer, we linked each enriched pathway with the pathways enriched in Key genes connected to Query genes in that pathway by virtue of Query-Key gene associations ( Figure 3B). As shown in Figure 3B, the Key genes from layer 1 are mainly enriched in Invasion, Cell cycle, and DNA Repair, suggesting that only genes in these cancer pathways impact the informative Query genes and thus enriched pathways of layer 1. Interestingly, among these significantly enriched pathways of layer 1, only genes in invasion influence the significantly enriched pathways in layer 2, which are metastasis, inflammation, and stemness. Among these pathways, only invasion and metastasis contribute to the significantly enriched pathways of layer 3, which are invasion, metastasis, inflammation, and stemness, and further contribute to the classification of BRCA. Overall, this investigation reveals that T-GEM attends to more pathways in layers but manages to focus on genes in key cancer pathways in layers 2 and 3 to make decisions.

T-GEM Defines a Regulatory Network That Reveals Marker Genes for Different Cancer Types
Note that the Query-Key gene relationship obtained in the last section, as in Figure 3A, essentially defines a regulatory network that a layer learns with Key genes as the regulators. We extracted the network from the last T-GEM layer for BRCA and LUAD classification using their respective samples and examined the networks (Figure 4).

T-GEM Defines a Regulatory Network That Reveals Marker Genes for Different Cancer Types
Note that the Query-Key gene relationship obtained in the last section, as in Figure  3A, essentially defines a regulatory network that a layer learns with Key genes as the regulators. We extracted the network from the last T-GEM layer for BRCA and LUAD classification using their respective samples and examined the networks (Figure 4). For BRAC, we first examined the nodes in this regulatory and tested their differential expression between breast cancer patients and normal samples by using the Mann-Whitney U test. Overall, 102/107 genes showed significant differential expression in breast cancer patients compared to normal samples (FDR < 0.05). We further examined the five notdifferentially-expressed genes (ABAT, NFKBIA, GUCY1A1, CDH2, TPM2) and found that they do play an important role in breast cancer. For example, ABAT can suppress breast cancer metastasis [41], and NFKBIA's deletion and low expression are highly related to the unfavorable survival of breast cancer patients [42]. CDH2 is strongly associated with several stem cell-related transcription factors, and it could be the targeted therapy for breast cancer [43]. TPM2 is a potential novel tumor suppressor gene for breast cancer [44]. This result suggests that T-GEM can focus on functional relevant genes even if they show small differential expression.
Close examination of the network (Figure 4) revealed several hub genes, including GATA3, FOXA1, COMP, IGFBP2, COX7A1, CDH2, VEGFD, AKAP12, and STC2. All of them are highly related to breast cancer, and some of them are known BRCA markers. For For BRAC, we first examined the nodes in this regulatory and tested their differential expression between breast cancer patients and normal samples by using the Mann-Whitney U test. Overall, 102/107 genes showed significant differential expression in breast cancer patients compared to normal samples (FDR < 0.05). We further examined the five notdifferentially-expressed genes (ABAT, NFKBIA, GUCY1A1, CDH2, TPM2) and found that they do play an important role in breast cancer. For example, ABAT can suppress breast cancer metastasis [41], and NFKBIA's deletion and low expression are highly related to the unfavorable survival of breast cancer patients [42]. CDH2 is strongly associated with several stem cell-related transcription factors, and it could be the targeted therapy for breast cancer [43]. TPM2 is a potential novel tumor suppressor gene for breast cancer [44]. This result suggests that T-GEM can focus on functional relevant genes even if they show small differential expression.
Close examination of the network (Figure 4) revealed several hub genes, including GATA3, FOXA1, COMP, IGFBP2, COX7A1, CDH2, VEGFD, AKAP12, and STC2. All of them are highly related to breast cancer, and some of them are known BRCA markers. For example, GATA3 is a well-known associated with favorable breast cancer pathologic features and a promising prognostic marker for breast cancer [45]; FOXA1 is a transcription factor that is predominantly expressed in breast cancer and is considered as breast cancer biomarker [46,47]; COMP is an emerging independent prognostic marker for breast cancer patients [48]. IGFBP2 is directly related to estrogen receptor status, and it is a potential biomarker for the therapeutic response to breast cancer [49]. Finally, STC2 has been shown to suppress breast cancer cell migration and invasion [50].
For LUAD, 183/199 nodes show significant differential expression in LUAD patients compared to normal samples (U-test, FDR < 0.05). Similarly, many of the remaining genes that are not differentially expressed (FBN1, COX7A2L, IGFBP2, WNT5B, CDKN1B, KIT, etc.) have been shown to be related to lung cancer. For example, FBN1 plays a crucial role in EMT, and its gene expression is highly correlated with the recurrence-free survival rate for LUAD patients [51]. COX7A2L is important for the assembly of mitochondrial respiratory supercomplexes, and it is related to the regulation of respiratory chain supercomplex assembly and function in lung cancer [52]. A high circulating level of IGFBP2 is highly associated with a poor survival rate, and it is a penitential prognostic biomarker for lung cancer [53].
Similar to BRCA, we observed hub gens NKX2-1, ST6GALNAC1, FOXA1, HSD17B6, AKAP12, DDIT4, ASPN, CEACAM5, MALAT1, GALNT6, and PIPOX ( Figure 5), and they also highly associated with LUAD. NKX2-1 controls lung cancer development, and it could be a novel biomarker for lung cancer [54]. AKAP12 acts as a tumor promoter in LUAD and is negatively correlated with patients' prognosis [55]. A high expression level of DDIT4 associates with poor survival in LUAD patients, and it is a predictor of overall survival [56]. ASPN is found to correlate with LUAD patients' survival time [57]. HSD17B6 is shown as a potential tumor suppressor in LUAD and a promising prognostic indicator for LUAD patients [58]. CEACAM5 is indicated as a valid clinical biomarker and promising therapeutic target in lung cancer [59]. MALAT1 is a highly conserved nuclear noncoding RNA (ncRNA) and a predictive marker for metastasis development in lung cancer [60]. ST6GALNAC1 can promote lung cancer metastasis, and it is also a novel biomarker for LUAD [61,62]. FOXA1 plays an oncogenic role in lung cancer, and its high expression level is associated with a poor overall survival rate [63]. GALNT6 can promote invasion and metastasis of LUAD, and its expression level is associated with LUAD lymph node metastasis and poor prognosis [64,65]. PIPOX is a major enzyme in the sarcosine metabolism pathway, and it is positively correlated with shorter overall survival [66]. In summary, T-GEM's attention mechanism models a regulatory network of Query genes by Key genes, and the hub genes of the network of the top layer reveal important marker genes of different cancer types. In summary, T-GEM's attention mechanism models a regulatory network of Query genes by Key genes, and the hub genes of the network of the top layer reveal important marker genes of different cancer types.

T-GEM' Performance for PBMC Single Cell Cell-Type Prediction and Interpretation Result
To further examine the generalizability of T-GEM, we applied it to the PBMC singlecell data set for cell type classification. The dataset contains 10 different cell types and 2000 samples for each cell type. Overall, 1000 highly variable genes were selected as the input features. We followed the same model training and selection process and obtained a T-GEM model with one layer and five heads with GeLU as the activation function of the classification layer.
T-GEM achieved among the best performances, outperforming random forest and decision trees by large margins but was competitive with CNN and SVM (Table 5). We further examined the confusion matrices of SVM and T-GEM separately and observed that both models have high misclassification rates for three T cell subtypes, CD4+ T helper, CD4+/CD25 T Reg, CD4+ /CD45RA+/CD25-Naïve T. However, T-GEM has lower misclassification rates (150/400 and 73/400) for T helper cell and T Reg cell than SVM (157/400, 94/400), although SVM has fewer misclassifications (50/400) for Naïve T cell than T-GEM (82/400). In conclusion, T-GEM achieved competitive performances with SVM or CNN for PBMC cell type classification using scRNA-seq data. However, CNN and SVM are not easy to interpret. We investigate the interpretability of T-GEM next. As shown for TCGA cancer type classification, T-GEM makes decisions by focusing on the genes in cancer-related pathways. Therefore, we used the same approach to examine the functions that T-GEM leans. We enriched the informative genes from the first layer in GO Biology Processes. The result shows that most of the informative genes are enriched on the ribosome-and translation-related pathways, including, for example, cytoplasmic translation, peptide biosynthetic process, ribosome biogenesis, ribosome assembly, etc.(the detailed enrichment result is in the Supplementary Materials file) Because these cell types are FACS-sorted based on the protein activities of surface markers, but T-GEM's inputs are gene expression, T-GEM seems to focus more on translation-related processes that translate gene expression to protein activities. We subsequently compared the informative genes with the marker genes for these PBMC cell types defined in the CellMatch database [67] and identified the NK cell markers NKG7 and KLRB1 and the B cell markers CD79B and CD37. Note that many of the marker genes (such as CD4, CD19, etc.) were excluded from the 1000 HVGs selected for our prediction, suggesting that their expression levels are either low or not differentially expressed in the cell types. This result highlights T-GEM's ability to learn the underlying processes that relate predicted phenotypes with inputs.

The T-GEM's Regulatory Networks for NK Cell and B Cell
We next derived the regulatory network of Key-Query genes for the T-GEM layer for the NK cells and B cells (Figures 6 and 7). Out of 123 nodes in this network, 114 showed differential expression in NK cells (FDR < 0.05). For NK cells, there are four hub genes, including MALAT1, CD52, RPLP1, and KLRB1 ( Figure 6). KLRB1 is a C-type lectin-like receptor of NK cells and a known NK cell maker. It plays an important role as an inhibitory receptor of NK cells and also works as a co-stimulatory receptor to promote the secretion of IFNγ [68]. MALAT1 is a druggable long non-coding RNA and plays an important role in NK cell-mediated immunity [69]. 3 15 of 19 Figure 6. T-GEM regulatory network for NK cell. Nodes are informative genes and their associated Key genes. Links go from Key to Query genes. Genes with red color are hub genes. Figure 6. T-GEM regulatory network for NK cell. Nodes are informative genes and their associated Key genes. Links go from Key to Query genes. Genes with red color are hub genes.
We next examined the regulatory network for CD19+ B cells (Figure 7). Note that CD19 is not an HVG. Out of 128 nodes in this network, 127 showed differential expression in B cells (FDR < 0.05). CD53 is the only gene that is not differentially expressed. However, CD53 is found to promote B cell development by maintaining IL-&R signaling [70]. We examined the hub genes of the B cell regulatory network, which include HLA-DPB1, RPL35, RPLP0, RPLP1, RPL13, and MALAT1. Interestingly, RPL35, RPLP0, RPLP1, and RPL13 are ribosomal proteins, which could contribute to the enriched translation-related functions of this layer. MALAT1 is also a hub gene for the NK cell network. It has large expression variations across different PBMC cell types and is down-regulated in B cells [71]. HLA-DPB1 is shown to be highly expressed in B cell lines [72] and could serve as an expression marker for B cells. Taken together, these results demonstrate T-GEM's ability to learn functions and marker genes associated with predicted phenotypes. Figure 6. T-GEM regulatory network for NK cell. Nodes are informative genes and their a

Conclusions
We proposed T-GEM, an interpretable deep-learning architecture inspired by Transformer for gene expression-based phenotype predictions using RNA-seq and scRNA-seq data. In the model architecture, the input gene expression is treated as the token, and a self-attention function attending to markers was proposed to model gene-gene interactions. We comprehensively investigated T-GEM's performance and learned functions for cancer type classification using TCGA data and immune cell type classification using PBMC scRNA-seq. We show that • T-GEM can provide accurate gene expression-based prediction. We investigated T-GEM for cancer type classification using TCGA data and immune cell type classification using PBMC scRNA-seq. T-GEM has the best performance on both datasets compared with CNN and several classical classifiers. • T-GEM learns associated cancer functions. We showed that T-GEM had broad attention in the first layer and paid attention to specific cancer-related genes in layers 2 and 3 for cancer type classification. We also revealed that T-GEM learned cancer-associated pathways at every layer and could concentrate on specific pathways important for predicted phenotypes in layer 3.

•
We extracted the regulatory network of layer 3 and showed that the network hub genes were likely cancer marker genes. We also demonstrated the generalization of these results for immune cell type classification using PBMC scRNA-seq data.
As far as we know, this is the first time such a model has been proposed for modeling gene expression data. Inevitably, there are still many challenges to be addressed. High memory overhead associated with the training of the attention weights limits the input gene dimension. Right now, we have to carefully preselect a subset of genes, which could overlook important genes for T-GEM to model. Furthermore, the self-attention function in Equation (2) is general and allows the integration of prior biological networks such as protein-protein interaction networks to introduce biological meaningful inductive bias into the learning. Research into solutions to these challenges will be our future focus.