Artiﬁcial Neural Network Analysis of Gene Expression Data Predicted Non-Hodgkin Lymphoma Subtypes with High Accuracy

: Predictive analytics using artiﬁcial intelligence is a useful tool in cancer research. A multilayer perceptron neural network used gene expression data to predict the lymphoma subtypes of 290 cases of non-Hodgkin lymphoma (GSE132929). The input layer included both the whole array of 20,863 genes and a cancer transcriptome panel of 1769 genes. The output layer was lymphoma subtypes, including follicular lymphoma, mantle cell lymphoma, diffuse large B-cell lymphoma, Burkitt lymphoma, and marginal zone lymphoma. The neural networks successfully classiﬁed the cases consistent with the lymphoma subtypes, with an area under the curve (AUC) that ranged from 0.87 to 0.99. The most relevant predictive genes were LCE2B , KNG1 , IGHV7_81 , TG , C6 , FGB , ZNF750 , CTSV , INGX , and COL4A6 for the whole set; and ARG1 , MAGEA3 , AKT2 , IL1B , S100A7A , CLEC5A , WIF1 , TREM1 , DEFB1 , and GAGE1 for the cancer panel. The characteristic predictive genes for each lymphoma subtypes were also identiﬁed with high accuracy (AUC = 0.95, incorrect predictions = 6.2%). Finally, the topmost relevant 30 genes of the whole set, which belonged to apoptosis, cell proliferation, metabolism, and antigen presentation pathways, not only predicted the lymphoma subtypes but also the overall survival of diffuse large B-cell lymphoma (series GSE10846, n = 414 cases), and most relevant cancer subtypes of The Cancer Genome Atlas (TCGA) consortium including carcinomas of breast, colorectal, lung, prostate, and gastric, melanoma, etc. (7441 cases). In conclusion, neural networks predicted the non-Hodgkin lymphoma subtypes with high accuracy, and the highlighted genes also predicted the survival of a pan-cancer series.


Introduction
Non-Hodgkin lymphomas (NHL) are a group of hematological neoplasia that originate from B-lymphocytes, T-lymphocytes, or natural killer (NK) cells, either from progenitors or mature cells [1,2]. The 2016 World Health Organization (WHO) classification of tumors of hematopoietic and lymphoid tissues integrates morphologic, immunophenotypic, genetic, and clinical features to define and classify the distinct NHL subtypes [3]. The lymphoid neoplasms are derived from cells that differentiate into T-lymphocytes or B-lymphocytes. The T-lymphocytes include subtypes of CD8-positive cytotoxic Tlymphocytes, CD4-positive T helper lymphocytes, and FOXP3-positive regulatory Tlymphocytes (Tregs). The B-lymphocytes category includes B-lymphocytes (naïve, centrocytes, and centroblasts) and plasma cells. The lymphoid neoplasms were initially classified according to the clinical behavior as indolent, aggressive, and highly aggressive [1,2,4]. Nevertheless, due to the heterogeneous evolution, the current classification focuses more  [3]. Mantle cell lymphoma develops from or has a stage of differentiation of the pregerminal center B-lymphocytes (CD5-positive naïve B-lymphocytes); follicular lymphoma, diffuse large B-cell lymphoma, and Burkitt lymphoma derived from B-lymphocytes of the germinal center (centroblasts, and centrocytes); and marginal zone B-cell lymphoma derives from pot-germinal center B-lymphocytes.  [3]. Mantle cell lymphoma develops from or has a stage of differentiation of the pregerminal center B-lymphocytes (CD5-positive naïve B-lymphocytes); follicular lymphoma, diffuse large B-cell lymphoma, and Burkitt lymphoma derived from B-lymphocytes of the germinal center (centroblasts, and centrocytes); and marginal zone B-cell lymphoma derives from pot-germinal center B-lymphocytes.

Gene Expression Data Set
Neural networks are the preferred analytical tool for many predictive data mining applications because they are convenient, flexible, and powerful [5][6][7]. Predictive neural networks are particularly useful in applications where the underlying process is complex, such as biological systems [8][9][10][11][12][13][14]. The multilayer perceptron (MLP) procedure produces a predictive model for one or more dependent (target) variables based on the values of the predictor variables [5,6]. This research used publicly available data to predict some of the NHL subtypes based on the expression of 20,863 genes and a cancer transcriptome panel of 1769 target genes. Additionally, the most relevant genes of the MLP were tested for prediction of the overall survival of one of the most frequent NHL subtypes, DLBCL, and other types of neoplasia.
The RNA had been extracted from fresh/frozen tumor specimens using trizol or Qiagen RNA extraction kit. The biotin labeling, hybridization, and scanning protocols had been performed according to the conventional Affymetrix protocols. The microarray that was used was the Affymetrix U133 plus 2.0 (GPL570, Affymetrix, Santa Clara, CA, USA; https://www.thermofisher.com/order/catalog/product/900466#/900466, last accessed on 15 August 2021). The ExpressionFileCreator software RMA-normalized the raw cel files (GenePattern31, available at https://www.genepattern.org/, last accessed on 15 August 2021), and the quality check for batch effects was performed by unsupervised clustering of the 3000 most variably expressed genes across the data set.
Pan-cancer data sets were obtained from TCGA (https://tcga-data.nci.nih.gov/, last accessed on 15 August 2021) and UCSC Xena (http://xena.ucsc.edu/, last accessed on 15 August 2021). The clinical data were available in the corresponding repositories. The data sets were quantile-normalized and log2 transformed if necessary. The risk score (also known as the prognostic index) was used to create risk groups. The risk score was calculated by multiplying the beta coefficients of the Cox model by the gene expressions (Risk score = β 1 x 1 + β 2 x 2 + . . . + β p x p where x i is the expression value and β I is the beta value of the Cox table). In the Cox, all the genes are included in a unique model [16,17].

Software
Different software was used following the manufacturer's instructions for data preparation, processing, analysis, and confirmation of results: EditPad Lite (version 8.  [18], JMP statistical discovery from SAS (JMP Pro 14.0.0, SAS Institute Inc., Heidelberg, Germany), IBM SPSS and modeler (versions 26 and 18, IBM Corporation, Armonk, NY, USA), R (version 3.6.3 (https://www.r-project.org/ (accessed on 15 August 2021))), and R Studio (version 1.3.959; https://www.rstudio.com/products/rstudio/#rstudio-desktop (accessed on 15 August 2021)). Data manipulation was mainly performed using EditPad Lite and Excel. The GSEA software was mainly used for collapsing multiple probes sets to one gene. There are several options, including max probe, median, mean, and sum of probes. In this research, the max probe was chosen. GSEA allowed determining whether a priori set of genes showed an association toward two biological states (e.g., lymphoma subtype). Nevertheless, this information was excluded from the manuscript because of length constraints. Survival analysis using R can be checked on the following web page: https://cran.r-project.org/web/views/Survival.html (accessed on 15 August 2021). Random forest for gene expression can be calculated in the following web pages http: //genesrf.iib.uam.es/and https://www.ligarto.org/rdiaz/software/software#varSelRF (based on R, accessed on 15 August 2021). IBM modeler includes several machine learning techniques.

Multilayer Perceptron Analysis
The multilayer perceptron neural network analyses predicted the lymphoma subtype based on the genes expression values. The analysis was performed to predict all subtypes simultaneously and then each subtype against the others. The multilayer perceptron was performed as we have recently described [17,[19][20][21][22][23].
This type of feed-forward neural network consisted of three types of layers ( Figure 2). In our procedure, the "best" architecture was searched. The input layer received the gene expression values for each gene to be processed. The output layer performed the prediction and classification. The hidden layers acted as the computational engine. The dependent variable was nominal because its values represented categories with no intrinsic ranking (the lymphoma subtypes). The predictor variables (covariates are scale) were the genes, the 20,863 genes of the array, and a cancer transcriptome panel of 1790 genes. The hidden layers ranged from one or two, with hyperbolic tangent or sigmoid activation function, and with automatically computed (ranging from 1 to 50) or custom number of units. The hyperbolic tangent function has the form: γ(c) = tanh(c) = (e c − e −c )/(e c + e −c ). It takes real-valued arguments and transforms them to the range (-1, 1). This sigmoid function has the form: γ(c) = 1/(1 + e −c ). It takes real-valued arguments and transforms them to the range (0, 1). The output layer used the identity, softmax, hyperbolic tangent, or sigmoid activation functions. This identity function has the form: γ(c) = c. It takes real-valued arguments and returns them unchanged. This softmax function has the form: γ(c k ) = exp(c k )/Σ j exp(c j ). It takes a vector of real-valued arguments and transforms it to a vector whose elements fall in the range (0, 1) and sum to 1. Softmax is available only if all dependent variables are categorical. The activation function chosen for the output layer determined the rescaling method. The dependent variables rescaling was standardized, normalized, adjusted normalized, or none. The training was batch, online, or mini-batch. Batch training updates the synaptic weights only after passing all training data records; that is, batch training uses information from all records in the training data set. It is most useful for "smaller" data sets. Online training updates the synaptic weights after every single training data record; that is, online training uses information from one record at a time. Online training is superior to batch for "larger" data sets with associated predictors. Minibatch training divides the training data records into groups of approximately equal size, then updates the synaptic weights after passing one group; that is, mini-batch training uses information from a group of records. Then the process recycles the data group if necessary. Mini-batch training offers a compromise between batch and online training, and it may be best for "medium-size" data sets. The optimization algorithm was the scaled conjugate gradient or gradient descent. The options of the training set, initial lambda (0.0000005), initial sigma (0.00005), interval center (0), and interval offset (±0.5) [5,6,17,[19][20][21][22][23].
The output included the diagram and the synaptic weights that were exported to an xml file. Synaptic weights display the coefficient estimates that show the relationship between the units in a given layer to the units in the following layer. Synaptic weights are based on the training sample even if the active data set is partitioned into training, testing, and holdout data. The network performance was assessed with the model summary, classification results, receiver operating characteristic (ROC) curve, cumulative gains chart, lift chart, predicted by the observed chart, and residual by the predicted chart. The independent variable importance analysis was performed. Independent variable importance analysis performs a sensitivity analysis, which computes the importance of each predictor in determining the neural network. The analysis is based on the combined training and testing samples or only on the training sample if there is no testing sample. This creates a table and a chart displaying importance and normalized importance for each predictor. The predicted value or category for each dependent variable was saved. The predicted pseudoprobability for each dependent variable was also kept. As for stopping rules, the maximum step without a decrease in error as set to 1, the maximum training time to 15 min, the minimum relative change in training error to 0.0001, and the minimum relative change in the training error ratio to 0.001. Missing values were excluded, if present [5,6,17,[19][20][21][22][23].
Mach. Learn. Knowl. Extr. 2021, 3 FOR PEER REVIEW 5 The predicted value or category for each dependent variable was saved. The predicted pseudoprobability for each dependent variable was also kept. As for stopping rules, the maximum step without a decrease in error as set to 1, the maximum training time to 15 min, the minimum relative change in training error to 0.0001, and the minimum relative change in the training error ratio to 0.001. Missing values were excluded, if present [5,6,17,[19][20][21][22][23].

Hardware
The analysis was performed in a desktop workstation equipped with an AMD Ryzen 7 3700X (8-Core processor at 3.59 GHz), 16 GB of RAM, and NVIDIA GeForce GTX 1650 GPU.

Ethical Compliance
This study used gene expression data involving humans. Ethical approval had been previously obtained from the corresponding institutions. This research complies with the World Medical Association (WMA), Declaration of Helsinki statement of ethical principles for medical research involving human subjects. The Tokai University School of Medicine Review Board had approved research in artificial intelligence (IRB-156).

Multilayer Perceptron Analysis (MLP) for Predicting all NHL Subtypes
The MLP analysis was performed using all the genes and the cancer transcription panel. The neural network architecture and performance are shown in Table 1 and Figure  3. The performance was higher in the cancer transcriptome panel rather than when using the whole set of 20,863 genes. The genes were ranked according to their normalized importance for predicting NHL subtypes. In Table 2, the top 20 genes are shown for all genes set, including the normalized importance (NI), their function, and the keyword. The most relevant genes for predicting NHL subtypes were LCE2B, KNG1, IGHV7_81, TG, and C6. In Table 3, the top 20 genes of the cancer transcriptome panel are shown, and the neural network architecture is shown in Figure 4. The most relevant genes were ARG1, MAGEA3, AKT2, IL1B, and S100A7A.

Hardware
The analysis was performed in a desktop workstation equipped with an AMD Ryzen 7 3700X (8-Core processor at 3.59 GHz), 16 GB of RAM, and NVIDIA GeForce GTX 1650 GPU.

Ethical Compliance
This study used gene expression data involving humans. Ethical approval had been previously obtained from the corresponding institutions. This research complies with the World Medical Association (WMA), Declaration of Helsinki statement of ethical principles for medical research involving human subjects. The Tokai University School of Medicine Review Board had approved research in artificial intelligence (IRB-156).

Multilayer Perceptron Analysis (MLP) for Predicting All NHL Subtypes
The MLP analysis was performed using all the genes and the cancer transcription panel. The neural network architecture and performance are shown in Table 1 and Figure 3. The performance was higher in the cancer transcriptome panel rather than when using the whole set of 20,863 genes. The genes were ranked according to their normalized importance for predicting NHL subtypes. In Table 2, the top 20 genes are shown for all genes set, including the normalized importance (NI), their function, and the keyword. The most relevant genes for predicting NHL subtypes were LCE2B, KNG1, IGHV7_81, TG, and C6. In Table 3, the top 20 genes of the cancer transcriptome panel are shown, and the neural network architecture is shown in Figure 4. The most relevant genes were ARG1, MAGEA3, AKT2, IL1B, and S100A7A.  First, all the genes of the array were used as input. Second, the input data were the genes of the cancer transcriptome panel. In both cases, the output was the 5 NHL subtypes. As shown by the ROC curve, the performance was higher for the cancer transcriptome panel. First, all the genes of the array were used as input. Second, the input data were the genes of the cancer transcriptome panel. In both cases, the output was the 5 NHL subtypes. As shown by the ROC curve, the performance was higher for the cancer transcriptome panel.

Multilayer Perceptron Analysis (MLP) for Predicting Each NHL Subtype against the Other Subtypes
The MLP analysis was repeated for predicting each individual NHL subtype against the other subtypes. For example, the neural network architecture for FL prediction had two outputs: FL vs. other NHL subtypes (i.e., MCL + DLBCL + BL + MZL). Table 4 shows the architecture characteristics and performance of each MLP using all the genes of the array. The performance was suitable, with high accuracy of the classification and low percentages of incorrect predictions (averaged incorrect predictions = 7.2%). In this series, the NHL subtype with less accuracy was DLBCL. Nevertheless, the areas under the curves were above 0.90 in all cases. Figure 5 shows the top 20 predictive genes for each NHL subtype.
The results of the MLPs using the cancer transcriptome panel are shown in Table 5 and Figure 6. The neural network performances were higher than all genes set (n = 20,863), and in all cases above the area under the curve were above 0.95 with averaged incorrect predictions of 6.2%. The network architecture for predicting FL against the other NHL subtypes is shown in Figure 7.

Multilayer Perceptron Analysis (MLP) for Predicting Each NHL Subtype against the Other Subtypes
The MLP analysis was repeated for predicting each individual NHL subtype against the other subtypes. For example, the neural network architecture for FL prediction had two outputs: FL vs. other NHL subtypes (i.e., MCL + DLBCL + BL + MZL). Table 4 shows the architecture characteristics and performance of each MLP using all the genes of the array. The performance was suitable, with high accuracy of the classification and low percentages of incorrect predictions (averaged incorrect predictions = 7.2%). In this series, the NHL subtype with less accuracy was DLBCL. Nevertheless, the areas under the curves were above 0.90 in all cases. Figure 5 shows the top 20 predictive genes for each NHL subtype.    The results of the MLPs using the cancer transcriptome panel are shown in Table 5 and Figure 6. The neural network performances were higher than all genes set (n = 20,863), and in all cases above the area under the curve were above 0.95 with averaged incorrect predictions of 6.2%. The network architecture for predicting FL against the other NHL subtypes is shown in Figure 7.

Prediction of the Overall Survival of DLBCL and Other Types of Cancer.
The MLP analysis predicted the NHL subtypes using all the 20,863 genes of the array. The characteristics of the neural network architecture, performance, and accuracy are shown in Table 1 and Figure 3. Table 2 lists the top 20 genes with the highest normalized importance for predicting the NHL subtypes. The top 30 genes were the following: LCE2B,
The predictive value for overall survival of the top 30 genes was explored for the most frequent lymphoma subtype, the DLBCL, using an independent series of 414 cases (GSE10846). Additionally, the prognostic value was also tested in the most frequent and relevant cancer subtypes of the TCGA database. Using a risk-score formula, the cases of each series were stratified into high and low-risk groups. The risk scores were calculated by multiplying the beta values of the Cox regression per gene expression values for each gene, as previously described [8][9][10][11][12][13]. The overall survival was calculated using the Kaplan-Meier and log-rank test and Cox regression analyses. In total, 7441 neoplastic samples were included in the analysis. The analysis showed that using the top 30 genes, it was possible to predict the prognosis of DLBCL and the rest of the cancer subtypes ( Table 6 and Figures 8 and 9). 1

Discussion
Neural networks are one of the preferred analytical tools in data mining because they can easily adapt to complex processes such as biological systems. Neural networks are different from the traditional statistical methods.
Despite that a linear regression model can be considered a special case of certain neural networks, linear regression is characterized by a rigid model structure and a set of Using the gene expression of 30 genes, previously identified by the MLP analysis and a risk-score formula, the overall survival of the most relevant subtypes of cancers was calculated. These prognostic genes belong to the apoptosis, cell proliferation, metabolism, and antigen presentation pathways. Colorectal cancer, Log-rank (Mantel-Cox) p-value = 2 × 10 −6 ; prostate cancer, p = 1.6 × 10 −11 ; skin melanoma, p = 2.5 × 10 −10 ; gastric + esophageal cancer, p = 3.3 × 10 −6 ; liver cancer, p = 1.4 × 10 −10 ; cervical cancer, p = 3.7 × 10 −9 .

Discussion
Neural networks are one of the preferred analytical tools in data mining because they can easily adapt to complex processes such as biological systems. Neural networks are different from the traditional statistical methods.
Despite that a linear regression model can be considered a special case of certain neural networks, linear regression is characterized by a rigid model structure and a set of assumptions that are imposed before learning from the data [5,6]. However, the neural network makes minimal demands on the model structure and assumptions, and the rela-tionships between the predictors and predicted variables do not need to be hypothesized in advance. In a neural network, the type of relationship is determined during the learning process. If a linear relationship is appropriate, the results of the neural network will be close to those of the linear regression, but in other circumstances, the appropriate relationship will be nonlinear [5,6]. In exchange for this flexibility, the synaptic weights of a neural network are not easily interpretable. Therefore, the results of the neural network cannot be easily understood by humans ("black box").
For instance, Figure 5 showed the network diagram of the multilayer perceptron analysis for predicting follicular lymphoma against the other non-Hodgkin lymphoma subtypes based on a cancer transcriptome panel. In the analysis, all the 1769 genes were ranked according to their normalized importance for prediction. The top 20 genes were the following: ITGA1, CRP, MYCT1, ARG1, KRT5, PPARGC1A, CXCR2, C9, IL5RA, PTPRR, LEP, FGF1, MAGEC1, CALML5, KRT1, DSC3, C6, ICAM4, CCL7, and WT1. The most relevant gene in this model was integrin subunit alpha 1 (ITGA1), with a normalized importance of 1, followed by C-reactive protein (CRP, 0.95), MYC target 1 (MYCT1, 0.78), arginase-1 (ARG1, 0.85), and keratin 5 (KRT5, 0.84). The relationship between these genes and the diagnosis of follicular lymphoma was analyzed using a binary linear regression (backward stepwise, conditional), and at the last step the most relevant genes were MYCT1 (odds ratio = 1.1, p < 0.001), KRT5 (odds ratio = 0.9, p < 0.001), DSC3 (odds ratio = 1.4, p < 0.001), and WT1 (odds ratio = 0.83, p < 0.001). Therefore, ITGA1 did not follow a statistically significant linear relationship. However, follicular lymphoma was characterized by an increase in MYCT1 and DSC3 and a decrease in KRT5 and WT1. Figure 5 also showed a nonlinear prediction of FL vs. others using a Bayesian network; this graphical model displays variables (nodes) in a data set and the probabilistic, or conditional, independencies between them. Causal relationships between nodes (genes) can be represented by a Bayesian network; however, the links in the network (arcs) do not necessarily represent direct cause and effect [5,6,17,21,23]. In the Bayesian network, ITGA1 is located more centrally, with a connection to CXCR2 that connects to FGF1, PPARGC1A, MAGEC1, and ICAM4. ITGA1 is an integrin upregulated by the B-lymphocytes of follicular lymphoma when cultured with follicular dendritic cells (FDCs); it is an important molecule necessary for the cell-matrix adhesion and the pathogenesis of follicular lymphoma [26]. Elevated CRP is associated with poor prognosis in follicular lymphoma [27]. ARG1 expression defines immunosuppressive subsets of tumor-associated macrophages [28], and macrophages contribute to follicular lymphoma pathogenesis [29]. Interestingly, CD19 was the least important gene, which makes pathological sense as CD19 is a pan B-cell marker common to all the non-Hodgkin lymphoma (NHL) subtypes. The relationship of these markers with follicular lymphoma pathogenesis helps explain why the prediction of the neural network was accurate.
This research predicted several subtypes of non-Hodgkin lymphoma (NHL) using multilayer perceptron neural network analysis. The predictors were the whole set of 20,863 genes of the array and a cancer transcriptome panel of 1769. The prediction accuracy and network performances were suitable, as shown in the model summary, classification results, ROC curve, cumulative gains chart, lift chart, predicted by the observed chart, and residual by the predicted chart. The prediction accuracy was higher in the case of the cancer transcriptome panel, which was expected because it is a panel for the neoplasia research. Table 2 showed the top 20 genes with predictive values to differentiate NHL subtypes. Most of these genes were related to apoptosis, cell growth, metabolism, and immune response [24,25]. These pathways were also highlighted when using the cancer transcriptome panel (Table 3). A complete analysis of the relevance of each gene is beyond the scope of this manuscript because the main aim was to confirm whether the artificial neural network could identify the NHL subtypes. However, there is information regarding the most relevant genes. For example, LCE2B has been related to laryngeal squamous cell carcinoma [30]; and macrophages express ARG1 [28], which affect the follicular lymphoma pathogenesis [29], and other lymphomas such as Hodgkin lymphoma [31]. The multilayer perceptron analysis also predicted each subtype individually, with even higher accuracy, and the predictive genes were highlighted. For example, IL10 was predictive of mantle cell lymphoma; and mantle cell lymphoma proliferates upon IL-10 in the CD40 system [32]. In diffuse large B-cell lymphoma, AKT2 is an oncogenic molecule that can be targeted by drugs [33]. In Burkitt lymphoma, TNF-alpha induces CXCR2 receptors [34]. In relation to mantle cell lymphoma, CXCL13 is associated with the migration of malignant B-lymphocytes [35].
Finally, due to the potential importance of the highlighted genes for the pathogenesis of neoplasia in general, we used the top 30 genes of the multilayer perceptron analysis to predict the overall survival of several subtypes of neoplasia (Section 3.3). We confirmed that those genes were not only capable of differentiating the different subtypes of NHL but also managed to predict the overall survival of the most relevant human cancers. Therefore, these genes are potentially relevant in the pathogenesis of human neoplasia.

Conclusions
In conclusion, using multilayer perceptron artificial neural networks, it is possible to predict the subtypes of non-Hodgkin lymphoma; and we identified a gene set that could predict the overall survival of several subtypes of cancer.
Author Contributions: Conceptualization, methodology, software, analysis, resources, writing, and funding acquisition, J.C. Revision, and resources, R.H. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Boards and the Ethics Committees of the Institutions who generated the publicly available databases. Artificial Intelligence research at Tokai University, School of Medicine, is under protocol code IRB-156.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The gene expression data (GEO data sets) were obtained from the publicly available database of the NCBI resources webpage, located at https://www.ncbi.nlm.nih. gov/gds (accessed on 15 August 2021).