Next Article in Journal
Population Pharmacokinetic Model-Based Evaluation of Intact Oxaliplatin in Rats with Acute Kidney Injury
Next Article in Special Issue
The Impact of Resampling and Denoising Deep Learning Algorithms on Radiomics in Brain Metastases MRI
Previous Article in Journal
PARP Inhibitors and Myeloid Neoplasms: A Double-Edged Sword
Previous Article in Special Issue
Impact of Interobserver Variability in Manual Segmentation of Non-Small Cell Lung Cancer (NSCLC) Applying Low-Rank Radiomic Representation on Computed Tomography
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Artificial Neural Networks Predicted the Overall Survival and Molecular Subtypes of Diffuse Large B-Cell Lymphoma Using a Pancancer Immune-Oncology Panel

1
Department of Pathology, School of Medicine, Tokai University, 143 Shimokasuya, Isehara 259-1193, Japan
2
Monoclonal Antibodies Unit, Spanish National Cancer Research Center (Centro Nacional de Investigaciones Oncologicas, CNIO), Melchor Fernandez Almagro 3, 28029 Madrid, Spain
3
Department of Pathology, MD Anderson Cancer Center Madrid, Calle de Arturo Soria 270, 28033 Madrid, Spain
4
Department of Hematology, School of Medicine, Tokai University, 143 Shimokasuya, Isehara 259-1193, Japan
5
Sharjah Institute for Medical Research, Department of Clinical Sciences, College of Medicine, University of Sharjah, Sharjah P.O. Box 27272, United Arab Emirates
6
Division of Surgery and Interventional Science, University College London, Gower Street, London WC1E 6BT, UK
*
Author to whom correspondence should be addressed.
Cancers 2021, 13(24), 6384; https://doi.org/10.3390/cancers13246384
Submission received: 3 November 2021 / Revised: 11 December 2021 / Accepted: 17 December 2021 / Published: 20 December 2021
(This article belongs to the Special Issue Medical Imaging and Machine Learning​)

Abstract

:

Simple Summary

This research predicted the overall survival of patients and cell-of-origin molecular subtypes of diffuse large B-cell lymphoma from Tokai University using gene expression data. A pancancer immune profiling panel was analyzed using artificial neural networks, and high accuracy of prediction was found. Additionally, the results were explained with other machine learning techniques and conventional bioinformatics analyses.

Abstract

Diffuse large B-cell lymphoma (DLBCL) is one of the most frequent subtypes of non-Hodgkin lymphomas. We used artificial neural networks (multilayer perceptron and radial basis function), machine learning, and conventional bioinformatics to predict the overall survival and molecular subtypes of DLBCL. The series included 106 cases and 730 genes of a pancancer immune-oncology panel (nCounter) as predictors. The multilayer perceptron predicted the outcome with high accuracy, with an area under the curve (AUC) of 0.98, and ranked all the genes according to their importance. In a multivariate analysis, ARG1, TNFSF12, REL, and NRP1 correlated with favorable survival (hazard risks: 0.3–0.5), and IFNA8, CASP1, and CTSG, with poor survival (hazard risks = 1.0–2.1). Gene set enrichment analysis (GSEA) showed enrichment toward poor prognosis. These high-risk genes were also associated with the gene expression of M2-like tumor-associated macrophages (CD163), and MYD88 expression. The prognostic relevance of this set of 7 genes was also confirmed within the IPI and MYC translocation strata, the EBER-negative cases, the DLBCL not-otherwise specified (NOS) (High-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangements excluded), and an independent series of 414 cases of DLBCL in Europe and North America (GSE10846). The perceptron analysis also predicted molecular subtypes (based on the Lymph2Cx assay) with high accuracy (AUC = 1). STAT6, TREM2, and REL were associated with the germinal center B-cell (GCB) subtype, and CD37, GNLY, CD46, and IL17B were associated with the activated B-cell (ABC)/unspecified subtype. The GSEA had a sinusoidal-like plot with association to both molecular subtypes, and immunohistochemistry analysis confirmed the correlation of MAPK3 with the GCB subtype in another series of 96 cases (notably, MAPK3 also correlated with LMO2, but not with M2-like tumor-associated macrophage markers CD163, CSF1R, TNFAIP8, CASP8, PD-L1, PTX3, and IL-10). Finally, survival and molecular subtypes were successfully modeled using other machine learning techniques including logistic regression, discriminant analysis, SVM, CHAID, C5, C&R trees, KNN algorithm, and Bayesian network. In conclusion, prognoses and molecular subtypes were predicted with high accuracy using neural networks, and relevant genes were highlighted.

1. Introduction

Diffuse large B-cell lymphoma (DLBCL) is one of the most frequent non-Hodgkin lymphomas (NHL) in developed Western and Asian countries, representing around 25% of NHL cases [1,2,3].
DLBCL is a heterogeneous entity because of its diverse histological and genetic features and clinical evolution. There are several subtypes of DLBCL, such as T cell/histiocyte-rich large B-cell lymphoma, primary DLBCL of the mediastinum, intravascular large B-cell lymphoma, primary DLBCL of the central nervous system, Epstein–Barr virus (EBV)-positive DLBCL, etc. Additionally, some cases overlap with Burkitt lymphoma and were previously referred as “Burkitt-like”. Currently, the term High-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangements is used [1,2].
With current rituximab-based therapy, DLBCL is curable in around 50% of cases [4]. Therefore, at diagnosis, it is important to identify and predict which patients will clinically evolve unfavorably. The prognosis of DLBCL can be assessed with several variables, such as the International Prognostic Index (IPI), which includes several clinical and biochemical variables (age, LDH, ECOG performance status, clinical stage, and extranodal sites); cell of origin molecular subtypes (gene expression profiling, Hans, Choi, and Tally algorithms, and the Lymph2Cx platform) [5,6,7,8,9]; MYC, BCL2, and BCL6 abnormalities; and the tumor immune microenvironment [10,11,12,13]. Based on gene expression, three types of DLBCL have been defined: germinal center B-cell-like (GCB), activated B-cell-like (ABC), and not-otherwise-specified type 3 (i.e., unclassified, unspecified).
It is recommended that all cases undergo assessment of the molecular subtype at diagnosis. The gold standard is gene expression profiling (GEP) using the “lymphochip” microarray, but this technique requires the use of frozen tissue, which is not always available. Currently, the molecular subtype can be assessed using formalin-fixed paraffin-embedded tissue (FFPET) samples using the nCounter NanoString platform [8]. This array uses the gene expression of 32 genes, including the known markers of Hans’ classifier MME (CD10), BCL6, and IRF4 (MUM-1), the LMO2 gene of the Tally algorithm, and other relevant pathogenic genes such as BCL2, BTK, CARD11, MYD88, and TP53. Interestingly, the genes GCET1 and FOXP1 of the Choi algorithm are excluded in this panel.
The immuno-oncology pathway is now important in the analysis of the pathogenesis of DLBCL because through it, actionable gene expression profiles in the context of cancer immunotherapy can be identified. The nCounter pancancer immune profiling panel performs multiplex gene expression analysis in humans with 770 genes (40 housekeeping and 730 immune oncology genes) from different immune cell types, common checkpoint inhibitors, CT antigens, and genes covering both adaptive and innate immune response [14].
Some of the most impressive recent advances in artificial intelligence (AI) have been in the field of deep learning [15]. Deep learning models have neared or even exceeded human-level performance [15]. Artificial neural networks (ANNs) are a set of algorithms that were designed based on the human brain to identify patterns [11,12,16]. ANNs interpret sensory data through a kind of machine perception, labeling or clustering of raw input data/information [11,12,16]. The patterns that ANNs recognize are numerical, contained in vectors into which all real-world data (be they images, sound, text, or time series) must be translated [11]. As an approach to machine learning, ANNs can handle complex patterns found in the most challenging real-word datasets. ANNs use nonlinear modeling to identify complex relationships between variables and to create predictive models. ANNs provide an alternative predictive capability to approaches such as regression and classification trees and are characterized by being flexible and the lack of distributional assumptions [17,18]. In predictive applications, the multilayer perceptron (MLP) and the radial basis function (RBF) networks are commonly used. Both networks are supervised, because the results can be compared against known values of the target variables [10,11,12,17,18]. Both MLP and RBF have a “feedforward architecture”, because the connections in the network flow from the input layer to the output layer without any feedback loops. The architecture comprises the following parts: (1) an input layer that contains the predictors; (2) a hidden layer with unobservable nodes, or units; and (3) the output layer that contains the responses. The value of each hidden unit is some function of the predictors. The choice between MLP and RBF is influenced by the type of data to be analyzed and the level of complexity to uncover. Generally, the MLP procedure can handle more complex relationships. Conversely, the RBF procedure, which is characterized by one hidden layer, is usually faster [10,11,12,13,17,18,19,20,21].
Explainable AI (XAI) is attracting much interest in medicine [22]. XAI deals with the implementation of transparency and traceability of statistical black-box machine learning methods, particularly deep learning [22]. In the machine-based decision-making process, it is crucial to reproduce and comprehend both the learning and knowledge-extraction processes [22,23]. This is important, because for decision support it is necessary to understand the causality of learned representations [22,23]. In this research, machine learning techniques and conventional statistics were performed additionally to the neural network analyses to make the results for explainable, because explainability of AI can help to enhance trust of medical professionals in future AI systems [22,23].
In previous publications, we used publicly available data for ANNs. In this research, we used ANNs to predict the overall survival outcomes and molecular subtypes of a series of 106 cases from Tokai University Hospital, using gene expression data from the pancancer immune profiling panel, and validated the relevant marker with immunohistochemistry at protein level. We found that ANNs predicted survival and molecular subtypes with high accuracy.

2. Materials and Methods

2.1. Patients, Samples and Gene Expression Data

The series included 106 patients from Tokai University Hospital. This research complied with the Declaration of Helsinki and ethical principles regarding human experimentation. The Tokai University Institutional Review Board approved this research (protocol code IRB14R-080 and IRB-156).
The cases were diagnosed following the criteria of the 2016 revision of the World Health Organization classification of lymphoid neoplasms [3] and corresponded to DLBCL morphology. The cases were selected from 2006 to 2016, being from the years 2008–2016 in 74% of the cases. This series of cases were from the rituximab-treatment era: they were mainly treated with R-CHOP (72.4%) or R-CHOP-like (22.4%) therapy. The main clinicopathological characteristics of the samples were as follows: male sex in 65/104 (62.5%); male/female ratio 65/39 (1.67); age range (23–97); age > 60 years in 70/104 (67.3%), and low International Prognostic Index (IPI) in 27/97 (27.8%), low–intermediate in 30/97 (30.9%), high–intermediate in 14/97 (14.4%), and high in 18/97 (18.6%). Based on the Lymph2Cx assay, the cell-of-origin subtypes were GCB in 51/104 (49%), ABC in 31/104 (29.8%), and unclassified in 22/104 (21.2%). Notably, in two cases, the assay result was nonassessable (total analyzed cases: 106). Epstein–Barr virus (EBV) positivity, assessed using EBV-encoded RNA (EBER) in situ hybridization (ISH), was negative in 79/98 (80.6%) and positive in 19/98 (19.4%) of the cases. The translocation status for BCL2, MYC, and BCL6 was available in 72% of the cases: BCL2 translocation positive (TL+) cases were 19/76 (25%), MYC TL+ were 19/76 (25%), and BCL6 TL+ were 19/74 (25.7%). Cases with MYC and BCL2 rearrangements, irrespective to BCL6 translocation status, were 8/76 (10.5%). High-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangements (HGBL) presented in 11 cases. The overall survival of this series according to the IPI and EBER is shown in Figure 1. As expected in a conventional series of DLBCL, high IPI and EBER-positive cases were associated with poor prognoses of patients.
Whole tissue sections from formalin-fixed paraffin-embedded tissue blocks, containing more than 70% tumoral cells, were outsourced to Celgene Corporation, where RNA extraction was performed and applied to the nCounter pancancer immune profiling panel (NanoString Technologies, Inc., Seattle, WA, USA). The molecular subtype was assessed using the Lymph2Cx gene expression panel (NanoString). This panel comprises 730 immune-oncology genes and 40 housekeeping genes. The list of housekeeping genes is shown in the Supplementary Materials.
In the Tokai series, immunohistochemistry using the Hans algorithm (CD10, BCL6, and MUM-1) was performed [7].

2.2. Artificial Neural Network Analysis

The multilayer perceptron analysis (MLP) used the normalized and log2 transformed gene expression data. We used the calibrated data, which had already been normalized to positive control, for the “housekeeping gene normalization” procedure. The calibrated data were the raw data multiplied by the calibration factors. The housekeeping gene normalization was calculated using the following formula: log2((normData(,i)/hkGeomMeans(i))). Notably, an alternative option was the following: log2((normData(,i)/hkGeomMeans(i))*scalingFactor). This scaling factor could be a constant, for instance, 1000.
The complete procedure for MLP analysis was performed as we have previously described [10,11,12,13,19,20,21]. In the MLP procedure, predictive model for one or more dependent (target) variables is created based on the values of the predictor variables. The basic structure of an MLP is shown in Figure 2 and Figure 3 [10,11,12,13,19,20,21].
The dependent variables can be nominal, ordinal, or scale (continuous). The predictors can be specified as factors (categorical) or covariates (scale). In this study, the dependent variables were the overall survival outcome (dead vs. alive) and the Lymph2Cx molecular subtype (GCB, ABC, and Unspecified). The dependent variables were nominal, because their values represented categories with no intrinsic ranking. The predictors, which were the gene expression values of the pancancer immune profiling panel, were specified as covariates. The rescaling of covariates, which improves network training, was standardized. The database was partitioned by randomly assigning the cases based on relative numbers of cases: 70% to the training set and 30% to the testing set. In this analysis, the holdout partition was set at 0%. During the procedure, categorical predictors and dependent variables were temporarily recoded using one-of-c coding. When a variable has c categories, it is stored as c vectors: the first category (1,0,…,0), the next (0,1,0,…,0), and the final (0,0,…,0,1).
A series of parameters was set for architecture design. In the input layer, the nodes included the expression values of each gene. In the selection of the hidden layer, the number of layers (one or two), activation function (hyperbolic tangent or sigmoid), and number of units were specified. The hyperbolic tangent function had the form γ(c) = tanh(c) = (ece−c)/(ec + e−c), and the sigmoid function, the form γ(c) = 1/(1 + e−c). The output layer contained the target (dependent) variables. The activation functions of the output layer were the identity (γ(c) = c), softmax (γ(c k) = exp(c k)/Σjexp(c j)), the hyperbolic tangent, and the sigmoid. Notably, the activation function chosen for the output layer determined which rescaling methods were available. The rescaling of the dependent variables was standardized ((x − mean)/s), normalized ((x − min)/(max − min)), adjusted normalized ((2*(x − min)/(max − min)) − 1), and none.
The type of training determined how the network processed the records. The training types were batch, online, or minibatch. The batch, useful for small datasets, updated the synaptic weights only after passing all training data records. The online, more suitable for large datasets, updated the synaptic weights after every single training data records. The minibatch, best for medium-sized datasets, divided the training data records into groups of approximately equal size and updated the synaptic weights after passing one group. The synaptic weights were estimated using the optimization algorithms, the scaled conjugate gradient (only for batch training), and the gradient descent (for online, minibatch, and batch). The training options were different according to the type and the optimization algorithm. In the case of batch training and scaled conjugate gradient, the initial lambda was set at 0.0000005, the initial sigma at 0.00005, the interval center at 0, and the interval offset at ±0.5.
The network performance, which displays results used to determine whether the model is “good”, was assessed by the classification results, receiver operating characteristic (ROC) curve, cumulative gains chart, lift chart, predicted by observed chart, and residual by predicted chart.
The classification results, presented in Table 1, showed the classification table for each categorical dependent variable by partition and overall; the number of correctly and incorrectly classified cases were given.
The ROC curve is a graphical plot that shows the diagnostic ability of a binary classifier system as its discrimination threshold is varied. In a ROC curve, the true positive rate (sensitivity) is plotted as a function of the false positive rate (1–specificity). It is displayed for each categorical dependent variable. For each curve, the area under the curve (AUC) is also shown. The AUC is a measure of how well a parameter can distinguish between two diagnostic groups. A value of 0.5 means that the variable under study cannot distinguish between two groups. A perfect separation leads to an AUC of 1 [10,11,12,13,17,18,19,20,21,24].
For categorical dependent variables, the predicted-by-observed chart displays clustered boxplots of predicted pseudoprobabilities for the combined training and testing samples. The x axis corresponds to the observed response categories, and the legend to the predicted categories. Using 0.5 as the pseudoprobability cutoff for classification, the proportion of the boxplot above the 0.5 mark on the y axis represents correct predictions shown in the classification table. The proportion below the 0.5 mark represents incorrect predictions. When there are only two categories in the target variable, the first two boxplots are symmetrical about the horizontal line at 0.5 [10,11,12,13,19,20,21,25].
The cumulative gains chart shows the percentage of the overall number of cases in a given category “gained” by targeting a percentage of the total number of cases. The lift chart is derived from the cumulative gains chart; the values on the y axis correspond to the ratio of the cumulative gains for each curve to the baseline [10,11,12,13,19,20,21,25].
Using a sensitivity analysis, the independent variables were ranked according to their importance for predicting the dependent variable and in determining the neural network. The importance of an independent variable is a measure of how much the network’s model-predicted value changes for different values of the independent variable. Normalized importance is simply the importance values divided by the largest importance value and expressed as percentages [10,11,12,13,19,20,21,25].
The predicted value or category and the predicted pseudoprobability for each dependent variable were saved. The synaptic weights were exported to an xml file. The missing values were excluded from the analysis. As stopping rules, the maximum steps without a decrease in error were set at 1, the minimum relative change in training error was set at 0.0001, and the in-training error ratio was set at 0.001.
If it were necessary to exactly replicate the results, the same initialization value for random number generation, the same data order, the same variable order, and the same procedure settings should be used. Random number generation was used during the procedures of assignment of partitions, random subsampling for initialization of synaptic weights, random subsampling for automatic architecture selection, and the simulated annealing algorithm used in weight initialization and automatic architecture selection [10,11,12,13,19,20,21,25].
Radial basis function (RBF) analysis was also performed in a similar manner as MLP analysis. For the RBF analysis, the best number of units in the hidden layer was specified within a range or automatically computed, and the activation function was the normalized or the ordinary radial basis function. The overlap among hidden units was computed or specified. The user-missing values were excluded. The RBF algorithm is shown in Figure 4.

2.3. Statistical Analyses and Software

All statistical analyses were performed using several types of software, either for data processing, preanalysis, final analysis, or confirmation of results [10,11,12,13,17,18,19,20,21,24,25]:
Comparisons between groups were performed using crosstabulation (chi-square tests, including the Fisher’s exact test), and nonparametric tests for independent samples (Mann–Whitney, and Kruskal–Wallis H tests). Overall survival was calculated from the time of diagnosis to the time of death or the last alive follow-up time. Survival analysis was performed using the Kaplan–Meier and log rank tests, including the Breslow and Tarone–Ware tests. The hazard risks were calculated using Cox regression analysis. The association of the most relevant genes, which were highlighted in the neural network, with molecular subtypes was performed using binary logistic regression. Risk scores were calculated by multiplying the beta values of the multivariate Cox regression analysis for overall survival of each gene with the values of the corresponding gene expressions, as previously described [10,11,12,13,17,18,19,20,21,24,25]. These analyses were performed using mainly IBM SPSS. Survival analysis using R can be checked on the following web page: https://cran.r-project.org/web/views/Survival.html (accessed on 8 December 2021) [19]. Random forest is shown in http://genesrf.iib.uam.es/ and https://www.ligarto.org/rdiaz/software/software#varSelRF (based on R, accessed on 8 December 2021) [19]. All the analyses were performed on a Ryzen 7 3700X CPU workstation with 16 GB RAM and an NVIDIA GeForce GTX 1650 GPU.
The data analysis workflow is shown in Figure 5.

2.4. Immunohistochemistry

Immunohistochemistry for MAPK3 (ERK1) and MAPK1 (ERK2) was performed in a tissue microarray of 96 cases of DLBCL (Table A5). The clinicopathological characteristics of this series of 96 cases are shown in the Table A4. The microarray had been created from paraffin-embedded formalin-fixed tissue blocks and was stained using a rabbit monoclonal primary antibody against endogenous levels of phospho-p44/42 MAPK (Erk1/2) (Thr202/Tyr204) (#4370, Cell Signaling Technology K.K. Tokyo, Japan). The staining was performed using a Leica Bon-Max slide stainer (Leica Biosystems K.K. Tokyo, Japan) following the manufacturer’s instructions [26], Bond epitope retrieval solution 1 (pH 6.0, 30 min., #AR9961, Leica), and at a 1:400 dilution. After staining, the slides were scanned in a NanoZoomer S360 digital slide scanner (#C13220-01, Hamamatsu K.K. Hamamatsu, Japan) and visualized using the NDP.view2 viewing software (#U12388-01, Hamamatsu). The MAPK-positive cells had morphology compatible with macrophages or dendritic cells. The immunohistochemical signals were evaluated as an ordinal variable as 0 (no staining, <1%), 1 + (few scattered positive cells, 1–20%), or 2 + (abundant cells, >20%) (Figure 6). The correlation of MAPK as an ordinal variable with the molecular subtype according to the Hans classifier was performed using binary logistic regression. Additional markers for correlation with MAPK were LMO2 (mouse monoclonal, 299B, CNIO, Spain), CD163 (10D6, Leica), CSF1R (FER216, CNIO, Spain), and PD-L1 (E1J2J, Cell Signaling) [12,16].

3. Results

3.1. Prediction of the Overall Survival Outcome (Dead/Alive)

3.1.1. Analysis Using the 730 Genes of the Pancancer Panel

The 730 genes of the pancancer immune profiling panel were used to predict the overall survival outcome using a multilayer perceptron (MLP) analysis. Table 1 shows the detailed information of the artificial neural network, including case processing; characteristics of the input, hidden, and output layers; a model summary for training and testing; classification; and the area under the curve. The training set included 72 of 105 cases (67%) and the testing set included 33 of 105 cases (31%). The performance of the network was satisfactory, with only 15.3% incorrect predictions. The percentages of correct classifications in the training and validation sets were 84.75% and 81.8%, respectively. The area under the curve was 0.898 for both alive and death survival outcome. According to the normalized importance, the top 10 most relevant genes for this model were CD55, ARG1, SPANXB1, CTAG1B, IFNA8, CASP1, IL2, TNFSF12, ANP32B, and CTSG (Table 2). Among the following genes on the list, 11–20, other relevant genes in the pathogenesis of cancer were identified, such as REL and CD8A.

3.1.2. Analysis Using the Top 20 Genes of the MLP

To comprehend and trust the results of the output created by the neural network, a concept known as explainable artificial intelligence (XAI), the top 20 genes identified by the MLP were correlated with the overall survival of patients. The correlation used univariate (Table A1) and multivariate Cox regression analyses (Table 3) and gene-set enrichment analysis (GSEA). The GSEA showed enrichment toward the dead phenotype, confirming that some of the genes associated toward patients who died (Figure 7). In the multivariate regression analysis, seven genes were the most relevant: ARG1, IFNA8, CASP1, TNFSF12, CTSG, REL, and NRP1 (Table 3, Step 14 (last)). The overall survival plot for each gene is shown in Figure 8. Finally, using a risk-score formula with the gene expression of 20 genes or the 7 genes, two risk groups could be defined with different overall survival outcomes (Figure 3). The high-risk group was characterized by higher expression of CD163, which is a marker of M2-like tumor-associated macrophages (TAMs), and MYD88, which is a marker of NF-kappa-B activation, cytokine secretion, and inflammatory response. High-risk vs. low-risk group, 1.7 ± 3.5 vs. 0.4 ± 1.7 (p = 0.002) and 1.2 0.7 vs. 0.9 0.4 (p = 0.008). Table A3 shows the immune oncology annotations of the top 20 genes.
The predictive value for overall survival of these seven genes was evaluated in the different subtypes/entities of DLBCL using the same risk groups and cutoffs (Figure 9). The predictive value was kept in within the IPI L+LI and H+HI strata, within the EBER-negative cases (but not in the EBER-positive cases), within MYC translocation positive and negative cases, and within the non-High-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangements (i.e., DLBCL NOS) (but not in the 11 High-grade B-cell lymphomas).

3.1.3. Multivariate Analysis Using the Set of Seven Genes and Clinicopathological Variables

A multivariate Cox regression analysis for prediction of the overall survival using the variables of the final set of seven genes, IPI, and EBV was calculated. The results were as follows: set of seven genes, p < 0.001, hazard risk (HR) = 3.6 (95% CI = 1.8–7.1); IPI, p = 0.055, HR = 1.9 (0.9–3.6); and EBER, p = 0.054, HR = 0.054 (0.9–4.9).
When the molecular subtypes (GCB vs. ABC+Unspecified) were included in the equation, the results were as follows: set of seven genes, p < 0.001, HR = 2.3 (1.1–5.2); EBER, p = 0.036, HR = 2.3 (1.1–5.2); IPI, p = 0.134, HR = 1.7 (0.9–3.2); and molecular subtypes, p = 0.107, HR = 1.8 (0.9–3.4).
Finally, when the set of seven genes was included in the equation with the IPI, EBV, molecular subtypes, and High-grade B-cell lymphoma, only the set of seven genes (p < 0.001, HR = 5.4) and EBER (p = 0.006, HR = 5.3) retained prognostic relevance. Therefore, the final set of seven genes was an independent prognostic factor.
In this series, both the IPI and EBER had prognostic relevance (Figure 1). Within the variables that make up the IPI (age >60, Ann Arbor stage III–IV, ECOG performance status ≥2, serum LDH level >1 x normal, and > 1 extranodal site), a univariate Cox regression analysis revealed that for overall survival, only the stage (p = 0.039, HR = 2.0), LDH (p = 0.024, HR = 2.4), and >1 extranodal site (p = 0.007, HR = 3.1) had prognostic relevance. Since these variables were within the IPI, in the final Cox model, they were not included.
Notably, a multilayer perceptron analysis could be used to perform the multivariate analysis in a nonlinear manner (Figure 10). The input variables (predictors, 15 units) were the IPI, EBER, molecular subtypes, High-grade B-cell lymphoma, and the seven genes of the set. The output variable was the overall survival outcome (dead vs. alive, two units). The hidden layer had one layer with three units. The activation function was the hyperbolic tangent in the hidden layer and softmax in the output layer. The network performance was good, with an ROC area under the curve of 0.880 and 84.4% correct classification. The most important factors, according to their normalized importance (NI), for predicting the overall survival outcome (dead/alive) were as follows: ARG1 (100% NI), REL (63.5%), CTSG (54.2%), IFNA8 (52.7%), NRP1 (52.0%), CASP1 (47.3%), TNFSF12 (36.0%), molecular subtypes (34.9%), EBER (22.8%), high-grade B-cell lymphoma (8.7%), and IPI (6.7%).

3.1.4. Additional Machine Learning Analyses

In addition to artificial neural networks, other machine learning techniques were used. Table 4 shows the overall accuracy of the tests and the numbers of fields that were used in the final model. Logistic regression, discriminant analysis, and SVM predicted the overall survival outcome with overall accuracies of 100% using the 730 genes of the panel. Besides these, decision trees also predicted overall survival with high accuracy, and above 95% in the case of CHAID and C5 trees. The CHAID method had the best accuracy among the decision trees and used only 10 genes in the model, which were RUNX1, TBK1, ATF1, CSF2, CXCL14, SMAD2, POU2F2, ADORA2A, FCGR2B, and CXCR1 (Figure 11).
The modeling for overall survival using other machine learning techniques was repeated using only the top 20 genes identified from the multilayer perceptron analysis. The most accurate model was the Bayesian network, which had an overall accuracy of 93% (Figure 11), followed by the C&R tree (77%), C5 tree (70%), KNN algorithm (70%), and logistic regression (68%).

3.1.5. Validation in an Independent Series of DLBCL

Validation of the prognostic value of the set of genes identified in the Tokai series was performed using the GSE10848 series, which includes 414 cases. Using the risk-score formula [20] with the gene expression of the 20 genes or the 7 genes, two risk groups were defined, which had different overall survival outcomes (log rank p < 0.0001, HR = 3.6 and p < 0.0001, HR = 2.4, respectively) (Figure 11).

3.2. Prediction of the Three Molecular Subtypes (GCB, ABC, and Unspecified)

The multilayer perceptron analysis correlated the 730 pancancer immune profiling genes with the molecular subtypes of GCB, ABC, and Unspecified. The artificial neural network successfully predicted the molecular subtypes with high accuracy (Figure 12). The classification was correct in 98.7% of the cases in the training set and 81.5% of those in the testing set. The area under the curve was 0.99 for both GCB and ABC and 0.98 for the Unspecified group. Table 1 shows the details of this artificial neural network. According to their normalized importance for predicting the molecular subtype, the top most relevant genes were A2M, ABCB1, ABL1, ADA, ADORA2A, AICDA, AIRE, AKT3, ALCAM, and AMBP (Table 2).
Several machine learning techniques were applied, and logistic regression, discriminant analysis, and SVM predicted the molecular subtypes with overall accuracies of 100% using the 730 genes of the panel. Decision trees also managed to predict the molecular subtypes. The C5 and CHAID trees used 7 and 8 genes, respectively, with overall accuracies of 96% (Table 4).

3.3. Prediction of the Two Molecular Subtypes (GCB, ABC+Unspecified)

3.3.1. Analysis Using the 730 Genes of the Pancancer Panel

The multilayer perceptron analysis correlated the 730 pancancer immune profiling genes with the molecular subtypes as GCB versus ABC+Unspecified (Table 1, Figure 13 and Figure 14). In comparison to the other analyses, this artificial neural network had the best prediction accuracy, with lower percentages of incorrect predictions and cross-entropy errors. The percentages of correct classification were 100% for the training set and 96.4% for the testing set. The areas under the curve were 1.0 for both GCB and ABC+Unspecified. Table 2 shows the most relevant predictive genes identified by the artificial neural network. The top 10 genes were CD37, STAT6, ATF2, ROPN1, C4B, NOTCH1, CTAG1B, ICAM3, CEACAM1, and NOD2. Other relevant genes present within the 11-20 top genes were LAG3, TP53, MAPK3, and REL.
MAP3K was tested at the protein level using immunohistochemistry in another series; the clinicopathological characteristics of this series are shown in Table A5. The frequencies with immunohistochemistry for phospho-p44/42 MAPK (Erk1/2) (Thr202/Tyr204) were as follows: 0, 32/90 (35.6%); 1+, 30/90 (33.3%); and 2+, 28/90 (31.1%). The immunohistochemistry confirmed the relevance of this marker as highlighted in the neural network. High expression of MAP3K associated with a GCB phenotype (odds ratio of non-GCB = 0.543, 95% CI 0.3–0.96, p = 0.037). The expression of MAP3K was correlated with LMO2 and macrophage markers including CSF1R, CD163, TNFAIP8, CASP8, PTX3, and PD-L1. MAP3K correlated with LMO2 (a marker of the germinal center) (odds ratio = 2.8, 95% CI: 1.1–7.2, p = 0.039). Interestingly, though MAP3K showed histological expression similar to that of macrophages, no correlation was found with markers of M2-like tumor-associated macrophages (all p > 0.05).

3.3.2. Analysis Using the Top 20 Genes of the MLP

In Table 5, the associations between the top 20 genes and the molecular subtypes, as calculated using multivariate binary logistic regression, are shown. In the final model, the most relevant genes positively associated with the ABC+Unspecified subtype were CD37, GNLY, and IL17RB; STAT6 and REL were inversely correlated. Table A2 shows the univariate analysis results. Table A4 shows the immune oncology annotations. The GSEA analysis showed a sinusoidal-like shape, with some markers associated with ABC+Unspecified and others with GCB (Figure 5).

3.3.3. Additional Machine Learning Analyses

Other machine learning techniques also predicted the molecular subtype with high accuracy. Some included the 730 genes in the model, such as logistic regression, discriminant analysis, SVM, and KNN algorithm. However, the CHAID and C5 trees used six and five genes, respectively (Table 5, Figure 13).
The modeling for overall survival using other machine learning techniques was repeated using only the top 20 genes identified by the multilayer perceptron analysis. The most accurate model was the Bayesian network, which had an overall accuracy of 93% (Figure 14), followed by the C5 tree (88%), logistic regression (68%), and discriminant analysis (86%).

3.4. Artificial Neural Network Analysis Using the Radial Basis Function

All of the data were reanalyzed with a radial basis function (RBF) ANN as in the multilayer perceptron analysis. The neural network predicted both the overall survival outcome and the molecular subtypes. The network performance for the survival outcome was poor (AUC of 0.628). However, the performances for the molecular subtypes were acceptable (0.83 and 0.85). Since the performance of the multilayer perceptron was better, the results for the RBF are not shown in this manuscript.

4. Discussion

DLBCL is heterogeneous in terms of morphological features, genetic alterations, biological characteristics, and prognosis [1,2,3]. The preferred treatment is chemoimmunotherapy with R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone). Gene expression has been extensively studied in DLBCL using microarray technology [27]. The cell of origin classification, which is based on unsupervised clustering, has dominated the field. Despite the recent advances in diagnosis and treatment, there is a need to find prognostic markers.
This research analyzed the microarray data using a novel approach based on artificial neural networks and included conventional strategies to make the results more explainable. Originally, the molecular subtypes were defined using frozen tissue. Using the Lymph2Cx assay [28], the classification can now be conducted using formalin-fixed paraffin-embedded tissue biopsies [9,29]. The Lymph2Cx panel included 37 genes, 32 “endogenous” and 5 “controls”. Among the “endogenous”, markers of the conventional Hans classifier are present, such as MME (CD10), BCL6, and IRF4 (MUM-1). Other interesting markers for the pathology of DLBCL are BTK, MYC, CARD11, LMO2, TP53, and MYD88. This research classified the cases based on the Lymph2Cx assay: the frequency of the GCB subtype was 51/104 (49%), ABC 31/104 (29.8%), and Unspecified 22/104 (21.2%).
This research used a pancancer immune profiling panel, which was fully compatible with clinically relevant formalin-fixed, paraffin-embedded (FFPE) tumor sections. This panel had an immune cell coverage that included B-cells, T-cells, CD4-positive Th1 cells, regulatory T-lymphocytes (Tregs), CD8-positive cytotoxic T-lymphocytes, exhausted CD8-positve T-lymphocytes, cytotoxic cells, dendritic cells, macrophages, mast cells, neutrophils, and NK cells. B-cells are primary mediators of the humoral immune response. T-cells mediate cell-based immunity using cytokines and directly kill target cells. CD4-positive Th1 cells release IL2 and interferon gamma and stimulate CD8-positive cytotoxic T-lymphocytes, NK cells, and macrophages. Tregs play an important role in suppressing immune responses, affecting both B- and T-cells [14]. Using the genes of this panel, we predicted the overall survival outcome and molecular subtypes with high accuracy, and the top 20 genes influencing each prediction were highlighted. The annotation of these genes regarding the immune profiling panel is shown in the Appendix A. For example, the top 20 genes that predicted overall survival belonged to the immune response, CT antigen and cell type specific (Th and mast cells). Within the immune response, the most relevant categories were cell functions (IL2, ANP32B/ARPRIL, and NRP1), chemokines (TNFSF12, CCL15, and XCL2), and regulation (IL2, CTSG, and TIRAP). Regarding the molecular subtypes, the annotations were immune response, cell type Th and cytotoxic cells, and CT antigens. The most relevant immune response categories were regulation (STAT6, NOTCH1, ICAM3, LAG3, and REL) and T- and B-cell functions (STAT6, LAG3, and TP53). Therefore, we showed that the immune response is important for survival and molecular subtype classification in the pathogenesis of DLBCL.
Artificial neural networks are the chosen tool for many predictive data mining applications because they are easy to use, flexible, and powerful [17,18]. Predictive neural networks are especially useful when the underlying processes are complex, such as the pathological background of DLBCL. This research used two types of neural networks, the multilayer perceptron (MLP) and radial basis function (RBF). The type of data and the level of complexity define the procedure to use. While the MLP procedure can find more complex relationships, RBF is faster [17,18]. This research used both types, but we found that the MLP made more accurate predictions. Therefore, the analysis was based mainly on the MLP results.
Artificial neural networks are used in predictive applications and are supervised in the sense that the model-predicted results can be compared against known values of the target variables [17,18]. An advantage of neural networks is that they make minimal demands on the model structure and assumptions, unlike traditional statistical methods. The traditional linear regression model, when using the least-square method and storing the regression coefficients, is a special case of certain neural network. However, it has a rigid model structure and a set of assumptions that are imposed before learning from the data. However, neural networks are flexible. The tradeoff is that the synaptic weights are not easily interpretable [17,18]. For example, the synaptic weights for the most relevant gene for predicting the overall survival outcome, CD55, were as follows: -0.441 for H(1:1), -0.204 for H(1:2), 0.168 for H(1:3), 0.199 for H(1:4), -0.458 for H(1:5), and -0.733 for H(1:6). The synaptic weight informs about the amplitude or the strength of the connection between two nodes (neurons). Since ANNs are black-box models because of their multilayer nonlinear structure, the explanation of the underlying process that produces the relationship between the dependent (target) and independent (predictors) variables is unintelligible, nontransparent, and untraceable by humans [30]. The overall survival outcome and the molecular subtypes of patients with diffuse large B-cell lymphoma (DLBCL) were predicted with high accuracy, and the most relevant genes were highlighted using nonlinear analysis. To make the results more understandable, i.e., explainable artificial intelligence (XAI), several machine learning methods were applied. A thorough evaluation of the relationships between the predictors and the predicted variables in these methods explained the underlying process of the neural network. For example, the MLP highlighted 20 genes with high capability to predict the overall survival outcome, and using conventional analyses such as GSEA, we confirmed the association with bad prognosis. Multivariate Cox regression analysis reduced the list to seven genes, with ARG1, TNFSF12, REL, and NRP1 associated with good (HR < 1) and IFNA8, CASP1, and CTSG with bad prognosis (HR > 1). As individual markers, these genes also predicted the prognosis, as shown in the Kaplan–Meier plots. Additionally, the risk-score formula integrated all genes, and two groups with different risk could be found among the results. Macrophages release interferon alpha-8 (IFNA8) [26], and we found that the high-risk group was associated with high expression of CD163, which is a marker of M2-like tumor-associated macrophages (TAMs). Caspase-1 (CASP1) is involved in various inflammatory processes and initiates programmed cell death [31], and we found that the high-risk group also associated with high MYD88 expression. Cathepsin G (CTSG) belongs to the complement pathway [32]. It has been related to oral squamous cell carcinoma [33], is broadly expressed in acute myeloid leukemia, and is an effective immunotherapeutic target [34]. Finally, in this research the pancancer immune profiling panel predicted with high accuracy molecular subtypes, and the most relevant markers for the ABC/non-GCB phenotype were CD37, GNLY, and CD46. Membrane cofactor protein (CD46) acts as a costimulatory factor for T-cells, which induces the differentiation of Tregs [32]. Therefore, the data showed that the immune microenvironment plays an important role in GCB and non-GCB differentiation, as shown in the germinal center dynamics under physiological conditions.
We recently described that high expression of PTX3 was associated with poor prognosis in DLBCL [25]. Though the immunohistochemistry of MAPK showed a macrophage-like pattern, no correlation was found between MAPK and PTX3. Similar results were found for the TNFAIP8 marker [12]. We also previously described the gene expression of High-grade B-cell lymphoma [35]. In this research, we identified seven genes that predicted the overall survival of patients of non-High-grade B-cell lymphoma cases. Therefore, our data suggested that AID is a poor prognostic marker of High-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangements, as it has a different pathological background.
Applying artificial intelligence for the analysis of gene expression not only is useful in the analysis of individual entities, but allows differentiating between different lymphoma subtypes, as we showed in non-Hodgkin lymphomas [19]. Deep neural networks are characterized by having a multilayer nonlinear structure (i.e., black-box model). Therefore, neural networks are criticized as being nontransparent because their predictions are not traceable by humans. In this research, we combined artificial neural networks and machine learning to make the results more understandable (explainable [22]). In the future, explainable artificial intelligence (XAI) may enable human users to understand, and hence trust, artificial intelligence methods and results of high prediction accuracy.

5. Conclusions

In conclusion, artificial intelligence analyses provide highly effective results. However, these artificial neural network-based models are black-box models because the relational link between input and output is unobservable. We successfully combined artificial neural networks, machine learning, and conventional biomedinformatics to predict the overall survival outcome and molecular subtypes of DLBCL. This approach identified molecular targets that indicated poor and favorable survival in DLBCL in addition to showing that MAPK3 correlated with the GCB subtype.

Supplementary Materials

The list of housekeeping genes and logistic regression are available online at https://www.mdpi.com/article/10.3390/cancers13246384/s1.

Author Contributions

Conceptualization, methodology, software, analysis, and writing, J.C.; validation, R.H.; analysis and investigation, S.H., Y.Y.K., M.M., S.T., H.I., A.I., Y.K., G.R., J.F.G. and K.A.; supervision and project administration, N.N. All authors have read and agreed to the published version of the manuscript.

Funding

Joaquim Carreras was funded by The Ministry of Education, Culture, Sports, Science, and Technology (MEXT) and the Japan Society for the Promotion of Science, grants KAKEN 15K19061 and 18K15100, and the Tokai University School of Medicine, Research Incentive Assistant Plan 2021-B04. Rifat Hamoudi was funded by the AL-JALILA Foundation (grant number AJF2018090) and the University of Sharjah (grant number 1901090258).

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board (or Ethics Committee) of Tokai University, School of Medicine (protocol code IRB14R-080 and IRB-156).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The source codes and data from Tokai University presented in this study are available on reasonable request to the corresponding author (J.C.). The raw gene expression data are not publicly available because of a data protection policy for patient data.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Correlations between the top 20 genes of the multilayer perceptron and the overall survival of the patients (univariate analysis).
Table A1. Correlations between the top 20 genes of the multilayer perceptron and the overall survival of the patients (univariate analysis).
OrderGeneBetaSEWalddfpHazard Risk95.0% CI for HR
LowerUpper
1CD55−0.12 0.22 0.31 10.5770.9 0.6 1.4
2ARG10.12 0.09 1.56 10.2111.1 0.9 1.3
3SPANXB10.14 0.13 1.15 10.2831.2 0.9 1.5
4CTAG1B0.06 0.12 0.22 10.6411.1 0.8 1.3
5IFNA80.17 0.14 1.49 10.2221.2 0.9 1.5
6CASP10.45 0.20 5.24 10.0221.6 1.1 2.3
7IL20.04 0.15 0.09 10.7671.0 0.8 1.4
8TNFSF12−0.05 0.16 0.11 10.740.9 0.7 1.3
9ANP32B0.06 0.31 0.04 10.8511.1 0.6 1.9
10CTSG0.10 0.06 2.75 10.0971.1 1.0 1.3
11THY1−0.08 0.14 0.30 10.5860.9 0.7 1.2
12REL−0.40 0.20 3.83 10.050.7 0.5 1.0
13TLR20.24 0.16 2.20 10.1381.3 0.9 1.7
14C8A0.06 0.19 0.10 10.7521.1 0.7 1.6
15IL220.06 0.13 0.17 10.6771.1 0.8 1.4
16TIRAP−0.01 0.28 0.00 10.9671.0 0.6 1.7
17CCL15−0.03 0.22 0.02 10.9011.0 0.6 1.5
18NRP1−0.08 0.18 0.19 10.6630.9 0.7 1.3
19XCL20.12 0.12 0.97 10.3241.1 0.9 1.4
20MFGE8−0.11 0.13 0.68 10.410.9 0.7 1.2
Univariate Cox regression analysis for overall survival. Each gene was analyzed individually as a quantitative variable.
Table A2. Correlations between the top 20 genes of the multilayer perceptron and the molecular subtypes GCB vs. ABC+Unspecified (univariate analysis).
Table A2. Correlations between the top 20 genes of the multilayer perceptron and the molecular subtypes GCB vs. ABC+Unspecified (univariate analysis).
OrderGeneBetaSEWalddfpHazard Risk95.0% CI for HR
LowerUpper
1CD370.16 0.19 0.67 10.413 1.2 0.8 1.7
2STAT6−0.84 0.48 3.07 10.080 0.4 0.2 1.1
3ATF2−0.64 0.65 0.96 10.327 0.5 0.1 1.9
4ROPN10.47 0.33 2.09 10.149 1.6 0.8 3.0
5C4B0.22 0.14 2.65 10.103 1.2 1.0 1.6
6NOTCH1−0.18 0.32 0.32 10.575 0.8 0.4 1.6
7CTAG1B0.28 0.24 1.32 10.250 1.3 0.8 2.1
8ICAM30.44 0.28 2.55 10.111 1.6 0.9 2.7
9CEACAM10.38 0.18 4.36 10.037 1.5 1.0 2.1
10NOD20.27 0.18 2.26 10.133 1.3 0.9 1.9
11LAG30.35 0.14 6.22 10.013 1.4 1.1 1.9
12GNLY0.62 0.17 13.10 10.000 1.9 1.3 2.6
13TP53−0.97 0.39 6.02 10.014 0.4 0.2 0.8
14CD460.39 0.34 1.30 10.254 1.5 0.8 2.9
15MAPK3−0.49 0.49 0.99 10.319 0.6 0.2 1.6
16TREM2−0.19 0.22 0.76 10.383 0.8 0.5 1.3
17REL−1.16 0.36 10.61 10.001 0.3 0.2 0.6
18NUP107−0.59 0.58 1.04 10.309 0.6 0.2 1.7
19IL17RB0.17 0.10 2.94 10.087 1.2 1.0 1.5
20SPACA30.14 0.22 0.39 10.530 1.1 0.7 1.8
Univariate binary logistic regression analysis for molecular subtype. Each gene was analyzed individually as a quantitative variable (GCB as reference).
Table A3. Annotations of the overall survival top genes.
Table A3. Annotations of the overall survival top genes.
OrderGeneGene ClassImmune Response CategoryAnnotation
1CD55Immune ResponseN/ACD molecules, innate immune response
2ARG1Immune ResponseN/AResponse to drug
3SPANXB1CT AntigenN/AN/A
4CTAG1BCT AntigenN/AN/A
5IFNA8Immune ResponseInterleukinsInnate immune response, interleukins
6CASP1Immune ResponseN/AInnate immune response
7IL2Immune ResponseCytokines, T-Cell Functions, RegulationAdaptive immune response, anti-inflammatory cytokines, B-cell activation, cytokines and receptors, innate immune response, interleukins, Th1 and Th2 differentiation, T-cell differentiation, T-cell polarization, T-cell regulators
8TNFSF12 (TWEAK)Immune ResponseChemokines, TNF SuperfamilyChemokines and receptors, TNF superfamily members and their receptors
9ANP32B (APRIL)Immune Response—Cell Type Specific (Th)Cell FunctionsBasic cell functions, cell type specific
10CTSGImmune Response—Cell Type Specific (Mast cell)Regulation, Pathogen DefenseCell type specific, defense response to fungus, positive regulation of immune response
11THY1Immune ResponseN/ACD molecules
12RELImmune ResponseRegulationTranscription factors
13TLR2Immune ResponseTLRCD molecules, innate immune response, toll-like receptor
14C8AImmune ResponseComplementComplement pathway, innate immune response
15IL22Immune ResponseCytokinesAcute-phase response, anti-inflammatory cytokines, interleukins
16TIRAPImmune ResponseN/AInnate immune response
17CCL15Immune ResponseChemokinesAdaptive immune response, chemokines and receptors, inflammatory response
18NRP1Immune ResponseCell FunctionsBasic cell functions, CD molecules
19XCL2Immune ResponseChemokinesChemokines and receptors
20MFGE8Immune ResponseTransporter FunctionsReceptors involved in phagocytosis
N/A, not applicable; Th, T helper; TLR, toll-like receptor; CD, cluster differentiation.
Table A4. Annotations of the molecular subtypes top genes.
Table A4. Annotations of the molecular subtypes top genes.
OrderGeneGene ClassImmune Response CategoryAnnotation
1CD37Immune ResponseN/AAdaptive immune response, CD molecules
2STAT6Immune Response—Cell Type Specific (Th2 cell)Chemokines, Regulation, T-Cell FunctionsAdaptive immune response, cytokines and receptors, cell type specific, Th2 orientation, transcription factors, transcriptional regulators
3ATF2Immune ResponseN/AInnate immune response
4ROPN1CT AntigenN/AN/A
5C4BImmune ResponseComplementComplement pathway, innate immune response
6NOTCH1Immune ResponseRegulationTranscriptional regulators
7CTAG1BCT AntigenN/AN/A
8ICAM3Immune ResponseAdhesion, RegulationAdhesion, CD molecules, regulation of immune response
9CEACAM1Immune ResponseAdhesionAdhesion, CD molecules
10NOD2Immune ResponseCytokinesInnate immune response, cytokines and receptors
11LAG3Immune Response—CheckpointRegulation, T-Cell FunctionsAdaptive immune response, CD molecules, negative regulation of immune response, T-cell activation
12GNLYImmune Response—Cell Type Specific (Cytotoxic cells)Cell Functions, CytotoxicityAdaptive immune response, basic cell functions, cell type specific, cytotoxicity
13TP53Immune ResponseT-Cell FunctionsT-cell proliferation
14CD46Immune ResponseN/ACD molecules, innate immune response
15MAPK3Immune ResponseN/AInnate immune response
16TREM2Immune ResponseN/AHumoral immune response
17RELImmune ResponseRegulationTranscription factors
18NUP107Immune Response—Cell Type Specific (Th cell)Cell CycleCell type specific, M phase of mitotic cell cycle
19IL17RBImmune ResponseChemokinesChemokines and receptors
20SPACA3CT AntigenN/AN/A
N/A, not applicable; Th, T helper; TLR, toll-like receptor; CD, cluster differentiation.
Table A5. Characteristics of the series of DLBCL used in the immunohistochemical analysis.
Table A5. Characteristics of the series of DLBCL used in the immunohistochemical analysis.
VariableFrequency (%)Univariate Cox Regression for Overall Survival
pHazard R.LowerUpper
Male54/97 (55.7)0.94110.51.9
Age > 6067/97 (69.1)0.00441.610.3
Ann Arbor stage III–IV42/89 (47.2)0.061.90.93.7
ECOG performance status ≥213/78 (16.7)0.00024.31.99.4
Serum LDH high (>219)58/96 (60.4)0.0043.11.46.8
Extranodal sites >118/73 (24.7)0.0033.11.56.4
IPI
Low31/81 (38.3)Reference---
Low–intermediate25/81 (30.9)0.0083.71.49.8
High–intermediate14/81 (17.3)0.0333.31.19.9
High11/81 (13.6)0.0045.31.716.5
sIL2R high (>530)70/91 (76.9)0.0174.21.313.7
B symptoms19/80 (23.8)0.3951.40.73
Location
Nodal (+spleen)53/97 (54.6)Reference---
Waldeyer’s ring9/97 (9.3)0.1670.201.8
Gastrointestinal10/97 (10.3)0.7480.80.22.8
Other extranodal25/97 (25.8)0.2161.50.82.9
Treatment
RCHOP65/91 (71.4)Reference---
RCHOP-like20/91 (22.0)0.1361.70.83.6
Others6/91 (6.6)0.1332.50.88.5
Response to treatment
CR64/86 (74.4)Reference---
PD11/86 (12.8)6.5 × 10−1126.39.870.2
PR11/86 (12.8)1.7 × 10−812.75.330.9
Epstein–Barr virus (EBER+)12/95 (15.8)0.00431.46.4
Hans classifier
GCB31/95 (32.6)Reference---
Non-GCB64/95 (67.4)0.0132.81.36.4
Immune phenotype
CD3+0/97 (0)N/A---
CD5+14/96 (14.6)0.7360.90.42.1
CD20+93/97 (95.9)0.4170.60.12.3
CD10+29/96 (30.2)0.0110.30.10.8
MUM1+ (IRF4)76/96 (79.2)0.1931.70.83.9
BCL2+76/96 (79.2)0.0542.80.97.8
BCL6+64/96 (66.7)0.8210.90.51.8
RGS1 high (>3%)51/95 (53.7)0.0132.51.25.2
Molecular analysis
MYD88 L265P mutation3/39 (7.7)0.5420.50.14
BCL2 translocation2/42 (4.8)0.9930.90.17.4
MYC translocation7/46 (15.2)0.8140.90.32.9
BCL2/MYC double hit1/42 (2.4)0.3212.80.421.6

References

  1. Brown, J.R.; Freedman, A.S.; Aster, J.C. Pathobiology of Diffuse Large B Cell Lymphoma and Primary Mediastinal Large B Cell Lymphoma; UpToDate: Waltham, MA, USA, 2021. [Google Scholar]
  2. Aster, J.C.; Freedman, A.S. Prognosis of Diffuse Large B Cell Lymphoma; UpToDate: Waltham, MA, USA, 2021. [Google Scholar]
  3. Swerdlow, S.H.; Campo, E.; Pileri, S.A.; Harris, N.L.; Stein, H.; Siebert, R.; Advani, R.; Ghielmini, M.; Salles, G.A.; Zelenetz, A.D.; et al. The 2016 revision of the World Health Organization classification of lymphoid neoplasms. Blood 2016, 127, 2375–2390. [Google Scholar] [CrossRef] [Green Version]
  4. Smith, A.; Crouch, S.; Howell, D.; Burton, C.; Patmore, R.; Roman, E. Impact of age and socioeconomic status on treatment and survival from aggressive lymphoma: A UK population-based study of diffuse large B-cell lymphoma. Cancer Epidemiol. 2015, 39, 1103–1112. [Google Scholar] [CrossRef] [Green Version]
  5. Cho, I.; Yoon, N.; Hyeon, J.; Sim, J.; Yoo, H.Y.; Kim, S.J.; Kim, W.S.; Ko, Y.H. Comparison of the Lymph2Cx Assay and Hans Algorithm in Determining the Cell-of-Origin of Diffuse Large B-Cell Lymphomas, Not Otherwise Specified. Appl. Immunohistochem. Mol. Morphol. 2020, 28, 731–740. [Google Scholar] [CrossRef]
  6. Choi, W.W.L.; Weisenburger, D.D.; Greiner, T.C.; Piris, M.A.; Banham, A.H.; Delabie, J.; Braziel, R.M.; Geng, H.; Iqbal, J.; Lenz, G.; et al. A New Immunostain Algorithm Classifies Diffuse Large B-Cell Lymphoma into Molecular Subtypes with High Accuracy. Clin. Cancer Res. 2009, 15, 5494–5502. [Google Scholar] [CrossRef] [Green Version]
  7. Hans, C.P.; Weisenburger, D.D.; Greiner, T.C.; Gascoyne, R.D.; Delabie, J.; Ott, G.; Müller-Hermelink, H.K.; Campo, E.; Braziel, R.M.; Jaffe, E.S.; et al. Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray. Blood 2004, 103, 275–282. [Google Scholar] [CrossRef] [PubMed]
  8. Rosenwald, A.; Wright, G.; Chan, W.C.; Connors, J.M.; Campo, E.; Fisher, R.I.; Gascoyne, R.D.; Muller-Hermelink, H.K.; Smeland, E.B.; Giltnane, J.M.; et al. The Use of Molecular Profiling to Predict Survival after Chemotherapy for Diffuse Large-B-Cell Lymphoma. N. Engl. J. Med. 2002, 346, 1937–1947. [Google Scholar] [CrossRef] [PubMed]
  9. Scott, D.W.; Mottok, A.; Ennishi, D.; Wright, G.W.; Farinha, P.; Ben-Neriah, S.; Kridel, R.; Barry, G.S.; Hother, C.; Abrisqueta, P.; et al. Prognostic Significance of Diffuse Large B-Cell Lymphoma Cell of Origin Determined by Digital Gene Expression in Formalin-Fixed Paraffin-Embedded Tissue Biopsies. J. Clin. Oncol. 2015, 33, 2848–2856. [Google Scholar] [CrossRef] [PubMed]
  10. Carreras, J.; Kikuti, Y.Y.; Miyaoka, M.; Hiraiwa, S.; Tomita, S.; Ikoma, H.; Kondo, Y.; Ito, A.; Nakamura, N.; Hamoudi, R. Artificial Intelligence Analysis of the Gene Expression of Follicular Lymphoma Predicted the Overall Survival and Correlated with the Immune Microenvironment Response Signatures. Mach. Learn. Knowl. Extr. 2020, 2, 35. [Google Scholar] [CrossRef]
  11. Carreras, J.; Kikuti, Y.; Miyaoka, M.; Hiraiwa, S.; Tomita, S.; Ikoma, H.; Kondo, Y.; Ito, A.; Nakamura, N.; Hamoudi, R. A Combination of Multilayer Perceptron, Radial Basis Function Artificial Neural Networks and Machine Learning Image Segmentation for the Dimension Reduction and the Prognosis Assessment of Diffuse Large B-Cell Lymphoma. AI 2021, 2, 8. [Google Scholar] [CrossRef]
  12. Carreras, J.; Kikuti, Y.Y.; Miyaoka, M.; Hiraiwa, S.; Tomita, S.; Ikoma, H.; Kondo, Y.; Ito, A.; Shiraiwa, S.; Hamoudi, R.; et al. A Single Gene Expression Set Derived from Artificial Intelligence Predicted the Prognosis of Several Lymphoma Subtypes; and High Immunohistochemical Expression of TNFAIP8 Associated with Poor Prognosis in Diffuse Large B-Cell Lymphoma. AI 2020, 1, 23. [Google Scholar] [CrossRef]
  13. Carreras, J.; Kikuti, Y.; Miyaoka, M.; Roncador, G.; Garcia, J.; Hiraiwa, S.; Tomita, S.; Ikoma, H.; Kondo, Y.; Ito, A.; et al. Integrative Statistics, Machine Learning and Artificial Intelligence Neural Network Analysis Correlated CSF1R with the Prognosis of Diffuse Large B-Cell Lymphoma. Hemato 2021, 2, 11. [Google Scholar] [CrossRef]
  14. NanoString Technologies. 530 Fairview Ave N, S., WA 98109, USA. Immune Cell Type Gene Coverage. Available online: https://www.nanostring.com/products/ncounter-assays-panels/oncology/pancancer-immune-profiling/ (accessed on 7 September 2021).
  15. Intro to Machine Learning. Available online: https://www.kaggle.com/learn/intro-to-machine-learning (accessed on 7 September 2021).
  16. Global, I. What Is Artificial Neural Network (ANN). Available online: https://www.igi-global.com/dictionary/artificial-neural-network-ann/1518 (accessed on 7 September 2021).
  17. IBM Corporation. IBM SPSS Neural Networks. New Tools for Building Predictive Models; IBM: Armonk, NY, USA, 2011. [Google Scholar]
  18. IBM Corporation. Neural Networks; IBM: Armonk, NY, USA, 2021. [Google Scholar]
  19. Carreras, J.; Hamoudi, R. Artificial Neural Network Analysis of Gene Expression Data Predicted Non-Hodgkin Lymphoma Subtypes with High Accuracy. Mach. Learn. Knowl. Extr. 2021, 3, 36. [Google Scholar] [CrossRef]
  20. Carreras, J.; Hamoudi, R.; Nakamura, N. Artificial Intelligence Analysis of Gene Expression Data Predicted the Prognosis of Patients with Diffuse Large B-Cell Lymphoma. Tokai J. Exp. Clin. Med. 2020, 45, 37–48. [Google Scholar] [PubMed]
  21. Carreras, J.; Kikuti, Y.; Roncador, G.; Miyaoka, M.; Hiraiwa, S.; Tomita, S.; Ikoma, H.; Kondo, Y.; Ito, A.; Shiraiwa, S.; et al. High Expression of Caspase-8 Associated with Improved Survival in Diffuse Large B-Cell Lymphoma: Machine Learning and Artificial Neural Networks Analyses. BioMedInformatics 2021, 1, 3. [Google Scholar] [CrossRef]
  22. Holzinger, A.; Langs, G.; Denk, H.; Zatloukal, K.; Müller, H. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1312. [Google Scholar] [CrossRef] [Green Version]
  23. Gershman, S.J.; Horvitz, E.J.; Tenenbaum, J.B. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science 2015, 349, 273–278. [Google Scholar] [CrossRef] [PubMed]
  24. IBM Corporation. IBM SPSS Statistics Algorithms; IBM: Armonk, NY, USA, 2020; p. 1265. [Google Scholar]
  25. Carreras, J.; Kikuti, Y.Y.; Hiraiwa, S.; Miyaoka, M.; Tomita, S.; Ikoma, H.; Ito, A.; Kondo, Y.; Itoh, J.; Roncador, G.; et al. High PTX3 expression is associated with a poor prognosis in diffuse large B-cell lymphoma. Cancer Sci. 2021. [Google Scholar] [CrossRef]
  26. Fujisawa, M.; Matushima, M.; Carreras, J.; Hirabayashi, K.; Kikuti, Y.Y.; Ueda, T.; Kaneko, M.; Fujimoto, R.; Sano, M.; Teramura, E.; et al. Whole-genome copy number and immunohistochemical analyses on surgically resected intracholecystic papillary neoplasms. Pathol. Int. 2021. [Google Scholar] [CrossRef] [PubMed]
  27. Susanibar-Adaniya, S.; Barta, S.K. 2021 Update on Diffuse large B cell lymphoma: A review of current data and potential applications on risk stratification and management. Am. J. Hematol. 2021, 96, 617–629. [Google Scholar] [CrossRef] [PubMed]
  28. Baptista, M.J.; Tapia, G.; Morgades, M.; Muncunill, J.; Muñoz-Marmol, A.-M.; Montoto, S.; Gribben, J.G.; Calaminici, M.; Martinez, A.; Gonzalez-Farre, B.; et al. Using the Lymph2Cx assay for assessing cell-of-origin subtypes of HIV-related diffuse large B-cell lymphoma. Leuk. Lymphoma 2018, 60, 1087–1091. [Google Scholar] [CrossRef] [PubMed]
  29. Schmitz, R.; Wright, G.W.; Huang, D.W.; Johnson, C.A.; Phelan, J.D.; Wang, J.Q.; Roulland, S.; Kasbekar, M.; Young, R.M.; Shaffer, A.L.; et al. Genetics and Pathogenesis of Diffuse Large B-Cell Lymphoma. N. Engl. J. Med. 2018, 378, 1396–1407. [Google Scholar] [CrossRef]
  30. Buhrmester, V.; Munch, D.; Arens, M. Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey. Mach. Learn. Knowl. Extr. 2021, 3, 48. [Google Scholar] [CrossRef]
  31. Shi, J.; Zhao, Y.; Wang, K.; Shi, X.; Wang, Y.; Huang, H.; Zhuang, Y.; Cai, T.; Wang, F.; Shao, F. Cleavage of GSDMD by inflammatory caspases determines pyroptotic cell death. Nature 2015, 526, 660–665. [Google Scholar] [CrossRef] [PubMed]
  32. The UniProt Consortium. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 2021, 49, D480–D489. [Google Scholar] [CrossRef] [PubMed]
  33. Huang, G.-Z.; Wu, Q.-Q.; Zheng, Z.-N.; Shao, T.-R.; Li, F.; Lu, X.-Y.; Ye, H.-Y.; Chen, G.-X.; Song, Y.-X.; Zeng, W.-S.; et al. Bioinformatics Analyses Indicate That Cathepsin G (CTSG) is a Potential Immune-Related Biomarker in Oral Squamous Cell Carcinoma (OSCC). OncoTargets Ther. 2021, 14, 1275–1289. [Google Scholar] [CrossRef] [PubMed]
  34. Alatrash, G.; Garber, H.R.; Zhang, M.; Sukhumalchandra, P.; Qiu, Y.; Jakher, H.; Perakis, A.A.; Becker, L.; Yoo, S.Y.; Dwyer, K.C.; et al. Cathepsin G is broadly expressed in acute myeloid leukemia and is an effective immunotherapeutic target. Leukemia 2016, 31, 234–237. [Google Scholar] [CrossRef] [Green Version]
  35. Miyaoka, M.; Kikuti, Y.Y.; Carreras, J.; Itou, A.; Ikoma, H.; Tomita, S.; Shiraiwa, S.; Ando, K.; Nakamura, N. AID is a poor prognostic marker of high-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangements. Pathol. Int. 2021. [Google Scholar] [CrossRef]
Figure 1. Overall survival according to the international prognostic index (IPI) and Epstein–Barr virus infection (EBER). Cases with IPI high–intermediate/high and EBER positivity were associated with poor overall survival (p < 0.01).
Figure 1. Overall survival according to the international prognostic index (IPI) and Epstein–Barr virus infection (EBER). Cases with IPI high–intermediate/high and EBER positivity were associated with poor overall survival (p < 0.01).
Cancers 13 06384 g001
Figure 2. Basic structure of a multilayer perceptron (MLP). This figure shows the basic structure of an MLP artificial neural network, the same type of network used in this research. The network is characterized by a feedforward architecture with one hidden layer for predicting the overall survival outcome. The connections in the network flow forward from the input layer to the output layer without any feedback loop. The input layer contains the predictors (the gene expression data). The hidden layer contains unobservable nodes (units). The output layer contains the responses. The MLP network allows a second hidden layer. H (1:1) means hidden layer 1, node 1. In Table 1, the specific details of the neural networks are shown. For instance, the hidden layer of the MLP for overall survival had 6 nodes (H (1:1–6)), and that for the cell-of-origin molecular classification had 11 and 14 nodes.
Figure 2. Basic structure of a multilayer perceptron (MLP). This figure shows the basic structure of an MLP artificial neural network, the same type of network used in this research. The network is characterized by a feedforward architecture with one hidden layer for predicting the overall survival outcome. The connections in the network flow forward from the input layer to the output layer without any feedback loop. The input layer contains the predictors (the gene expression data). The hidden layer contains unobservable nodes (units). The output layer contains the responses. The MLP network allows a second hidden layer. H (1:1) means hidden layer 1, node 1. In Table 1, the specific details of the neural networks are shown. For instance, the hidden layer of the MLP for overall survival had 6 nodes (H (1:1–6)), and that for the cell-of-origin molecular classification had 11 and 14 nodes.
Cancers 13 06384 g002
Figure 3. Architecture of the multilayer perceptron (MLP). The MLP is an artificial neural network that is characterized by a feedforward structure and supervised learning. The MLP network is a function of one or more predictors (known as inputs or independent variables) that minimizes the prediction error on one or more target variables (outputs). This figure shows the general architecture for the MLP network and the corresponding notation.
Figure 3. Architecture of the multilayer perceptron (MLP). The MLP is an artificial neural network that is characterized by a feedforward structure and supervised learning. The MLP network is a function of one or more predictors (known as inputs or independent variables) that minimizes the prediction error on one or more target variables (outputs). This figure shows the general architecture for the MLP network and the corresponding notation.
Cancers 13 06384 g003
Figure 4. Architecture of the radial basis function (RBF). The RBF is a supervised feedforward learning network that is characterized by only one hidden layer. This figure shows the architecture of the three layers of the RBF network and the corresponding notation.
Figure 4. Architecture of the radial basis function (RBF). The RBF is a supervised feedforward learning network that is characterized by only one hidden layer. This figure shows the architecture of the three layers of the RBF network and the corresponding notation.
Cancers 13 06384 g004
Figure 5. Data analysis workflow. This research used the gene expression data of 106 cases of DLBCL from Tokai University. The gene set was composed of 730 genes from the pancancer immune profiling panel. Two types of artificial neural network were used: multilayer perceptron (MLP) and radial basis function (RBF). Besides these, other machine learning techniques were included in the analysis in addition to conventional statistics.
Figure 5. Data analysis workflow. This research used the gene expression data of 106 cases of DLBCL from Tokai University. The gene set was composed of 730 genes from the pancancer immune profiling panel. Two types of artificial neural network were used: multilayer perceptron (MLP) and radial basis function (RBF). Besides these, other machine learning techniques were included in the analysis in addition to conventional statistics.
Cancers 13 06384 g005
Figure 6. Immunohistochemistry for MAPK. The immunohistochemical signals were evaluated as an ordinal variable as 0 (no staining, <1%), 1 + (few scattered positive cells, 1–20%), or 2 + (abundant cells, >20%). Positive staining, brown color (DAB).
Figure 6. Immunohistochemistry for MAPK. The immunohistochemical signals were evaluated as an ordinal variable as 0 (no staining, <1%), 1 + (few scattered positive cells, 1–20%), or 2 + (abundant cells, >20%). Positive staining, brown color (DAB).
Cancers 13 06384 g006
Figure 7. Multilayer perceptron artificial neural network for predicting overall survival. The neural network predicted the overall survival outcome as dead/alive using the 730 genes of the pancancer immune oncology profiling panel. The network performance can be checked using several parameters such as the area under the curve (AUC), which had a value of 0.89. The network performance outputs are the predictive by observed chart, the cumulative gains chart, and the lift chard. The genes were ranked according to their normalized importance for prediction of the overall survival outcome, as shown in the independent variable importance chart. The top 20 genes are listed. GSEA showed enrichment toward the dead phenotype for some genes.
Figure 7. Multilayer perceptron artificial neural network for predicting overall survival. The neural network predicted the overall survival outcome as dead/alive using the 730 genes of the pancancer immune oncology profiling panel. The network performance can be checked using several parameters such as the area under the curve (AUC), which had a value of 0.89. The network performance outputs are the predictive by observed chart, the cumulative gains chart, and the lift chard. The genes were ranked according to their normalized importance for prediction of the overall survival outcome, as shown in the independent variable importance chart. The top 20 genes are listed. GSEA showed enrichment toward the dead phenotype for some genes.
Cancers 13 06384 g007
Figure 8. Overall survival according to the top genes of the MLP analysis. Because of the MLP and multivariate Cox regression analysis a final set of 7 genes were highlighted. Using a risk-score formula, the cases were divided into high and low-risk groups that had different overall survival (p < 0.001). Additionally, using a cutoff for the gene expression values, overall survival plots were calculated for each highlighted gene.
Figure 8. Overall survival according to the top genes of the MLP analysis. Because of the MLP and multivariate Cox regression analysis a final set of 7 genes were highlighted. Using a risk-score formula, the cases were divided into high and low-risk groups that had different overall survival (p < 0.001). Additionally, using a cutoff for the gene expression values, overall survival plots were calculated for each highlighted gene.
Cancers 13 06384 g008
Figure 9. Overall survival according to the top genes of the MLP analysis in DLBCL subtypes/entities. As a result of the MLP and the multivariate Cox regression analysis a final set of 7 genes were highlighted. Using a risk-score formula, the cases were divided into high and low-risk groups that had different overall survival when stratifying for IPI, Epstein Barr virus infection (EBER), MYC rearrangement and High-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangement.
Figure 9. Overall survival according to the top genes of the MLP analysis in DLBCL subtypes/entities. As a result of the MLP and the multivariate Cox regression analysis a final set of 7 genes were highlighted. Using a risk-score formula, the cases were divided into high and low-risk groups that had different overall survival when stratifying for IPI, Epstein Barr virus infection (EBER), MYC rearrangement and High-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangement.
Cancers 13 06384 g009
Figure 10. Multivariate overall survival analyses. The set of 7 genes was used in addition to the IPI, EBER, molecular subtypes, and HGBL to predict the overall survival outcome (dead/alive). A multilayer perceptron analysis successfully classified the cases based on those parameters, with a network performance having an area under the curve (AUC) of 0.880. The variables were ranked according to their normalized importance for predicting the prognosis. The most relevant predictors were ARG1, REL, and CTSG.
Figure 10. Multivariate overall survival analyses. The set of 7 genes was used in addition to the IPI, EBER, molecular subtypes, and HGBL to predict the overall survival outcome (dead/alive). A multilayer perceptron analysis successfully classified the cases based on those parameters, with a network performance having an area under the curve (AUC) of 0.880. The variables were ranked according to their normalized importance for predicting the prognosis. The most relevant predictors were ARG1, REL, and CTSG.
Cancers 13 06384 g010
Figure 11. Other machine learning techniques for predicting overall survival. In addition to the artificial neural networks, other machine learning techniques were used. This figure shows the results of the CHAID decision tree and the Bayesian network. CHAID, or chi-squared automatic interaction detection, is a classification method for building decision trees by using chi-squared statistics to identify optimal splits. A Bayesian network is a graphical model that displays variables (nodes) in a dataset and the probabilistic, or conditional, independencies between them. Causal relationships between nodes may be represented but the links (arcs) do not necessarily represent direct cause and effect. Finally, the predictive value of the final set of 7 genes was tested in an independent series of DLBCL of 414 cases, and the results were reproducible.
Figure 11. Other machine learning techniques for predicting overall survival. In addition to the artificial neural networks, other machine learning techniques were used. This figure shows the results of the CHAID decision tree and the Bayesian network. CHAID, or chi-squared automatic interaction detection, is a classification method for building decision trees by using chi-squared statistics to identify optimal splits. A Bayesian network is a graphical model that displays variables (nodes) in a dataset and the probabilistic, or conditional, independencies between them. Causal relationships between nodes may be represented but the links (arcs) do not necessarily represent direct cause and effect. Finally, the predictive value of the final set of 7 genes was tested in an independent series of DLBCL of 414 cases, and the results were reproducible.
Cancers 13 06384 g011
Figure 12. Multilayer perceptron analysis for predicting molecular subtypes (GCB, ABC, unspecified). The neural network predicted molecular subtypes as GCB, ABC, and Unspecified using the 730 genes of the pancancer immune oncology profiling panel. The network performance was checked using several parameters, such as the area under the curve (AUC), which had a value of 0.99. The genes were ranked according to their normalized importance for prediction, as shown in the independent variable importance chart. The top 20 genes are listed. The molecular subtypes were based on the Lymph2Cx assay.
Figure 12. Multilayer perceptron analysis for predicting molecular subtypes (GCB, ABC, unspecified). The neural network predicted molecular subtypes as GCB, ABC, and Unspecified using the 730 genes of the pancancer immune oncology profiling panel. The network performance was checked using several parameters, such as the area under the curve (AUC), which had a value of 0.99. The genes were ranked according to their normalized importance for prediction, as shown in the independent variable importance chart. The top 20 genes are listed. The molecular subtypes were based on the Lymph2Cx assay.
Cancers 13 06384 g012
Figure 13. Multilayer perceptron analysis for predicting molecular subtypes (GCB vs. ABC+Unspecified). The neural network predicted the molecular subtypes as GCB and ABC+Unspecified using the 730 genes of the pancancer immune oncology profiling panel. The network performance was checked using several parameters, such as the area under the curve (AUC), which had a value of 1.0. The genes were ranked according to their normalized importance for prediction, as shown in the independent variable importance chart. The top 20 genes are listed. GSEA analysis had a sinusoidal-like shape, with some genes associated with the GCB and others with the ABC+Unspecified phenotype.
Figure 13. Multilayer perceptron analysis for predicting molecular subtypes (GCB vs. ABC+Unspecified). The neural network predicted the molecular subtypes as GCB and ABC+Unspecified using the 730 genes of the pancancer immune oncology profiling panel. The network performance was checked using several parameters, such as the area under the curve (AUC), which had a value of 1.0. The genes were ranked according to their normalized importance for prediction, as shown in the independent variable importance chart. The top 20 genes are listed. GSEA analysis had a sinusoidal-like shape, with some genes associated with the GCB and others with the ABC+Unspecified phenotype.
Cancers 13 06384 g013
Figure 14. Other machine learning techniques for predicting molecular subtypes. The different gene expression levels between the two molecular subtypes are shown in a boxplot figure. In addition to artificial neural networks, other machine learning techniques were used. This figure shows the results of the CHAID decision tree and the Bayesian network. Finally, the predictive value of the final set of MAPK was tested in an independent series of DLBCL of 96 cases from Tokai University, and the results confirmed the association with the GCB phenotype. The expression of MAP3K was correlated with LMO2 and M2-like tumor-associated macrophage markers including CSF1R, CD163, and PD-L1. MAP3K correlated with LMO2 (odds ratio = 2.8, p = 0.039). Interestingly, though MAP3K showed histological expression similar to that of macrophages, no correlation was found with the markers (CD163, CSF1R, TNFAIP8, CASP8, PD-L1, PTX3, and IL-10).
Figure 14. Other machine learning techniques for predicting molecular subtypes. The different gene expression levels between the two molecular subtypes are shown in a boxplot figure. In addition to artificial neural networks, other machine learning techniques were used. This figure shows the results of the CHAID decision tree and the Bayesian network. Finally, the predictive value of the final set of MAPK was tested in an independent series of DLBCL of 96 cases from Tokai University, and the results confirmed the association with the GCB phenotype. The expression of MAP3K was correlated with LMO2 and M2-like tumor-associated macrophage markers including CSF1R, CD163, and PD-L1. MAP3K correlated with LMO2 (odds ratio = 2.8, p = 0.039). Interestingly, though MAP3K showed histological expression similar to that of macrophages, no correlation was found with the markers (CD163, CSF1R, TNFAIP8, CASP8, PD-L1, PTX3, and IL-10).
Cancers 13 06384 g014
Table 1. Multilayer perceptron analysis.
Table 1. Multilayer perceptron analysis.
Dependent VariableOS OutcomeMolecular Subtype (GCB, ABC, Unspecified)Molecular Subtype (GCB, ABC+Unspecified)
Case processing
Training set72/105 (68.6%)77/104 (74.0%)76/104 (73.1%)
Testing set33/105 (31.4%)27/106 (26.0%)28/104 (26.9%)
Input layer
Covariates730730730
Units730730730
RescalingStandardizedStandardizedStandardized
Hidden Layer
Number111
Units61114
Activation functionHyperbolic tangentHyperbolic tangentHyperbolic tangent
Output layer
Num. Dependent variables111
Units232
Activation functionSoftmaxSoftmaxSoftmax
Error functionCross-entropyCross-entropyCross-entropy
Model summary
Training
Cross-entropy error27.8845.0610.594
Incorrect predictions %15.31.3<0.0001
Stopping rule used1 consecutive step(s) with no decrease in error1 consecutive step(s) with no decrease in error1 consecutive step(s) with no decrease in error
Training time00:01.400:01.100:00.6
Testing
Cross-entropy error13.84712.1552.662
Incorrect predictions %18.20 18.53.6
Classification (% correct)
Training 84.798.7100
Testing81.881.596.4
Area under the curve
0.898 (Alive)0.995 (GCB)1.0 (GCB)
0.898 (Dead)0.994 (ABC)1.0 (ABC+Unspecified)
0.989 (Unspecified)
OS, overall survival; GCB, germinal center B-cell-like; ABC, activated B-cell-like (ABC). The molecular subtypes were based on the Lymph2Cx assay.
Table 2. Top 20 genes identified by the multilayer perceptron analysis.
Table 2. Top 20 genes identified by the multilayer perceptron analysis.
Overall Survival OutcomeMolecular Subtype
Dead/AliveGCB/ABC/UnspecifiedGCB/ABC+Unspecified
OrderGeneNormalized ImportanceGeneNormalized ImportanceGeneNormalized Importance
1CD551.000 ARG11.000 CD371.000
2ARG10.982 CTAG1B0.959 STAT60.867
3SPANXB10.949 CD550.950 ATF20.830
4CTAG1B0.946 IL220.915 ROPN10.819
5IFNA80.853 PRG20.903 C4B0.814
6CASP10.851 SSX10.895 NOTCH10.805
7IL20.834 JAK30.877 CTAG1B0.797
8TNFSF120.819 PPBP0.869 ICAM30.796
9ANP32B0.795 APP0.851 CEACAM10.783
10CTSG0.784 TAL10.839 NOD20.773
11THY10.780 IL4R0.831 LAG30.773
12REL0.779 F120.815 GNLY0.767
13TLR20.775 MAP2K20.810 TP530.762
14C8A0.767 MAP4K20.798 CD460.755
15IL220.760 MAPK140.797 MAPK30.747
16TIRAP0.755 A2M0.795 TREM20.739
17CCL150.754 LILRA40.791 REL0.736
18NRP10.753 MAPK10.789 NUP1070.722
19XCL20.750 DMBT10.786 IL17RB0.718
20MFGE80.749 NT5E0.781 SPACA30.714
The genes were ranked according to their normalized importance for predicting overall survival and molecular subtypes. The molecular subtypes were based on the Lymph2Cx assay.
Table 3. Correlations between the top 20 genes of the multilayer perceptron and the overall survival of the patients.
Table 3. Correlations between the top 20 genes of the multilayer perceptron and the overall survival of the patients.
GeneBetaSEWaldDfpHazard Risk95.0% CI for HR
LowerUpper
Step 1
CD55−0.06 0.29 0.04 10.851 0.9 0.5 1.7
ARG1−0.54 0.22 6.00 10.014 0.6 0.4 0.9
SPANXB10.28 0.24 1.34 10.246 1.3 0.8 2.1
CTAG1B−0.02 0.17 0.01 10.903 1.0 0.7 1.4
IFNA80.47 0.26 3.22 10.073 1.6 1.0 2.7
CASP11.42 0.37 14.51 10.000 4.1 2.0 8.6
IL2−0.04 0.28 0.03 10.874 1.0 0.6 1.7
TNFSF12−1.00 0.37 7.28 10.007 0.4 0.2 0.8
ANP32B0.06 0.53 0.01 10.907 1.1 0.4 3.0
CTSG0.45 0.14 10.11 10.001 1.6 1.2 2.1
THY10.42 0.32 1.74 10.188 1.5 0.8 2.9
REL−0.32 0.25 1.60 10.205 0.7 0.4 1.2
TLR20.51 0.29 3.13 10.077 1.7 0.9 2.9
C8A−0.18 0.33 0.29 10.593 0.8 0.4 1.6
IL22−0.10 0.17 0.31 10.580 0.9 0.6 1.3
TIRAP−0.32 0.39 0.69 10.406 0.7 0.3 1.6
CCL15−0.17 0.33 0.29 10.593 0.8 0.4 1.6
NRP1−0.78 0.35 5.07 10.024 0.5 0.2 0.9
XCL2−0.09 0.16 0.31 10.579 0.9 0.7 1.3
MFGE8−0.04 0.30 0.02 10.888 1.0 0.5 1.7
Step 14 (last)
ARG1−0.46 0.19 5.84 10.016 0.6 0.4 0.9
IFNA80.37 0.20 3.64 10.056 1.5 1.0 2.1
CASP11.34 0.31 18.59 10.000 3.8 2.1 7.1
TNFSF12−0.78 0.27 8.42 10.004 0.5 0.3 0.8
CTSG0.37 0.13 8.00 10.005 1.4 1.1 1.9
REL−0.33 0.19 3.20 10.074 0.7 0.5 1.0
NRP1−0.53 0.28 3.67 10.055 0.6 0.3 1.0
Multivariate Cox regression analysis for overall survival; backward conditional.
Table 4. Machine learning analysis for predicting the overall survival outcome and molecular subtypes.
Table 4. Machine learning analysis for predicting the overall survival outcome and molecular subtypes.
Dependent Variable
(Target Variable)
ModelOverall Accuracy (%)Num. of Fields
(Genes)
Overall survival outcome
(Dead/Alive)
Logistic regression100730
Discriminant100730
SVM100730
CHAID tree97.110
C5 tree96.212
C&R tree86.736
Neural net.69.6730
KNN algorithm60.9730
Bayesian net.26.4730
Molecular subtype (GCB/ABC/Unspecified)Logistic regression100730
Discriminant100730
SVM100730
C5 tree96.27
CHAID tree96.28
Neural net.92.3730
Quest75.912
C&R tree756
KNN algorithm71.2730
Bayesian net.0.9730
Molecular subtype (GCB/ABC+Unspecified)Logistic regression100730
Discriminant100730
SVM100730
Neural net.99730
CHAID tree99.16
C5 tree97.15
KNN algorithm84.6730
Quest83.66
C&R tree50.9730
Bayesian net.0730
Table 5. Associations between the top 20 genes identified by the multilayer perceptron and molecular subtypes (GCB, ABC+Unspecified).
Table 5. Associations between the top 20 genes identified by the multilayer perceptron and molecular subtypes (GCB, ABC+Unspecified).
GeneBetaSEWalddfpOdds Ratio (OR)95.0% CI for OR
LowerUpper
Step 1
CD371.59 0.52 9.41 1 0.002 4.9 1.8 13.4
STAT6−2.50 0.93 7.20 1 0.007 0.1 0.0 0.5
ATF20.02 1.27 0.00 1 0.990 1.0 0.1 12.2
ROPN10.33 0.49 0.45 1 0.503 1.4 0.5 3.6
C4B−0.15 0.34 0.20 1 0.654 0.9 0.4 1.7
NOTCH1−0.42 0.80 0.27 1 0.605 0.7 0.1 3.2
CTAG1B0.46 0.62 0.56 1 0.453 1.6 0.5 5.3
ICAM30.07 0.53 0.02 1 0.897 1.1 0.4 3.0
CEACAM10.30 0.28 1.12 1 0.291 1.3 0.8 2.4
NOD2−0.01 0.42 0.00 1 0.990 1.0 0.4 2.3
LAG30.61 0.40 2.32 1 0.128 1.8 0.8 4.0
GNLY0.78 0.34 5.35 1 0.021 2.2 1.1 4.2
TP53−0.93 0.65 2.01 1 0.156 0.4 0.1 1.4
CD461.04 0.65 2.57 1 0.109 2.8 0.8 10.1
MAPK3−0.87 1.05 0.70 1 0.404 0.4 0.1 3.2
TREM2−0.82 0.42 3.77 1 0.052 0.4 0.2 1.0
REL−0.96 0.55 3.10 1 0.079 0.4 0.1 1.1
NUP1070.59 1.03 0.33 1 0.568 1.8 0.2 13.7
IL17RB0.31 0.19 2.70 1 0.100 1.4 0.9 2.0
SPACA3−0.23 0.45 0.25 1 0.618 0.8 0.3 1.9
Constant6.82 3.86 3.12 10.077 917.1
Step 14 (last)
CD371.05 0.37 8.17 10.004 2.9 1.4 5.9
STAT6−1.96 0.75 6.76 10.009 0.1 0.0 0.6
GNLY1.01 0.25 15.70 10.000 2.7 1.7 4.5
CD460.78 0.46 2.86 10.091 2.2 0.9 5.4
TREM2−0.55 0.33 2.76 10.097 0.6 0.3 1.1
REL−1.13 0.49 5.28 10.022 0.3 0.1 0.8
IL17RB0.38 0.15 6.35 10.012 1.5 1.1 2.0
Constant4.67 2.53 3.41 10.065 107.0
Multivariate binary logistic regression for molecular subtypes (GCB vs. ABC+Unclassified), backward conditional. In the regression analysis, the GCB is the reference group.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Carreras, J.; Hiraiwa, S.; Kikuti, Y.Y.; Miyaoka, M.; Tomita, S.; Ikoma, H.; Ito, A.; Kondo, Y.; Roncador, G.; Garcia, J.F.; et al. Artificial Neural Networks Predicted the Overall Survival and Molecular Subtypes of Diffuse Large B-Cell Lymphoma Using a Pancancer Immune-Oncology Panel. Cancers 2021, 13, 6384. https://doi.org/10.3390/cancers13246384

AMA Style

Carreras J, Hiraiwa S, Kikuti YY, Miyaoka M, Tomita S, Ikoma H, Ito A, Kondo Y, Roncador G, Garcia JF, et al. Artificial Neural Networks Predicted the Overall Survival and Molecular Subtypes of Diffuse Large B-Cell Lymphoma Using a Pancancer Immune-Oncology Panel. Cancers. 2021; 13(24):6384. https://doi.org/10.3390/cancers13246384

Chicago/Turabian Style

Carreras, Joaquim, Shinichiro Hiraiwa, Yara Yukie Kikuti, Masashi Miyaoka, Sakura Tomita, Haruka Ikoma, Atsushi Ito, Yusuke Kondo, Giovanna Roncador, Juan F. Garcia, and et al. 2021. "Artificial Neural Networks Predicted the Overall Survival and Molecular Subtypes of Diffuse Large B-Cell Lymphoma Using a Pancancer Immune-Oncology Panel" Cancers 13, no. 24: 6384. https://doi.org/10.3390/cancers13246384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop