High Expression of Caspase-8 Associated with Improved Survival in Diffuse Large B-Cell Lymphoma: Machine Learning and Artiﬁcial Neural Networks Analyses

: High expression of the anti-apoptotic TNFAIP8 is associated with poor survival of the patients with diffuse large B-cell lymphoma (DLBCL), and one of the functions of TNFAIP8 is to inhibit the pro-apoptosis Caspase-8. We aimed to analyze the immunohistochemical expression of Caspase-8 (active subunit p18; CASP8) in a series of 97 cases of DLBCL from Tokai University Hospital, and to correlate with other Caspase-8 pathway-related markers, including cleaved Caspase-3, cleaved PARP, BCL2, TP53, MDM2, MYC, Ki67, E2F1, CDK6, MYB and LMO2. After digital image quantiﬁ-cation, the correlation with several clinicopathological characteristics of the patients showed that high protein expression of Caspase-8 was associated with a favorable overall and progression-free survival (Hazard Risks = 0.3; p = 0.005 and 0.03, respectively). Caspase-8 also positively correlated with cCASP3, MDM2, E2F1, TNFAIP8, BCL2 and Ki67. Next, the Caspase-8 protein expression was modeled using predictive analytics, and a high overall predictive accuracy (>80%) was obtained with CHAID decision tree, Bayesian network, discriminant analysis, C5 tree, logistic regression, and Artiﬁ-cial Intelligence Neural Network methods (both Multilayer perceptron and Radial basis function); the most relevant markers were cCASP3, E2F1, TP53, cPARP, MDM2, BCL2 and TNFAIP8. Finally, the CASP8 gene expression was also successfully modeled in an independent DLBCL series of 414 cases from the Lymphoma/Leukemia Molecular Proﬁling Project (LLMPP). In conclusion, high protein expression of Caspase-8 is associated with a favorable prognosis of DLBCL. Predictive modeling is a feasible analytic strategy that results in a solution that can be understood (i.e., explainable artiﬁcial intelligence


Introduction
Diffuse Large B-cell Lymphoma (DLBCL) is one of the most frequent non-Hodgkin lymphomas (NHLs) in western countries.DLBCL accounts for an approximate 25% of NHLs and is characterized by being heterogeneous from a clinicopathological point of view, including histological morphological features, genetic changes and biological characteristics [1][2][3].Within the category of DLBCL there are several distinct subtypes that are separated, such as the T cell histiocyte rich large B cell lymphoma, the primary DLBCL of mediastinum, the intravascular lymphoma and the lymphomatoid granulomatosis [2].The prognosis of DLBCL is variable, and with current treatment the disease is curable in 50% of the cases [2,4].As DLBCL is heterogeneous, it is necessary to identify biomarkers with prognostic value.
The prognosis of DLBCL correlates with the International Prognostic Index (IPI) score, which includes the factors of the age, the serum lactate dehydrogenase, Eastern Cooperative Oncology Group (ECOG) performance status, the clinical stage and the number of extranodal disease sites [5][6][7][8].A variation of the original IPI that incorporates more detailed information about these used clinical variables is the National Comprehensive Cancer Network (NCCN)-IPI [9].In this research both IPIs will be used.
In comparison to the GCB, the ABC subtype is characterized by a more aggressive clinical evolution and constitutive activation of the anti-apoptotic nuclear factor kappa B (NF-kB) pathway [23][24][25][26].We have recently described the prognostic value of a negative mediator of apoptosis in DLBCL, the tumor necrosis factor alpha-induced protein 8 (TNFAIP8) [27,28].In this research, we had used artificial intelligence-the multilayer perceptron neural network-to analyze the gene expression of the DLBCL series of the Lymphoma/Leukemia Molecular Profiling Project (LLMPP) and to identify the genes that were associated with the overall survival of the patients.The TNFAIP8 was identified within the top 20 most relevant genes of the LLMPP series.Then, we validated the importance of TNFAIP8 by immunohistochemistry and by digital quantification using a machine-learning Weka-based segmentation method in a series of DLBCL from Tokai University Hospital, and we confirmed that high TNFAIP8 was associated with a poor overall survival of the patients [28].TNFAIP8 acts as a negative mediator of apoptosis and may play a role in tumor progression.TNFAIP8 suppresses the TNF-mediated apoptosis by inhibiting Caspase-8 activity but not the processing of procaspase-8, subsequently resulting in inhibition of BID cleavage and Caspase-3 activation [29][30][31].
In our previous research, we quantified the protein expression of TNFAIP8 in a series from Tokai University Hospital and we also correlated with two markers related to the proliferation cycle, the Ki67 and MYC.We found that through immunohistochemistry, the expression of TNFAIP8 was associated with a poor survival of the patients and also positively correlated with Ki67 and MYC in a moderate manner.Nevertheless, in our previous work we had the limitation of not knowing how in DLBCL the TNFAIP8 expression correlated with the apoptosis pathway (Caspase-8, Caspase-3, PARP), which is the main function of TNFAIP8.In Figure 1 the protein-protein interactions of TNFAIP8 are shown.These interactions highlight the apoptosis (including Caspase-8), cell cycle and the p53 signaling pathways.In addition, in our previous research our correlations included only a linear analysis, and more complex nonlinear analyses (that may fit better in the biological processes) had not been performed.Statistics and machine learning differ in their aim: statistical models infer relationship between variables.Conversely, machine learning is designed to make the most accurate predictions.Interactions between the Caspase-8 and the Caspase-8-related proteins.The aim of this research is to analyze the role of Caspase-8 in Diffuse large B-cell lymphoma, focusing in the investigation of the possible pathological mechanism, the correlations with Caspase-8-related markers and the clinicopathological correlations.This network summarizes the predicted associations of Caspase-8 with the group of pathway-related proteins.The nodes are the proteins and the edges represent the predicted functional associations: action types (activation, binding, inhibition, etc.) and effects types (positive, negative, and unspecified).The basic network only has the markers (nodes) of this project (left), the extended network (right) includes additional nodes for better action types and action effects information.
The purpose of this research was to analyze the expression of Caspase-8 (CASP8) in DLBCL.A series of DLBCL from Tokai University was immunostained with Caspase-8 and the protein expression was quantified by digital image analysis, and other markers of the Caspase-8 pathway including BCL2, cCASP3, CDK6, E2F1, LMO2, MDM2, MKI67, MYB, MYC, cPARP and TP53 were analyzed as well.We performed statistics and machine learning analyses to investigate the correlations between them and with the clinicopathological characteristics of the samples.Then, we also used the multilayer perceptron neural network analysis to identify other genes related to CASP8 using the LLMPP dataset.We found that high expression of Caspase-8 was associated with a good prognosis of the patients.

Series of DLBCL from Tokai University Hospital
The DLBCL series of Tokai University Hospital is comprised of 97 cases, collected from the years 2006 to 2011.The clinicopathological characteristics are shown in Table 1.In summary, the male/female ratio was 54/43 (1.3) and the age ranged from 14 to 97 years, with a median of 67 and a mean of 64.2 ± 14.5.According to the International Prognostic Index (IPI), 38.3% of the patients were low, 30.9% were low-intermediate, 17.3% were high-intermediate and 13.6% were high.Serum IL2R was high in 77% of the cases and B symptoms were present in 24% of the cases.The location was nodal (including the spleen) in 55% of the cases.The treatment was RCHOP or RCHOP-like in 93.4% of the cases.Clinical response was achieved in 74% of the patients.The pathological characteristics showed that the cell-of-origin was non-GCB in 67% of the cases, and the immune phenotype was CD5+ in 15%, CD10 in 30%, MUM1+ in 79%, BCL2+ in 79% and BCL6 in 67% of the cases.The immunohistochemical expression of Regulator of G-protein signaling 1 (RGS1), which is a marker associated with the chemotaxis of B-lymphocytes, with the germinal centers formation and with a poor prognosis of DLBCL [22,32], was high in 54% of the cases.The clinicopathological variables associated with the overall survival of the patients are shown in Table 1.Relevant variables were the IPI, sIL2R, Epstein-Barr virus infection and the cell-of-origin molecular classification according to the Hans' classifier [10].We used the series of the LLMPP for gene expression analysis [33,34].This series, the GSE10846, is a robust and well annotated series of 414 cases of DLBCL from Western countries that is publicly archived and available for downloading at the webpage https: //www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10846 (accessed on 16 April 2021).This series was last updated on 25 March 2019 (contact: Prof. Louis M. Staudt, Center for Cancer Research, National Cancer Institute, Building 10, Room 5A02, Bethesda, MD 20892, USA).
The clinicopathological features of this series are shown in detail in the Table 2.In summary, the male/female ratio was 224/172 (1.3) and the age ranged from 14 to 92 years, with a median of 62.5 and a mean of 61 ± 15.5.The 5 and 10-years overall survival of the patients was 57% and 47%, respectively.The variables with prognostic value for the overall survival included, among others, the National Comprehensive Cancer Network International Prognostic Index (the enhanced NCCN IPI) and the cell-of-origin molecular subtypes of germinal center B-cell (GCB), activated B-cell (ABC) and unclassified types (Table 2).This series is comparable to the one from Tokai University Hospital.

Immunohistochemistry and Digital Image Quantification
The immunohistochemical procedures were performed using formalin-fixed paraffinembedded tissue sections of the lymphoma samples.The immunostaining was performed in a fully automated stainer for immunohistochemistry and in-situ hybridization (Leica Biosystems Bond-Max, Leica K.K., Tokyo, Japan), including the manufacturer's ancillary reagents and consumables such as the Dewax solution (AR9222), Wash solution (AR9590), Bond epitope retrieval solution 1 and 2 (AR9961 and AR9640) and Polymer refine detection (DS9800).The staining process included the following steps: dewaxing, antigen retrieval, peroxide block, post-primary, polymer, DAB and hematoxylin.The mounting was performed in a Leica CV5030 coverslipper.The slides were visualized in an Olympus BX53 upright microscope, with a DP74 digital camera and cellSens imaging software (Olympus LifeScience, Olympus K.K., Tokyo, Japan).The whole slides were also digitalized using a Hamamatsu digital slide scanner, the NanoZoomer S360, and visualized with the NDP.view2Viewing software (Hamamatsu Photonics K.K., Hamamatsu, Japan).The representative areas of each marker were stored as a jpeg image for futher digital image quantification using the Fiji (ImageJ) image processing package, in a RGB and threshold strategy as we have recently described [28].

Bioinformatics and Statistical Analyses 2.3.1. Comparison between Groups
Comparisons between groups was performed when needed using non-parametric tests, with the Mann-Whitney U test or the Kruskal-Wallis test, and with crosstabulations that included the Pearson Chi-Square, The Fisher's Exact test, and the Likelihood Ratio test.Correlations between two quantitative variables were performed with Pearson and Spearman correlations.Binary logistic regression was performed to calculate Odds Ratios and to correlate the expression of Caspase-8 (as dichotomic variable) and the rest of clinicopathological variables (also as dichotomic variables).

Survival Analysis
The definition of overall and progression-free survivals were the standards as described by Cheson BD et al. [37,38].The overall survival was calculated from the time of diagnosis to the time of the death or the last follow-up.The Kaplan-Meier analysis with the Log rank test was used to calculate survival times, as well as for group comparisons; and the analysis included the Breslow and Tarone-Ware tests when necessary.Survival analysis was also performed with the Cox regression (enter method).The significance threshold was set a priori at p < 0.05.

Software and Artificial Neural Network Analysis
Several software were used in this research according to the manufacturer's instructions: R software for statistical computing version 3.6.2a-d) and Radial Basis Function analysis using the immunohistochemical data was performed following the manufacturer's instructions, and as we have thoroughly described in our previous publications [39,40].In this neural network analysis, the prediction of Caspase-8 by the other related markers of the pathway was performed using the immunohistochemical data of Caspase-8 as a dichotomic variable (high vs. low, with the same cut-off of the overall survival).
The LLMPP dataset was downloaded from the Gene Expression Omnibus (GEO) repository located on the National Center for Biotechnology Information (NCBI) webpage.The gene expression data of the GSE10846 was normalized and log2 transformed.The probes were collapsed according to the maximum probe values.Therefore, each gene had one expression value and the final series was comprised of a total 20,684 genes and 414 cases.Using an Artificial Intelligence approach, we aimed to predict the gene expression of Caspase-8 (CASP8) by the rest of the genes of the array (20,683 genes), using the series of 414 cases of DLBCL from the LLMPP.We used the multilayer perceptron (MLP) procedure, which produced a predictive model for CASP8 (dependent, target variable) based on the values of the predictor variables.Therefore, the dependent variable was the CASP8 and the covariates were the 20,683 genes.In this analysis, the dependent variable was treated as a scale (continuous) because the values represent ordered categories with a meaningful metric, so that distance comparisons between values are appropriate.Of note, this differs from our previous publications in which the dependent (target) variables were dichotomic (high vs. low, or dead vs. alive) [27,28].Another difference from our previous publications [27,28] is that we are using the values of the collapsed probes.In the setup, CASP8 was the dependent variable, while for the rest of the genes the covariates and the rescaling of the covariates were standardized.As partitions, 70% of the cases corresponded with the training set, while 30% corresponded with the testing set (the holdout was 0%).In the partition dataset the cases were randomly assigned based on the relative number of cases.The architecture had a series of parameters.The hidden layers setup included the number of hidden layers (one or two), the activation function (hyperbolic tangent or sigmoid), and the number of units (automatically computed or custom).The output layer setup included the activation function (identity, softmax, hyperbolic tangent or sigmoid), and the rescaling of scale dependent variables (standardized, normalized, adjusted normalized or none).The type of training could be batch, online or mini-batch; and the optimization algorithm included the scaled conjugate gradient or gradient descent.In the training options the initial lambda value was 0.0000005, the initial sigma 0.00005, the interval center 0, and the interval offset ±0.5.The output displayed the network structure (description, diagram, and the synaptic weights), and the network performance (model summary, predicted by observed chart and residual by predicted chart).In addition, the output also showed the case processing summary and the independent variable importance analysis.The predicted value or category for the dependent variable was saved as a new variable.The synaptic weight estimates were also exported as an xml file.The setup also included the user-missing values and the stopping rules.The LLMPP dataset was downloaded from the Gene Expression Omnibus (GEO) repository located on the National Center for Biotechnology Information (NCBI) webpage.The gene expression data of the GSE10846 was normalized and log2 transformed.The probes were collapsed according to the maximum probe values.Therefore, each gene had one expression value and the final series was comprised of a total 20,684 genes and 414 cases.Using an Artificial Intelligence approach, we aimed to predict the gene expression

Immunohistochemical Protein Expression of Caspase-8 in DLBCL (Tokai Series)
The immunohistochemical protein expression of Caspase-8 in the series of 97 cases of DLBCL from Tokai University Hospital showed a histological localization in the cytoplasm of the cells (compatible with B-lymphocytes), that had a morphology of middle or large sized centroblasts, or immunoblasts in some cases.In some cases, with high Caspase-8 expression the localization was perinuclear including some extension into the nucleus.After digital image quantification, the Caspase-8 expression ranged from 0.0% to 40.2%, with a median of 3.1% and a mean of 6.7% ± 8.3.In Figure 3, the immunohistochemical expression of Caspase-8 is shown, with a characteristic low and high expression pictures.In addition, the immunohistochemistry of the other markers is also shown in the Figures 3-5.BCL2 is an apoptosis inhibitor, controlling the mitochondrial membrane activity and inhibiting caspase activity [31].By immunohistochemistry, Caspase-8 protein expression was cytoplasmic and perinuclear, with some staining in the nucleus when the protein expression was high.Cleaved Caspase-3, cleaved PARP and MDM2 staining was nuclear.BCL2 expression was mainly cytoplasmic and perinuclear.MYC proto-oncogene is a transcription factor that binds the DNA and activates the transcription of growth-related genes, promotes angiogenesis and regulates somatic reprogramming.Ki67 plays a key role in cell proliferation, with a role in chromatin organization maintaining the mitotic chromosomes dispersed.E2F1 is a transcription factor involved in cell cycle regulation (progression from G1 to S phase) and DNA replication.E2F1 binds RB1 and can mediate both cell proliferation and p53 apoptosis.CDK6 is a kinase involved in the control of cell cycle (G1/S transition) and cell differentiation [31].By immunohistochemistry all the markers show nuclear staining.CDK6 also shown cytoplasmic localization.

Correlation between Caspase-8 and the Clinicopathological Variables in DLBCL (Tokai Series)
The protein expression of Caspase-8 as a quantitative variable correlated with the overall survival of the patients (Figure 6).The Cox regression analysis showed a trend of correlation with the overall survival, with high values associated with better survival, Beta = −0.045,p value = 0.071, Hazard risk = 0.956 (95% CI 0.911-1.004).A cut-off was searched and at 8.7% two groups of patients were identified, with different overall survival.The group of high Caspase-8 expression (>8.8%, n = 27/97, 27.8%) was characterized by a more favorable overall survival than the group of low expression (<8.8%, n = 70/97, 72.2%): Beta = −1.3,p value = 0.009, Hazard risk = 0.3 (95% CI, 0.1-0.7).This means, on average, a 70% lower risk of death, and a 233% increase in survival time.When the overall survival was compared using the Kaplan-Meier and the Log rank test, the group of high Caspase-8 expression was characterized by a favorable prognosis, with a 3, 5 and 10-year overall survival of 85%, 85% and 75%.Conversely, the group of low expression had an unfavorable prognosis, with a 3, 5 and 10-year survival of 56%, 52% and 40% (p value = 0.005).
Figure 5. Immunohistochemical expression of MYB, LMO2 and TNFAIP8 (Tokai series).MYB is a transcriptional activator that binds the DNA and plays a role in the control of cell proliferation and differentiation.LMO2 is a nuclear marker expressed by normal B lymphocytes in the germinal centers.It also regulates hematopoietic stem cell differentiation.TNFAIP8 is a negative regulator of apoptosis and play a role in tumor progression.It inhibits Caspase-8, subsequently resulting in inhibiting the activation of Caspase-3 [31].We have recently described that high expression of TNFAIP8 correlates with poor survival of DLBCL patients [28].MYB and LMO2 protein expression is nuclear, TNFAIP8 is in the cytoplasm and perinuclear.
The protein expression of Caspase-8 was also correlated with the progression-free survival of the patients.As a quantitative variable, Caspase-8 did not correlate with the progression-free survival (p value = 0.251).Using the same cut-off as the overall survival (8.8%), high Caspase-8 expression was associated with a more favorable progression-free survival of the patients (Beta = −0.952,p value = 0.036, Hazard risk = 0.386 (95% CI, 0.2-0.9).
Using the same cut-off of the survival analysis (8.8%), a correlation was performed with several clinicopathological characteristics of the series.Nevertheless, no significant correlations were found (Table 3).Therefore, other factors that are not the conventionally tested in DLBCL may be related to the Caspase-8 expression.

Predictive Modeling of Caspase-8 Protein Expression by the Rest of Caspase-8-Related Markers (Tokai Series)
Predictive analytics was performed to model the immunohistochemical expression of Caspase-8 as a dichotomic variable (high vs. low, using the same 8.7% cut-off) with all the other Caspase-8-related markers, which were used as quantitative variables.
Twelve different models were executed, including the algorithms of C5.0 node that builds a decision tree or a rule set, logistic regression, Bayesian Network, discriminant analysis, k-Nearest Neighbor (KNN), Support Vector Machine (SVM), Tree-AS decision tree, Chi-squared Automatic Interaction Detection (CHAID) decision tree, Classification and Regression (C&R) Tree and Neural Network.

Chi-Squared Automatic Interaction Detection (CHAID) Decision Tree
The CHAID node graph is shown in Figure 7; this decision tree predicted the Caspase-8 expression using cCASP3, BCL2, LMO2 and cPARP.The CHAID classification method builds decision trees by using chi-square statistics to identify optimal cut-offs (splits).Unlike the C&R Tree and the QUEST nodes, the CHAID method can generate non-binary trees.Therefore, the splits can be of more than 2, and the trees are wider.CHAID node decision tree analysis (Tokai series, immunohistochemical data).The Chisquared automatic interaction detection (CHAID) is a classification method for building decision trees that identify optimal splits by using chi-square statistics.CHAID examines the crosstabulations between each input field and the outcome, and tests for significance.CHAID can generate nonbinary trees (splits of more than two branches).In this analysis we aimed to predict the Caspase-8 expression as low (1) versus high (2), which is the same cut-off used for the survival analysis.The Caspase-8 expression could be predicted using cleaved Caspase-3, BCL2, cleaved and PARP.This decision tree is highlighting the Caspase-8, cCaspase-3, cPARP apoptosis pathway.

Bayesian Network
The bayesian network model is shown in Figure 8.A Bayesian network is a graphical model that shows variables (i.e., nodes) in a dataset and the probabilistic, or conditional, independencies between them.Causal relationships between nodes may be represented, but the links in the network (i.e., arcs) do not necessarily represent direct cause and effect.The basic view contains a network graph of nodes that displays the relationship between the target (dependent) variable and the predictor variables, and the relationship between the predictors.The distribution view shows the conditional probabilities for each node in the network as a mini graph, the corresponding tables for cleaved Caspase-3 and E2F1 are shown below.
Figure 8. Bayesian Network (Tokai series, immunohistochemical data).The Bayesian network allows to build a probabilistic model combining observed and recorded evidence with "common-sense" realworld knowledge to establish the likelihood of occurrences by using seemingly unlinked attributes.Therefore, Bayesian networks are used for making predictions.Each of the nodes is one of the markers that have been analyzed by immunohistochemistry in the Tokai series of DLBCL.In this analysis we aimed to predict the Caspase-8 expression (target) by the rest of the markers (predictors).Bayesian networks are very robust where information is missing and make the best possible prediction using whatever information is present.In this figure, the conditional probabilities of cCaspase-3 and E2F1 are also shown.

C5.0 Decision Tree
The C5.0 algorithm builds a decision tree by splitting the sample based on the field that provides the maximum information gain.The C5.0 node can predict only a categorical target.In this model, the Caspase-8 expression (high vs. low) was predicted by cleaved Caspase-3 and E2F1 variables as shown in the Figure 9. C5.0 node decision tree analysis (Tokai series, immunohistochemical data).The C5.0 algorithm was used to predict the Caspase-8 expression as a categorical target (low versus high, same cut-off for the survival analysis) by the rest of the markers (predictors).C5.0 models are quite robust when missing data is present and there are large numbers of input fields.C5.0 models tend to be easier to understand.In this analysis we found that Caspase-8 expression could be predicted by cCaspase-3 and E2F1, highlighting the apoptosis pathway.

Integrated Analysis
The results of several tests were integrated to calculate the percentage of importance for the association to Caspase-8.The most relevant markers were cCASP3, E2F1, TP53, MDM2, BCL2 and TNFAIP8 (Table 10).

Gene Expression Analysis Based on CASP8 Expression in DLBCL (LLMPP Series)
The LLMPP DLBCL dataset that is comprised of 20,684 genes was used to identify in an unsupervised manner which genes are associated with the CASP8 expression.A multilayer perceptron analysis was performed, with CASP8 as dependent variable (quantitative data) and the rest of 20,863 as predictors (also as quantitative variables).As a result of the artificial neutral network, the genes were ranked according to their normalized importance for prediction of the CASP8 expression.The neural network moderately managed to predict the CASP8 expression.According to their normalized importance, the top most relevant genes were: MED29 (1st), PRH1, YIPF3, PLEKHH1, PRB4, IKZF1, CYSRT1, ACTC1, FAM160B1, TBC1D10C, TMEM176B, ADAMTS10, CTSV, CEP20, AZGP1, ZNF557, SDCCAG8, CSKMT, BGLAP and SRP54 (20th).
1, highlighted in the model; 0, not highlighted.MLP, multilayer perceptron; RBF, radial basis function; ANN, artificial neural network.Conversely, the radial basis function (RBF) is generally faster and has only one hidden layer, but at the cost of reduced predictive power.The hidden layer(s) contains unobservable units.The value of each hidden unit is some function of the predictors.In this figure, the relevance of each marker for prediction of Caspase-8 is shown by the width of the node and by the value of the normalized importance for prediction.The performance of the network can be checked by the area under the curve ROC curve, of which the higher it is, the better the prediction of Caspase-8 expression.The synaptic weights from the output of the network are available on request from the corresponding author (Carreras J).
To understand the relationship between CASP8 expression and the top 20 genes, the expression of CASP8 was modeled using the top 20 genes.The analysis included the following model types: regression, generalized liner, linear-AS, LSVM, random trees, Tree-AS, linear, CHAID, C&R tree and neural network.The most relevant models were the following: CHAID (correlation 0.806), neural network (0.712), regression (0.668), generalized linear (0.668), linear (0.667) and C&R tree (0.647).
A visualization of the CHAID and neural network is shown in Further analysis was performed focusing on CASP8 as a dichotomic variable in the DLBCL GEO GSE10846.Using a ROC curve analysis, the best cut-off of CASP8 for the overall survival phenotype (dead/alive) was searched, and the value was 10.3805.Among the 414 cases of the series, CASP8 was high in 180 (48.3%) and low in 234 (69.2%).We confirmed the association of most of the previously identified 20 top genes of the neural network analysis with a high CASP8 expression.The Gene Set Enrichment Analysis (GSEA) is a biostatistical method that confirms if a defined set of genes correlates between two biological states (e.g., phenotypes).We used GSEA to correlate the phenotype CASP8 high vs. low with several set of genes (pathways).The whole collection of the MSigDB gene sets were used (23,677 genes sets in total, MSigDB database v7.3 updated March 2021), which include 9 major collections: H (hallmark genes), C1 (positional), C2 (curated), C3 (regulatory target), C4 (computational), C5 (ontology), C6 (oncogenic signature), C7 (immunologic signature), and C8 (cell type signature).From the 23,677 tested genes sets, 843 gene sets were significantly enriched at nominal p value < 5%, either towards high or low CASP8.For example, significantly enriched pathways of the oncogenic signature that associated to high CASP8 were ALK, KRAS, PGF, P53 and CYCLIND1.Other correlations included sets of the immunologic signature such as macrophages (Genes up-regulated in bone marrow-derived macrophages treated with IL4, GSE25088).The complete results are available on request from the corresponding author (Carreras J).Molecular Profiling Project (LLMPP) was used to predict the expression of the CASP8 as a quantitative target variable.In this analysis, the predictors were the 20,863 genes of the gene expression array.Conversely to the analysis of the Tokai cases, in the LLMPP data analyses the CASP8 is predicted as a quantitative variable, which we have not performed in our previous publications (thus the novelty).In neural networks, the predicted by observed chart is used for continuous targets and displays a binned scatterplot of the predicted values on the vertical axis by the observed values on the horizontal axis.The importance of each predictor in making the prediction is shown in the independent variable importance figure.The synaptic weights from the output of the network and the normalized importance chart are available on request from the corresponding author (Carreras J).Typically, the modelling will focus on the predictor fields that matter most and those that matter least will be dropped or ignored.Therefore, the neural network was repeated only with the top 20 genes.In addition to the neural network analysis, this figure also shown the result of the CHAID decision tree.

Discussion
This research focused on the analysis of Caspase-8 in DLBCL from Tokai University Hospital.The protein expression of Caspase-8 was evaluated by immunohistochemistry, followed by marker quantification by digital image analysis.We found that high Caspase-8 protein expression was associated with a favorable prognosis of the patients, including a favorable overall and progression-free survival.
Apoptosis is a term to designate programmed cell death.The mechanism of cell death has multiple roles, including a function in the pathogenesis, homeostasis, and control of several types of infection, as well as in cancer [41].Excessive cell damage results in passive necrosis.On the other hand, the mechanism of cell death can be triggered by several molecular programs including cellular stress, oncogenic changes that involve tumor suppressor genes and oncogenes, several pathogens, and other immune mechanisms.Apoptosis is one of the most known and studied types of programmed cell death [41]; other types of programmed cell death are necroptosis, pyroptosis, ferroptosis, mitotic catastrophy and autophagic cell death, among others [41].The pathway of apoptosis includes an extrinsic (controlled death receptors of the TNFR superfamily) and an intrinsic (mitochondrial) pathway.Interestingly, ligation of these death receptors induces both activation of extrinsic apoptosis and necroptosis, and the balance between these two pathways determines whether the cell lives.Caspase-8 has a role in initiating of extrinsic apoptosis and inhibiting necroptosis [41].Caspase-8 activates Caspase-3 by proteolytic cleavage, and then Caspase-3 cleaves other vital cellular proteins or other caspases, which result in activation of cPARP, which eventually leads to apoptosis [42][43][44].
In DLBCL, the mechanisms of cell survival are dysregulated [45].Dysregulation of an inhibitor of apoptosis proteins (IAPs) has been described in DLBCL [45].For example, overexpression of XIAP (an apoptosis inhibitor) was associated with a worse outcome in DLBCL [46].Another inhibitor, the Survivin, was also found overexpressed in DLBCL [47] and in ABC molecular type DLBCL the overexpression was also associated to a poor prognosis [47].Besides, we recently described that high expression of another apoptosis inhibitor (TNFAIP8) was associated with a poor prognosis of DLBCL [40].In this project the protein expression of Caspase-8 was analyzed in a series of Tokai University's, and we found that high expression was associated with a favorable survival of the patients.Therefore, while anti-apoptosis seems to be associated to a poor prognosis of DLBCL, the pro-apoptosis Caspase-8 associates to a favorable outcome of the patients.
In DLBCL there is also dysregulation of TP53 [45], which includes not only mutations or deletions of TP53, but also alterations of TP53 pathways-related markers of BCL6, MDM2, CDKN2A, etc.In this research some of these markers were analyzed by immunohistochemistry in the Tokai series, and the relationship between them as well as with Caspase-8 was explored as shown in Figure 1.In addition, using several modeling analyses, we showed how these markers correlated with the Caspase-8 expression, either as positive or negative correlation, so a pathogenic model can be postulated.For example, the Caspase-8 expression could be calculated as 0.2*MYC + −0.2*MDM2 + 0.9*E2F1 + 0.1*BCL2 + −0.3*TP53 + −1.7*cPARP + 3.1*cCASP3 − 2.697.This research focused on the analysis of Caspase-8 in a series of Tokai University's and we found that high protein expression of Caspase-8 correlated with a favorable outcome of the patients, both the overall survival and the progression-free survival.As shown in the Figure 6, the 30% of the patients with high Caspase-8 expression had a favorable overall survival.At the 10-years' time, around 80% of the patients with high Caspase-8 expression were still alive.Conversely, at that time only 40% were alive in the low expression group.This finding was important and to the best of our knowledge, to date, this association has not been reported in DLBCL.Nevertheless, the Caspase-8 did not correlate with the conventional clinicopathological variables that are usually associated with the prognosis of DLBCL such as the cell-of-origin molecular classifications (Hans' algorithm) and the International Prognostic Index (IPI) that integrates the clinical variables of age, performance status, LDH, extranodal sites and stage.Therefore, a functional network association analysis was performed, markers associated to Caspase-8 were identified (Figure 1), and finally several types of predictive modeling were tested.
Predictive analytics was performed to model the immunohistochemical expression of Caspase-8 as a dichotomic variable (high vs. low, using the same 8.7% cut-off for the overall survival analysis) with the other Caspase-8-related markers, which were used as quantitative variables.
Twelve different models were executed, including the algorithms of C5.0 node that builds a decision tree or a rule set, logistic regression, Bayesian Network, discriminant analysis, k-Nearest Neighbor (KNN), Support Vector Machine (SVM), Tree-AS decision tree, Chi-squared Automatic Interaction Detection (CHAID) decision tree, Classification and Regression (C&R) Tree and Neural Network.All these models of data mining are tools that enable to develop predictive models using the research experimental data.This data mining process allowed better results and data interpretation, and integrated methods of machine learning, artificial intelligence, and statistics.Of note, each method had certain strengths and was best suited for particular types of problems.Among the 12 different models that were executed, 9 models predicted the Caspase-8 protein expression as a dichotomic variable (high vs. low).When ranked according to their overall accuracy for Caspase-8 prediction, the results were as follows: CHAID tree (92%, 4 variables), Bayesian Network (88%, 12 variables), C5 tree (85%, 2 variables), Logistic regression (83%, 12 variables) and Neural network (80%, 12 variables).The results of all these types of analysis were compatible between them, and each model provided insights into the relationship between Caspase-8 and the rest of the markers.Nevertheless, as previously stated, each method had strengths and weaknesses.For example, the decision trees had an overall accuracy that ranged from 92% for the CHAID tree to 85% of the C5 tree.This means that prediction of Caspase-8 was successful, although variable.Nevertheless, in these models not all the markers were used in the final model, so the relevance of some of the markers cannot be properly assessed.The Bayesian Network built a probabilistic model and made use of all the markers.Bayesian Networks are very robust where information is missing and make the best possible prediction using whatever information is present.Causal relationships between nodes may be represented but the links in the network (i.e., arcs) do not necessarily represent direct cause and effect.The logistic regression (i.e., nominal regression) classifies records based on values of input fields.It is comparable to the linear regression, but the target variable is categorical instead of numeric.This method had the strength of allowing us to know which were the most relevant markers for the prediction of Caspase-8, with information of the direction of the association (increase or decrease) and the strength of that association.Neural networks are simple models of the way the nervous system operates.The basic units are neurons, which are typically organized into layers.There are three parts in a neural network: the input, the hidden and the output layers.The network learns thorough training.Since the output is known, as the training progresses the network becomes increasingly accurate in replicating the known outcomes.Since the deep neural networks have a multilayer non-linear structure (i.e., black box model), neural networks are criticized to be non-transparent because their predictions are not traceable by humans.
In our analysis we could rank the markers according to their normalized importance for Caspase-8 prediction, but the reason for this association was elusive because the synaptic weights are only sort of meaningful.In summary, we used a series of algorithms to create classification models.Each model used the values of the input fields (our markers) to predict the value of one output or target field (Caspase-8 as a dichotomic variable, high vs. low), and the integration of all the information made the results more understandable (explainable).As shown in Table 10, the most relevant markers associated with Caspase-8 were the following: cCASP3, E2F1, TP53, cPARP, MDM2, BCL2 and TNFAIP8.Caspase 3, PARP, BCL2 are known markers closely related to apoptosis.Therefore, it makes sense that they were highly associated with Caspase-8.Nevertheless, some of the markers are also associated with other pathways.MDM2 is a ligase that inhibits the p53 and p73mediated cell cycle arrest and apoptosis [31].The p53 protein is a tumor suppressor that also controls the cell cycle and induces apoptosis.MYC proto-oncogene is a transcription factor that activates the transcription of growth-related genes and promotes angiogenesis.Ki67 has a role in chromatin organization and it is a widely used marker of cell proliferation.E2F1 is also involved in the cell cycle.CDK6 is a kinase that also controls the G1/S cell cycle transition and the cell differentiation [31].MYB also controls the cell cycle and cell differentiation.LMO2 is a nuclear marker of normal B lymphocytes of the germinal centers, and DLBCL is supposed to be developed from these lymphocytes.Finally, TNFAIP8 is a negative regulator of apoptosis and plays a role in tumor progression [31].In summary, the most relevant markers that we have highlighted belonged to the apoptosis and the control of cell cycle.
Finally, the Capase-8 gene expression as a quantitative variable was also analyzed in an independent series of DLBCL of the LLMPP, as the relationship with other genes could also be successfully explored.The most relevant gene was MED29, a component of the Mediator complex that is involved in the regulation of transcription [31].MED29 has been related to prostate cancer [48].
Future research directions should include analyzing the same markers in larger series of DLBCL to validate our findings.In addition, in-vitro or in-vivo analyses may also help to clarify the pathological function of Caspase-8 in DLBCL.

Conclusions
In conclusion, high immunohistochemical protein expression of Caspase-8 is associated with a favorable overall survival and progression-free survival of the patients in a series of DLBCL from Tokai University Hospital.The relationship of Caspase-8 with other related markers could also be confirmed by predictive analytics including decision trees, Bayesian network, logistic regression and artificial neural networks.Therefore, the immunohistochemical analysis of Caspase-8 could be implemented in the routine diagnosis of DLBCL as a prognostic marker.

Figure 1 .
Figure 1.Interactions between the Caspase-8 and the Caspase-8-related proteins.The aim of this research is to analyze the role of Caspase-8 in Diffuse large B-cell lymphoma, focusing in the investigation of the possible pathological mechanism, the correlations with Caspase-8-related markers and the clinicopathological correlations.This network summarizes the predicted associations of Caspase-8 with the group of pathway-related proteins.The nodes are the proteins and the edges represent the predicted functional associations: action types (activation, binding, inhibition, etc.) and effects types (positive, negative, and unspecified).The basic network only has the markers (nodes) of this project (left), the extended network (right) includes additional nodes for better action types and action effects information.

Figure 2 .
Figure 2. (a) General architecture for the multilayer perceptron artificial neural network.(b) Activation functions for the multilayer perceptron artificial neural network.(c) Error functions for the multilayer perceptron artificial neural network.(d) Notation for the multilayer perceptron artificial neural network.

Figure 3 .
Figure 3. Immunohistochemical expression in the DLBCL samples of Caspase-8, cleaved Caspase-3, cleaved PARP, MDM2 and BCL2 (Tokai series).Caspase-8 protein is a protease with a key role in the programmed cell death (extrinsic apoptosis).Once activated, Caspase-8 cleaves and activates other effector caspases including Caspase-3 and PARP1.It also regulates necroptosis and innate immunity.MDM2 is a ligase that inhibits the p53 and p73-mediated cell cycle arrest and apoptosis.BCL2 is an apoptosis inhibitor, controlling the mitochondrial membrane activity and inhibiting caspase activity[31].By immunohistochemistry, Caspase-8 protein expression was cytoplasmic and perinuclear, with some staining in the nucleus when the protein expression was high.Cleaved Caspase-3, cleaved PARP and MDM2 staining was nuclear.BCL2 expression was mainly cytoplasmic and perinuclear.

Figure 4 .
Figure 4. Immunohistochemical expression in the DLBCL samples of TP53, MYC, Ki67, E2F1 and CDK6 (Tokai series).P53 is a tumor suppressor that controls the cell cycle and induces apoptosis.MYC proto-oncogene is a transcription factor that binds the DNA and activates the transcription of growth-related genes, promotes angiogenesis and regulates somatic reprogramming.Ki67 plays a key role in cell proliferation, with a role in chromatin organization maintaining the mitotic chromosomes dispersed.E2F1 is a transcription factor involved in cell cycle regulation (progression from G1 to S phase) and DNA replication.E2F1 binds RB1 and can mediate both cell proliferation and p53 apoptosis.CDK6 is a kinase involved in the control of cell cycle (G1/S transition) and cell differentiation[31].By immunohistochemistry all the markers show nuclear staining.CDK6 also shown cytoplasmic localization.

Figure 6 .
Figure 6.Overall and progression-free survival according to the Caspase-8 expression by immunohistochemistry (Tokai series, immunohistochemical data).High percentages of Caspase-8 associated with a favorable prognosis of the patients with DLBCL, including both the overall survival and the progression-free survival.

Figure 7 .
Figure 7. CHAID node decision tree analysis (Tokai series, immunohistochemical data).The Chisquared automatic interaction detection (CHAID) is a classification method for building decision trees that identify optimal splits by using chi-square statistics.CHAID examines the crosstabulations between each input field and the outcome, and tests for significance.CHAID can generate nonbinary trees (splits of more than two branches).In this analysis we aimed to predict the Caspase-8 expression as low (1) versus high (2), which is the same cut-off used for the survival analysis.The Caspase-8 expression could be predicted using cleaved Caspase-3, BCL2, cleaved and PARP.This decision tree is highlighting the Caspase-8, cCaspase-3, cPARP apoptosis pathway.

Figure 9 .
Figure 9. C5.0 node decision tree analysis (Tokai series, immunohistochemical data).The C5.0 algorithm was used to predict the Caspase-8 expression as a categorical target (low versus high, same cut-off for the survival analysis) by the rest of the markers (predictors).C5.0 models are quite robust when missing data is present and there are large numbers of input fields.C5.0 models tend to be easier to understand.In this analysis we found that Caspase-8 expression could be predicted by cCaspase-3 and E2F1, highlighting the apoptosis pathway.

Figure 10 .
Figure10.Artificial Neural Network analysis for the prediction of Caspase-8 by the Caspase-8-related markers (Tokai series, immunohistochemical data).The neural network model determines how the network connects the predictors (our series of 12 markers, input layer) to the targets (the Caspase-8, output layer, as a dichotomic variable high versus low, same cutoff used for the survival analysis) through the hidden layers.The multilayer perceptron (MLP) allows for more complex relationships.

Figure 11 .
Figure 11.Prediction of CASP8 by 20,683 genes of the LLMPP series and modeling using the top 20 most relevant genes (gene expression data).The DLBCL gene expression data of the GEO dataset GSE10846 of the Lymphoma/Leukemia

Table 1 .
Clinicopathological characteristics of the DLBCL series of Tokai University Hospital.

Table 2 .
Clinicopathological characteristics of the DLBCL series of the LLMPP.

Table 3 .
Correlation between the clinicopathological characteristics of the DLBCL cases and high immunohistochemical expression of Caspase-8 (Tokai series).
*1Error computations are based on the testing sample.* 2 Determined by the testing data criterion: The "best" number of hidden units is the one that yields the smallest error in the testing data.

Table 10 .
Integrated analysis, ranking of markers according to relevance of Caspase-8 association.