Anomaly Detection and Artificial Intelligence Identified the Pathogenic Role of Apoptosis and RELB Proto-Oncogene, NF-kB Subunit in Diffuse Large B-Cell Lymphoma

: Background: Diffuse large B-cell lymphoma (DLBCL) is one of the most frequent lymphomas. DLBCL is phenotypically, genetically, and clinically heterogeneous. Aim: We aim to identify new prognostic markers. Methods: We performed anomaly detection analysis, other artificial intelligence techniques, and conventional statistics using gene expression data of 414 patients from the Lymphoma/Leukemia Molecular Profiling Project (GSE10846), and immunohistochemistry in 10 reactive tonsils and 30 DLBCL cases. Results: First, an unsupervised anomaly detection analysis pin-pointed outliers (anomalies) in the series, and 12 genes were identified: DPM2 , TRAPPC1 , HYAL2 , TRIM35 , NUDT18 , TMEM219 , CHCHD10 , IGFBP7 , LAMTOR2 , ZNF688 , UBL7 , and RELB , which belonged to the apoptosis, MAPK, MTOR, and NF-kB pathways. Second, these 12 genes were used to predict overall survival using machine learning, artificial neural networks, and conventional statistics. In a multivariate Cox regression analysis, high expressions of HYAL2 and UBL7 were correlated with poor overall survival, whereas TRAPPC1 , IGFBP7 , and RELB were correlated with good overall survival ( p < 0.01). As a single marker and only in RCHOP-like treated cases, the prognostic value of RELB was confirmed using GSEA analysis and Kaplan–Meier with log-rank test and validated in the TCGA and GSE57611 datasets. Anomaly detection analysis was successfully tested in the GSE31312 and GSE117556 datasets. Using immunohistochemistry, RELB was positive in B-lymphocytes and macrophage/dendritic-like cells, and correlation with HLA DP-DR, SIRPA, CD85A (LILRB3), PD-L1, MARCO, and TOX was explored. Conclusions: Anomaly detection and other bio-informatic techniques successfully predicted the prognosis of DLBCL, and high RELB was associated with a favorable prognosis.


Introduction 1.Clinicopathological Characteristics and Prognosis of Diffuse Large B-Cell Lymphoma
This study aimed to identify new prognostic markers of diffuse large B-cell lymphoma (DLBCL) using anomaly detection analysis.By identifying outlier cases, the genes associated with those unusual cases were identified, and their prognostic value was assessed.
The classification of hematologic malignancies integrates data from several sources, including pathologic characteristics, pathophysiology, treatment, and outcomes.The current classification is the World Health Organization (WHO) revised 4th edition (WHO4R) [1], which has recently been updated into the International Consensus Classification 2022 (ICC2022) [1][2][3][4], and the proposed 5th edition of the World Health Organization Classification of Haematolymphoid Tumours: Lymphoid Neoplasms (WHO5) [5].In this classification, mature Bcell neoplasms are hematological cancers originating from lymphocytes with a lymphocyte subtype or cell lineage of B cells.
These neoplasms are classified according to several parameters, such as morphological characteristics, architectural distribution of the neoplastic cells, immunophenotypic markers, genetic alterations, and clinical features of the patients [6][7][8][9][10][11][12][13].They are classified into different subtypes based in part on the postulated cell of origin.
DLBCL is one of the most frequent histological subtypes of hematological neoplasia, accounting for approximately 25-30% of non-Hodgkin lymphomas.
The incidence of DLBCL in the United States and the United Kingdom is approximately 7 cases per 100,000 people per year.In Europe, there are 5 cases per 100,000 people per year [14][15][16].Interestingly, the incidence differs according to ethnicity.White Americans have a higher incidence than Blacks, Asians, and Native Americans [14,15,17].
DLBCL originates from mature B cells that have the histological appearance of centroblast or immunoblasts, which are two types of activated B cells.The histological appearance of DLBCL is variable because of the heterogeneity of the morphological characteristics of the neoplastic B lymphocytes and the tumor-immune microenvironment.This heterogeneity is shown in Figure 1.
Clinically, most patients present with a rapidly growing mass located in the lymph nodes or abdomen.In approximately 60% of cases, the disease will present as an advanced stage.Two subtypes of DLBCL have been identified on the basis of gene expression and the postulated cell-of-origin: the germinal center B cell type (GCB) and the activated B cell type (ABC) [2,[33][34][35].
Predicting clinical evolution is currently performed using the International Prognostic Index (IPI) with its variants, the revised IPI, and the National Comprehensive Cancer Network (NCCN)-IPI [33][34][35].The IPI uses as unfavorable predictors an age > 60 years, serum lactate dehydrogenase concentration above normal, ECOG performance status ≥ 2, Ann Arbor stage III or IV, and number of extranodal disease sites > 1 [34].Gene expression profiling also stratifies patients into two prognostic groups, with the activated B cell type associated with a poorer prognosis.Integration with other genetic factors, such as the presence of BCL2, MYC, and BCL6 translocations; copy number changes and LOH; and mutational profiling, has allowed the identification of different genetic subtypes (MCD, BN2, EZB, ST2, A53, and N1) [36].Interestingly, these subtypes also showed different gene signatures, including malignant processes (proliferation signature and MYC, ribosomal proteins, glycolipid pathways), B cell differentiation, transcription factors (IRF4, BCL6, OCT2, and TCF3), oncogenic signaling (NFKB, p53, NOTCH, PI3K, and JAK2), and immune microenvironment (T follicular helper cells, CD4 T helper cells, CD8 cytotoxic T lymphocytes, regulatory T lymphocytes, natural killer cells, macrophages, dendritic cells, and fibrosis) [36].Clinically, most patients present with a rapidly growing mass located in the lymph nodes or abdomen.In approximately 60% of cases, the disease will present as an advanced stage.Two subtypes of DLBCL have been identified on the basis of gene expression and the postulated cell-of-origin: the germinal center B cell type (GCB) and the activated B cell type (ABC) [2,[33][34][35].
Predicting clinical evolution is currently performed using the International Prognostic Index (IPI) with its variants, the revised IPI, and the National Comprehensive Cancer Network (NCCN)-IPI [33][34][35].The IPI uses as unfavorable predictors an age > 60 years, serum lactate dehydrogenase concentration above normal, ECOG performance status ≥ 2, Ann Arbor stage III or IV, and number of extranodal disease sites > 1 [34].Gene expression profiling also stratifies patients into two prognostic groups, with the activated B cell type associated with a poorer prognosis.Integration with other genetic factors, such as the presence of BCL2, MYC, and BCL6 translocations; copy number changes and LOH; and mutational profiling, has allowed the identification of different genetic subtypes (MCD, BN2, EZB, ST2, A53, and N1) [36].Interestingly, these subtypes also showed different gene

Machine Learning and Anomaly Detection 1.2.1. Machine Learning
Machine learning can be defined as an analytic method that uses data and algorithms to emulate human learning and gradually improve accuracy [37].It is a branch of artificial intelligence (AI) that uses statistical methods and algorithms to make classifications and predictions [37].Neural networks are a subfield of machine learning, and deep learning is a subfield of neural networks [38].
A machine learning algorithm has three components: the decision process algorithms make predictions or classifications; the error function evaluates the prediction of the model; and the model optimization process adjusts the weights autonomously to improve the performance of the model [38].
There are three main types of learning: supervised, unsupervised, and reinforcement learning.Supervised learning uses labeled datasets to make classifications, predictions, and regression.Unsupervised learning uses unlabeled datasets to identify not readily apparent patterns and classify cases [39].Reinforcement learning is an area of machine learning that handles sequential decision-making problems in a situation of uncertainty.Reinforcement learning learns to optimize sequential decisions by finding the best strategy [40].
Machine learning is an area of artificial intelligence that fits mathematical models to observed data.Machine learning can be broadly divided into supervised learning, unsupervised learning, and reinforcement learning (Figure 2).Deep neural networks contribute to each of these areas.The type of analysis to be performed depends on the type of data and the aim of the study [41].
intelligence (AI) that uses statistical methods and algorithms to make classifications and predictions [37].Neural networks are a subfield of machine learning, and deep learning is a subfield of neural networks [38].
A machine learning algorithm has three components: the decision process algorithms make predictions or classifications; the error function evaluates the prediction of the model; and the model optimization process adjusts the weights autonomously to improve the performance of the model [38].
There are three main types of learning: supervised, unsupervised, and reinforcement learning.Supervised learning uses labeled datasets to make classifications, predictions, and regression.Unsupervised learning uses unlabeled datasets to identify not readily apparent patterns and classify cases [39].Reinforcement learning is an area of machine learning that handles sequential decision-making problems in a situation of uncertainty.Reinforcement learning learns to optimize sequential decisions by finding the best strategy [40].
Machine learning is an area of artificial intelligence that fits mathematical models to observed data.Machine learning can be broadly divided into supervised learning, unsupervised learning, and reinforcement learning (Figure 2).Deep neural networks contribute to each of these areas.The type of analysis to be performed depends on the type of data and the aim of the study [41].In this study, anomaly detection was used to identify anomalies (rare events) in the dataset.A model was constructed from the input data (gene expression) without corresponding labels (i.e., "no supervision").Rather than learning a mapping from input to output, the goal is to describe or understand the structure of the data.Subsequently, supervised learning was used to predict the overall survival outcome (dead vs. alive).
Different types of machine learning methods, including supervised, unsupervised, and reinforcement learning, are shown in Figure 3.In this study, anomaly detection was used to identify anomalies (rare events) in the dataset.A model was constructed from the input data (gene expression) without corresponding labels (i.e., "no supervision").Rather than learning a mapping from input to output, the goal is to describe or understand the structure of the data.Subsequently, supervised learning was used to predict the overall survival outcome (dead vs. alive).
Different types of machine learning methods, including supervised, unsupervised, and reinforcement learning, are shown in Figure 3.

Segmentation Analysis
Segmentation is the technique of splitting cases into different groups depending on their characteristics.There are several segmentation methods, such as K-Means, Kohonen, TwoSteps cluster, TwoStep-AS, and Anomaly detection.
K-Means is a type of clustering analysis that is unsupervised because there is no definition of the target variable (field).The dataset is clustered into different groups to search for patterns in the input data.Within a cluster, the cases are similar to each other, but the characteristics differ between clusters.From the data, the centers of the clusters are searched, and the cases are assigned to the most similar cluster based on the input variables.Of note, the order of the data may affect the clustering output [42] Kohonen clustering analysis is also known as knet or self-organizing map (S.O.M).A type of neural network that performs unsupervised clustering.Within a group, the cases are similar and different from a different cluster.The basic unit of the neural network is the neuron.The network architecture organizes neurons into input and output layers.All input neurons connect to output neurons, and the connections have a weight (w), which is also known as strength.The output is a map of a two-dimensional grid in which the units have no connections [43,44].Types of machine learning methods for predictive data analysis.In addition to anomaly detection analysis, there are many other types of machine learning that can be classified as supervised (A), unsupervised (B), and reinforcement learning (C).Of note, this figure includes methods usually used in predictive data analysis, but it does not focus on deep learning and reinforcement learning (please refer to popular deep learning frameworks such as tensorflow, keras, and pytorch, for documentation).
An image of the K-Means cluster (left), Kohonen clustering analysis (middle), and anomaly detection (right) are shown in Figure 4.
inition of the target variable (field).The dataset is clustered into different groups to search for patterns in the input data.Within a cluster, the cases are similar to each other, but the characteristics differ between clusters.From the data, the centers of the clusters are searched, and the cases are assigned to the most similar cluster based on the input variables.Of note, the order of the data may affect the clustering output [42] Kohonen clustering analysis is also known as knet or self-organizing map (S.O.M).A type of neural network that performs unsupervised clustering.Within a group, the cases are similar and different from a different cluster.The basic unit of the neural network is the neuron.The network architecture organizes neurons into input and output layers.All input neurons connect to output neurons, and the connections have a weight (w), which is also known as strength.The output is a map of a two-dimensional grid in which the units have no connections [43,44] An image of the K-Means cluster (left), Kohonen clustering analysis (middle), and anomaly detection (right) are shown in Figure 4.The TwoStep cluster is also an unsupervised method.As in the K-Means and Kohonen methods, the cases are grouped in clusters with similar characteristics, whereas differences are observed between clusters.The method follows two steps.First, the raw input data are compressed into different subclusters.Second, a hierarchical clustering method joins the subclusters into larger clusters.Of note, this method is sensitive to the order of the training data.The TwoStep cluster has the advantage of handling mixed types of variables, can use large datasets, and can exclude outliers.However, it cannot handle missing data.

Anomaly Detection Analysis
An anomaly is a data point or collection of data that does not follow the same pattern or has the same structure as the rest of the data [45].Anomaly detection is a machine The TwoStep cluster is also an unsupervised method.As in the K-Means and Kohonen methods, the cases are grouped in clusters with similar characteristics, whereas differences are observed between clusters.The method follows two steps.First, the raw input data are compressed into different subclusters.Second, a hierarchical clustering method joins the subclusters into larger clusters.Of note, this method is sensitive to the order of the training data.The TwoStep cluster has the advantage of handling mixed types of variables, can use large datasets, and can exclude outliers.However, it cannot handle missing data.

Anomaly Detection Analysis
An anomaly is a data point or collection of data that does not follow the same pattern or has the same structure as the rest of the data [45].Anomaly detection is a machine learning method that identifies data points, events, and/or observations that deviate from a dataset's ordinary distribution [46].In other words, anomaly detection is a technique that allows the identification of rare events that do not fit normal patterns.Examples of applications of this technique can be found in the following link: https://paperswithcode. com/task/anomaly-detection (accessed on 18 October 2023).
The anomaly detection procedure is designed to quickly detect unusual cases for dataauditing purposes in the exploratory data analysis step before any inferential data analysis.It searches for unusual cases and can be useful for detecting outliers within a large amount of data.The algorithm is designed for generic anomaly detection, which means that the definition of an anomalous case is not specific to any particular application [47].Therefore, it can identify outliers even if they do not follow any known pattern.This method analyzes several variables to identify clusters that include cases with similar characteristics.Later, each record is compared with the others of the peer group to identify the anomalies.For each record, an anomaly index is assigned.The higher the anomaly index, the greater the deviation of a particular case from the average.An index above 2 is a good cutoff for identifying anomalies because it indicates a deviation twice the average.Of note, the identified cases should be assessed as suspected anomalies because, after close analysis, they may turn out to be true outliers.The algorithm is divided into three stages: modeling, scoring, and reasoning.

Aim
This study aimed to identify new prognostic markers of DLBCL using anomaly detection analysis.By identifying outlier cases, the genes associated with those unusual cases were identified.Then, the prognostic value of the identified genes was evaluated in all cases of the series using other techniques, including several machine learning and artificial neural networks, and conventional biostatistics, such as Cox regression and Kaplan-Meier with log-rank test (Figure 5).[47].Therefore, it can identify outliers even if they do not follow any known pattern.This method analyzes several variables to identify clusters that include cases with similar characteristics.Later, each record is compared with the others of the peer group to identify the anomalies.For each record, an anomaly index is assigned.The higher the anomaly index, the greater the deviation of a particular case from the average.An index above 2 is a good cutoff for identifying anomalies because it indicates a deviation twice the average.Of note, the identified cases should be assessed as suspected anomalies because, after close analysis, they may turn out to be true outliers.The algorithm is divided into three stages: modeling, scoring, and reasoning.

Aim
This study aimed to identify new prognostic markers of DLBCL using anomaly detection analysis.By identifying outlier cases, the genes associated with those unusual cases were identified.Then, the prognostic value of the identified genes was evaluated in all cases of the series using other techniques, including several machine learning and artificial neural networks, and conventional biostatistics, such as Cox regression and Kaplan-Meier with log-rank test (Figure 5).

Materials and Methods
The gene expression of DLBCL is an important source of data for identifying prognostic markers.This study analyzed the gene expression of one of the most relevant DLBCL gene expression datasets of the Lymphoma/Leukemia Molecular Profiling Project (LLMPP).The dataset was GSE10846, which is a retrospective study of 414 DLBCL cases [48,49].The last update of this dataset was 25 March 2019.
GSE10846 is a very well clinically characterized series of DLBCL.Despite being some years old, it serves the purpose of this research because we are looking for genes associated with the pathogenesis of DLBCL.Of note, to test the predictive value of one of the most relevant genes, only RCHOP-like cases were used.
A complete description of the clinicopathological characteristics of this series is presented in our previous publication that analyzed CSF1R expression [50].In summary, 55% of the cases were male and aged > 60 years, NCCN-IPI was high-intermediate and high risk in 35.8%, the cell-of-origin molecular subtype was activated B cell type and unclassified in 45.8%, and the treatment was RCHOP-like in 56.3% of the cases.

Materials and Methods
The gene expression of DLBCL is an important source of data for identifying prognostic markers.This study analyzed the gene expression of one of the most relevant DLBCL gene expression datasets of the Lymphoma/Leukemia Molecular Profiling Project (LLMPP).The dataset was GSE10846, which is a retrospective study of 414 DLBCL cases [48,49].The last update of this dataset was 25 March 2019.
GSE10846 is a very well clinically characterized series of DLBCL.Despite being some years old, it serves the purpose of this research because we are looking for genes associated with the pathogenesis of DLBCL.Of note, to test the predictive value of one of the most relevant genes, only RCHOP-like cases were used.
A complete description of the clinicopathological characteristics of this series is presented in our previous publication that analyzed CSF1R expression [50].In summary, 55% of the cases were male and aged > 60 years, NCCN-IPI was high-intermediate and high risk in 35.8%, the cell-of-origin molecular subtype was activated B cell type and unclassified in 45.8%, and the treatment was RCHOP-like in 56.3% of the cases.
The method used was anomaly detection analysis, which is a model designed to identify outliers in the gene expression data.This method is unsupervised.While traditional methods usually look into a few variables at the same time (one or two), the anomaly detection method can examine several fields (genes).The variables are analyzed to find clusters or peer groups that are similar.Each record can then be compared with others in its peer group to identify possible anomalies.The further away a case is from the typical center, the more likely it is to be abnormal.The anomaly detection algorithm is presented in the Zenodo repository [51].
The GSE10846 data were downloaded from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) public functional genomics data repository (https://www.ncbi.nlm.nih.gov/gds;accessed on 15 February 2024).The gene expression array used in this series was the GPL570, Affymetrix Human Genome U133 Plus 2.0 Array (HG-U133_Plus_2).The data were normalized and log2 transformed [50].The series comprises 420 cases, 414 cases of DLBCL, and 6 cases of reactive lymphoid tissue.The series contains 20,684 genes.The gene expression values were collapsed to one value for each gene in the case of multiple probes using collapse to the maximum expression function [50].The output identified case anomalies and the most relevant genes that contributed to them.
Further analysis consisted of several machine learning and artificial neural networks, as we recently published [52][53][54][55][56][57].Finally, a conventional Cox regression for overall survival, backward conditional, was performed using the same set of genes to easily understand the prognostic value of these markers.Table 1 describes the basics of the machine learning and neural network analyses used in this study [58].

Model Description
Anomaly detection Method that quickly looks for unusual cases based on deviations from the norms of their cluster groups [51].

Bayesian Network
Creates a graphical model that shows variables (nodes) linked using arcs.Probabilistic independencies between nodes are displayed.The arcs do not necessarily represent cause and effect [52,53,55,61].

C5.0
Builds a decision tree.It splits the samples on the basis of the variable that provides more information and has more weight.Then, multiple splits are made based on other variables until the cases cannot be further divided.Finally, splits with few contributions to the model are removed.This model can only predict a categorical target [58,62].

C&R Tree
The classification and regression (C&R) tree is similar to the C5.0 method.All splits are binary [63].

CHAID
Chi-squared Automatic Interaction Detection (CHAID) creates decision trees using calculations based on the chi-square test.Crosstabulations between the input variables and the output are examined, and the variables are ranked according to their significance for selection in the tree model [64][65][66][67][68].

KNN Algorithm
Nearest Neighbor Analysis classifies cases based on their similarity to other cases.This method identifies the pattern of the data [71].

Logistic regression
Also known as nominal regression, it is a method that classifies records based on predictors in a manner similar to linear regression but with a categorical target variable.

LSVM
The data were classified on the basis of a linear support vector machine.This method is useful for large datasets with many variables [72,73].

Neural Network
Basic units, known as neurons, are organized into different layers.The input layer contains nodes with input variables (predictors).The output layer contains nodes with the target fields.Nodes are interconnected by different strengths (weights).The number of hidden layers defines the "deep" of the network.Using training, the weights are changed from random to optimized, and the network replicates the known outcomes [74][75][76][77][78][79].
Quest Quick, Unbiased, Efficient Statistical (QUEST) tree creates a binary classification method.All splits are binary.

Model Description
Random Forest This is an implementation of the bagging algorithm.A collection of decision trees is used to make predictions [80][81][82].

Random Trees
It is based on the C&R methodology and uses recursive partitioning to split records into segments with similar outputs [83].

SVM
A support vector machine (SVM) is suitable when the dataset contains a very large number of predictors.It is a solid classification and regression technique that does not overfit the training data [84,85].
Tree-AS This method creates a decision tree using CHAID or exhaustive CHAID, which is more time-consuming [52,53,57].
XGBoost Linear Implementation of a gradient boosting algorithm with a linear model as the base [86].
Additional descriptions of machine learning and neural network models are presented in the companion manuscript "Artificial Intelligence Analysis and Reverse Engineering of Molecular Subtypes of Diffuse Large B-Cell Lymphoma Using Gene Expression Data".BioMedInformatics 2024, 4, 295-320.https://doi.org/10.3390/biomedinformatics4010017 (accessed on 15 February 2024) [58].
All analyses were performed on a desktop equipped with an AMD Ryzen 9 5900X and NVIDIA GeForce RTX 3060 Ti GPU and 16 GB of RAM.Conventional statistics were calculated using IBM SPSS version 27.0.1.064-bit edition (IBM Corporation, Orchard Rd, Armonk, NY 10504, USA).
Anomaly detection analysis was also performed using other series to confirm that the method is applicable.The GSE31312 and GSE117556 datasets were used, which have 498 and 928 cases of DLBCL.
The gene expression analysis of RELB was also performed in TCGA (n = 267) and GSE57611 (n = 30).

Anomaly Detection Analysis
The anomaly detection analysis using GSE10846 ranked the cases according to the anomaly index, which ranged from 0.813 to 1.763 (Supplementary Excel File).Of note, cases with anomaly index values of less than 1 or even 1.5 would not be considered anomalies.The distribution of anomaly index values is shown in Figure 6.

Anomaly Detection Analysis
The anomaly detection analysis using GSE10846 ranked the cases according to the anomaly index, which ranged from 0.813 to 1.763 (Supplementary Excel file).Of note, cases with anomaly index values of less than 1 or even 1.5 would not be considered anomalies.The distribution of anomaly index values is shown in Figure 6.It is an unsupervised method that examines large numbers of variables to identify clusters or peer groups.Then, each record is compared to others in its peer group to identify possible anomalies.Each record (blue circle) is assigned an abnormality index.High index implies a higher average of the case than the average.In the setup, several options can be specified, such as the adjustment of coefficient, number of peer groups, noise level, and noise ratio.
The model also identified the 12 genes that contributed to anomaly detection: DPM2, TRAPPC1, HYAL2, TRIM35, NUDT18, TMEM219, CHCHD10, IGFBP7, LAMTOR2, ZNF688, UBL7, and RELB (Table 2).It is an unsupervised method that examines large numbers of variables to identify clusters or peer groups.Then, each record is compared to others in its peer group to identify possible anomalies.Each record (blue circle) is assigned an abnormality index.High index implies a higher average of the case than the average.In the setup, several options can be specified, such as the adjustment of coefficient, number of peer groups, noise level, and noise ratio.
The anomaly detection methodology was also tested in another series of DLBCL, GSE31312.This is a series of 498 de novo adult DLBCL cases treated with RCHOP.Gene expression was performed using the Affymetrix HG-U133 Plus 2.0 platform.The last update was 3 August 2020.Anomaly detection classified the series into two peer groups of 315 and 183 cases.In the peer group of 183, the contribution was also of 12 genes, but different (NACA4P, DAZAP2, RSP28, RPS7, TSPOAP1_AS1, MT_ND5, MIR142, MGC16275, SHOC2, CALM1, GLUL, and SIT29).Therefore, the anomaly detection method can be applied to series other than GSE10846.Of note, because of the intrinsic heterogenicity of DLBCL, including the characteristics of unusual cases (anomalies), the two series provided different results.This is not a bad result.Other methods, such as artificial neural networks, can also provide different results in each analysis due to different factors, including the random number generator.
Anomaly detection was also performed using the GSE117556 dataset.This dataset belongs to a retrospective analysis of whole transcriptome data for 928 DLBCL patients from the REMoDLB clinical trial.The platform was an Illumina HumanHT-12 WG-DASL V4.0 R2 expression beadchip [105].RNA was extracted from formalin-fixed, paraffinembedded (FFPE) biopsies.The method classified the series into two peer groups of 661 and 267 records.In the second peer group, the contribution was of 27 genes.

Prediction of Overall Survival Using Machine Learning and Artificial Neural Networks Based on 12 Genes
The 12 genes previously identified in the anomaly detection analysis were used as predictors (inputs) of the prognosis of patients with DLBCL in the GSE10846 series.The prognosis was defined by the outcome of overall survival (output variable, dead versus alive).Several machine learning models and artificial neural networks were tested, including the C5.0 decision tree, logistic regression, Bayesian network, discriminant analysis, KNN algorithm, LSVM, random trees, SVM, Tree-AS, XGBoost linear, SGBoost tree, CHAID tree, Quest tree, C&R tree, random forest, and neural network.
The models were ranked according to overall accuracy (%), and the best models were the XGBoost tree, random forest, and C5 tree (Table 3).
Of note, the analysis was performed in all cases, CHOP-like, and RCHOP-like cases.The performance was assessed with the overall accuracy that is the percentage of records for which the outcome was correctly predicted.The formula is shown in Appendix B.

Cox Regression Analysis of Overall Survival Using the 12 Genes
The 12 genes were used as predictors of overall survival using conventional Cox regression analysis in the GSE10846 series.The method was backward conditional.In the last step (n = 8), only five genes retained significant values.In this model, TRAPPC1, IGFBP7, and RELB were associated with a favorable prognosis, and HYAL2 and UBL7 were associated with a poor prognosis (Table 4).When MYC and BCL2 were added to the equation, the Cox regression analysis only kept MYC as a significant predicted value (p value = 0.008, HR = 1.280, 95% CI 1.066-1.536), in addition to the other five genes that had similar values as in Table 4.
Similar results were found when NCCN-IPI was added to the equation with the five genes.NCCN-IPI was also significant, as were the other five genes (p value < 0.001, HR = 2.438, 95% CI = 1.713-3.469).
In this model, the molecular subtypes of GCB and ABC had no predictive value when combined with the five genes.
Finally, the prognostic value as a single variable of RELB was tested using survival analysis with Kaplan-Meier and log-rank tests.In the DLBCL cases treated with RCHOPlike cases, high RELB expression was associated with a favorable prognosis of the patients (Figure 7).Abnormality detection analysis identified 12 genes.The prognostic value of these genes for overall survival was tested using several artificial intelligence analysis techniques.XGBoost tree (A), random forest (B), C5 tree (C), and neural network (D).Of note, the prognostic value of RELB was confirmed in the RCHOP-like cases of the LLMPP series using conventional overall survival analysis of Kaplan-Meier with log-rank tests (E).High gene expression of RELB was associated with favorable overall survival (E).
The prognostic value of RELB was evaluated in other series of patients.In TCGA and GSE57611, high RELB gene expression was associated with favorable overall survival (Hazard-risk 0.45 and 0.1645, respectively (p values 0.0018 and 0.0171) (Appendix A, Figure A1).

Validation of the Predictive Value of RELB for Overall Survival of Patients Using Gene Set Enrichment Analysis (RCHOP-Treated Cases)
The predictive value of RELB in DLBCL was assessed using GSEA analysis in the RCHOP-treated cases of the LLMPP series.Gene set enrichment analysis (GSEA) is a computational method that determines whether an a priori-defined set of genes shows statistically significant, concordant differences between two biological states (e.g., phenotypes) [106][107][108].In this study, the phenotypes were the overall survival outcome as dead and alive.The priori set of genes was the RELB pathway.To define the RELB pathway, the STRING platform was used.STRING is a protein-protein interaction network and functional enrichment analysis [109,110].A functional network association analysis was performed using RELB as the hub gene to design the RELB pathway (1st shell ≤ 20 interactions; 2nd shell ≤ 5 interactions; confidence as the meaning of network edges).The network had 26 nodes and 227 edges, with an average node degree of 17.5, an averaged local clustering coefficient of 0.865, and protein-protein interaction enrichment p value < 0.001 (Figure 8A).
Later, the genes of the RELB network were used as a pathway for the GSEA analysis, and the results showed enrichment toward the alive phenotype (Figure 8B).Therefore, the RELB pathway was associated with a favorable overall survival of the DLBCL pathway, as identified in our previous analyses of machine learning and conventional biostatistics.In the core enrichment of the GSEA plot, 13 genes were identified, with RELB in the third position (Figure 8, Table 5).Protein−protein interaction analysis and gene set enrichment analysis (GSEA) of RELB gene and pathway.First, a functional network association analysis (protein−protein interaction network) focused on RELB created a pathway.Later, this RELB pathway was used in the GSEA analysis.The GSEA analysis confirmed the association of the RELB gene and pathway with a favorable overall survival of patients with DLBCL treated with R-CHOP therapy.Functional network association analysis (A), GSEA (B).
Later, the genes of the RELB network were used as a pathway for the GSEA analysis, and the results showed enrichment toward the alive phenotype (Figure 8B).Therefore, the RELB pathway was associated with a favorable overall survival of the DLBCL pathway, as identified in our previous analyses of machine learning and conventional biostatistics.In the core enrichment of the GSEA plot, 13 genes were identified, with RELB in the third position (Figure 8, Table 5).

Immunohistochemical Analysis of RELB and Immune Microenvironment
The histological protein expression of RELB was analyzed by immunohistochemistry in 10 reactive tonsils (i.e., reactive tissue control) and 30 cases of DLBCL NOS.In reactive tonsils, the expression of RELB was mainly located in the germinal centers of reactive follicles.There, two types of intensity were identified: strong in macrophages/dendritic cells and weak in the B lymphocytes.In DLBCL NOS, the expression was heterogeneous, and four patterns were identified: 0 (negative), 1+ (weak), 2+ (moderate), and 3+ (strong).In DLBCL, the positive cells were heterogeneous when the staining was moderate/strong, with a mixture of B-cell staining and macrophage/dendritic cell-like.Additional markers were included in the panel to investigate the expression of macrophagerelated immune microenvironment markers, including HLA DP-DR, SIRPA, CD85A, PD-L1, MARCO, and TOX (TOX1).In summary, the expression of RELB partially correlated with macrophage/dendritic cell markers but was also present in the B-lymphocytes (Figures 9 and 10).

Discussion
Diffuse large B-cell lymphoma (DLBCL) is one of the most frequent histological subtypes of non-Hodgkin lymphomas, accounting for approximately 20-30% of cases.DLBCL is a heterogeneous diagnostic category with heterogeneous morphological, genetic, and clinical characteristics.The current classification dates back to 2017 with the revised 4th edition [1], and several subtypes were defined, including T cell/histiocyte-rich large B cell 10.Immunohistochemical analysis of RELB in relationship with other immune microenvironment markers in DLBCL NOS.The expression of RELB in DLBCL was heterogeneous, with a pattern compatible with mixture of macrophage/dendritic cells and B-lymphocytes.Correlation with other macrophage-associated and immune microenvironment/immune checkpoint markers was performed using HLA DP-DR, SIRPA, CD85A, PD-L1, MARCO, and TOX (TOX1).Original magnification 400×.

Discussion
Diffuse large B-cell lymphoma (DLBCL) is one of the most frequent histological subtypes of non-Hodgkin lymphomas, accounting for approximately 20-30% of cases.DLBCL is a heterogeneous diagnostic category with heterogeneous morphological, genetic, and clinical characteristics.The current classification dates back to 2017 with the revised 4th edition [1], and several subtypes were defined, including T cell/histiocyte-rich large B cell lymphoma, the primary mediastinal large B cell lymphoma, intravascular B cell lymphoma, the primary DLBCL of the central nervous system, the primary cutaneous DLBCL, leg type, and EBV-positive DLBCL not-otherwise-specified (NOS) [1].An important subtype is high-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangements, which in some cases had previously been called Burkitt-like lymphoma [1,2].In this study, our diagnostic category was diffuse large b-cell lymphoma not otherwise specified.
The molecular pathogenesis of DLBCL includes a complex and multistage pathological mechanism that results in the proliferation of a germinal center or postgerminal center B cell clone.One of the best characterized molecular changes is the acquisition of rearrangements of BCL6, BCL2, and MYC.
The MYC proto-oncogene is a transcription factor that binds to DNA nonspecifically yet recognizes the 5 ′ -CAC[GA]TG-3 ′ sequence [111,112].MYC activates the transcription of several genes that have tumor-promoting functions [111,112].In DLBCL, MYC gene rearrangement occurs in approximately 10% of cases [113], and in 80% of translocation-positive cases, the partner is the IGH locus.The presence of MYC rearrangement, copy-number gain (amplification), and/or overexpression is associated with poor prognosis [113][114][115].Despite the importance of MYC in DLBCL pathogenesis, most cases are MYC rearrangement negative.In our series, the REL high group was characterized by a lower frequency of MYC translocation: REL high vs. low, 11.5% vs. 88.5% (p = 0.009).
Using a novel analysis approach, we identified 12 genes with prognostic value in DLBCL: DPM2, TRAPPC1, HYAL2, TRIM35, NUDT18, TMEM219, CHCHD10, IGFBP7, LAMTOR2, ZNF688, UBL7, and RELB.The functions and biological relevance of these genes are shown in Table 1.Most of these genes have multiple functions, but a proportion of them are related to the control of apoptosis, such as HYAL2, TRIM35, TMEM219, CHCHD10, and UBL7.In DLBCL, the dysregulation of the apoptosis pathway is an important pathogenic mechanism.In up to 30% of DLBCL cases, especially in the germinal center B cell-like subtype, there is BCL2 overexpression.BCL2 is an oncogene that inhibits apoptosis and leads to the enhanced survival of tumor cells [116].
We also identified a marker of the NF-kappa-B pathway, the RELB proto-oncogene, NF-KB subunit.NF-kappa-B is a pleiotropic transcription factor involved in many biological processes, such as inflammation, immunity, differentiation, cell growth, tumorigenesis, and apoptosis.It is a pathogenic marker of DLBCL [103,104].We found that the high expression of RELB was associated with a favorable prognosis.Our results are consistent with previously reported data in DLBCL [117,118].
The work of Chi Young Ok et al. [118] is of special interest.This study analyzed a large cohort of 533 cases of de novo DLBCL, and the gene and protein expression of the five NF-KB pathway subunits (p50, p52, p65, RELB, and c-Rel) was assessed.All subunits were expressed by GCB and ABC DLBCL, but there were differences between the two subtypes of the cell of origin.The expression of p52/RELB was associated with improved OS and PFS.When cases were stratified into GCB and ABC, p52 or p52/RELB expression status was associated with better OS and PFS only within the GCB subtype.
NF-KB signaling is an important regulator of apoptosis.Several genetic alterations and other mechanisms activate the NF-KB pathway.The constitutive activation of the NF-KB pathway contributes to cancer development, progression, and therapy resistance [119].NF-KB signaling is categorized as canonical or noncanonical.
The canonical pathway is activated by C-like receptors 4, the TNF receptor family, and the antigen receptors BCR and TCR, whereas the noncanonical pathway is activated by other receptors, such as BAFF-R, CD40, RANK, CD30, and LTβ-R [119].The canonical pathway includes SYK, BTK, CARD1/MLAT1/BCL10, and RELA.Target genes are related to survival, anti-apoptosis, cell proliferation, inflammation, and innate immunity.
This study and the use of the anomaly detection technique have limitations.This study focused on the GSE10846 dataset.This is a series that was made public on 28 November 2008 and was last updated on 25 March 2019.Therefore, it is a relatively old series of DLBCL cases.This retrospective study included 181 clinical samples from CHOP-treated patients and 233 samples from Rituximab-CHOP-treated patients.The array used was Affymetrix U133 plus 2.0.Currently, there is newer technology to assess gene expression data, such as Clariom assays from Thermo Fisher Scientific and next-generation sequencing (RNA-Seq) to reveal the presence and quantity of RNA molecules in biological samples.Therefore, this study used a series with relatively old technology, and approximately half of the patients had received CHOP therapy.However, this series was created by the Lymphoma/Leukemia Molecular Profiling Project (LLMPP).It is very well annotated, and the clinicopathological characteristics of the samples are complete and reliable.The analysis was first performed using all 414 cases but was later repeated using only the R-CHOP cases.For example, Figure 2 shows the overall survival of patients based on RELB expression only in RCHOP-like cases, and the prognostic relevance of RELB was maintained.
The anomaly detection procedure searches for unusual cases based on deviations from the norms of their cluster groups [47].This procedure allows the rapid detection of unusual cases during the exploratory data analysis step before any inferential data analysis [47].However, this algorithm is designed for generic anomaly detection, and the definition of anomalous cases is not specific to any particular application [46,47].
The anomaly detection analysis using GSE10846 ranked the cases according to the anomaly index, which ranged from 0.813 to 1.763 (Supplementary Excel File).There is no definitive cutoff for selecting anomalous cases.Cases with anomaly index values less than 1 or even 1.5 would not be considered anomalies, but the selection cases should be tested by other techniques to confirm that they are true anomalous cases.
The results of the anomaly detection technique depend on the series of cases.This is a limitation because anomalous cases may have different clinicopathological characteristics and different gene expression profiles depending on the series, especially if the disease is heterogeneous, such as DLBCL.Anomaly detection was technically successful in the GSE31312 and GSE117556 datasets, but the genes identified were different.This is due to the heterogeneous profile of DLBCL and the characteristics of each series.This is not a bad result.However, we confirmed the relevance of RELB for predicting DLBCL not only in the GSE10846 but also in the TCGA and GSE57611 series.
When the 12 genes were used as predictors of overall survival using a conventional Cox regression analysis in the GSE10846 series, in the last step, only five genes retained a significant value.In this model, TRAPPC1, IGFBP7, and RELB were associated with a favorable prognosis, whereas HYAL2 and UBL7 were associated with a poor prognosis (Table 4).Similar results were found when NCCN-IPI was added to the equation with the five genes.NCCN-IPI was also significant, as were the other five genes (p value < 0.001, HR = 2.4).Therefore, despite its limitations, this bioinformatics approach provides useful information regarding the pathogenesis of DLBCL.Of note, further analysis will include the validation of RELB in individual series of cases.
Jintao Wu et al. recently identified RELB as a potential molecular biomarker for immunotherapy in human pan-cancer [120].Using the Cancer Genome Atlas Program (TCGA) dataset, they found that RELB was detected in human cancers and that the expression was associated with the overall survival of the patients, with a favorable in some cases, such as glioblastoma multiforme and lung adenocarcinoma, and unfavorable in others, such as breast cancer.Interestingly, using gene set enrichment analysis, an association of RELB and the tumor immune microenvironment and immune checkpoint was identified [120].This is a relevant result because immuno-oncology and immunotherapeutic therapies in DLBCL include monoclonal anti-CD20 antibody (rituximab), monoclonal anti-PD-1 antibodies (nivolumab and pembrolizumab), monoclonal anti-PD-L1 antibodies (avelumab, durvalumab, and atezolizumab), and chimeric antigen receptor (CAR) T-cell therapy [121,122].The role of RELB in the pathogenesis of DLBCL is complex [103,104,118,[123][124][125][126].Further analysis of the impact of RELB on the prognosis of DLBCL and their relationship with known and well stablished markers such as MYC, BCL2, and BCL6 [12,127] is warranted.

Conclusions
In conclusion, using a statistical approach based on anomaly detection and artificial intelligence of gene expression data of DLBCL, we identified pathogenic markers related to apoptosis, MAPK and MTOR, and the NF-KB pathway.High expression of the RELB proto-oncogene is associated with a favorable prognosis of DLBCL.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/biomedinformatics4020081/s1,Anomaly detection Excel File.Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Figure 1 .
Figure 1.Histological heterogeneity of DLBCL.Despite the fact that DLBCL is a unique lymphoma subtype, its morphological characteristics are heterogeneous, including the neoplastic B lymphocytes and variable content of the tumor immune microenvironment.Hematoxylin and eosin stain (scale bar = 50 µm).The histological cases were retrieved from the lymphoma database of the Department of Pathology, Tokai University, School of Medicine.

Figure 1 .
Figure 1.Histological heterogeneity of DLBCL.Despite the fact that DLBCL is a unique lymphoma subtype, its morphological characteristics are heterogeneous, including the neoplastic B lymphocytes and variable content of the tumor immune microenvironment.Hematoxylin and eosin stain (scale bar = 50 µm).The histological cases were retrieved from the lymphoma database of the Department of Pathology, Tokai University, School of Medicine.

Figure 2 .
Figure 2. Types of artificial intelligence methods.

Figure 2 .
Figure 2. Types of artificial intelligence methods.

Figure 3 .
Figure 3. Types of machine learning methods for predictive data analysis.In addition to anomaly detection analysis, there are many other types of machine learning that can be classified as supervised (A), unsupervised (B), and reinforcement learning (C).Of note, this figure includes methods

Figure 3 .
Figure3.Types of machine learning methods for predictive data analysis.In addition to anomaly detection analysis, there are many other types of machine learning that can be classified as supervised (A),

Figure 4 .
Figure 4. Segmentation analysis.This figure shows example images of the K-Means cluster (A), Kohonen clustering analysis (B), and anomaly detection (C).

Figure 4 .
Figure 4. Segmentation analysis.This figure shows example images of the K-Means cluster (A), Kohonen clustering analysis (B), and anomaly detection (C).

Figure 5 .
Figure 5. Aim and methodology.The discovery set was the Lymphoma/Leukemia Molecular Profiling Project (LLMPP) GSE10846 gene expression dataset (last update 25 March 2019) of 414 cases.

Figure 5 .
Figure 5. Aim and methodology.The discovery set was the Lymphoma/Leukemia Molecular Profiling Project (LLMPP) GSE10846 gene expression dataset (last update 25 March 2019) of 414 cases.

Figure 6 .
Figure 6.Anomaly index values.Anomaly detection analysis identifies outliners, or unusual cases, in the data.It records information on what normal behavior looks like and identifies outliers even if they do not conform to any known pattern.It is an unsupervised method that examines large numbers of variables to identify clusters or peer groups.Then, each record is compared to others in its peer group to identify possible anomalies.Each record (blue circle) is assigned an abnormality index.High index implies a higher average of the case than the average.In the setup, several options can be specified, such as the adjustment of coefficient, number of peer groups, noise level, and noise ratio.

Figure 6 .
Figure 6.Anomaly index values.Anomaly detection analysis identifies outliners, or unusual cases, in the data.It records information on what normal behavior looks like and identifies outliers even if theydo not conform to any known pattern.It is an unsupervised method that examines large numbers of variables to identify clusters or peer groups.Then, each record is compared to others in its peer group to identify possible anomalies.Each record (blue circle) is assigned an abnormality index.High index implies a higher average of the case than the average.In the setup, several options can be specified, such as the adjustment of coefficient, number of peer groups, noise level, and noise ratio.

Figure 7 .
Figure 7. Machine learning and artificial neural networks using the LLMPP gene expression dataset.Abnormality detection analysis identified 12 genes.The prognostic value of these genes for overall

Figure 7 .
Figure 7. Machine learning and artificial neural networks using the LLMPP gene expression dataset.Abnormality detection analysis identified 12 genes.The prognostic value of these genes for overall survival was tested using several artificial intelligence analysis techniques.XGBoost tree (A), random forest (B), C5 tree (C), and neural network (D).Of note, the prognostic value of RELB was confirmed in the RCHOP-like cases of the LLMPP series using conventional overall survival analysis of Kaplan-Meier with log-rank tests (E).High gene expression of RELB was associated with favorable overall survival (E).

Figure 8 .
Figure 8. Protein−protein interaction analysis and gene set enrichment analysis (GSEA) of RELB gene and pathway.First, a functional network association analysis (protein−protein interaction network) focused on RELB created a pathway.Later, this RELB pathway was used in the GSEA analysis.The GSEA analysis confirmed the association of the RELB gene and pathway with a favorable overall survival of patients with DLBCL treated with R-CHOP therapy.Functional network association analysis (A), GSEA (B).

Figure 8 .
Figure8.Protein−protein interaction analysis and gene set enrichment analysis (GSEA) of RELB gene and pathway.First, a functional network association analysis (protein−protein interaction network) focused on RELB created a pathway.Later, this RELB pathway was used in the GSEA analysis.The GSEA analysis confirmed the association of the RELB gene and pathway with a favorable overall survival of patients with DLBCL treated with R-CHOP therapy.Functional network association analysis (A), GSEA (B).

Figure 9 .
Figure 9. Immunohistochemical analysis of RELB in reactive tonsils and DLBCL.The protein expression of RELB was analyzed in 10 reactive tonsils (tissue control) and 30 cases of DLBCL not otherwise specified (NOS).In reactive tonsils, RELB expression was mainly present in the germinal centers of the follicles, with strong staining in macrophage/dendritic cells and weak in the B-lymphocytes.In DLBCL NOS, the staining was heterogeneous, ranging from 0 to 3+, and expressed by neoplastic B-lymphocytes and cells of the microenvironment.

9 .
Immunohistochemical analysis of RELB in reactive tonsils and DLBCL.The protein expression of RELB was analyzed in 10 reactive tonsils (tissue control) and 30 cases of DLBCL not otherwise specified (NOS).In reactive tonsils, RELB expression was mainly present in the germinal centers of the follicles, with strong staining in macrophage/dendritic cells and weak in the B-lymphocytes.In DLBCL NOS, the staining was heterogeneous, ranging from 0 to 3+, and expressed by neoplastic B-lymphocytes and cells of the microenvironment.

Figure 10 .
Figure 10.Immunohistochemical analysis of RELB in relationship with other immune microenvironment markers in DLBCL NOS.The expression of RELB in DLBCL was heterogeneous, with a pattern compatible with mixture of macrophage/dendritic cells and B-lymphocytes.Correlation with other macrophage-associated and immune microenvironment/immune checkpoint markers was performed using HLA DP-DR, SIRPA, CD85A, PD-L1, MARCO, and TOX (TOX1).Original magnification 400×.
Author Contributions: Conceptualization, J.C.; formal analysis, J.C.; investigation, J.C. and R.H.; resources, J.C.; writing-original draft preparation, J.C.; writing-review and editing, J.C.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the Ministry of Education, Culture, Sports, Science and Technology (MEXT), grant number KAKEN 23K06454.Rifat Hamoudi is funded by ASPIRE, the technology program management pillar of Abu Dhabi's Advanced Technology Research Council (ATRC), via the ASPIRE Precision Medicine Research Institute Abu Dhabi (AS-PIREPMRIAD) award grant number VRI-20-10.Institutional Review Board Statement: This study was conducted in accordance with the Declaration of Helsinki, and was approved by the Institutional Review Board of TOKAI UNIVERSITY, SCHOOL OF MEDICINE (protocol code IRB14R-080 and IRB20-156).

Table 1 .
A brief description of the machine learning methods used in this study.

Table 2 .
Genes identified in anomaly detection analysis using the GSE10846 series.

Table 2 .
Genes identified in anomaly detection analysis using the GSE10846 series.

Table 3 .
Prediction of overall survival outcome (dead vs. alive) using machine learning and artificial neural networks, based on 12 previously identified genes in anomaly detection analysis.

Table 4 .
Prediction of the overall survival using Cox regression analysis based on the 12 genes.