Deep Learning for the Classification of Non-Hodgkin Lymphoma on Histopathological Images

Simple Summary Histopathological examination of lymph node (LN) specimens allows the detection of hematological diseases. The identification and the classification of lymphoma, a blood cancer with a manifestation in LNs, are difficult and require many years of training, as well as additional expensive investigations. Today, artificial intelligence (AI) can be used to support the pathologist in identifying abnormalities in LN specimens. In this article, we trained and optimized an AI algorithm to automatically detect two common lymphoma subtypes that require different therapies using normal LN parenchyma as a control. The balanced accuracy in an independent test cohort was above 95%, which means that the vast majority of cases were classified correctly and only a few cases were misclassified. We applied specific methods to explain which parts of the image were important for the AI algorithm and to ensure a reliable result. Our study shows that classifications of lymphoma subtypes is possible with high accuracy. We think that routine histopathological applications for AI should be pursued. Abstract The diagnosis and the subtyping of non-Hodgkin lymphoma (NHL) are challenging and require expert knowledge, great experience, thorough morphological analysis, and often additional expensive immunohistological and molecular methods. As these requirements are not always available, supplemental methods supporting morphological-based decision making and potentially entity subtyping are required. Deep learning methods have been shown to classify histopathological images with high accuracy, but data on NHL subtyping are limited. After annotation of histopathological whole-slide images and image patch extraction, we trained and optimized an EfficientNet convolutional neuronal network algorithm on 84,139 image patches from 629 patients and evaluated its potential to classify tumor-free reference lymph nodes, nodal small lymphocytic lymphoma/chronic lymphocytic leukemia, and nodal diffuse large B-cell lymphoma. The optimized algorithm achieved an accuracy of 95.56% on an independent test set including 16,960 image patches from 125 patients after the application of quality controls. Automatic classification of NHL is possible with high accuracy using deep learning on histopathological images and routine diagnostic applications should be pursued.


Hardware and Software
For training and prediction with our models, we used the BwForCluster MLS&WIS Production nodes [21] that feature the Nvidia Tesla K80 (models B0 to B3, see also Mod training and optimization) or the Nvidia GeForce RTX 2080Ti (model B4). With the Nvid Tesla K80 nodes, we used both GPUs with a mirrored strategy from TensorFlow. With th Nvidia GeForce RTX 2080Ti nodes, we used a single GPU. Furthermore, we applied si gularity (Sylabs, https://sylabs.io/singularity/; v3.7.2, accessed on 1 May 2021) to ado (v3. 6

Analytical Subsets
To ensure reliable results, patients were randomly separated into three subsets: training (60%), validation (20%), and test sets (20%). All image patches from a patient (case) were used in the respective subset. We had a checkpoint in our code to ensure that cases were used in a single subset only. These subsets were not changed during the analyses.

Model Training and Optimization
We used models from the EfficientNet family [22] for our analysis. The EfficientNet family is composed of multiple models (from B0 to B7), which are each scaled versions of the baseline model B0. The models were scaled by the compound scaling method introduced in [22]. With compound scaling, each consecutive model increased in network width, depth, and image resolution by a set of fixed scaling coefficients. This form of scaling utilizes the idea that network width, depth, and image resolution seem to exhibit a certain relationship [22]. A model with fewer trainable weights can be trained using fewer resources, and its inference is faster [22]. In this study, we investigated up to which stage the compound scaling did seem to be beneficial in predicting the NHL on histopathological images. The nontrainable model parameters (such as dropout) provided in the tensorflow implementation of EfficientNet models were used without modification. The batch size was chosen as the maximal allowed value (in the sequence of 2 n , n ∈ N), given the available GPU memory. The batch size usually becomes smaller when scaling up an EfficientNet; the image resolution increases and the model itself becomes bigger due to the additional weights. We used the Adam optimizer with a learning rate that was selected for each model as follows: models were trained for 50 epochs (each a pass of the full training data) with various learning rates roughly in the range of 10 −5 to 10 −6 . Then, the best-performing learning rate was chosen, and the respective model was trained further until there seemed to be no further performance gain. Performance was visually evaluated by the achieved validation and training accuracy, the amount of overfitting (difference of training and validation accuracy), and the smoothness of the accuracy curves. The models with highest validation accuracy for each class of EfficientNet models (B0-B4) were compared, and the overall best-performing one was used to classify the test set.
For the tumor-free reference cases, a detailed classification (into LNs from lung, colon, and pancreas) was available, which we used for training of the classifier (we anticipated that this might improve accuracy). Since such a detailed classification was not available for the tumor cases, we analyzed our predictions on the test data using an aggregated class "tumor-free reference LN".

Patient Cohort, Annotation, Image Patch Extraction, and Subset Analysis
Cases from SLL/CLL (n = 129) and DLBCL (n = 119), as well as control LNs from lung, colon, and pancreas (n = 381), were identified, retrieved, assembled in a TMA, stained, and scanned. Identification of representative regions resulted in a total of 84,139 extracted 100 × 100 µm (395 × 395 px) image patches. The number of extracted image patches is displayed in Table 1. The goal to extract a minimum of 10 image patches per patient was met in all but seven cases.

Convolutional Neuronal Network Selection and Hyperparameter Optimization
Different models (B0-B4) were trained and optimized using different learning rates. Figure 3 shows the training and validation accuracy of the models with the highest validation accuracy per EfficientNet architecture (B0, B1, etc.). Since the B4 architecture did not seem to outperform the B3 architecture, we did not tune the architectures B5-B7 on our data. For the tuned models in Figure 3, the chosen learning rate and batch size were as follows: B0, 1 × 10 −6 , 256; B1, 1 × 10 −5 , 128; B2, 9 × 10 −6 , 128; B3, 8 × 10 −6 , 64; B4, 6 × 10 −6 , 16. Whereas the overall accuracies of the B3 and B2 models were almost on par, the respective confusion matrices on the validation data of the B3 model were slightly more accurate. Thus, we chose the B3 model to classify the test set.

Convolutional Neuronal Network Selection and Hyperparameter Optimization
Different models (B0-B4) were trained and optimized using different learning rates. Figure 3 shows the training and validation accuracy of the models with the highest validation accuracy per EfficientNet architecture (B0, B1, etc.). Since the B4 architecture did not seem to outperform the B3 architecture, we did not tune the architectures B5-B7 on our data. For the tuned models in Figure 3, the chosen learning rate and batch size were as follows: B0, 1 × 10 −6 , 256; B1, 1 × 10 −5 , 128; B2, 9 × 10 −6 , 128; B3, 8 × 10 −6 , 64; B4, 6 × 10 −6 , 16. Whereas the overall accuracies of the B3 and B2 models were almost on par, the respective confusion matrices on the validation data of the B3 model were slightly more accurate. Thus, we chose the B3 model to classify the test set.   Figure 4 displays the normalized confusion matrix of the selected B3 model in terms of the image patches or cases. For these matrices, image patches were assigned the predicted class with the highest probability, and cases were assigned the predicted class of the majority of their patches. We used the balanced accuracy (BACC) [23] instead of the plain accuracy to account for class imbalance. The model showed a high BACC for DLBCL and the tumor-free reference (in both cases, only a single missed case was identified; Figure 4). However, the predictions for CLL displayed a lower BACC with multiple misclassifications. Figure 4 displays the normalized confusion matrix of the selected B3 model in te of the image patches or cases. For these matrices, image patches were assigned the dicted class with the highest probability, and cases were assigned the predicted cla the majority of their patches. We used the balanced accuracy (BACC) [23] instead o plain accuracy to account for class imbalance. The model showed a high BACC for DL and the tumor-free reference (in both cases, only a single missed case was identified; ure 4). However, the predictions for CLL displayed a lower BACC with multiple mis sifications.  Table 2 features the BACC for different quality control thresholds at the case or p level. Any patch with a predicted probability (in terms of the highest prediction prob ity) of less than the patch-based quality control (PQC) threshold was filtered out. The c based quality control (CQC) threshold filtered cases in which the proportion of pat for the predicted class was less than the threshold. From Table 2, one can see that a crease in the case-based quality control threshold improved the overall BACC u 95.56%. A more detailed example for results with a PQC and CQC of 0.9 showed a crease in the proportion of misclassified patches and cases ( Figure S1). Only 3/102 pat were misclassified using high-quality control thresholds. To further investigate if the network was learning the correct features on cells applied SmoothGrad [24] to a selection of patches ( Figure 5). SmoothGrad produ heatmaps indicating the importance of certain pixels toward the prediction of a ce class. In the heatmaps in Figure 5, we can observe high activity in the respective cells not in noncellular structures. Thus, we concluded that our model predicted the respe class on the basis of cell morphology.  Table 2 features the BACC for different quality control thresholds at the case or patch level. Any patch with a predicted probability (in terms of the highest prediction probability) of less than the patch-based quality control (PQC) threshold was filtered out. The casebased quality control (CQC) threshold filtered cases in which the proportion of patches for the predicted class was less than the threshold. From Table 2, one can see that an increase in the case-based quality control threshold improved the overall BACC up to 95.56%. A more detailed example for results with a PQC and CQC of 0.9 showed a decrease in the proportion of misclassified patches and cases ( Figure S1). Only 3/102 patients were misclassified using high-quality control thresholds. To further investigate if the network was learning the correct features on cells, we applied SmoothGrad [24] to a selection of patches ( Figure 5). SmoothGrad produced heatmaps indicating the importance of certain pixels toward the prediction of a certain class. In the heatmaps in Figure 5, we can observe high activity in the respective cells and not in noncellular structures. Thus, we concluded that our model predicted the respective class on the basis of cell morphology. To estimate the inference latency of our model on a CPU and GPU, we classified a random image patch (each pixel from a uniform distribution) multiple times. We predicted 1000 steps of our final model with a tf.data (https://www.tensorflow.org/guide/data, accessed on 1 May 2021) pipeline using a batch size of 1 that repeated the random image patch. The prediction took 203 s with 203 ms per step (i.e., per image patch) on a single thread of an Intel(R) Core(TM) i9-9880H CPU (2.3 GHz) (Intel Corporation, Santa Clara, USA), and 107 s with 107 ms per step on an Nvidia Quadro T2000 (Nvidia Corporation, Santa Clara, CA, USA).

Discussion
In the present study, we evaluated and optimized a convolutional neuronal network (CNN) for the classification of histopathological images of tumor-free LNs, SLL/CLL, and DLBCL. The principal capacity of CNN for the classification of malignant and benign diseases on scanned histopathological tissue of conventionally stained sections was previously demonstrated and is well documented [19,[25][26][27][28]. Specifically, the technique has been shown to be capable of classifying carcinoma subtypes and of identifying LN metastases of carcinomas [19,[29][30][31]. However, studies on the classification of lymphomas are relatively scarce, and normal LNs as controls have rarely been included [14,17,18,22,32,33]. In addition to the classification of lymphoma subtypes, it has been shown that molecular alterations may be detected by deep learning algorithms on histopathological tissue sections [17].
The abovementioned studies on lymphoma classification have in common that they showed that lymphoma subtyping is possible with high accuracy (often >90%) using deep learning techniques when 2-4 lymphoma subtypes are included for classification. In this regard, our study is comparable as we included normal LNs as a control and two common NHL subtypes of B-cell lineage. In line with these previous studies, our BACC was >95%.
Direct comparison of the different studies in terms of methodology is somewhat difficult, as, in addition to the included entities, the number of cases, the study design, the image input parameters, the architecture of the respective networks, and the evaluation were highly heterogeneous.
Commonly, deep learning studies require a large set of images, but there is no con- To estimate the inference latency of our model on a CPU and GPU, we classified a random image patch (each pixel from a uniform distribution) multiple times. We predicted 1000 steps of our final model with a tf.data (https://www.tensorflow.org/guide/data, accessed on 1 May 2021) pipeline using a batch size of 1 that repeated the random image patch. The prediction took 203 s with 203 ms per step (i.e., per image patch) on a single thread of an Intel(R) Core(TM) i9-9880H CPU (2.3 GHz) (Intel Corporation, Santa Clara, USA), and 107 s with 107 ms per step on an Nvidia Quadro T2000 (Nvidia Corporation, Santa Clara, CA, USA).

Discussion
In the present study, we evaluated and optimized a convolutional neuronal network (CNN) for the classification of histopathological images of tumor-free LNs, SLL/CLL, and DLBCL. The principal capacity of CNN for the classification of malignant and benign diseases on scanned histopathological tissue of conventionally stained sections was previously demonstrated and is well documented [19,[25][26][27][28]. Specifically, the technique has been shown to be capable of classifying carcinoma subtypes and of identifying LN metastases of carcinomas [19,[29][30][31]. However, studies on the classification of lymphomas are relatively scarce, and normal LNs as controls have rarely been included [14,17,18,22,32,33]. In addition to the classification of lymphoma subtypes, it has been shown that molecular alterations may be detected by deep learning algorithms on histopathological tissue sections [17].
The abovementioned studies on lymphoma classification have in common that they showed that lymphoma subtyping is possible with high accuracy (often >90%) using deep learning techniques when 2-4 lymphoma subtypes are included for classification. In this regard, our study is comparable as we included normal LNs as a control and two common NHL subtypes of B-cell lineage. In line with these previous studies, our BACC was >95%.
Direct comparison of the different studies in terms of methodology is somewhat difficult, as, in addition to the included entities, the number of cases, the study design, the image input parameters, the architecture of the respective networks, and the evaluation were highly heterogeneous.
Commonly, deep learning studies require a large set of images, but there is no consensus on the minimum number of cases that should be included. The previously reported studies on lymphoma subtyping included between 34 and 259 cases per entity and a total of 2560 to 850,000 image patches [14,22,32]. One study included 867 DLBCL cases, but their algorithm was mainly designed to separate DLBCL and samples not related to lymphoma [16]. In our study, we included a total of 629 patient samples and 84,139 image patches, making it that with the highest case number on lymphoma subtyping by deep learning to date.
Currently, the use of a training, validation, and test sets is advocated. The deep learning algorithm is trained and optimized using the first two sets. The test set should only be used for final classification. This setup was used by most investigators on lymphoma, including our own, but not in all previous studies [22].
The patch size of the final images ranged between 16 × 16 px and 800 × 800 px in most studies [31]. Currently, there is no standard regarding the size of the image patches, but it seems fair to argue that, if the size is smaller in terms of cytological features and if the size is larger, the architecture is better represented. Some of the variation in pixel size is due to different magnifications used [18]. Often, images are either extracted at ×200 or, as in our study, at ×400 [22]. In this regard, it is important to note that previous investigations on the classification of follicular lymphoma versus reactive follicular hyperplasia, both processes that show prominent architectural changes, included rather low magnifications to ensure architectural representation [18]. As SLL/CLL and DLBCL show very distinct cell morphologies, we used a higher magnification (400×) to have a better cytological representation of the respective cell types. During the annotation, we tried to avoid a prominent representation of the tissue edge from the tissue cores, thereby ensuring transferability to whole slides. Although not explicitly tested, we would expect our algorithm to achieve similar results on whole-slide images, as the image patches from TMA cores and from whole slides are comparable.
Moreover, different CNN architectures have been applied in previous studies. We decided to use the EfficientNet framework because it achieved a high top-1 accuracy of 84.3% on the ImageNet dataset, while being smaller and significantly faster than network architectures achieving comparably high accuracy rates on the same dataset [34]. The EfficientNet architectures use a compound scaling method to balance width, depth, and resolution of a network, and they have successfully been applied to histopathological image classification tasks [35]. Computational time might be an important factor not only for training models, but also for application in routine diagnostics. In this regard, it would be beneficial to find an equally fast way to use CPUs in a routine context, especially because these have a low inference time and most computers worldwide are not equipped with a GPU.
Lastly, there is also no established standard for evaluation. Most authors used a majority vote where the class with the highest probability was chosen as the final result [22,32]. If multiple magnifications were included, the final result was calculated by averaging the respective single results [18]. We believe that, if multiple classes are included in an algorithm, it might not be enough to calculate the final result on the basis of a majority vote. If, for example, the algorithm is trained on three diseases, the random chance for each class would be 33.3%. In the given example, a doubtful result of 34% probability for one class would trigger this class to be labeled as the final result. The application of quality control limits has previously been proposed and applied [19]. In the abovementioned lymphoma studies, only one group used the quality control limits at the image patch level to create heatmaps at the patient level, but this method was not applied for the final classification result [18]. For applications in the routine diagnostic setting, the implementation of quality control measures is important in our opinion. We tested the effect of quality control limits at the image patch and patient level and achieved not only an increase in accuracy, but also automatic screening for cases with doubtful results that need further review.
In a routine diagnostic scenario, a small and resource-sparing panel of confirmatory immunohistological and/or molecular methods could be ordered after confirmation of the deep learning result by a pathologist. Specifically, in cases where LNs are reviewed for metastasis of carcinomas by pathologists with low expertise in terms of hematological neoplasia, our algorithm could raise alertness for an underlying hematological neoplasm such as SLL/CLL [36].
The limitations of our study are the sample size, the number of included entities, and the process for hyperparameter tuning. Herein, we examined a total of 629 cases. Following the random separation into training, validation, and test sets, only 378 cases were included in the training set. SLL/CLL and DLBCL may both be morphologically different, and many variants and specific morphological features are recognized in the current World Health Organization classification [3]. In this regard, it must be noted that some subsets of SLL/CLL may show extensive plasmacytoid appearance [37], may exhibit large confluent proliferation centers, i.e., not the equivalent of Richter transformation [38], or may show differences in proliferative activity and prognosis according to IGH gene homology with a germline sequence [39]. Likewise, there are specific forms of DLBCL such as the activated B-like and germinal center B-like subtypes that have distinct morphological, immunohistological, and genetic characteristics [40]. As a function of the described variations which are mainly due to distinct molecular changes, it becomes clear that a limited number of cases and extracted image patches per patient can only display a fraction of the overall possible morphological spectrum of SLL/CLL and DLBCL, as well as their reactive changes. Our model was trained to detect only two B-NHLs. Therefore, it cannot be expected that the algorithm will reliably classify other types of B-NHLs, lymphomas of T-cell origin, or Hodgkin lymphomas that were not trained in the current study. Moreover, a small number of tumor cells per image patch may be a limiting factor, and the minimal number of tumor cells per image patch needed for a reliable result is currently not clear. It is possible that the misclassification of SLL/CLL patient samples as normal lymph nodes occurred, even when applying high-quality control thresholds, due to the fact that neoplastic cells represent only a fraction of the overall image area. Our algorithm showed 100% sensitivity and specificity for the detection of DLBCL, but slightly lower sensitivity for SLL/CLL. For screening purposes, it would be desirable to achieve a high sensitivity for lymphoma in order to avoid false negatives, while specificity is less important if additional investigations will be performed. Whereas the introduction of quality thresholds reduced the number of misclassified patients to 3%, the overall problem of lower sensitivity, particularly for SLL/CLL, remained. Considering the abovementioned statements, the application of deep learning for NHL classification must always be conducted under the supervision of a pathologist to avoid misdiagnosis and potentially harmful consequences for patients.

Conclusions
In the present study, we trained an efficient CNN architecture on scanned histopathological slides and showed that the classification of tumor free LNs, SLL/CLL, and DLBCL is possible with high accuracy. The application of deep learning techniques for histopathological routine diagnostics should be pursued.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/ 10.3390/cancers13102419/s1: Figure S1: Confusion matrix of the best-performing model in terms of the test data at the patch (left) and case level (right). Data Availability Statement: Scanned tissue sections can be obtained from the NCT tissue biobank (https://www.nct-heidelberg.de/forschung/nct-core-services/nct-tissue-bank.html, accessed on 1 May 2021). Tissue tiles can be obtained from the corresponding author upon reasonable request. The data cannot be uploaded for open access due to current legislation. The code we used to fit our models can be accessed at https://github.com/AG-Computational-Diagnostic/pacltune-pup, accessed on 1 May 2021.