Label Distribution Learning for Automatic Cancer Grading of Histopathological Images of Prostate Cancer

Simple Summary We aimed to develop and evaluate an automatic prediction system for grading histopathological images of prostate cancer using a deep learning model and label distribution learning. Our results show that the label distribution learning improved the diagnostic performance of the automatic prediction system for the cancer grading. Abstract We aimed to develop and evaluate an automatic prediction system for grading histopathological images of prostate cancer. A total of 10,616 whole slide images (WSIs) of prostate tissue were used in this study. The WSIs from one institution (5160 WSIs) were used as the development set, while those from the other institution (5456 WSIs) were used as the unseen test set. Label distribution learning (LDL) was used to address a difference in label characteristics between the development and test sets. A combination of EfficientNet (a deep learning model) and LDL was utilized to develop an automatic prediction system. Quadratic weighted kappa (QWK) and accuracy in the test set were used as the evaluation metrics. The QWK and accuracy were compared between systems with and without LDL to evaluate the usefulness of LDL in system development. The QWK and accuracy were 0.364 and 0.407 in the systems with LDL and 0.240 and 0.247 in those without LDL, respectively. Thus, LDL improved the diagnostic performance of the automatic prediction system for the grading of histopathological images for cancer. By handling the difference in label characteristics using LDL, the diagnostic performance of the automatic prediction system could be improved for prostate cancer grading.


Introduction
In 2022, 34,500 Americans were predicted to die of prostate cancer [1]. Prostate cancer is the second leading cause of cancer-related death in men. It is also the leading cause of cancer-related morbidity in men, with 268,490 cases [1]. Although the 5-year survival rate of prostate cancer has improved in recent years, the number of deaths is considerably high due to the large number of patients with prostate cancer. Bone metastases from prostate cancer frequently occur and reduce the patients' quality of life.
Prostate cancer is diagnosed by palpation, prostate-specific antigen testing, ultrasound examination, magnetic resonance imaging, and biopsy. The definitive diagnosis of prostate cancer is made by performing a pathological evaluation of the prostate tissue obtained through biopsy or surgery. A visual evaluation score called the Gleason score is used for pathological evaluation of prostate tissues [2,3]. It is determined based on the classification system of histological morphology with five-class patterns. For each specimen, the pattern with the largest area is regarded as the first pattern, while the pattern with the second largest area is regarded as the second pattern. The sum of the two patterns is then calculated to obtain the Gleason score. For example, if the values of the first and second patterns are 4 and 5, respectively, the Gleason score is 9 (4 + 5).
Various problems associated with the determination of the Gleason score have been identified. Hence, the International Society of Urological Pathology (ISUP) proposed a new grading system, called the ISUP grade group classification system (ISUP score), based on the Gleason score [3]. The appropriate ISUP score was assigned by grouping the Gleason scores, with higher ISUP scores indicating higher malignancy.
A previous study evaluated the reproducibility of the Gleason score and its interobserver variability [4][5][6]. One previous study showed moderate agreement on the Gleason score between the observers [4]. As the ISUP score is based on the Gleason score, the agreement on the ISUP score is expected to be moderate.
The recent emergence of deep learning provides powerful tools for medical image analysis [7]. Deep learning makes it possible to extract valuable features in an end-to-end manner. Many studies have used deep learning for medical image analysis in radiology [7][8][9][10] and pathology [11][12][13][14][15]. To improve the inter-observer variability of the Gleason and ISUP scores, deep learning has been used for automated grading systems [15][16][17][18][19][20][21][22][23]. Table 1 presents a summary of the results of previous studies on automatic prediction systems for Gleason and ISUP scores. Table 1 shows that the size of the dataset used in the deep learning-based systems of previous studies was less than 10,000. As shown in the previous study [10], it is difficult to construct reliable deep-learning-based systems with small-sized datasets.
This study aimed to (i) use a large dataset for the development of a deep-learningbased ISUP grading system, (ii) evaluate the generalizability of the automated ISUP grading, and (iii) incorporate label distribution learning (LDL) [24][25][26] in system development with deep learning to address a difference in label characteristics between datasets. Of these, the primary goal of the current study is to combine deep learning with LDL and evaluate the utility of LDL in the ISUP grading.

Dataset
A total of 10,616 whole slide images (WSIs) were used in this study, which are available from the Prostate cANcer graDe Assessment (PANDA) challenge [14,28]. The 5160 and 5456 WSIs of the PANDA dataset were collected from Radboud University Medical Center and Karolinska Institute, respectively. A summary of the PANDA dataset is presented in Table 2. The details of the PANDA dataset are described in the published paper [14] and the Kaggle website [29]. The two institutions used different scanners with slightly different maximum microscopic resolutions. The annotation process for the 5160 and 5456 WSIs differed between the two institutions. At the Radboud University Medical Center, labels of the 5160 WSIs were retrieved from pathology reports, containing the original diagnosis. Trained students read all the reports and assigned a label to each WSI. At the Karolinska Institute, the 5456 WSIs were annotated by an experienced pathologist. Therefore, the labels of 5160 WSIs from Radboud University Medical Center might have a higher degree of inconsistency than those from the Karolinska Institute. In addition, Table 2 shows that the frequencies of ISUP scores were different between Radboud University Medical Center and Karolinska Institute; the frequencies of ISUP scores were more uniform at Radbound University Medical Center than at Karolinska Institute.
To develop and evaluate our deep learning-based systems, 5160 WSIs from Radboud University Medical Center and 5456 WSIs from Karolinska Institute were used as the development (training/validation sets) and unseen test sets, respectively. Table 2 shows that there was a difference in the frequency of ISUP scores between Radboud University Medical Center and Karolinska Institute. In addition, the 5160 WSIs of Radboud University Medical Center, which might have higher label inconsistency, were used for the development of a deep learning-based system. Therefore, it was expected that the dataset splitting of the current study would cause a deterioration in diagnostic performance compared with that of the original PANDA challenge.

Baseline Convolutional Neural Network
In the PANDA challenge, many deep learning-based systems have been developed using convolutional neural networks (CNNs). After the PANDA challenge, the source codes of several CNNs were made available as open sources. For example, the source code of the first-place solution used in the PANDA challenge can be obtained from the GitHub repository [30]. Their CNN ranked 22nd (metric = 0.910) and 1st (metric = 0.940) on the public and private leaderboards of the PANDA challenge, respectively. The details of the first-place solutions are described on their slide [31]. The first solution consists of several types of CNNs. One of the CNNs of the first-place solution was used as the baseline CNN in this study. To implement our baseline CNN (network structure, image preprocessing, loss function, etc.), the source code of the first-place solution was used. Based on the firstplace solution, our baseline CNN used EfficientNet [32] (EfficientNet B1) for the network structure, which was pretrained with the ImageNet dataset. Because the original WSIs contained non-tissue lesions, the input images of the baseline CNN were preprocessed by creating tiled images as the first-place solution. In image tiling, the original WSIs were split into several tiles (image patches), tiles with non-tissue lesions were removed, and the tiles were concatenated as the input image. Figure 1 shows representative images of the original WSIs and their tiles. After image tiling, the original WSI was converted into a tiled image of 1536 × 1536 × 3 (1536 = 8 tile × 192) in size. Based on the first-place solution, the labels of the WSIs in our baseline CNN were represented as 10-dimensional vectors, where the first and second five dimensions were derived based on the ISUP score and the first pattern of the Gleason score, respectively. For example, the ISUP score and the first pattern of the Gleason score were 3 and 4, respectively, and the scores 3 and 4 were converted to [1, 1, 1, 0, 0] and [1, 1, 1, 1, 0], respectively. Then, the [1, 1, 1, 0, 0] and [1, 1, 1, 1, 0] were concatenated as [1, 1, 1, 0, 0, 1, 1, 1, 1, 0]. The [1, 1, 1, 0, 0, 1, 1, 1, 1, 0] was used as the label of the WSI for our baseline CNN. The loss function of our baseline CNN was regarded as the binary cross-entropy loss between the 10-dimensional vectors of the outputs and labels. in this study. To implement our baseline CNN (network structure, image preprocessing, loss function, etc.), the source code of the first-place solution was used. Based on the firstplace solution, our baseline CNN used EfficientNet [32] (EfficientNet B1) for the network structure, which was pretrained with the ImageNet dataset. Because the original WSIs contained non-tissue lesions, the input images of the baseline CNN were preprocessed by creating tiled images as the first-place solution. In image tiling, the original WSIs were split into several tiles (image patches), tiles with non-tissue lesions were removed, and the tiles were concatenated as the input image. Figure 1 shows

Proposed CNN with LDL
The primary goal of the current study is to combine deep learning with LDL [24][25][26]. For this purpose, LDL was incorporated into our proposed CNN. Previous studies suggested that LDL could be used to address label inconsistency issues in traditional single labels. Instead of assigning a single label, LDL covers a certain number of adjacent labels, where each label represents a different degree of description. Given the N input training WSIs with the corresponding single labels of ISUP scores and the first pattern of Gleason scores, the dataset is represented as { (x1, y1, z1), · · ·, (xN, yN, zN) }, where xi represents WSI, yi represents the ISUP score of WSI (yi ∈ [0, · · ·, 5]), and zi represents the first pattern of the Gleason score (zi ∈ [0, · · ·, 5]). For WSI xi, the label distribution (c = 1, 2, 3, …, D, where D is the maximum number of label distributions) was generated based on the following Gaussian function: 1 where li is the rescaled value of yi or zi and is the standard deviation of the Gaussian function. In this study, the values of li were obtained from yi or zi using linear scaling based on the value of D (range of li matched with [1, …, D]). represents the description degree (probability) of a label. As the maximum of the Gaussian function was obtained at c = li, the probability at li was the highest. However, the probabilities at 1 or 1

Proposed CNN with LDL
The primary goal of the current study is to combine deep learning with LDL [24][25][26]. For this purpose, LDL was incorporated into our proposed CNN. Previous studies suggested that LDL could be used to address label inconsistency issues in traditional single labels. Instead of assigning a single label, LDL covers a certain number of adjacent labels, where each label represents a different degree of description. Given the N input training WSIs with the corresponding single labels of ISUP scores and the first pattern of Gleason scores, the dataset is represented as { (x 1 , y 1 , z 1 ), . . . . . . , (x N , y N , z N ) }, where x i represents WSI, y i represents the ISUP score of WSI (y i ∈ [0, . . . . . . , 5]), and z i represents the first pattern of the Gleason score (z i ∈ [0, . . . . . . , 5]). For WSI x i , the label distribution d c i (c = 1, 2, 3, . . . , D, where D is the maximum number of label distributions) was generated based on the following Gaussian function: where l i is the rescaled value of y i or z i and σ is the standard deviation of the Gaussian function. In this study, the values of l i were obtained from y i or z i using linear scaling based on the value of D (range of l i matched with [1, . . . , D]). d c i represents the description degree (probability) of a label. As the maximum of the Gaussian function was obtained at c = l i , the probability at l i was the highest. However, the probabilities at c = l i − 1 or c = l i + 1 were not 0 in general. Based on the label distribution, label inconsistency caused by label noise was handled in the LDL. The raw values of d c i were normalized to satisfy the following conditions: (i) d c i ∈ [0, 1] and (ii) Representative examples of label distributions are shown in Supplementary Material S1 ( Figure S1A-D).
The network structure of our proposed CNN with LDL was based on EfficientNet. In our proposed CNN, the base part with convolutional layers of EfficientNet was retained, while the head part of EfficientNet was replaced with three fully connected layers with the following numbers of outputs: 512, 100, and 2D. The three layers accompanied the batch normalization and activation layers of the rectified linear unit. Our proposed CNN received tiled WSIs and generated a 2D-dimensional vector. The first and second D-dimensional vector (o 1 i and o 2 i ) outputs from our proposed CNN corresponded to two vectors of d c i obtained from y i and z i . Training of our proposed CNN was accomplished by minimizing the KL divergence between the predicted and ground-truth label distributions, which is represented by the following equation: where d c i _ f rom_y i and d c i _ f rom_z i were obtained from y i and z i , respectively, with Equation (1) showing the label distributions. Here, KL div is the KL divergence between two probability distributions.

Implementation Details
PyTorch (version 1.9.0) and PyTorch Lightning (version 1.6.0) were used as the deep learning frameworks in this study. The baseline CNN was implemented based on the open-source code of the first-place solution in the PANDA challenge. As the original source code of the first-place solution used the old version of PyTorch Lightning (version 0.8.5), the source code for our baseline CNN was revised using the new version of PyTorch Lightning. The hyperparameters of our baseline CNN are available from https://github.com/ kentaroy47/Kaggle-PANDA-1st-place-solution/blob/master/src/configs/final_1.yaml (accessed on 6 January 2023). Briefly, the hyperparameters of our baseline CNN were as follows: number of epochs, 30; optimizer, Adam; learning rate, 3.0 × 10 −5 ; and size of input (tiled image), 1536 × 1536 × 3. The proposed CNN with LDL was implemented by modifying the source code of the baseline CNN. The base part of the proposed CNN was a pretrained EfficientNet (B0-B5) model. The hyperparameters specific to the proposed CNN were as follows: weight of LDL = 0.20 (Equation (2)), σ = 2.0, and D = 18.

Evaluation of CNNs
In the baseline CNN, the predicted label was obtained by summing the first 5-dimensional vector of its output. This method was implemented in the original source code of the first solution. In the proposed CNN with LDL, the raw predicted label of the proposed CNN was determined using the following equation: where pl i is the predicted raw label of the proposed CNN. Finally, the value of pl i was inverted via linear rescaling based on the value of D, and the final predicted label for the proposed CNN was obtained. In the baseline CNN and proposed CNN, the predicted Gleason score was not used for evaluation. Five-fold cross-validation was performed for the baseline CNN and proposed CNN using the development set (5160 WSIs from Radboud University Medical Center). After the five-fold cross-validation, the unseen test set (5456 WSIs from the Karolinska Institute) was used to evaluate the CNNs. The prediction results for the test set were obtained using an ensemble of five trained models for the baseline and proposed CNNs. As the whole process of prediction is not available in the source code of the first-place solution, it was implemented for the baseline and proposed CNNs in this study. The evaluation metrics used in this study were quadratic weighted kappa (QWK) and accuracy. QWK, the main metric utilized in the PANDA challenge, was used to evaluate the agreement between the ground truth label and predicted label. The QWK typically varies from 0 (random agreement) to 1 (complete agreement). QWK was calculated as follows: First, a confusion matrix M was constructed from the ground truth and predicted labels. The size of M was N-by-N (N = 6, range of ground truth label; 0-5, range of predicted labels). In matrix M, M i,j corresponded to the number of ISUP scores i (ground truth) that received a predicted value j. The N-by-N matrix of weights w was calculated based on the difference between the actual and predicted values using the following equation: The N-by-N confusion matrix of the expected outcomes, E, was calculated assuming that there was no correlation between the ground truth and predicted labels. In all three matrices (M, w, and E), QWK was calculated as follows: Accuracy was defined as the ratio of diagonal elements in matrix M calculated between the ground truth and predicted labels. In this study, the IUSP scores were used as the ground truth in the calculation of QWK and accuracy; the Gleason scores were ignored when calculating the evaluation metrics of this study.
To statistically test the difference in accuracy between the baseline and proposed CNNs, McNemar's test was used. A p-value < 0.05 was considered to indicate statistical significance. Table 3 presents the results of five-fold cross-validation of the baseline and proposed CNNs (EfficientNet B0-5) using the development set. The cross-validated QWK and the accuracy of the baseline CNN were 0.820 and 0.545, respectively. The cross-validated QWK and accuracy of the proposed CNNs were 0.817-850 and 0.646-0.680, respectively. In most cases, the cross-validated QWK and accuracy of the proposed CNNs were better than those of the baseline CNN. The proposed CNN of EfficientNet B3 was the best, as shown in Table 3. Table 3. Results of the five-fold cross-validation of the baseline and proposed CNNs (EfficientNet B0-5).  Table 4 shows the results of five-fold cross-validation of the proposed CNNs with different D values (EfficientNet B3 only). Here, D = 12, 30, and 60 were used in addition to D = 18. The cross-validated QWK and accuracy of the proposed CNNs were 0.835-850 and 0.609-0.700, respectively ( Table 4). The proposed CNN with D = 60 achieved the highest accuracy (Table 4). The cross-validated QWK and accuracy of the proposed CNN of EfficientNet B3 with LDL (D = 18) and another proposed CNN of EfficientNet B3 with LDL (D = 60) were relatively high (Tables 3 and 4, respectively). Hence, these two types of proposed CNNs were evaluated using the test set. Table 5 shows the diagnostic performance of the baseline and proposed CNNs in the unseen test set. The QWK and accuracy of the baseline CNN were 0.240 and 0.247, respectively, in the test set. The QWK and accuracy of the two types of proposed CNNs were 0.301 and 0.364, and 0.249 and 0.407, respectively, in the test set. The difference in accuracy between the baseline CNN and the proposed CNN (EfficientNet B3 with LDL (D = 60)) was statistically significant (p-value < 0.000001). As a result, the proposed CNNs can improve the diagnostic performance of the first-place solution for the automatic prediction of prostate cancer grade. Figure 2 shows the confusion matrix between the ground truth and the predicted label for the proposed CNN of EfficientNet B3 with LDL (D = 60).  Tables [3][4][5] show that when the development set (5160 WSIs from Radboud University Medical Center) was used, the diagnostic performance of the baseline and proposed CNNs severely deteriorated in the test set (5456 WSIs from the Karolinska Institute). Table 5 shows that the proposed CNNs could achieve better diagnostic performance  Tables 3-5 show that when the development set (5160 WSIs from Radboud University Medical Center) was used, the diagnostic performance of the baseline and proposed CNNs severely deteriorated in the test set (5456 WSIs from the Karolinska Institute). Table 5 shows that the proposed CNNs could achieve better diagnostic performance for the automatic prediction of ISUP scores compared with the baseline CNN. Because the major difference between the proposed CNNs and the baseline CNNs was the use of LDL, LDL was useful for improving the CNNs and predicting the ISUP scores automatically. However, for the proposed CNN and baseline CNN, their diagnostic performance in the test set was worse than that in the development set.

Discussion
As previously mentioned in Materials and Methods section and Table 2, the labels of 5160 WSIs from the Radboud University Medical Center were annotated by trained students. Therefore, these labels might have a higher degree of inconsistency compared with those from the Karolinska Institute. The label inconsistency of WSIs in the Radboud University Medical Center might decrease the generalizability of the model in the development of the proposed and baseline CNNs. In addition, the difference in label frequency between Radboud University Medical Center and Karolinska Institute might deteriorate the model's generalizability. Our results and speculation indicate that when a large number of images with low-quality labels are available, the development of a deep learning model may lead to the deterioration of the generalizability of the model. This point should be carefully considered when developing deep-learning-based systems.
Issues related to label inconsistency have been reported in the original PANDA challenge. However, the discrepancy in diagnostic performance between the development and test sets was not extremely large during the original PANDA challenge because the development set consisted of WSIs obtained from the two institutions. In this study, the discrepancy in diagnostic performance between the development and test sets was significantly large because the development set was derived from one institution.
LDL has been proposed to address the issues related to label inconsistency on traditional single labels. In this study, the diagnostic performance of our system could be improved by incorporating LDL into the deep learning-based system. Owing to the characteristics of LDL, it might be useful for other medical systems to resolve label inconsistency issues in the grading scores (cancer staging, grading for cancer diagnosis, and so on). Table 1 presents a summary of the results of previous studies on automatic prediction systems for Gleason and ISUP scores. The size of the dataset in this study was larger than that in the previous studies. However, the diagnostic performance of our CNNs was worse than that of the CNNs in the previous studies. As shown, the differences in label inconsistency and label frequency between Radbound University Medical Center and Karolinska Institute may cause the performance discrepancy.
This study has several limitations. First, it was conducted using a public dataset (the PANDA dataset). Our results should be confirmed using other datasets. Second, our results were obtained using EfficientNet. Although EfficientNet is a standard deep learning model, other deep learning models should be used to evaluate the usefulness of LDL. As the first-place solution to the PANDA challenge used EfficientNet in the source code, EfficientNet was also used in this study.

Conclusions
Our proposed CNN with LDL could improve the cancer grading (ISUP scores) of histopathological images in prostate cancer. This improvement could be achieved with the aid of LDL (a strategy used to address label inconsistency issues). However, when the difference in label characteristics (label inconsistency and label frequency) between the development and test sets was observed, the generalizability of the baseline and proposed CNNs deteriorated in the unseen test set. Our results should be carefully considered when developing deep-learning-based systems.