A Deep Learning Pipeline for Grade Groups Classification Using Digitized Prostate Biopsy Specimens

Prostate cancer is a significant cause of morbidity and mortality in the USA. In this paper, we develop a computer-aided diagnostic (CAD) system for automated grade groups (GG) classification using digitized prostate biopsy specimens (PBSs). Our CAD system aims to firstly classify the Gleason pattern (GP), and then identifies the Gleason score (GS) and GG. The GP classification pipeline is based on a pyramidal deep learning system that utilizes three convolution neural networks (CNN) to produce both patch- and pixel-wise classifications. The analysis starts with sequential preprocessing steps that include a histogram equalization step to adjust intensity values, followed by a PBSs’ edge enhancement. The digitized PBSs are then divided into overlapping patches with the three sizes: 100 × 100 (CNNS), 150 × 150 (CNNM), and 200 × 200 (CNNL), pixels, and 75% overlap. Those three sizes of patches represent the three pyramidal levels. This pyramidal technique allows us to extract rich information, such as that the larger patches give more global information, while the small patches provide local details. After that, the patch-wise technique assigns each overlapped patch a label as GP categories (1 to 5). Then, the majority voting is the core approach for getting the pixel-wise classification that is used to get a single label for each overlapped pixel. The results after applying those techniques are three images of the same size as the original, and each pixel has a single label. We utilized the majority voting technique again on those three images to obtain only one. The proposed framework is trained, validated, and tested on 608 whole slide images (WSIs) of the digitized PBSs. The overall diagnostic accuracy is evaluated using several metrics: precision, recall, F1-score, accuracy, macro-averaged, and weighted-averaged. The (CNNL) has the best accuracy results for patch classification among the three CNNs, and its classification accuracy is 0.76. The macro-averaged and weighted-average metrics are found to be around 0.70–0.77. For GG, our CAD results are about 80% for precision, and between 60% to 80% for recall and F1-score, respectively. Also, it is around 94% for accuracy and NPV. To highlight our CAD systems’ results, we used the standard ResNet50 and VGG-16 to compare our CNN’s patch-wise classification results. As well, we compared the GG’s results with that of the previous work.


Introduction
The most recent statistics from the American Cancer Society showed that prostate cancer (PC) is the most prevalent type of cancer with 248,530 (26%) cases, and it is also the second leading cause of cancer-related death with 34,130 (26%) [1] among men. Prostate tumors are like many other cancers in that the initial stage does not cause death or pain. By time the tumor is recognized, it has advanced to high grade with increase mortality [1]. The pathological evaluation of prostate biopsies determines the best treatment method of PC [2]. One of the methods used to characterize the heterogeneous tumor growth patterns is the Gleason grading system, which observes in a biopsy regarding their degree of discrimination or the Gleason pattern (GP).
The GP practically ranges from GP1 through GP5. The GP1 (stroma) and GP2 (benign) represent the nonepithelium tissue. While GP3, GP4, and GP5 represent the epithelium tissue. GP3 indicates moderately differentiated glands compared with that of GP5, which represents poorly differentiated cells [3,4]. Many factors contribute to determining the stage of PC, like the prostate-specific antigen (PSA) level. However, the primary factor is the Gleason score (GS). The GS is the grading system used to determine PC's aggressiveness depending on the two most frequent GP observed in the biopsy [5]. Typically, the GS ranges from 6 to 10, where 6 illustrates low-grade cancer, i.e., the cancer is likely to grow slowly, and 10 represents high-grade, i.e., the cancer is expected to spread more rapidly. The GS grading system is often divided into only three categories, 6, 7, and 8-10 [5]. This classification is rather coarse. For example, GS7 could indicate that the most cells are GP3, followed by GP4, or that most cells are GP4, followed by GP3; however, the latter case has a much worse prognosis. Similarly, GS9 or GS10 has a worse prognosis than GS8 despite often being grouped together. Eventually, the 2014 International Society of Urological Pathology (ISUP) developed a simple grading system for PC: grade groups (GG) system, based on the visual assessment of cell differentiation and GP predominance [6], see Figure 1. The GG ranges from GG1 to GG5, with higher GG indicating greater clinical risk. Table 1 shows the relation between the Gleason grading (GP and GS) and the GG system, besides the shape of cell tissues for GP and the GG's risk level. Table 1. Grading systems for a prostate biopsy specimen are the Gleason grading system, the Gleason pattern (GP) and Gleason score (GS), as well as the grade groups (GP) system.  The discordance in GG diagnosis by different pathologists using the same biopsy is between 30-50% [3,7,8]. Agreement is greater among pathologists with urologic subspecialty training and high experience than among pathologists in general [9]. Accurate diagnosis of prostate biopsy specimens helps physicians to make essential treatment decisions [9,10]. Due to expert subspecialists' availability, the development of an automated system for assessing prostate biopsy specimens with expert-level performance could improve prostate biopsy's clinical utility.

Shape of Cell
In recent years, extensive research work was developed to diagnose the tumorous lesions for many organs, especially the prostate [11][12][13][14]. Deep learning (DL) combined with histopathology and radiology imaging, particularly magnetic resonance imaging (MRI), plays an essential role in grading the prostate's cancerous tissues [4,[15][16][17]. For histopathology, Arvaniti et al. [18] developed a DL approach to identify automated Gleason grading of prostate cancer tissue microarrays with Hematoxylin and Eosin (H&E) staining. This model's advantages were that it was trained by a dataset of about 641 patients and tested on an independent cohort of 245 patients annotated by two pathologists. Cohen's quadratic kappa statistic was used to the interannotator agreements between the model and each pathologist. The authors reported the results for the GP and GS. However, they did not mention the final classification of the GG that is considered the simple grading system for PC. Also, Bulten et al. [19] introduced automated grading prostate biopsies using a DL system. They focus on classifying the GP for the prostate biopsies, and then identifying the GS depending on the GP predominance. Their DL approach was developed using 5834 biopsies from 1243 patients. The authors did not report the overall accuracy of the GP classification; they reported only the final results for the GG. Similarly, Nagpal et al. [4] used a DL approach to improve GS for whole-slide images of prostatectomies. The system was developed using 112 million pathologist-annotated image patches from 1226 slides and tested on an independent validation dataset of 331 slides. The author considered the GG4 with GG5 as one class and did not report the individual results for both of them. The view tissue for the GG5 is very similar to the GG4, and it is considered the big challenge to differentiate between them.
According to the previous studies, the general technique to classify the GG, (see e.g., the work in [19]) is that the DL systems are developed to segment digitized prostate biopsy specimens (PBSs) into regions according to GP, after which the GG is identified depending on the GS and GP grade and its predominance. However, no study reported the overall accuracy of the GP segmentation. They reported only the final results for the GG. Our work develops a DL-based computer-aided diagnostic (CAD) system for reading digitized PBSs sections and dealing with GP as a classification problem, not as a segmentation task using patch-and pixel-wise classification methodology. This is the first time this ideas was applied, to the best of our knowledge. Finally, GP labels are used to determine the GS and GG, and their performance is comparable to expert subspecialists.
The rest of the paper is structured as follows. Section 2 describes in details the proposed DL pipeline. The performance metrics used for evaluation and the details of experimental results are given in Section 3. The limitation and highlights for our pipeline are discussed in Section 4. Finally, Section 5 concludes the work.

Methods
This work's primary objective is to develop a CAD system for accurate GG diagnosis of PBSs. The proposed overall CAD system performs GP classification, as well as identifies the GS and GG. The GP classification pipeline consists of three stages: a DL system consisting of fusion three convolution neural networks (CNN); namely, a pyramidal CNN. Also, a patch and pixel-wise classification that divided the original image into patches and labeled them according to GP, the majority voting techniques is used in this step to merge the patches images into the original size, see Figure 2. Finally, identifying the GS and GG depends on the classification of the GP.

Deep Learning System
The CNN plays an essential role in many fields of medical image analysis, especially in the segmentation [12,20,21], and classification [22,23]. Our DL system has a pyramidal architecture containing three CNNs, see Figure 2. The overall framework for our DL system is depicted in Figure 3 , which shows the training and testing phases for for patch-wise classification. For the training model, the preprocessing step is applied to prepare the input data for the CNN. The preprocessing includes histogram equalization followed by edge enhancement [24,25]. The edge enhancement is applied to make the edges visible prominently by increasing the contrast of the pixels around the specific edges. The convolution matrix, namely, mask or kernel, is utilized to enhance the edges [24,25]. Figure 4 shows the effect of applying edge enhancement and histogram equalization on the original prostate patch. After that, the PBSs images are divided into overlapping patches with three different sizes: 100 × 100, 150 × 150, and 200 × 200 pixels, see Figure 2. The overlap between successive patches is 75%. The generation of overlapped patches provides different image viewpoints that enhance the DL framework's training and validation. We select for training those patches with no more than 30% of their area labeled as background in the ground truth. Each patch is assigned a single label, being the ground truth GP of most pixels in the patch. If the winning label matches the value of the center of the given patch, then this patch is selected for training. Otherwise, we remove it from the CNN training. Algorithm 1 presents all details about the preprocessing step.
The pyramidal CNN is composed of three CNNs, and each one has a different patch size, as shown in Figure 2. We designed the three CNNs such that they have the same architecture but with different sizes. The prominent architecture of any DL-CNN base consists of input layers, hidden layers, and an output layer [26]. Our CNN's input layer is fed with the patches from the first step (preprocessing) for our proposed framework. The small CNN (CNN S ) is fed with 100 × 100 patches, the medium CNN (CNN M ) is fed with 150 × 150 patches, and the large CNN (CNN L ) is provided with 200 × 200 patches. The core of CNN is the hidden layers that contain the number of CNN parameters and weights. The architecture of the hidden layers of our CNN is represented by a series of convolution layers (CLs) intervened by max-pooling layers (MPLs) and dropout layer, followed by two fully connected layers (FCls). Finally, there is a soft-max layer to give the probability for the five classes.
In the CL, the image is convolved with kernels (multiple kernels in each layer) to extract prominent features that describe the object of interest in the input patch; these features are called feature maps. Therefore, each CL results in a volume of feature maps. In our implementation, we use kernels of size 3 × 3. In MPLs, the spatial dimensions are reduced by a factor of two. Benefits are twofold: firstly, keeping only the most prominent features, discarding those less essential, and secondly, reducing computational cost and training time. In our implementation, the stride is equal to one for all layers.   The CNN contains four convolution layers, and the number of CLs filters is 16, 32, 64, and 128. The dropout layers weight 0.1, 0.1, 0.3, 0.3. The number of units for the FCLs is 64 and 512, respectively. The training seeks to minimize the cross-entropy loss between the predicted probabilities and the ground truth labels. The dropouts layers follow FCLs to minimize network overfitting, and the dropout rate is set to 0.15 for both layers. The total and trainable parameters for CNN S , CNN M , and CNN L , are 264,421, 534,757, and 952,549, respectively, and there are no nontrainable parameters for all of them. The labeled patches are used to train the proposed CNN. During the training, the CNN uses iterative optimization to learn its weights to maximize the number of correctly classified samples during prediction.
Our DL model has numerous parameters. Therefore, we utilized a hyper-parameter tuning with a random search (RS) technique that helps to reduce overfitting and improves our model's performance. Our system's accuracy is assessed by performing training, validation, and testing for the patches. The curves for the accuracy and loss of training the CNN L , the best accuracy among three CNN, are presented in Figure 5. Also, the validation (accuracy and loss) curves are shown in Figure 5. By increasing the number of epochs, the validation accuracy rises until it reaches around 0.78, and the validation loss decreases until it gets around 0.7.

Patch-and Pixel-Wise Classification
The goal for the patch-wise technique is to label all patches generated from the digitized PBS. We apply the patch-wise classification for all three CNNs individually during the test to assign each overlapped patch a label from one to five as GP categories. After that, a pixel-wise classification is applied to obtain three images of the same size as the original. The pixel-wise technique utilizes the output from the patch-wise classification, labeled patches generated from the three CNNs, to give all pixels that contain this patch the same label. Most of the pixels appear in several batches due to overlapped batches. Therefore, overlapped pixels have multiple labels depending on their position in the image. The majority voting is the core approach for the pixel-wise classification to get a single label for each overlapped pixel. The two techniques, patch-and pixel-wise, are used on the output of the three CNNs. The results after applying those techniques are three images of the same size as the original (each pixel has three labels); then, we applied majority voting again on those three images to obtain the final pixel-based classification.

Grade Groups System
The identification of the GG label is considered our goal in this work. Each digitized PBSs has labels between 1-5 according to its GP. We utilize Table 1 that demonstrates the relation between three measurements (GP, GS, and GG) to generate the GS from GP. Thus, identifying the GG from GS.

Results
The proposed framework is trained, validated, and tested on 416, 96, and 96, respectively, with whole slide images (WSIs) of the digitized PBSs from the Radboud University Medical Center, USA, and it was analyzed by the University of Louisville Hospital, USA [19,27]. A semiautomatic labeling technique was utilized to circumvent the need for full manual annotation by pathologists. Expert pathologists defined the GP, GS, and GG for all WSIs, and the digitized PBSs are divided into overlapping patches for patch and pixel-wise classification according to the GP ground truth. Our CAD software is primarily implemented in Python and Matlab programming environments. The experimental results were also performed on a Dell Precision workstation with an Intel Xeon eight-core CPU running at 3.31 GHz and 128 GB RAM.
The total number of patches for each CNN S , CNN M , and CNN L are around 5.8, 3.6, and 1.7 million, respectively. Those patches belong to 608 (416, 96, and 96) of the WSIs of the digitized PBS. The WSIs, 608, are separated into training, validation, and testing before generating the patches that means the model doesn't see the validation or testing patches. We face a big challenge to train our model with a balanced dataset because the occurrence of the GP1 is very high with around 60% of the PBS, and the other four types (GP2, GP3, GP4, GP5) have almost 40%. Therefore, we utilized all batches generated from the four GP (GP2, GP3, GP4, GP5) during our training and randomly selected the number of patches from the GP1 that made the number of patches very close. The CNN S , CNN M , and CNN L were trained with around 130, 45, and 25,000 patches, respectively, for each group.
Our goal for this automated system is to identify the GG. Therefore, there are many steps, and each one has its results before reaching our target. We show first the accuracy for each CNN, pyramidal CNN, especially the patch-wise classification, see Tables 2 and 3. After that, we demonstrate the system's accuracy for the GG. More details are presented in the following two subsections. Table 2. Patch-wise classification accuracy for our proposed CNN S and CNN L using precision, recall, F1-score, accuracy, macro-averaged, and weighted-averaged.

Patch-Wise Classification for Each CNN
The overall diagnostic classification accuracy is evaluated using the accuracy metrics precision, recall, F1-score, accuracy, macro-averaged, and weighted-averaged [28,29]. The F1-score is computed by the formula: The overall accuracy is the proportion of correctly classified samples out of all the samples; then, it must be equal for all metrics: precision, recall, F1-score. To summarize, the following always holds for the micro-F1.
micro-F1 = micro-precision = micro-recall = accuracy (2) since for the micro-averaging case, they are also equal to their harmonic mean; in other words, the micro-F1 case. The macro-average is also calculated for each metric and computed as simple arithmetic means of our per-class. The weighted average is the weighted of each class by the number of samples from that class. The patch-wise classification results for our proposed pyramidal CNN are reported in Tables 2 and 3. For CNN S , CNN M , and CNN L , the classification accuracy is in the range 0.70-0.76. The macro-averaged and weighted-average metrics are found to be around 0.68-0.77. To highlight the advantages of using our DL (CNN), the accuracy of our pipeline is compared against standard ResNet50, and VGG-16 [22,30]. The input patches are resized into 224 × 224 pixels to fit with the image input size of the ResNet50 and VGG-16. The F1-score of VGG-16 is in the range of 0.58-0.70. In addition, the macro-averaged and weighted-average metrics are both found to be 0.63. The accuracy of our CNN L is high compared with that of the VGG-16 (0.65) and almost the same with that of ResNet50 (0.75), see Table 3. Besides the high accuracy for our CNN, there are two reasons for creating this CNN. Firstly, the computational cost for the RestNet50 and VGG-16 is high compared with that of our CNN. The number of parameters for ResNet50 and VGG-16 is almost 23 and 27 million, respectively, and our CNN is approximately one million. The second reason is that our pyramidal CNN needs a flexible size for the CNN, and a standard CNN like ResNet50 is a fixed-size that is 224 × 224.

Grade Group Results
After applying the patch-and pixel-wise classification for each CNN, we merge the CNN outputs to obtain the production of the digitized PBS as the exact size of the original one. The new digitized PBS defines the results for GP from our automated system. We identify the GS and GG, the goal of our pipeline, using the fundamental converting that presents in the Table 1. Figure 6 shows the GG examples for our automated system results, which compare the reference standard and the predicted GG from our system using the distribution of GG.
The overall GG diagnostic accuracy is summarized in Table 4, which presents the accuracy metrics precision, recall, F1-score, accuracy, and negative predictive value (NPV). Also, the confusion matrices for the grade groups results are shown in Figure 7. To validate our CAD system's results and demonstrate its value, we compared our results with the pathologists' estimated results, and previous work [19]. Firstly, the discordance for diagnosing the GG from the same biopsy is between 30% and 50% for various pathologists [3,7,8]. Therefore, our CAD system's accuracy compared with that of pathologists' estimated results is acceptable because our results are about 80% for precision, between 60-80% for recall and F1-score, and around 94% for accuracy and NPV. Secondly, the average accuracy and NPV for our automated CAD system to identify GG are 0.8767 and 0.9167, respectively, while the previous work [19] has 0.8500 and 0.9100, respectively. The obtained results show that our results compared with that of previous work are higher than two percent for an average of accuracy and almost the same for the NPV. Table 4. Grade groups classification of our CAD system using precision, recall, F1-score, accuracy, and negative predictive value (NPV).

Discussion
The treatment for prostate cancer over the years improved for men with low-risk diseases. Notably, for patients with localized prostate cancer, active surveillance is safer compared with radical prostatectomy, as verified by many trials [31]. According to the American Society of Clinical Oncology, the GG and GP grading are considered the decisionmaker according to the guideline of the American Society of Clinical Oncology [32]. The consults were recommended to enhance the consistency and quality of care due to the interobserver variability for the Gleason system [32,33]. Therefore, our automated CAD system could be a valuable decision support tool for patients' GG with localized disease and give significant downstream treatment implications.
For that purpose, we developed an automated DL architecture for classifying the GG of digitized PBS. The ground truth of our datasets was performed using many experienced urologic subspecialists. They have around 25 years of experience with diverse backgrounds, and accessed several histologic sections and immunohistochemically stained sections for every specimen. The overall accuracy for our CAD system showed a similar rate compared with that of general pathologists, which is 70%. According to [34,35], the DL system can be used to alert pathologists of what might be missed. Otherwise, defining small tissue regions depends on a pathologist's judgment that leads to overrule of false-positive categorizations. Therefore, our DL-CAD system has benefits for bolstering the selection of treatment modalities, especially for patients with localized disease.
Developing a framework with high accuracy is our ultimate goal in which DL, patch-, and pixel-wise classification were performed. The accuracy of the diagnostic results for the proposed framework using pyramidal CNN presents that CNN L had higher accuracy than that of CNN S and CNN M , see Tables 2 and 3 and Figure 5 , which show the validation curves for the CNN L . The comparison between the best CNN against the standard ResNet50 and VGG-16 [22,30], shown in Table 3, emphasizes the benefits of using the hyper-parameter tuning and the RS technique. Besides, using the new idea to identify the problem as classification one, not a segmentation, developed high accuracy. The proposed CAD system can be helpful in healthcare systems in several ways, such as decreasing the consultation-associated costs, enhancing grading consistency, and reducing treatment-related morbidity for men with low-risk diseases. The performance metrics for GG estimation are higher for G1 and G2 grades compared with that of G3, G4, G5. Therefore, our automated system could be accurately classifying low-risk cases that are eligible for more conservative management.
The GG classification plays an essential role in prostate cancer treatment [31,36]. Still, it is not a straightforward task to the extent that there is no match the results between the subspecialists and the general pathologists' for the GG classification. The subspecialists' grading is more concordant than the general pathologists' grading [37]. However, due to the difficulty of GG and inherent subjectivity, there is discordance between subspecialists. Therefore, it is critical to enhance the risk stratification for prostate cancer by overcoming those disagreements. Developing a system with high precision that human graders and predict clinical risk is our priority. Machine learning, especially DL, models could distinguish novel histoprognostic signals that the human eye can not discover [38,39], as well as assistance in stratifying patient risk like existing molecular tests [40].
Despite the promising results, our automated DL system has some limitations. Firstly, our DL model was trained and tested from a single institution. Therefore, using an external test dataset for the different centers and WSI with various staining protocols should further enhance the robustness of our automated system. Secondly, our DL models, as well the pathologists who made the ground truth labeling, treated each biopsy as an independent sample. In clinical practice, multiple biopsies are sampled from various regions of the prostate. Therefore, an update to our model could take multiple biopsies into account and give a grade group prediction at the patient level. Finally, our study concentrated on the grading of acinar adenocarcinoma in prostate biopsies. However, prostate biopsies can contain other tumor types and foreign tissue, such as colon glands. The biopsies could also include additional prognostic information, such as the detection of intraductal carcinoma [41].

Conclusions
This paper introduced a Deep Learning-based CAD system to classify the grade groups (GG) system using digitized prostate biopsy specimens (PBSs) using pyramidal CNN, with patch-and pixel-wise classifications. The proposed pipeline results highlight our system's potential to classify all five GG of the PBSs in comparison with that of other standard CNNs. The agreement between our CAD system and pathologists is comparable to inter-rater reliability among pathologists. In future work, because the digitized PBSs do not have the same direction, adding a new preprocessing step to overcome this challenge will fit our results. This processing rotates the overlapped patches with angles 45 and 90, and flipping them will enhance our pyramidal CNN accuracy. In addition, to highlight our model, we will try to test our model with a dataset from another institution. Institutional Review Board Statement: This is a publicly available dataset.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data used in this study is available from https://www.kaggle. com/c/prostate-cancer-grade-assessment, accessed on 3 October 2021.