Transfer Learning for Adenocarcinoma Classifications in the Transurethral Resection of Prostate Whole-Slide Images

Simple Summary In this study, we trained deep learning models to classify TUR-P WSIs into prostate adenocarcinoma and benign (non-neoplastic) lesions using transfer and weakly supervised learning. Overall, the model achieved good classification performance in classifying whole-slide images, demonstrating the potential benefit of future deployments in a practical TUR-P histopathological diagnostic workflow system. Abstract The transurethral resection of the prostate (TUR-P) is an option for benign prostatic diseases, especially nodular hyperplasia patients who have moderate to severe urinary problems that have not responded to medication. Importantly, incidental prostate cancer is diagnosed at the time of TUR-P for benign prostatic disease. TUR-P specimens contain a large number of fragmented prostate tissues; this makes them time consuming to examine for pathologists as they have to check each fragment one by one. In this study, we trained deep learning models to classify TUR-P WSIs into prostate adenocarcinoma and benign (non-neoplastic) lesions using transfer and weakly supervised learning. We evaluated the models on TUR-P, needle biopsy, and The Cancer Genome Atlas (TCGA) public dataset test sets, achieving an ROC-AUC up to 0.984 in TUR-P test sets for adenocarcinoma. The results demonstrate the promising potential of deployment in a practical TUR-P histopathological diagnostic workflow system to improve the efficiency of pathologists.


Introduction
According to the Global Cancer Statistics 2020, prostate cancer is the most frequently diagnosed cancer in men in over one-half (112 of 185) of the countries in the world. It is the fifth leading cause of cancer death among men in 2020 with an estimated 1,414,259 new cases and 375,304 deaths worldwide [1]. The only way to properly diagnose prostate cancer is via histopathological confirmation [2]. Nodular hyperplasia (benign prostatic hyperplasia) is a common benign disorder of the prostate that represents a nodular enlargement of the gland caused by hyperplasia of both glandular and stromal components, resulting in an increase in the weight of the prostate. The conventional treatment for nodular hyperplasia is surgical. The transurethral resection of the prostate (TUR-P) is one of the most widely practiced surgical procedures, which is estimated to have been performed in about 7864 cases in 2014 in Japan [3]. TUR-P can be used as an incidental diagnosis for prostate cancer with the gold standard for prostate cancer being a biopsy. In TUR-P, an electrical loop of resectoscope excises hyperplastic prostate tissues to improve urine flow, resulting in many tiny tissue fragments with variable sizes during the procedure. As compared with conventional biopsy specimens (e.g., endoscopic biopsy from gastrointestinal tracts), TUR-P specimens are characterized by a very large volume of tissues and a large number of glass slides; therefore, the histopathological diagnosis for TUR-P specimen is one of the most tedious and error-prone tasks because there are a large number of tissue artifacts, and determining the orientation of the specimen is difficult. Importantly, cancers, especially prostate adenocarcinoma, are detected incidentally around 5-17% of TUR-P specimens [4][5][6][7][8][9][10]. Conventional active treatments (surgery or radiotherapy) are indicated in T1a patients with a life expectancy that is longer than 10 years and in the majority of T1b patients [4]. The histopathological evaluation of cancer (adenocarcinoma) in TUR-P specimens is important because the presence of cancer in more than 5% of the tissue fragments [11,12] or high-grade cancer [13,14] may affect the choice of treatment. Thus, for TUR-P specimens, reporting both the number of microscopic foci of carcinoma and the percentage of carcinomatous involvement is recommended. All these factors and burdens mentioned above highlight the benefit of establishing a histopathological screening system to detect prostate adenocarcinoma based on TUR-P specimens. Conventional glass slides of TUR-P specimens can be digitized as whole-slide images (WSIs), which could benefit from the application of computational histopathology algorithms, especially deep learning models, for aiding pathologists, reducing the burden of time-consuming diagnosis, and increasing the appropriate detection rate of prostate adenocarcinoma in TUR-P WSIs as part of a screening system.
In computational pathology, deep learning models have been widely applied in histopathological cancer classification on WSIs, cancer cell detection and segmentation, and the stratification of patient outcomes [15][16][17][18][19][20][21][22][23][24][25][26][27][28]. Previous studies have looked into applying deep learning models for adenocarcinoma classification in stomach [28][29][30], colon [28,31], lung [29,32], and breast [33,34] histopathological specimen WSIs. In a previous study, we trained a prostate adenocarcinoma classification model on needle biopsy WSIs [35] and evaluated the models on both needle biopsy and TUR-P WSI test sets to confirm their applications in different types of specimens, achieving an ROC-AUC of up to 0.978 in needle biopsy test sets; however, the model under-performed on TUR-P WSIs. Therefore, in this study, we trained deep learning models specifically for TUR-P WSIs. We evaluated the trained models on TUR-P, needle biopsy, and TCGA (The Cancer Genome Atlas) public dataset test sets, achieving an ROC-AUC of up to 0.984 in TUR-P test sets, 0.913 in needle biopsy test sets, and 0.947 in TCGA public dataset test sets. These findings suggest that deep learning models might be very useful as routine histopathological diagnostic aids for inspecting TUR-P WSIs to detect prostate adenocarcinoma precisely.

Clinical Cases and Pathological Records
Retrospectively, a total of 2060 H&E (hematoxylin & eosin)-stained histopathological specimen slides of human prostate adenocarcinoma and benign (non-neoplastic) lesions were collected from the surgical pathology files of a total of three hospitals, Shinyukuhashi, Wajiro, and Shinkuki hospitals (Kamachi Group Hospitals, Fukuoka, Japan), after a histopathological review of all specimens by surgical pathologists. Of the 2060 slides, 1560 were TUR-P obtained from 276 patients and 500 were needle biopsy obtained from 238 patients. All were obtained between 2017 and 2019. Histopathological specimens were selected randomly to reflect real clinical settings as much as possible. Prior to the experimental procedures, each WSI diagnosis was observed by at least two pathologists (at least over five years experience) with the final checking and verification performed by senior pathologists (at least over 10 years experience). The pathologists had to agree whether the output was adenomcarinoma or benign. All WSIs were scanned at a magnification of ×20 using the same Leica Aperio AT2 Digital Whole Slide Scanner (Leica Biosystems, Tokyo, Japan) and were saved in the SVS file format with JPEG2000 compression.

Dataset
Hospitals that provided histopathological specimen slides in the present study were anonymised (e.g., Hospital-A, B, and C). Table 1 breaks down the distribution of training and validation sets of TUR-P WSIs from two domestic hospitals (Hospital-A and B). Valida-tion sets were selected randomly from the training sets (Table 1). The test sets consisted of TUR-P, needle biopsy, and the TCGA (The Cancer Genome Atlas) public dataset WSIs ( Table 2). The distribution of test sets from three domestic hospitals (Hospital-A, B, and C) and the TCGA public dataset is summarized in Table 2. Patients' pathological records were used to extract WSIs' pathological diagnoses and to assign WSI labels. Training set WSIs were not annotated, and the training algorithm only used WSI diagnosis labels, meaning that the only information available for training was whether the WSI contained adenocarcinoma or was benign (non-neoplastic lesion), but no information about the location of the cancerous tissue lesions was provided. External prostate TCGA datasets are publicly available from the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/, accessed on 18 January 2022). We have confirmed that surgical pathologists were able to diagnose test sets in Table 2 from the visual inspection of the H&E-stained WSIs slide alone.

Deep Learning Models
We trained the models via transfer learning using the partial finetuning approach [36]. This is an efficient fine-tuning approach that consists of using the weights of an existing pre-trained model and only finetuning the affine parameters of batch-normalization layers and the final classification layer. For the model's architecture, we used EfficientNetB1 [37] starting with pre-trained weights on ImageNet. Figure 1 shows an overview of the training method. The training methodology that we used in the present study was exactly the same as reported in our previous studies [29,35]. For the sake of completeness, we repeat the methodology here.
We performed slide tiling by extracting square tiles from tissue regions of the WSIs. We started by detecting the tissue regions in order to eliminate most of the white background. This was conducted by performing thresholding on a grayscale version of the WSIs using Otsu's method [38]. During prediction, we performed the tiling of the tissue regions in a sliding window fashion by using a fixed-size stride (256 × 256 pixels). During training, we initially performed the random balanced sampling of tiles extracted from tissue regions, where we tried to maintain an equal balance of each label in the training batch. To do so, we placed WSIs in a shuffled queue such that we looped over the labels in succession (i.e., we alternated between picking a WSI with a positive label and a negative label). Once a WSI was selected, we randomly sampled batch size num labels tiles from each WSI to form a balanced batch. During the inference step, model weights were frozen, and the model was used to select tiles with the highest probability after applying it on entire tissue regions of each WSI. The top k tiles with the highest probabilities were then selected from each WSI and placed into a queue. During training, the selected tiles from multiple WSIs formed a training batch and were used to train the model.
To maintain the balance on the WSI, we oversampled from WSIs to ensure that the model trained on tiles from all WSIs in each epoch. We then switched to hard mining tiles. To perform hard mining, we alternated between training and inference. During inference, the CNN was applied in a sliding window fashion on all the tissue regions in the WSI, and we then selected the k tiles with the highest probability for being positive. This step effectively selects tiles that are most likely to be false positives when the WSI is negative. The selected tiles were placed in a training subset, and once that subset contained N tiles, training was initiated. We used k = 8, N = 256, and a batch size of 32.
To obtain a single prediction for the WSIs from tile predictions, we took the maximum probability from all tiles. We used the Adam optimizer [39], with the binary cross-entropy as the loss function and with the following parameters: beta 1 = 0.9, beta 2 = 0.999, a batch size of 32, and a learning rate of 0.001 when finetuning. We used early stopping by tracking the performance of the model on a validation set, and training stopped automatically when there were no further improvements on the validation loss for 10 epochs. We chose the model with the lowest validation loss as the final model.

Software and Statistical Analysis
Deep learning models were implemented and trained using TensorFlow [40]. AUCs were calculated in Python using the scikit-learn package [41] and plotted using matplotlib [42]. 95% CIs of the AUCs were estimated using the bootstrap method [43] with 1000 iterations.
The true positive rate (TPR) (also called sensitivity) was computed as follows.
The false positive rate (FPR) was computed as follows.
The true negative rate (TNR) (also called specificity) was computed as follows: where TP, FP, and TN represent true positive, false positive, and true negative, respectively. The ROC curve was computed by varying the probability threshold from 0.0 to 1.0 and computing both the TPR and FPR at the given threshold.

Insufficient AUC Performance of WSI Prostate Adenocarcinoma Evaluation on TUR-P WSIS Using Existing Series of Adenocarcinoma Classification Models
Prior to training a new prostate adenocarcinoma model using TUR-P WSIs, we applied existing adenocarcinoma classification models and evaluated their AUC performances on TUR-P test sets (Table 2). Existing adenocarcinoma classification models were summarized in Table 3: (1) [45]. Table 3 shows that Colon poorly ADC-2 (×20, 512) and Lung Carcinoma (×10, 512) models exhibited both high ROC-AUC and low log loss values compared to other models. Thus, we used the Colon poorly ADC-2 (×20, 512) and Lung Carcinoma (×10, 512) models as initial weights for finetuning on the TUR-P training sets when performing transfer learning (Table 1).

True Negative Prostate Adenocarcinoma Prediction of TUR-P WSIS
Our model (TL-colon poorly ADC-2 (×20, 512)) showed true negative predictions of prostate adenocarcinoma in TUR-P WSIs ( Figure 6A,B). In Figure 6A, histopathologically, there was nodular hyperplasia (benign prostatic hyperplasia) with chronic inflammation in all tissue fragments without evidence of malignancy ( Figure 6A,C-F), which were not predicted as prostate adenocarcinoma ( Figure 6B,C,E). Figure 6. Representative true negative prostate adenocarcinoma prediction outputs on a whole-slide image (WSI) from transurethral resection of the prostate (TUR-P) test sets using the model (TLcolon poorly ADC-2 (×20, 512)). Histopathologically, in (A), there was nodular hyperplasia (benign prostatic hyperplasia) with chronic inflammation without any evidence of malignancy (C-F). The heatmap image (B,C,E) shows a true negative prediction of prostate adenocarcinoma. The heatmap uses the jet color map where blue indicates low probability and red indicates high probability.

False Positive Prostate Adenocarcinoma Prediction of TUR-P WSIS
According to the histopathological reports and additional pathologist's review, there was no prostate adenocarcinoma observed in these TUR-P WSIs ( Figure 7A,E,H). Our model (TL-colon poorly ADC-2 (×20, 512)) showed false positive predictions of prostate adenocarcinoma ( Figure 7B-D,F,G,I,J). These false positive tissue areas ( Figure 7B-D,F,G,I,J) showed xanthogranulomatous inflammation ( Figure 7A,C,D), macrophagic infiltration ( Figure 7E,G), and squamous metaplasia with pseudo-koilocytosis ( Figure 7H,J), which could be the primary cause of false positives due to its morphological similarity in adenocarcinoma cells.

False Negative Prostate Adenocarcinoma Prediction of TUR-P WSIS
According to the histopathological report and additional pathologist's review, in this TUR-P WSI ( Figure 8A), there were a very small number of adenocarcinoma cells infiltrating in a tissue fragment ( Figure 8C), and pathologists have marked them with blue-dots. However, our model (TL-colon poorly ADC-2 (×20, 512)) did not predict any prostate adenocarcinoma cells ( Figure 8B,C).

Discussion
In this study, we trained deep learning models for the classification of prostate adenocarcinoma in TUR-P WSIs. Of the four models we trained (Table 4), the best model (TL-colon poorly ADC-2 (×20, 512)) achieved ROC-AUCs in the range of 0.896-0.984 on the TUR-P test set. The best model (TL-colon poorly ADC-2 (×20, 512)) also achieved high ROC-AUCs on needle biopsy (0.913) and TCGA public dataset (0.947) test set. The model (TL-lung carcinoma (×10, 512)) also achieved high ROC-AUCs in all test sets but was lower than the best one (TL-colon poorly ADC-2 (×20, 512)). The other two models were trained using the EfficientNetB1 [37] models starting with pre-trained weights on ImageNet at different magnifications (×10 and ×20) and tile sizes (224 × 224 px, 512 × 512 px). The models based on EfficientNetB1 (EfficientNetB1 (×10, 224) andEfficientNetB1 (×20, 512)) achieved robust high ROC-AUC values on TUR-P Hospital-B test sets compared to other test sets (Table 4). This shows that additional pre-training on other histopathological images was beneficial. Based on the prediction heatmap images of prostate adenocarcinoma, it was obvious that the models based on EfficientNetB1 (EfficientNetB1 (×10, 224) and EfficientNetB1 (×20, 512)) incorrectly predicted blue ink dots, which pathologists had marked during diagnosis, as prostate adenocarcinoma (Figure 3). Based on this finding, we have looked over WSIs in TUR-P Hospital-B test sets ( Table 2) and most of adenocarcinoma positive WSIs (28 out of 30 WSIs) had ink dots on WSIs, which were falsely predicted as adenocarcinoma. On the other hand, transfer learning models (TL-colon poorly ADC-2 (×20, 512) and TL-lung carcinoma (×10, 512)) revealed no false positive predictions on ink dots ( Figure 3); this is because those models had been trained on WSIs with ink labelled as non-neoplastic. The best model (TL-colon poorly ADC-2 (×20, 512)) and the second best model (TL-lung carcinoma (×10, 512)) were trained by the transfer learning approach from our existing colon poorly differentiated adenocarcinoma classification model [31] and lung carcinoma classification model [45] based on the findings of ROC-AUC and log loss values on TUR-P test sets (TUR-P Hospital-A-B) using existing adenocarcinoma classification models (Table 3). We used the partial finetuning approach [36] to train the models faster, as there are less weights involved in tuning. We used only 1020 TUR-P WSIs (adenocarcinoma: 79 WSIs; benign: 941 WSIs) ( Table 1) without manual annotations by pathologists [28,34,44]. We see that by specifically training on TUR-P WSIs, the models significantly improved prediction performances on the TUR-P test set (Table 4) compared to a previous study [35] that had lower ROC-AUC (0.737-0.909) and higher log loss (3.269-4.672) values. The combination of both models can provide accurate prostate adenocarcinoma classification on both needle biopsy [35] and TUR-P WSIs in routine histopathological diagnostic workflow.
Nodular hyperplasia (benign prostatic hyperplasia) is a common benign disorder of the prostate as a histopathological diagnosis referring to the nodular enlargement of the gland caused by hyperplasia in both glandular and stromal components within the prostatic transition zone and results in varying degrees of urinary obstruction, which sometimes requiring surgical interventions, including TUR-P [46]. Importantly, incidental prostate cancers are diagnosed at the time of TUR-P for benign prostatic disease [10]. According to the literature search, cancers, particularly prostate adenocarcinoma, are detected incidentally at around 5-17% of TUR-P specimens [4][5][6][7][8][9][10], meaning that around 83-95% of TUR-P specimens are benign lesions, which is nearly identical to the ratio of adenocarcinoma in the TUR-P test sets (Table 2). Therefore, the high values of specificity (0.884-0.992) in the best model are noteworthy (Table 5). Moreover, heatmap images revealed true negative predictions perfectly on each non-neoplastic fragment in both adenocarcinoma ( Figure 4) and benign (non-neoplastic) ( Figure 6) WSIs. Thus, heatmap images predicted by the best model would provide great benefits for pathologists who have to report the detail descriptions of many TUR-P specimens in routine clinical practices.
One limitation of this study is that it primarily included specimens from a limited number of hospitals and suppliers in Japan; therefore, the model could potentially be biased to such specimens. Further validations on a wide variety of specimens from multiple different origins would be essential for ensuring the robustness of the model. Another potential validation study could involve the comparison of the performance of the model against pathologists in a clinical setting. Another limitation of the study is that it simply performed classifications with respect to adenocarcinoma regardless of the Gleason score; in clinical practices, being able to classify the Gleason score would be of more interest.

Conclusions
The best deep learning model established in the present study offers promising results that indicate that it could be beneficial as a screening aid for pathologists prior to observing histopathology on glass slides or WSIs. At the same time, the model could be used as a double-checking tool for reducing the risk of missed cancer foci (incidental adenocarcinoma in TUR-P specimens). The most important advantage of using a fully automated computational tool is that it can systematically handle large amounts of WSIs without potential bias due to the fatigue commonly experienced by pathologists, which could drastically alleviate the heavy clinical burden of practical pathology diagnoses when using conventional microscopes.  Informed Consent Statement: Informed consent to use histopathological samples and pathological diagnostic reports for research purposes had previously been obtained from all patients prior to surgical procedures at all hospitals, and the opportunity for refusal to participate in research had been guaranteed in an opt-out manner.

Data Availability Statement:
The datasets generated during and/or analysed during the current study are not publicly available due to specific institutional requirements governing privacy protection but are available from the corresponding author upon reasonable request. The datasets that support the findings of this study are available from Kamachi Group Hospitals (Fukuoka, Japan), but restrictions apply to the availability of these data, which were used under a data-use agreement that was made according to the Ethical Guidelines for Medical and Health Research Involving Human Subjects as set by the Japanese Ministry of Health, Labour and Welfare (Tokyo, Japan) and, thus, are not publicly available. However, the data are available from the authors upon reasonable request for private viewing and with permission from the corresponding medical institutions within the terms of the data use agreement and if compliant with the ethical and legal requirements as stipulated by the Japanese Ministry of Health, Labour and Welfare.