Artificial Intelligence Based Algorithms for Prostate Cancer Classification and Detection on Magnetic Resonance Imaging: A Narrative Review

Due to the upfront role of magnetic resonance imaging (MRI) for prostate cancer (PCa) diagnosis, a multitude of artificial intelligence (AI) applications have been suggested to aid in the diagnosis and detection of PCa. In this review, we provide an overview of the current field, including studies between 2018 and February 2021, describing AI algorithms for (1) lesion classification and (2) lesion detection for PCa. Our evaluation of 59 included studies showed that most research has been conducted for the task of PCa lesion classification (66%) followed by PCa lesion detection (34%). Studies showed large heterogeneity in cohort sizes, ranging between 18 to 499 patients (median = 162) combined with different approaches for performance validation. Furthermore, 85% of the studies reported on the stand-alone diagnostic accuracy, whereas 15% demonstrated the impact of AI on diagnostic thinking efficacy, indicating limited proof for the clinical utility of PCa AI applications. In order to introduce AI within the clinical workflow of PCa assessment, robustness and generalizability of AI applications need to be further validated utilizing external validation and clinical workflow experiments.


Introduction
With a worldwide estimation of 1.4 million new cases in 2020, prostate cancer (PCa) is the second most common malignancy among men worldwide [1]. Despite the high prevalence of PCa, PCa related deaths account for merely 10% of all cancer deaths with five-year survival rate exceeding 98% for all PCa stages combined [2]. Considering the high PCa prevalence and low mortality rate, accurate differentiation between aggressive and non-aggressive PCa is of high importance to decrease overdiagnosis and overtreatment. Artificial intelligence (AI) techniques may have the potential to highlight important characteristics indicative of disease and therefore could provide significant aid in PCa management [3].
In 2018 and 2019, several large prospective trials concluded that the use of magnetic resonance imaging (MRI) prior to biopsy increases the detection of (more aggressive) clinically significant (cs)PCa, while decreasing detection of (non-aggressive) clinically insignificant (cis)PCa compared to transrectal ultrasound guided biopsy [4][5][6][7]. For this reason, multiparametric (mp)MRI has been included in the guidelines of the European Association of Urology (EAU) to be performed prior to biopsy [8]. It is recommended to use the Prostate Imaging and Reporting and Data System (PI-RADSv2.1) to report prostate MRI [9]. Suspicious lesions are graded from highly unlikely to highly likely for csPCa using a five-point Likert scale.
Due to the upfront role of mpMRI in the diagnostic pathway of PCa, the workload of prostate MRI examinations increases. Reporting these exams, however, requires substantial expertise and is limited by a steep learning curve and inter-reader variability [10][11][12]. Computer-aided detection and/or diagnosis (CAD) applications using AI may have a role in overcoming these challenges and aid in improving the workflow of prostate MRI assessment. Before AI-CAD applications for prostate MRI can be introduced within a clinical workflow, current applications described within literature and corresponding evidence for its potential use need to be investigated.
In this review we provide an overview of studies describing AI algorithms for prostate MRI analysis from January 2018 to February 2021, in which we differentiate applications for lesion classification and lesion detection for PCa. The study methodologies, data characteristics, and level of evidence are described. We furthermore review commercially available CAD software for PCa and discuss its clinical applications.

Background Machine Learning and Deep Learning Approaches
AI encompasses various subsets of learning techniques and algorithms. Machine learning (ML) is a subset within AI comprising algorithms that learn and predict specific tasks without explicit programming. For a long time, these ML techniques have served as the main pipeline for CAD applications [13]. Contrary to classical rule-based algorithms, ML is capable of learning and improving its task over time while being exposed to large and new data [14].
ML algorithms learn and predict by extracting and utilizing features [15]. In the field of prostate MRI, these features are mainly extracted from T2-weighted sequences and DWI with ADC maps and may additionally be combined with clinical parameters such as serum prostate-specific antigen (PSA) level and PSA density (PSAd). An expanding field used for image feature selection is that of radiomics. Radiomics concern the extraction of quantitative features from a region of interest (ROI), such as an annotation of a suspicious lesion, to describe the distinctive attributes of the ROI. Both semantic features, such as size and shape, as agnostic features, such as textures, are mined. The most significant features are selected and used in the learning task of the ML algorithm [16].
In more recent years, a particular subset of ML, deep learning (DL), gained popularity in CAD [13]. In contrast to classical ML algorithms, DL does not require prior feature extraction as the algorithm learns to extract complex and abstract features during training [14]. DL algorithms can be divided into typical algorithms utilizing one-dimensional feature input, or convolutional neural networks (CNN), utilizing two-and three-dimensional feature input, such as prostate mpMRI sequences. CNNs are often utilized within medical image analysis [17]. Although DL algorithms may be implemented without prior feature selection, these algorithms are limited by the need for extensive data for training. In addition, due to their complex architecture, DL algorithms are less transparent and difficult to interpret, which impedes widespread application [18].

Materials and Methods
The Pubmed and Cochrane libraries were searched for studies describing ML algorithms for the characterization, detection, and grading of PCa on MRI. The search was limited to articles written in English from 2018 to February 2021 using combined terms: artificial intelligence, machine learning, prostate cancer, magnetic resonance imaging, and corresponding synonyms for each term. The search was limited to these years to retrieve articles most representative for the current research field. Additional references were identified by manual search in the reference list of included papers. Duplicates, reviews, conference abstracts, preceding articles of described algorithm, and articles not related to the topic were excluded (Figure 1). conference abstracts, preceding articles of described algorithm, and articles not related to the topic were excluded (Figure 1). For categorization between various AI-CAD algorithms, studies were categorized within two common tasks [13,19]: 1. Lesion classification algorithms, i.e., Computer-Aided Diagnosis (CADx) Within the first group we included algorithms that classify manually annotated regions, such as lesion segmentations. We discriminate between two-class classification algorithms, utilizing either ML or DL, and multi-class classification algorithms.

Lesion detection algorithms, i.e., Computer-Aided Detection (CADe)
The second group included algorithms that detect and localize PCa lesions and provide the user with probability maps, segmentations, and/or attention boxes as output. We discriminate between algorithms providing two-class detection and multi-class detection. For all studies, AI algorithm characteristics, MRI sequences used, study design and cohort size, ground truth for PCa, and performance were extracted. Studies were graded using an adaptation of the hierarchical model for diagnostic imaging efficacy from Fryback and Thornbury, applicable for assessment of AI software in clinical practice (Table 1) [20,21]. For categorization between various AI-CAD algorithms, studies were categorized within two common tasks [13,19]:

1.
Lesion classification algorithms, i.e., Computer-Aided Diagnosis (CADx) Within the first group we included algorithms that classify manually annotated regions, such as lesion segmentations. We discriminate between two-class classification algorithms, utilizing either ML or DL, and multi-class classification algorithms.

2.
Lesion detection algorithms, i.e., Computer-Aided Detection (CADe) The second group included algorithms that detect and localize PCa lesions and provide the user with probability maps, segmentations, and/or attention boxes as output. We discriminate between algorithms providing two-class detection and multi-class detection.
For all studies, AI algorithm characteristics, MRI sequences used, study design and cohort size, ground truth for PCa, and performance were extracted. Studies were graded using an adaptation of the hierarchical model for diagnostic imaging efficacy from Fryback and Thornbury, applicable for assessment of AI software in clinical practice (Table 1) [20,21].
Secondly, a search for commercially available CAD software for PCa was performed to investigate current available products for clinical application. Applications were included if CAD was suited for prostate MRI assessment and received Food and Drug Administration (FDA) clearance and/or European Conformity (CE) marking. For included applications, key features, market date, and literature evidence were assessed. Table 1. Hierarchical model of efficacy to assess the contribution of AI software to the diagnostic imaging process. An adapted model from van Leeuwen et al. [21], based on Fryback and Thornbury's hierarchical model of efficacy [20].

Level Explanation Typical Measures
Level 1t * Technical efficacy Article demonstrates the technical feasibility of the software.
Level 1c ** Potential clinical efficacy Article demonstrates the feasibility of the software to be clinically applied.
Correlation to alternative methods, potential predictive value, biomarker studies.

Level 2
Diagnostic accuracy efficacy Article demonstrates the stand-alone performance of the software.
Standalone sensitivity, specificity, area under the ROC ¶ curve, or Dice score.

Level 3
Diagnostic thinking efficacy Article demonstrates the added value to the diagnosis.
Radiologist performance with/without AI, change in radiological judgement.

Level 4
Therapeutic efficacy Article demonstrates the impact of the software on the patient management decisions.
Effect on treatment or follow-up examinations.

AI Algorithms for Prostate Cancer Classification and Detection
In total, 59 studies were included in this review ( Figure 2). Thirty-nine articles (66%) described lesion classification algorithms. Of these, 35 articles (59%) described two-class lesion classification with 25 (42%) articles using an ML and 10 articles (17%) a DL approach. Four articles (7%) were included for multi-class lesion classification. The 20 remaining articles (34%) described lesion detection algorithms, with 17 studies (29%) for two-class lesion detection and 3 studies (5%) for multi-class lesion detection. Additionally, 6 commercially available AI applications for prostate MRI with either FDA clearance and/or CE marking were identified. In the next sections, topics will be summarized according to each category.

Lesion Classification (CADx)
In recent years, numerous ML and DL algorithms have been described for classification (CADx) of suspicious prostate lesions on MRI. Its task is to classify a manually annotated ROI in two or multiple classes, such as malignant versus benign tissue, classification between csPCa and cisPCa, or multi-class classification according to lesion aggressiveness (histopathological grading) or likelihood of csPCa (PI-RADS). Due to the different AI architecture of ML and DL and the large number of included studies, we describe the two-class lesion classification for ML and DL approaches separately.

Two-Class Lesion Classification with Machine Learning
In total, twenty-five studies described two-class lesion classification with a ML approach ( Table 2).
Most of these algorithms follow a similar workflow ( Figure 3). MR exams are used as input, either multiparametric or single sequence MR. Suspicious regions are manually or semiautomatically annotated by expert readers and used to extract image features. Image features comprise semantic features such as size, shape, and vascularity and agnostic features which describe the heterogeneity of the ROI through quantitative descriptors [16]. As shown in Table 2, image features may be extended with clinical variables such as PSAd. Subsequently, features with a strong relationship with the output labels are selected and used in the ML classification model. The output of the algorithm is a prediction score for two-classes, such as malignant versus benign lesions, for annotated ROIs. Included studies comprised cohort sizes ranging from 20 to 381 patients (median = 129). The gold standard for malignant lesions was obtained via prostate biopsy (19/25 (76%)) or after radical prostatectomy (7/25 (28%)). For most studies, lesion classification was based on either classification between malignant (ISUP ≥ 1) and benign lesions or csPCa (ISUP ≥ 2) vs. cisPCa (ISUP 1).

Lesion Classification (CADx)
In recent years, numerous ML and DL algorithms have been described for classification (CADx) of suspicious prostate lesions on MRI. Its task is to classify a manually annotated ROI in two or multiple classes, such as malignant versus benign tissue, classification between csPCa and cisPCa, or multi-class classification according to lesion aggressiveness (histopathological grading) or likelihood of csPCa (PI-RADS). Due to the different AI architecture of ML and DL and the large number of included studies, we describe the twoclass lesion classification for ML and DL approaches separately.

Two-Class Lesion Classification with Machine Learning
In total, twenty-five studies described two-class lesion classification with a ML approach (Table 2). Most of these algorithms follow a similar workflow ( Figure 3). MR exams are used as input, either multiparametric or single sequence MR. Suspicious regions are manually or semiautomatically annotated by expert readers and used to extract image features. Image features comprise semantic features such as size, shape, and vascularity and agnostic features which describe the heterogeneity of the ROI through quantitative descriptors [16]. As shown in Table 2, image features may be extended with clinical variables such as PSAd. Subsequently, features with a strong relationship with the output labels are selected and used in the ML classification model. The output of the algorithm is a prediction score for two-classes, such as malignant versus benign lesions, for annotated ROIs. Included studies comprised cohort sizes ranging from 20 to 381 patients (median = 129). The gold standard for malignant lesions was obtained via prostate biopsy (19/25 (76%)) or after radical prostatectomy (7/25 (28%)). For most studies, lesion classification was based on either classification between malignant (ISUP ≥ 1) and benign lesions or csPCa (ISUP ≥ 2) vs. cisPCa (ISUP 1).

Figure 3.
Machine learning (ML) workflow of two-class lesion classification for prostate cancer using an axial T2-weighted sequence. As input, multiparametric or singular MR sequences are used. Regions of interests (ROIs) are annotated, labeled, and used for feature extraction. A selection of features is used to train the ML-algorithm. As output, the annotated region is classified in one of the two classes.
Only a limited number of studies involved multicenter data (4/25 (16%)), whereas the remaining studies utilized retrospectively collected data from a single center (21/25 (84%)). In nine studies, the performance was assessed with cross-validation methods due to a limited study cohort size. Sixteen studies assessed performance on unseen data.  [29,33,41,43,47]. In the work of Kan et al., validation on an internal test set yielded a per lesion AUC for PCa characterization of 0.83. When tested on an external cohort, the per lesion performance decreased to an AUC of 0.67, indicating the importance of external validation for robust performance assessment [33,49].
Most of the included studies for ML based two-class lesion classification solely described the stand-alone performance of the algorithm and did not investigate the influence of CADx in a (prospective) clinical workflow, resulting in a level 2 efficacy (stand-alone performance; see also Table 1). To aid in performance interpretation, ten out of twentyfive studies compared their algorithm with visual scoring by radiologists. For example, Figure 3. Machine learning (ML) workflow of two-class lesion classification for prostate cancer using an axial T2-weighted sequence. As input, multiparametric or singular MR sequences are used. Regions of interests (ROIs) are annotated, labeled, and used for feature extraction. A selection of features is used to train the ML-algorithm. As output, the annotated region is classified in one of the two classes.   Only a limited number of studies involved multicenter data (4/25 (16%)), whereas the remaining studies utilized retrospectively collected data from a single center (21/25 (84%)). In nine studies, the performance was assessed with cross-validation methods due to a limited study cohort size. Sixteen studies assessed performance on unseen data. Kan [29,33,41,43,47]. In the work of Kan et al., validation on an internal test set yielded a per lesion AUC for PCa characterization of 0.83. When tested on an external cohort, the per lesion performance decreased to an AUC of 0.67, indicating the importance of external validation for robust performance assessment [33,49].
Most of the included studies for ML based two-class lesion classification solely described the stand-alone performance of the algorithm and did not investigate the influence of CADx in a (prospective) clinical workflow, resulting in a level 2 efficacy (stand-alone performance; see also Table 1). To aid in performance interpretation, ten out of twentyfive studies compared their algorithm with visual scoring by radiologists. For example, Antonelli et al. compared the performance of the algorithm with the assessment of three radiologists to identify Gleason 4 components in suspicious MRI lesions. The algorithm yielded a higher sensitivity at a 50% threshold for lesion classification in the peripheral zone (0.93) compared with the mean sensitivity of the three radiologists (0.72) [24].
In order to investigate the added clinical value of AI based lesion classification, Xu et al. and Zhang et al. introduced decision curve analysis (DCA) using retrospective data. DCA analysis is utilized to assess the clinical utility and additional benefit for a prediction algorithm e.g., assessment of an algorithm to reduce the number of unnecessary biopsies. As a result, a simulated impact on patient management is provided which benefits the interpretation of its clinical utility (efficacy level 4) [50]. Both Xu et al. and Zhang et al. showed that, compared to the treat-all-patients scheme or the treat-none scheme, ML algorithms could improve net benefit if the threshold probability of a patient or doctor was higher than 10% [46,47].

Two-Class Lesion Classification with Deep Learning
In total, ten studies were included in which DL was used for two-class lesion classification (Table 3).  [48]. ** Efficacy level 4 was assigned for potential simulated therapeutic efficacy as determined with decision curve analysis.
Compared to ML, DL does not require feature selection as features are learned during training (Figure 4). An ROI is annotated on MRI. As depicted in Table 3, ROIs encompass patches or volumes around the lesion or prostate gland and may be extended with clinical features. Selected ROIs are fed into a DL classification algorithm, in which features are extracted and one of two classes is predicted for the corresponding input. Alternatively, DL can be combined with ML in which a DL approach is used for feature extraction and a ML algorithm for classification [58]. Of the included studies, cohort size ranged from 18 to 499 patients (median = 278). Ground truth was provided by biopsy (9/10 (90%)) or radical prostatectomy (1/10 (10%)). Six out of the ten studies aimed to characterize benign tissue from csPCa (ISUP ≥ 2) and four studies aimed to classify benign from malignant lesions (ISUP ≥ 1).
Compared to ML, DL does not require feature selection as features are learned during training (Figure 4). An ROI is annotated on MRI. As depicted in Table 3, ROIs encompass patches or volumes around the lesion or prostate gland and may be extended with clinical features. Selected ROIs are fed into a DL classification algorithm, in which features are extracted and one of two classes is predicted for the corresponding input. Alternatively, DL can be combined with ML in which a DL approach is used for feature extraction and a ML algorithm for classification [58]. Of the included studies, cohort size ranged from 18 to 499 patients (median = 278). Ground truth was provided by biopsy (9/10 (90%)) or radical prostatectomy (1/10 (10%)). Six out of the ten studies aimed to characterize benign tissue from csPCa (ISUP ≥ 2) and four studies aimed to classify benign from malignant lesions (ISUP ≥ 1).  [52,59,60]. With this approach, pretrained algorithms for a different classification task are applied within a different but related learning task and therefore decrease the large labeled data requirement [14]. Chen et al. utilized a pretrained network on diabetic retinopathy diagnosis, which was trained on a dataset of 128,000 images [52]. Zhong et al. showed that higher AUC and accuracy could be achieved with transfer learning (AUC = 0.726, accuracy = 0.723), compared with the DL model without transfer learning (AUC = 0.687, accuracy = 0.702) [60].
Similarly to the ML algorithms, most described DL algorithms in Table 3

Multi-Class Lesion Classification
Several recent algorithms have introduced multi-class lesion classification, utilizing both conventional ML as DL algorithms, to assess lesion aggressiveness (n = 4 studies, Table 4).  [52,59,60]. With this approach, pretrained algorithms for a different classification task are applied within a different but related learning task and therefore decrease the large labeled data requirement [14]. Chen et al. utilized a pretrained network on diabetic retinopathy diagnosis, which was trained on a dataset of 128,000 images [52]. Zhong et al. showed that higher AUC and accuracy could be achieved with transfer learning (AUC = 0.726, accuracy = 0.723), compared with the DL model without transfer learning (AUC = 0.687, accuracy = 0.702) [60].
Similarly to the ML algorithms, most described DL algorithms in Table 3

Multi-Class Lesion Classification
Several recent algorithms have introduced multi-class lesion classification, utilizing both conventional ML as DL algorithms, to assess lesion aggressiveness (n = 4 studies, Table 4).  Assessment of the aggressiveness is important for PCa management. The histopathological grade is defined by the International Society of Urological Pathology (ISUP) [68]. Patients with cisPCa (often ISUP 2 or lower) are eligible for active surveillance (AS) whereas men with higher grade lesions (ISUP > 2) are advised to undergo invasive treatment, such as radical prostatectomy or radiotherapy [8,69]. Multi-class lesion classification algorithms utilize ML or DL techniques to grade input ROIs in different groups according to lesion aggressiveness ( Figure 5). Of the included studies, cohort size ranged from 72 to 112 patients (median = 99) and all studies used prostate biopsy as ground truth.  [63][64][65].
Only the study of Jensen et al. included multiple datasets from various sites [65]. Due to the smaller cohorts, two studies utilized cross-validation methods for validation. Both Chaddad et al. and Jensen et al. utilized independent data for algorithm validation [64,65]. No included study investigated additional value of CADx in a clinical setting and solely provided stand-alone performance of the algorithm (efficacy level 2).

Lesion Detection (CADe)
Besides classification of predefined ROIs on prostate MRI, several algorithms have been described that automatically detect suspicious PCa lesions (CADe). The general pipeline for these algorithms is displayed in Figure 6. Compared to lesion classification algorithms, no prior lesion annotation is necessary for classification as the AI method classifies the image on a voxel-level compared to a ROI. For this reason, PCa detection algorithms could aid in automated prostate MRI assessment, by presenting suspicious areas with probability maps and or segmentations to the reader. Studies described detection algorithms for two classes (e.g., malignant versus benign) and multi-class detection, in which malignant tissue is detected and classified according to its aggressiveness.  [63][64][65].
Only the study of Jensen et al. included multiple datasets from various sites [65]. Due to the smaller cohorts, two studies utilized cross-validation methods for validation. Both Chaddad et al. and Jensen et al. utilized independent data for algorithm validation [64,65]. No included study investigated additional value of CADx in a clinical setting and solely provided stand-alone performance of the algorithm (efficacy level 2).

Lesion Detection (CADe)
Besides classification of predefined ROIs on prostate MRI, several algorithms have been described that automatically detect suspicious PCa lesions (CADe). The general pipeline for these algorithms is displayed in Figure 6. Compared to lesion classification algorithms, no prior lesion annotation is necessary for classification as the AI method classifies the image on a voxel-level compared to a ROI. For this reason, PCa detection algorithms could aid in automated prostate MRI assessment, by presenting suspicious areas with probability maps and or segmentations to the reader. Studies described detection algorithms for two classes (e.g., malignant versus benign) and multi-class detection, in which malignant tissue is detected and classified according to its aggressiveness. Figure 6. Deep learning (DL) and machine learning (ML) workflow of algorithms for two-class lesion detection for prostate cancer (PCa) using an axial T2-weighted sequence. As input, multiparametric or single MR sequences are utilized. During this, training features are trained and used to classify image voxels within benign or malignant classes. Algorithms provide a probability map for prostate cancer likelihood. Based on a threshold within the probability map (e.g., probability > 0.5), prostate cancer segmentations (red) or attention boxes based on prostate cancer segmentations (yellow) may be extracted.

Two-Class Lesion Detection
In total, seventeen studies for two-class lesion detection were included (Table 5).

Two-Class Lesion Detection
In total, seventeen studies for two-class lesion detection were included (Table 5).
Most of the studies validated the performance of an original algorithm, whereas five studies performed a new validation study on existing CADe applications, assessing the robustness and generalizability of the algorithm.  McGarry et al. performed a follow-up study on a previously reported radiologypathology mapping algorithm for high-grade PCa detection [81]. The follow-up study showed introduction of variability in model performance using different pathologists to annotate lesions [80]. Schelb and colleagues simulated clinical deployment of a prior developed DL algorithm for detection of csPCa [84,85]. In this study, a new cohort of 259 patients was included for validation of the algorithm. Schelb et al. concluded that similar performance compared to PI-RADS assessment by radiologists was observed, i.e.,: sensitivity of 0.84 for PI-RADS ≥ 4 versus 0.83 for DL and that regular quality assurance of the model should be desired to maintain its performance [84].
Multiple studies investigated the added diagnostic value of CADe (efficacy level 3). Zhu et al. performed a study in which integration of CADe with structured MR reports of prostate mpMRI was evaluated [89]. A DL algorithm was trained to create probability maps of csPCa which were visualized during reporting of prostate MRI. The AUC increased from 0.83 to 0.89 with CADe assistance during reading. Furthermore, with CADe, 23/89 lesions were correctly upgraded versus 6/89 lesions incorrectly downgraded [89]. Gaur et al. and Greer et al. utilized multi-center studies to evaluate the additional value of CADe for PCa [73,76]. Readers were first asked to assess mpMRI sequences without AI assistance. For the second session, readers were instructed to perform PCa detection with probability maps created by CADe, in combination with the full mpMRI sequences. In 2020, Mehralivand et al. performed a second multi-center study utilizing the AI technique previously utilized by Gaur et al. [73,82]. Instead of probability maps, attention boxes for csPCa were provided to reduce the compromised interaction between the radiologists and the AI system. The lesion based AUC did not significantly increase with CADe assisted reading (0.749 for MRI and 0.775 for CADe assistance) [82].
Cao et al. detected and classified lesions in six grade groups: normal tissue, and ISUP 1 to ISUP 5 [92]. The authors implemented a multi-class DL algorithm with ordinal encoding incorporating both T2W and ADC images. Vente and colleagues described a 2D DL segmentation approach in which zonal masks were implemented along mpMRI [93]. Their work assigned different classes to lesions according to the probability of the output layer, with a higher ISUP group correlating to a higher probability. A quadratic-weighted kappa score of 0.13 was achieved, indicating the still difficult task for lesion detection combined with grading [93].
In the study of Winkel et al., an algorithm combining both detection and classification of lesions according to PI-RADSv2.1 was investigated [94]. In their work, a prototype DL-based CADe application was validated in a prospective PCa screening study involving 48 patients. The algorithm firstly detects lesion candidates, then reduces false positive candidates followed by a classification algorithm according to PI-RADSv2.1. Kappa statistics were applied to assess the AI solution agreement with PI-RADSv2.1 classification by radiologists. A weighted kappa of 0.42 was observed, showing moderate agreement with PI-RADS scoring [94]. All studies were assigned an efficacy level of 2. Although Winkel et al. utilized a prospective study design for validation of the AI performance, no combination of AIassisted lesion detection within a clinical workflow was performed [94]. The prospective aspect of the study, providing unique validation data, does provide stronger evidence for the algorithm [95]. Both Cao et al. and Winkel et al. incorporated comparison with radiological assessment [92,94]. Cao et al. showed a non-significant difference between the radiologists (sensitivity of 83.9% and 80.7%) and their algorithm (sensitivity of 80.5% and 79.2%), for the detection of histopathology-proven PCa lesions and csPCa lesions [92]. Winkel et al. showed that both the AI technique and the radiologist were able to identify all biopsy-verified PCa lesions [94].

Commercial CAD Algorithms for Prostate MRI
For CAD to be used in clinical practice, it needs to be approved by local authorities. In the United States the Food and Drug Administration (FDA) clears medical devices and in Europe a CE mark is necessary. For prostate MR analysis there are now six products commercially available, of which three are FDA cleared, one is CE marked and two products are both [96,97] (Table 7). The aim of these products is to optimize the prostate reading workflow and/or enhance lesion detection. The most important AI-based feature in five of these products is prostate segmentation to acquire volumetric information and calculate PSAd (OnQ Prostate, Cortechs.ai; PROView, GE Medical Systems; Quantib Prostate, Quantib; qp-Prostate, Quibim). One product claims to provide an image level probability for the presence of cancer including heatmaps to aid the radiologist in PCa detection (JPC-01K, JLK Inc.). Only a single product describes AI based multi-class lesion detection (CADe), classifying lesion candidates according to PI-RADSv2.1 (Prostate MR, Siemens Healthineers). The performance of a prototype of this product was validated within the work of Winkel et al. (Table 6) [94]. Further scientific evidence on the performance or efficacy of the products is limited.

Discussion
In this review, we identified current AI algorithms for PCa lesion classification (CADx) and detection (CADe). The narrative showed that most of the recent work is performed on lesion classification using ML applications with radiomics features. The included studies showed large differences in cohort sizes, ranging from 18 to 499 patients (median = 162), with different approaches for validation of algorithm performance. Few studies show efficacy levels higher than level 2, illustrating the limited evidence of AI PCa applications on the clinical impact and utility.
The majority of the included studies (50/59 (85%)) describe the stand-alone performance of algorithms and evaluate diagnostic accuracy with the AUC (efficacy level 2).
While the AUC provides the ability to benchmark different algorithms and reader performances, in order to assess improved treatment, physicians are more interested in AI performance benchmarked against experienced readers to assess clinical utility [95]. Only a few studies (9/59 (15%)) address an efficacy level 3 or higher by validating the CAD within a clinical workflow (efficacy level 3) [46,47,53,55,56]. This is in concordance with a recent meta-analysis on ML classification of csPCa, where the authors did not find an improved PCa detection using CAD in a clinical workflow [98].
The reported performances from included studies require careful interpretation. For example, only a limited number of prospective studies were found and cohort sizes were relatively small, ranging from 18 to 499 study subjects (median = 162). It is questionable if these relatively small datasets are sufficient to train robust and generalizable AI. To illustrate, the prototype CADe system validated by Winkel et al. was trained on 2170 biparametric prostate MR examinations, obtained from eight different institutions. A commercially available AI-CADe system for breast cancer detection was trained on 189,000 mammograms, obtained from four different vendors [94,99]. In addition, regarding reproducibility and generalizability, a limited number of the included studies utilized external validation. Supplementary Figures S1-S3 illustrate the observed heterogeneity in cohort sizes and validation approaches. From a clinical perspective, it is of utmost importance to validate the predictive performance of an algorithm on external data in an extensive cohort [95]. Although multiple studies utilized a split-sample approach, in which a subset of data was solely reserved for validation, and therefore could assess the validity of the algorithm, split-set validation does not provide an accurate assessment of the generalizability. Utilization of test data from different institutions and potentially different MR systems could provide more generalizable results [49,100]. Liu et al. systematically reviewed 82 articles regarding the diagnostic accuracy of DL algorithms for disease classification in medical imaging evaluated against health-care professionals [101]. In their work, a major finding was the limited amount of publications presenting externally validations and performance comparison with health-care professionals on the same samples [101]. Castillo et al. systematically reviewed literature regarding ML classification of csPCa on MRI. Their work confirms the lack of homogeneous reporting and external validation. The authors therefore advocate for prospective study designs to assess added clinical value on external and new patient data combined with standardized reporting methods [98]. A possible solution to introduce more comparable results is to set up a challenge in which the algorithms are all validated on the same dataset. In this way, validation and benchmarking of algorithms remains centralized. An example of this approach is the grand-challenge platform (https://grand-challenge.org, accessed on 3 May 2021), in which various challenges are introduced for AI-based medical imaging tasks.
Within this work, distinction between ML techniques and DL techniques was made. Especially in the classification group (CADx), a large subset of studies utilized ML combined with a radiomics workflow. Compared to DL, radiomics is often favored due to the transparency in features used to learn a classification task, as compared to the 'blackbox' phenomenon observed within DL [18]. Although the radiomics pipeline facilitates transparency for AI decision making, and therefore could aid in trustworthy AI, robustness of these algorithms needs yet to be assessed in larger studies [102]. A limitation for the radiomics pipeline is the reproducibility of imaging data. Factors such as interreader agreement for manually selected ROIs, variability due to different vendors and imaging protocols, and high amounts of correlated and clustered features, may limit the reproducibility and generalizability of these models [102,103]. This review has several limitations. The first limitation is the inclusion period between 2018 and February 2021. Due to this criterion, a multitude of proof-of-concept studies regarding PCa classification utilizing radiomics features were included. However, studies published prior to 2018 on traditional ML techniques for distinguishing benign versus PCa and multi-class classification or detection according to tumor aggressiveness, were excluded. Our rationale was that multi-reader studies or extended validation studies on AI algorithms developed before 2018 would be observed in more recent years. This was also observed for multiple included studies, in which traditional ML algorithms developed prior to 2018 were evaluated in multi-reader studies between 2018 and 2021 [73,76,82,89]. Future work could include publications prior to 2018 to reduce bias towards radiomics algorithms and introduce more work on traditional ML algorithms. This, however, would also limit the relevance of the current scientific field on AI for PCa assessment. Secondly, due to the observed heterogeneity within the cohort sizes and validation approaches, no metaanalysis on performance assessment was implemented and no comparison between various methods, such as ML and DL could be addressed with adequate support. To overcome this, performance could be assessed by grouping various AI approaches weighting the corresponding cohort sizes and validation approaches. This, however, exceeded the aims and objectives of this review and can be addressed in future work.
Many of the AI features, such as PCa detection and diagnosis, studied in literature have not yet found their way into commercial products. Current commercial CAD applications are mostly focused on prostate segmentation and volumetrics to improve the workflow for reporting and only a single application was observed with AI supported detection and lesion classification. This finding provides insights into the gap between academic results and clinical practice and may partly explain the lack of evidence on the impact of AI in clinical practice. Recent studies on CAD prototypes from, e.g., Winkel et al. indicate the future direction of commercial CAD applications, in which AI solutions for classification and detection of PCa lesions are gaining interest. The same authors have performed a new study on CAD implementation, which has been published after the inclusion period of studies within this review, underlining the rapid development within this field [104]. It is expected that future work on PCa CAD applications for lesion classification and detection will continue, with initiatives to centralize and combine data from multiple institutions to increase generalizability and robustness of PCa CAD arising [105].

Conclusions
Multiple AI algorithms for PCa classification and detection are being investigated in current literature. Although stand-alone performance of the algorithms shows to be promising for future implementation in clinical workflow, work on the generalizability and robustness still needs to be performed to assess the clinical benefit and utility of AI in this field. Future work should focus on increased cohort sizes, external validation, and benchmarking performance with expert readers to aid the development of reproducible and interpretable AI. In addition, studies incorporating CAD within a clinical workflow are necessary to demonstrate clinical utility and will guide the next steps for PCa CAD applications.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/diagnostics11060959/s1, Figure S1: Observed heterogeneity in study cohort, validation cohort, and validation approach for two-class lesion classification algorithms utilizing machine learning. Figure S2: Observed heterogeneity in study cohort, validation cohort, and validation approach for two-class lesion classification algorithms utilizing deep learning. Figure S3: Observed heterogeneity in study cohort, validation cohort, and validation approach for two-class lesion detection algorithms.