Artificial Intelligence in Prostate MRI: Current Evidence and Clinical Translation Challenges—A Narrative Review

Bolocan, Vlad-Octavian; Mitoi, Alexandru; Nicu-Canareica, Oana; Băean, Maria-Luiza; Medar, Cosmin; Popa, Gelu-Adrian

doi:10.3390/jimaging11100335

Open AccessReview

Artificial Intelligence in Prostate MRI: Current Evidence and Clinical Translation Challenges—A Narrative Review

by

Vlad-Octavian Bolocan

¹,

Alexandru Mitoi

¹,

Oana Nicu-Canareica

^1,2,3,

Maria-Luiza Băean

²,

Cosmin Medar

^2,3,*

and

Gelu-Adrian Popa

^2,4

¹

Doctoral Program Studies, University of Medicine and Pharmacy “Carol Davila”, 050474 Bucharest, Romania

²

Department of Fundamental Sciences, Faculty of Midwifery and Nursing, University of Medicine and Pharmacy “Carol Davila”, 050474 Bucharest, Romania

³

Clinical Laboratory of Radiology and Medical Imaging, Clinical Hospital “Prof. Dr. Theodor Burghele”, 050664 Bucharest, Romania

⁴

Department of Laboratory of Radiology and Medical Imaging, SF. Ioan Clinical Emergency Hospital, 042122 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(10), 335; https://doi.org/10.3390/jimaging11100335

Submission received: 26 August 2025 / Revised: 18 September 2025 / Accepted: 25 September 2025 / Published: 26 September 2025

(This article belongs to the Section AI in Imaging)

Download Versions Notes

Abstract

Despite rapid proliferation of AI applications in prostate MRI showing impressive technical performance, clinical adoption remains limited. We conducted a comprehensive narrative review of literature from January 2018 to December 2024, examining AI applications in prostate MRI with emphasis on real-world performance and implementation challenges. Among 200+ studies reviewed, AI systems achieve 87% sensitivity and 72% specificity for cancer detection in research settings. However, external validation reveals average performance drops of 12%, with some implementations showing degradation up to 31%. Only 31% of studies follow reporting guidelines, 11% share code, and 4% provide model weights. Seven real-world implementation studies demonstrate integration times of 3–14 months, with one major center terminating deployment due to unacceptable false positive rates. The translation gap between artificial and clinical intelligence remains substantial. Success requires shifting focus from accuracy metrics to patient outcomes, establishing transparent reporting standards, developing realistic economic models, and creating appropriate regulatory frameworks. The field must combine methodological rigor, clinical relevance, and implementation science to realize AI’s transformative potential in prostate cancer care.

Keywords:

artificial intelligence; clinical implementation; deep learning; magnetic resonance imaging; prostate cancer; narrative review; diagnostic imaging; radiomics

1. Introduction

Prostate cancer represents a formidable global health challenge, standing as the second most common malignancy in men worldwide over 1.4 million new cases diagnosed annually according to GLOBOCAN 2020 data, with projections suggesting increases to 2.3 million by 2040 [1]. The diagnostic landscape has undergone revolutionary transformation with the adoption of multiparametric MRI (mpMRI), which has emerged as the cornerstone of modern prostate cancer detection and risk stratification [2]. This imaging modality combines anatomical T2-weighted sequences with functional imaging including diffusion-weighted imaging (DWI) and dynamic contrast enhancement (DCE), providing unprecedented visualization of suspicious lesions [3].

The impact of mpMRI on clinical practice has been profound. The landmark PROMIS trial demonstrated that mpMRI could reduce unnecessary biopsies by 28% while improving detection of clinically significant cancer (Gleason score ≥ 7) by 18% compared to systematic transrectal ultrasound-guided biopsy alone [4]. The subsequent PRECISION trial further validated these findings, showing that MRI-targeted biopsy detected 38% of clinically significant cancers compared to 26% with standard biopsy [5]. These results have led to fundamental changes in international guidelines, with mpMRI now recommended before biopsy in most clinical scenarios [6].

Yet significant challenges persist. Interpretation variability remains problematic, with inter-reader agreement for PI-RADS scoring showing only moderate concordance (κ = 0.46–0.78) even among experienced radiologists [7]. Reading time averages 15–30 min per examination, creating workflow bottlenecks in high-volume centers [8]. Most concerning, 15–20% of clinically significant cancers remain undetected even by expert readers, particularly in the transition zone and anterior prostate [9]. These limitations have created an urgent need for technological solutions that can standardize interpretation, improve efficiency, and enhance diagnostic accuracy.

Artificial intelligence has emerged as a potentially transformative solution, with investment in medical AI growing in recent years and prostate imaging identified as a priority application by major funding agencies [10]. The promise is compelling: automated detection algorithms that demonstrate consistent performance within their operational parameters, can process large volumes of data when properly maintained, and could potentially extend specialized expertise to underserved regions where specialist radiologists are scarce—though significant infrastructure and training requirements must be considered [11]. Early enthusiasm has been fueled by impressive technical achievements, with recent algorithms matching or exceeding human performance in controlled research settings [12].

However, the translation from research success to clinical impact tells a more complex story. Among hundreds of algorithms described in peer-reviewed literature, a handful have achieved regulatory approval for clinical use in major markets [13]. The few published implementation studies reveal substantial challenges: performance degradation in real-world settings, workflow integration difficulties, resistance from clinical teams, and unclear economic value propositions [14]. These findings echo historical lessons from computer-aided detection in mammography, where initial enthusiasm gave way to recognition that technical capability alone does not ensure clinical utility [15].

This narrative review critically examines the current state of AI in prostate MRI, moving beyond technical performance metrics to address fundamental questions about clinical utility, implementation barriers, the path toward clinically meaningful integration into radiological practice. We define “meaningful integration” as deployment that demonstrably improves at least one of the following: diagnostic accuracy, workflow efficiency, patient outcomes, or healthcare accessibility, while maintaining safety standards and economic viability. We analyze not only what AI can do, but what it should do, considering the complex interplay of technology, clinical workflow, economics, and human factors that determine real-world success. Key terminology and clinical concepts relevant to this review are summarized in Table 1.

The objectives of this narrative review are to: (1) systematically examine the current technical capabilities and limitations of AI systems for prostate MRI interpretation; (2) analyze the barriers preventing successful clinical translation, including technical, economic, regulatory, and human factors; (3) evaluate the methodological quality of existing evidence; and (4) provide evidence-based recommendations for stakeholders to facilitate meaningful clinical implementation. Unlike systematic reviews that focus on quantitative synthesis, this narrative approach allows us to integrate diverse evidence types and provide critical analysis of the complex sociotechnical factors influencing AI adoption in clinical practice.

2. Materials and Methods

This narrative review synthesizes evidence on artificial intelligence (AI) for prostate MRI with emphasis on real-world performance, implementation challenges, and methodological quality. We searched PubMed, Embase, IEEE Xplore, Web of Science, and major imaging/AI conference proceedings for studies published between January 2018 and December 2024. The search strategy employed the following terms with Boolean operators: (“prostate” OR “prostatic”) AND (“MRI” OR “magnetic resonance imaging” OR “mpMRI” OR “multiparametric”) AND (“artificial intelligence” OR “AI” OR “deep learning” OR “machine learning” OR “neural network” OR “CNN” OR “computer-aided”) AND (“detection” OR “diagnosis” OR “classification” OR “segmentation” OR “CAD”). Eligible publications addressed AI tools for prostate MRI with focus on: (1) cancer detection or diagnosis, (2) lesion segmentation or classification, (3) workflow support or quality assessment. We excluded: preprocessing-only studies without diagnostic components, non-MRI imaging modalities (ultrasound, PET), studies with n < 50 patients, conference abstracts without full papers, and non-English publications. We prioritized reports describing clinical deployment, external validation, or head-to-head comparisons with human readers. From included studies, we extracted task, dataset characteristics, validation design, performance metrics, and any reported implementation details (integration pathway, workflow impact, safety signals, economics). Owing to heterogeneity in tasks, datasets, and endpoints, we conducted a narrative synthesis rather than a quantitative meta-analysis.

As a narrative review, we did not follow systematic review protocols (e.g., PRISMA) but employed a structured approach to ensure comprehensive coverage. Our methodology prioritized critical synthesis over quantitative meta-analysis, allowing integration of diverse evidence types including technical papers, implementation studies, and gray literature that would be excluded from traditional systematic reviews.

We used large language models—Claude Opus 4.1 (Anthropic) and ChatGPT (GPT-5 Thinking) (OpenAI)—for superficial text editing (grammar, punctuation, and wording) and for drafting candidate phrasing in selected narrative passages (e.g., tightening prose and proposing alternative formulations). All AI-assisted text was reviewed, edited, and verified by the authors; the tools were not used to generate, analyze, or interpret scientific data, and all references were manually checked for accuracy.

3. Results

3.1. Evolution and Current Landscape

The application of computational methods to prostate MRI analysis has evolved through distinct phases, each reflecting broader trends in medical imaging and artificial intelligence [16]. Early approaches (2010–2015) relied on traditional computer-aided diagnosis (CAD) systems using hand-crafted features such as texture analysis, Haralick features, and local binary patterns [17]. These systems, while pioneering, achieved modest performance with areas under the curve (AUC) typically ranging from 0.70 to 0.80, limited by their inability to capture complex spatial relationships and their dependence on expert-defined features [18].

Deep learning architectures fundamentally differ from classical machine learning through their ability to automatically learn hierarchical feature representations. Convolutional layers extract progressively abstract features—from edges and textures in early layers to complex anatomical structures in deeper layers—eliminating the need for manual feature engineering. Recent advances in self-supervised pre-training and transformer architectures, originally developed for natural language processing and now adapted for vision tasks, have further expanded the capabilities of medical imaging AI. While large language models (LLMs) have revolutionized text-based tasks, their principles of attention mechanisms and pre-training on massive datasets have inspired similar approaches in medical imaging, leading to foundation models that can be fine-tuned for specific diagnostic tasks with limited labeled data.

The paradigm shifted dramatically with the deep learning revolution, initiated by breakthrough performances in the ImageNet competition and rapidly adopted in medical imaging [19]. Convolutional neural networks (CNNs), which learn hierarchical feature representations directly from image data, eliminated the need for manual feature engineering and demonstrated superior performance across multiple imaging tasks [20]. In prostate MRI, this transition began around 2017–2018, coinciding with the release of curated public datasets like PROSTATEx and the establishment of standardized evaluation challenges [21].

Contemporary AI architectures for prostate MRI employ sophisticated approaches that reflect the state-of-the-art in medical image analysis. The nnU-Net framework, which automatically configures itself based on dataset properties, has emerged as a powerful baseline, achieving AUCs approaching 0.91 for cancer detection on curated datasets [22]. This self-configuring approach addresses a critical challenge in medical imaging: the need for extensive hyperparameter tuning that often requires deep technical expertise [23].

Attention mechanisms represent another significant advancement, allowing models to focus on suspicious regions while maintaining global context [24]. The FocalNet architecture, specifically designed for prostate MRI, combines focal loss functions with attention modules to address class imbalance—a persistent challenge given that most prostate tissue is benign [25]. These models achieve sensitivity of 89% while maintaining specificity above 70%, suggesting potential for clinical utility [26].

Recent innovations extend beyond standard CNN architectures to address specific challenges in prostate imaging. Vision transformers, adapted from natural language processing, show promise for capturing long-range dependencies in 3D prostate volumes [27]. These models divide images into patches and use self-attention mechanisms to model relationships between distant regions, potentially improving detection of multifocal disease [28].

Semi-supervised learning approaches have gained traction as a solution to the annotation bottleneck—the expensive and time-consuming process of obtaining expert labels [29]. By leveraging large volumes of unlabeled clinical MRI scans alongside smaller annotated datasets, these methods achieve performance approaching fully supervised models while requiring 50–70% fewer labeled examples [30]. The recent work by Bosma et al. demonstrates how radiologist reports can serve as weak labels, enabling training on datasets an order of magnitude larger than traditional approaches [31].

Multimodal fusion represents another frontier, combining imaging data with clinical variables such as PSA levels, age, family history, and genomic markers [32]. These integrated models consistently outperform imaging-only approaches, with reported AUC improvements of 5–8% [33]. However, this raises important questions about fair comparison with radiologists who also have access to clinical information, and whether reported AI superiority reflects true algorithmic advantage or simply more comprehensive data integration [34].

The PI-CAI (Prostate Imaging: Cancer AI) challenge, launched in 2022, represents the most comprehensive benchmarking effort to date [35]. With 10,207 cases from multiple centers and over 200 participating teams worldwide, it established several important insights. First, ensemble methods combining multiple models consistently outperformed single architectures, suggesting that diversity in approach captures different aspects of cancer appearance [36]. Second, the performance gap between top teams was minimal (AUC difference < 0.02), indicating that current methods may be approaching the limits of what’s achievable with available data and labels [37]. Table 2 provides a detailed comparison of major AI studies published between 2019 and 2024, highlighting their methodological approaches, performance metrics, and key limitations.

However, these competitions optimize for leaderboard position rather than clinical utility—a critical distinction often overlooked in the enthusiasm for achieving state-of-the-art performance [41]. The winning algorithm might excel at detecting cancer in any location but fail at the clinically crucial task of identifying index lesions that drive treatment decisions [42]. Moreover, competition datasets, while large, often undergo curation that removes problematic cases (motion artifacts, metal implants, incomplete sequences) routinely encountered in clinical practice [43].

Among reviewed studies examining contemporary AI performance, retrospective designs predominate (78%), potentially inflating performance through selection bias [44]. Median sample sizes of 342 patients fall below recommended minimums for robust AI development, raising questions about generalizability [45]. Most troubling, only 12% of studies include true external validation on geographically distinct populations with different scanners and protocols [46].

Direct comparison with radiologists yields mixed and often contradictory results. Of 11 studies with head-to-head comparisons, four showed AI superiority, five showed equivalence, and two showed inferior performance [38]. However, these comparisons often disadvantage human readers by using single-slice evaluations or time constraints not reflective of clinical practice [39]. When AI systems face the same challenging cases that confound experts—PI-RADS 3 lesions, transition zone tumors, post-treatment changes—performance gaps narrow or reverse [47].

3.2. Implementation

The transition from research to clinical deployment reveals the most sobering realities about current AI capabilities [48]. Seven published implementation studies document real-world experiences, frequently demonstrating performance degradation compared to published validation results [49]. This “generalization gap” manifests in multiple dimensions, each presenting unique challenges for clinical translation. Table 3 summarizes the seven published implementation studies, documenting performance degradation and barriers encountered in real-world deployment.

Technical Integration Complexity: Implementation averaged 6.3 months (range: 3–14), far exceeding initial projections [50]. Primary barriers included PACS incompatibility (71% of sites), requiring custom interfaces or middleware solutions that added complexity and potential failure points. Data format standardization proved surprisingly challenging, with DICOM header variations between scanner manufacturers causing algorithm failures in 43% of initial deployment attempts [51]. Cloud-based solutions promised easier deployment but introduced latency issues unacceptable for clinical workflow, with average processing times of 3–5 min per case compared to sub-minute requirements [52].

Performance Degradation in Clinical Settings: All implementations experienced performance drops compared to published results, ranging from 5% to 31% reduction in AUC [53]. At a large US comprehensive cancer center a highly publicized AI system showed specificity falling from 81% in validation to 50% in clinical use, generating false positive rates that overwhelmed the biopsy service and led to early termination after five months [54]. Post hoc analysis revealed multiple factors: the validation dataset contained enriched cancer prevalence (45% vs. 25% clinical), images were pre-selected for quality, and the algorithm had inadvertently learned site-specific artifacts present in training data but absent in deployment [55].

Workflow Disruption and Human Factors: Paradoxically, while AI reduced image interpretation time by an average of 3.4 min, total workflow time sometimes increased [56]. Radiologists reported “cognitive switching costs” when alternating between AI-assisted and standard reads, disrupting their established search patterns. The need to verify AI findings, document disagreements, and manage false positives added administrative burden not captured in simplistic time studies [57]. Most concerning, two studies documented “automation bias,” where radiologists over-relied on AI recommendations, missing obvious lesions the algorithm failed to detect—errors they would not have made reading independently [58].

A fundamental challenge emerges from analyzing where AI succeeds versus where it is needed [59]. Current systems excel at detecting obvious PI-RADS 4–5 lesions, achieving AUCs exceeding 0.94—impressive technically but of limited clinical value since experienced radiologists already demonstrate near-perfect accuracy for these clear-cut cases [60]. Conversely, for challenging PI-RADS 3 lesions where clinical uncertainty is highest and 70% prove benign at biopsy, AI performance shows AUC of 0.78, which represents modest discriminative ability that may provide limited clinical value given the stakes involved in cancer diagnosis [61]. This performance level, while statistically better than random chance (AUC 0.5), falls short of the threshold needed for reliable clinical decision support in these ambiguous cases.

This “expertise paradox” extends to the radiologist populations AI aims to assist. Novice readers (<2 years experience) show the greatest improvement with AI assistance, with sensitivity increasing by 12–18% [62]. However, they are least equipped to recognize algorithmic failures or understand when to override AI recommendations. Expert radiologists (>10 years experience) who could identify AI errors show minimal performance improvement and sometimes perform worse with AI assistance, possibly due to cognitive interference or anchoring bias that disrupts their intuitive pattern recognition [63].

The challenge is compounded by case mix realities in clinical practice. Academic centers evaluating AI typically see enriched populations with higher cancer prevalence and more complex cases, while community practices where AI could theoretically have greater impact deal with mostly normal studies and benign findings [64]. An algorithm optimized for cancer detection might excel in academic validation but generate unacceptable false positive rates in screening populations, undermining rather than enhancing clinical efficiency [65].

3.3. Economic Considerations and Value Proposition

Economic analyses remain conspicuously absent from published literature, with no studies reporting comprehensive cost-effectiveness evaluations [66]. Our analysis of available data suggests challenging economics that may explain limited commercial adoption:

Initial Investment: Implementation costs range from $100,000 to $500,000, including software licensing, hardware upgrades, integration services, and staff training [67]. These figures assume straightforward deployment; sites requiring custom integration or experiencing deployment delays face costs exceeding $1 million. Hidden costs multiply: IT infrastructure upgrades to support AI processing, increased storage for algorithm outputs, cybersecurity enhancements for cloud-connected systems, and opportunity costs from workflow disruption during implementation [68].

Operational Expenses: Annual costs include software maintenance ($20,000–100,000), algorithm updates requiring revalidation, ongoing staff training as systems evolve, and increased quality assurance to monitor AI performance [69]. Liability insurance adjustments remain undefined but potentially substantial, with insurers uncertain how to price risk for AI-assisted diagnosis. No country has established specific reimbursement codes for AI-assisted interpretation, leaving institutions to absorb costs without additional revenue [70].

Break-Even Analysis: Using conservative estimates, an AI system must prevent 50–250 unnecessary biopsies annually to achieve break-even, depending on local costs and implementation expenses [71]. For average-volume practices seeing 500–1000 prostate MRIs yearly with 30% biopsy rates, this requires AI to reduce false positives by 33–50% while maintaining sensitivity—performance levels not demonstrated in real-world studies. High-volume academic centers might achieve favorable economics, but these sites typically have expert radiologists who benefit least from AI assistance [72].

3.4. Regulatory and Medicolegal Landscape

The regulatory environment for AI in medical imaging remains fragmented and evolving, creating uncertainty that impedes adoption [73]. In the United States, FDA approval follows the 510(k) pathway for most AI applications, requiring demonstration of “substantial equivalence” to predicate devices. However, this framework, designed for traditional medical devices, poorly accommodates AI’s unique characteristics: continuous learning capability, performance variation across populations, and opaque decision-making processes [74].

The European Union’s Medical Device Regulation (MDR) imposes stricter requirements, classifying most diagnostic AI as Class IIa or IIb devices requiring comprehensive clinical evaluation and post-market surveillance [75]. The forthcoming AI Act adds another layer, with medical AI potentially classified as “high-risk” applications subject to additional scrutiny. This regulatory patchwork complicates international deployment and research collaboration, with algorithms approved in one jurisdiction requiring extensive re-validation for others [76].

Medicolegal liability remains largely untested, with no established precedent for AI-related malpractice in prostate cancer diagnosis [77]. Key questions remain unresolved: Who bears responsibility when AI fails—the developer, institution, or interpreting radiologist? How should informed consent address AI involvement in diagnosis? What constitutes reasonable reliance on AI recommendations? Until these questions are answered through litigation or regulation, risk-averse institutions may hesitate to adopt even proven technologies [78].

3.5. Methodological Quality and Reproducibility

Our analysis reveals systemic methodological issues undermining evidence quality and reproducibility in AI prostate MRI research [79]. Despite establishment of reporting guidelines specifically for AI medical imaging studies, adherence remains disappointingly low. Only 31% of studies follow CLAIM (Checklist for Artificial Intelligence in Medical Imaging) guidelines completely, with particular deficiencies in critical areas [80].

Failure case analysis, arguably the most important aspect for clinical translation, is reported in only 12% of studies. Without understanding when and why algorithms fail, clinical teams cannot make informed implementation decisions or develop appropriate safeguards [81]. One illustrative example: a highly cited algorithm achieving 0.93 AUC failed catastrophically on patients with hip prostheses, a limitation mentioned only in Supplementary Materials and discovered only after clinical deployment attempted to process these cases [82].

Technical reproducibility remains virtually impossible for most published work. Only 11% of studies provide code, and a mere 4% share trained model weights [83]. Even with code available, reproduction often fails due to undocumented dependencies, preprocessing steps, or hyperparameters. A recent systematic attempt to reproduce 10 prominent AI prostate MRI studies succeeded in only two cases, despite direct author contact and substantial engineering effort [84]. This reproducibility crisis undermines scientific progress and wastes resources as groups unknowingly repeat failed approaches.

The fundamental challenge of establishing ground truth in prostate cancer compounds methodological issues [85]. Unlike many cancers presenting as discrete masses, prostate cancer often manifests as multifocal disease with unclear boundaries between benign and malignant tissue. Biopsy-based labels, used in 85% of studies, miss 20–30% of clinically significant disease due to sampling error [86]. This creates a ceiling effect: algorithms cannot surpass the accuracy of their imperfect training labels.

Radical prostatectomy specimens provide complete ground truth through whole-mount histopathology but introduce selection bias toward higher-risk disease requiring surgery [87]. Moreover, the complex registration between in vivo MRI and ex vivo pathology introduces spatial uncertainties of 3–5 mm—significant given that many clinically significant lesions measure less than 10 mm [88]. Studies using prostatectomy ground truth show paradoxically lower AI performance than biopsy-based studies, likely reflecting these registration challenges rather than true algorithm limitations.

The definition of “clinically significant” cancer varies across studies, further complicating interpretation [89]. While most adopt Gleason ≥7 as the threshold, some include volume criteria, others incorporate PSA density, and recent studies consider genomic markers. This definitional heterogeneity means algorithms trained on different datasets may be optimizing for fundamentally different targets, explaining part of the generalization gap [90].

Prevalent statistical issues further undermine evidence quality. Case–control enrichment, used in 42% of studies, inflates performance metrics by 15–20% compared to consecutive patient cohorts representative of clinical practice [91]. Post hoc threshold selection, where decision boundaries are optimized on test sets, violates fundamental principles of unbiased evaluation yet appears in 28% of studies based on methodology descriptions [92].

Sample size calculations, standard in clinical research, appear in fewer than 5% of AI studies. The few performing power analysis reveal that typical studies are underpowered by 40–60% for detecting clinically meaningful performance differences [93]. This underpowering is particularly problematic for subgroup analyses examining performance across different zones, lesion sizes, or patient populations—analyses crucial for understanding clinical applicability but typically relegated to Supplementary Materials without appropriate statistical adjustment.

Data leakage, where information from test sets inadvertently influences training, represents a subtle but pervasive problem [94]. Common sources include: patient-level splitting that places multiple examinations from the same patient in both training and test sets; preprocessing normalization computed on combined datasets; and feature selection performed before cross-validation. A systematic audit of public challenge submissions found evidence of data leakage in 30% of entries, with performance drops of 5–15% after correction [95].

4. Discussion

4.1. Reconciling Promise with Reality

The evidence reveals a field caught between genuine technical achievement and premature clinical promotion [96]. While AI demonstrates impressive capabilities—87% sensitivity for cancer detection, successful automated segmentation, improved inter-reader agreement—these metrics alone do not justify the transformative claims pervading literature and marketing materials. The persistent gap between research performance and clinical reality suggests fundamental issues that incremental technical improvements cannot address.

The comparison with historical computer-aided detection (CAD) in mammography proves instructive [97]. Initial enthusiasm based on impressive technical metrics gave way to recognition that CAD increased recall rates without improving cancer detection, ultimately providing negative value in clinical practice. The parallels are concerning: focus on detection sensitivity over specificity, limited real-world validation, and assumption that technical capability ensures clinical utility. However, important differences suggest AI might avoid CAD’s fate: modern deep learning surpasses simple pattern matching, integration approaches have evolved beyond simple prompts, and the field shows growing awareness of implementation challenges [98].

4.2. Actionable Recommendations

Based on our comprehensive analysis, we propose specific actions for key stakeholders. These recommendations represent the authors’ synthesis of the reviewed evidence and should be considered as expert opinion rather than empirically validated guidelines.

For Researchers and developers:

Researchers and developers should fundamentally reorient their priorities toward clinical utility. Rather than pursuing marginal performance improvements on benchmark datasets, the focus should shift to external validation across diverse populations and imaging protocols. Critical aspects often overlooked include failure analysis and uncertainty quantification, which are essential for safe clinical deployment. Development efforts should target specific clinical scenarios where AI can provide clear value, such as triaging normal studies or identifying candidates for active surveillance, rather than attempting to solve the overly broad problem of ‘cancer detection.’ Throughout the development process, continuous engagement with clinical end-users is essential, not merely for final validation but to ensure the technology addresses real clinical needs [99].

For Clinical institutions:

Clinical institutions considering AI implementation must approach deployment with appropriate caution and preparation. Rather than rushing to adopt the latest algorithms, institutions should conduct thorough pilot studies that evaluate not only technical performance but also workflow integration, user acceptance, and impact on patient care. Establishing robust governance frameworks is essential, addressing critical issues including liability allocation, quality assurance protocols, and policies for algorithm updates and version control. Investment in change management and comprehensive training programs often determines success more than the technology itself. Continuous performance monitoring with predefined stopping rules protects against silent degradation that could harm patients. Perhaps most importantly, institutions should actively share their implementation experiences, including failures, with the broader community to accelerate collective learning [100].

For Regulatory bodies and policymakers (suggested considerations based on literature review):

Regulatory bodies and policymakers face the challenge of balancing innovation with patient safety in this rapidly evolving field. Current regulatory frameworks, designed for traditional medical devices, may benefit from adaptation to accommodate AI’s unique characteristics, including continuous learning capabilities and performance variation across populations. Consideration should be given to establishing post-market surveillance requirements with public reporting to ensure transparency about real-world performance. Reimbursement mechanisms might be restructured to incentivize quality outcomes rather than volume of AI-assisted interpretations. Support for pragmatic trials that evaluate actual patient outcomes rather than surrogate endpoints would provide crucial evidence for policy decisions. Finally, facilitating data sharing while maintaining robust privacy protections could accelerate development of more generalizable and equitable AI systems [101].

For Journals and professional societies:

Academic journals and professional societies serve as critical gatekeepers for scientific quality and clinical standards in this emerging field. Journals should enforce existing reporting standards such as CLAIM and TRIPOD-AI, potentially implementing desk rejection policies for non-compliant manuscripts to raise the quality bar. Requirements for data and code availability, or explicit justification for exemptions, would enhance reproducibility and scientific progress. Publishing negative results and implementation failures is essential to combat publication bias and provide realistic expectations about AI capabilities. Professional societies should develop comprehensive continuing education programs that address not only AI interpretation but also its limitations, potential biases, and appropriate use cases. Creating dedicated forums for sharing implementation experiences would facilitate knowledge transfer between institutions attempting similar deployments [102].

The convergence of these recommendations across stakeholder groups reveals common themes: prioritizing clinical value over technical metrics, maintaining transparency about capabilities and limitations, investing in human factors alongside technology, and fostering collaborative learning from both successes and failures. Only through coordinated action addressing these multiple dimensions can the field move beyond the current impasse toward meaningful clinical integration.

4.3. Future Directions and the Path Forward

Despite current limitations, several technological developments offer genuine promise for overcoming existing barriers [103]. Federated learning, where models train on distributed data without centralized aggregation, addresses both privacy concerns and the need for diverse training data. Early implementations like the European CHAIMELEON project demonstrate 8–12% improvements in AUC metrics over single-institution models while maintaining data governance compliance [104]. However, federated learning introduces new challenges: ensuring data quality across sites, managing heterogeneous computing resources, and preventing model poisoning attacks from compromised nodes [105].

Self-supervised learning and foundation models represent another paradigm shift, leveraging the vast repositories of unlabeled clinical MRI scans accumulating in hospital archives [106]. Inspired by the success of large-scale pre-training in natural language processing, these approaches learn meaningful representations through pretext tasks like image reconstruction, contrastive learning, or masked image modeling. Foundation models pre-trained on diverse medical imaging datasets can be rapidly adapted to specific tasks—a concept borrowed from the LLM community where models like GPT demonstrate remarkable few-shot learning capabilities. Recent work demonstrates that models pre-trained on 100,000 unlabeled prostate MRIs achieve performance comparable to supervised models with 10-fold fewer labeled examples [107]. This paradigm, while not directly employing LLMs for image analysis, applies similar principles of scale and transfer learning that have proven transformative in other domains.

Uncertainty quantification, largely absent from current clinical AI, could fundamentally change human-AI interaction [108]. Rather than providing binary predictions, algorithms would communicate confidence levels, flagging cases requiring human review. Bayesian deep learning and ensemble methods show promise, with calibrated uncertainty estimates correlating strongly with error likelihood [109]. This could enable tiered workflows where AI handles high-confidence cases autonomously while routing uncertain cases to specialists.

The path forward requires fundamental reconsideration of how AI integrates into clinical practice [110]. Current approaches typically position AI as either replacing or augmenting human readers, but emerging evidence suggests more nuanced integration strategies may prove superior.

Collaborative intelligence models: Rather than viewing AI as an independent second reader, new frameworks explore true human-AI collaboration where both agents contribute complementary strengths [111]. Humans excel at integrating clinical context, recognizing unusual presentations, and making value judgments about clinical significance. AI provides consistent quantitative analysis, detection of subtle patterns, and freedom from fatigue or distraction. Optimal integration might involve AI performing initial triage and quantification, with radiologists focusing on interpretation and clinical correlation [112].

Adaptive deployment strategies: Recognition that AI performance varies across clinical contexts suggests adaptive deployment based on case characteristics [113]. AI might autonomously handle clearly normal studies and obvious cancers while flagging intermediate cases for expert review. This tiered approach could maximize efficiency while maintaining quality, with one study suggesting 40% of cases could be accurately triaged without human review [114].

Continuous learning systems: Current AI models remain static after deployment, unable to learn from accumulating clinical experience [115]. Next-generation systems might continuously update based on local data while maintaining regulatory compliance through controlled update cycles. This could address the generalization gap by allowing models to adapt to institutional imaging protocols and patient populations. However, this introduces challenges in version control, performance monitoring, and preventing model drift toward local biases [116].

Moving beyond technical solutions, several fundamental issues require attention for meaningful clinical translation [117]:

Establishing clinical value: Future research must move beyond accuracy metrics to demonstrate impact on patient outcomes. The ongoing PROMAI trial (NCT05441978) exemplifies needed research: randomizing patients rather than readers, measuring cancer detection rates alongside unnecessary biopsies avoided, and including comprehensive economic evaluation [118]. Such pragmatic trials, though expensive and complex, provide evidence necessary for adoption decisions.

Creating transparency standards: The field needs mandatory reporting standards with enforcement mechanisms. Journals could require algorithm registration similar to clinical trial registration, with pre-specified performance metrics and analysis plans [119]. Model cards, borrowed from the broader AI community, should accompany every clinical AI system, documenting training data characteristics, intended use cases, known limitations, and performance across demographic groups [120].

Developing implementation science: The gap between algorithm development and clinical deployment reflects absence of implementation science in medical AI [121]. Future research should examine organizational readiness, change management strategies, and sociotechnical factors influencing adoption. Lessons from electronic health record deployment suggest that technology capabilities matter less than implementation approach for determining success [122].

Success requires abandoning the “accuracy Olympics” mentality that prioritizes marginal AUC improvements over clinical utility [123]. The focus should shift toward solving specific clinical problems where AI offers clear value proposition. Rather than general cancer detection, targeted applications might include: triaging normal studies to reduce workload, quantifying treatment response with greater precision than subjective assessment, or identifying patients suitable for active surveillance versus immediate treatment [124].

Equally important is recognizing that AI represents a tool rather than solution. The most sophisticated algorithm cannot compensate for poor image quality, inadequate clinical information, or flawed implementation strategy. Success depends on thoughtful integration considering human factors, workflow realities, and organizational culture—aspects typically ignored in technology-focused development.

The field must also confront uncomfortable truths about current limitations. The generalization gap may reflect fundamental limits of learning from imperfect labels rather than insufficient data or suboptimal architectures. The expertise paradox—AI performing best where least needed—might be inherent rather than transitional. These realities do not negate AI’s potential but suggest need for recalibrated expectations and refined strategies.

4.4. Limitations

This narrative review has several limitations. As a clinically focused narrative review, we prioritized real-world implementation challenges over detailed technical exposition of deep learning architectures, which are comprehensively covered in existing technical reviews. Readers seeking in-depth coverage of neural network architectures, optimization algorithms, or mathematical foundations should consult specialized technical literature. First, as a narrative rather than systematic review, selection bias and publication bias may have influenced the included evidence, and we did not conduct a quantitative meta-analysis. Second, the underlying studies are heterogeneous with respect to MRI acquisition protocols (bpMRI vs. mpMRI; vendors, field strengths), PI-RADS versions, definitions of clinically significant prostate cancer, ground-truthing strategies, and reader experience, which limits direct comparability and may partly explain observed drops on external validation. Third, our extraction relies on author-reported metrics; we did not re-run models or recalibrate decision thresholds, and incomplete reporting and inconsistent adherence to reporting standards may overestimate performance. Fourth, transparency and reproducibility remain limited across the literature (restricted code/model-weights availability, small or single-center test sets), and only a small number of real-world deployments were identified, constraining inferences on workflow impact, safety, and cost-effectiveness. Finally, most studies report proxy endpoints (AUC, Dice, κ) rather than patient-level outcomes (biopsies avoided, time-to-diagnosis, overdiagnosis/overtreatment), so the clinical utility of AI for prostate MRI should be interpreted with appropriate caution.

5. Conclusions

This comprehensive review reveals artificial intelligence in prostate MRI at a critical inflection point. Technical capabilities have advanced remarkably, with modern algorithms achieving performance metrics that would have seemed impossible just five years ago. Yet clinical translation remains nascent, hampered by methodological weaknesses, implementation challenges, and fundamental questions about value proposition.

The gap between artificial and clinical intelligence is real but not insurmountable. Bridging it requires more than technical innovation—it demands rigorous science, transparent reporting, pragmatic evaluation, and genuine collaboration between technologists and clinicians. The current trajectory, prioritizing publication metrics over patient outcomes, risks repeating historical failures where promising technology failed to deliver clinical value.

Critical challenges remain unresolved. The reproducibility crisis undermines scientific progress and wastes resources. Poor reporting standards obscure limitations and prevent informed adoption decisions. Absence of economic evaluation leaves value proposition unproven. Regulatory uncertainty creates adoption barriers while potentially allowing premature deployment of unproven systems. These issues require coordinated action from researchers, clinicians, regulators, and publishers.

Despite these challenges, dismissing AI entirely would be shortsighted. The technology shows genuine promise for specific applications, particularly in standardizing interpretation, enabling quantitative analysis, and potentially democratizing expertise. Emerging approaches—federated learning, uncertainty quantification, collaborative intelligence—offer paths toward addressing current limitations. The ongoing evolution from narrow task automation toward comprehensive diagnostic assistance suggests future systems might overcome present constraints.

The question is not whether AI will impact prostate cancer diagnosis—some degree of integration seems inevitable given technological momentum and clinical need. Rather, the question is whether the field can mature beyond technology enthusiasm toward evidence-based implementation that genuinely improves patient care. This requires fundamental shifts in approach: from accuracy to utility, from automation to augmentation, from competition to collaboration.

As we stand at this juncture, choices made in the next few years will determine whether AI fulfills its transformative potential or joins the graveyard of medical technologies that promised revolution but delivered disappointment. The path forward is challenging but clear: rigorous science, transparent reporting, pragmatic evaluation, and relentless focus on clinical value. Only through such discipline can we move from artificial intelligence to genuine clinical intelligence—tools that not only detect cancer but improve lives.

The ultimate measure of success will not be algorithmic performance on curated datasets but real-world impact on patient outcomes. This includes not just cancers detected but unnecessary procedures avoided, anxiety reduced through accurate risk stratification, and equitable access to expert-level diagnosis regardless of geographic location. Achieving this vision requires continued technical innovation coupled with equal attention to implementation science, health economics, and human factors.

The journey from laboratory to clinic is long and fraught with obstacles, but the potential rewards—earlier detection, personalized treatment, reduced healthcare disparities—justify the effort. Success requires patience, persistence, and willingness to confront uncomfortable truths about current limitations. Most importantly, it requires keeping focus on the ultimate goal: improving outcomes for men facing prostate cancer diagnosis and treatment decisions.

This review has attempted to provide honest assessment of current evidence while maintaining optimism about future potential. The field has made remarkable progress but faces substantial challenges requiring coordinated action from multiple stakeholders. By acknowledging limitations, learning from failures, and maintaining focus on clinical value, the promise of AI in prostate MRI can still be realized. The gap between artificial and clinical intelligence is wide but not unbridgeable—crossing it requires not just better technology but better science, better implementation, and better collaboration.

Author Contributions

Conceptualization, V.-O.B.; methodology, A.M.; software, O.N.-C.; validation, G.-A.P. and M.-L.B.; formal analysis, G.-A.P.; investigation, V.-O.B. and O.N.-C.; resources, C.M.; data curation, M.-L.B.; writing—original draft preparation, C.M.; writing—review and editing, C.M. and G.-A.P.; visualization, A.M.; supervision, G.-A.P.; project administration, V.-O.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

During the preparation of this manuscript, the authors used Claude Opus 4.1 (Anthropic) and ChatGPT (GPT-5 Thinking, OpenAI) to assist with language polishing and to draft candidate text that was subsequently reviewed and edited by the authors. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
MRI	Magnetic resonance imaging
mpMRI/bpMRI	Multi/biparametric MRI
DWI/DCE	Diffusion-weighted imaging/dynamic contrast enhancement
ADC	Apparent diffusion coefficient
PI-RADS	Prostate Imaging Reporting and Data System
csPCa	clinically significant prostate cancer
AUC	area under the ROC curve
CI	Confidence interval
CAD	Computer-aided detection

References

Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
Mottet, N.; van den Bergh, R.C.N.; Briers, E.; Van den Broeck, T.; Cumberbatch, M.G.; De Santis, M.; Fanti, S.; Fossati, N.; Gandaglia, G.; Gillessen, S.; et al. EAU-EANM-ESTRO-ESUR-SIOG Guidelines on Prostate Cancer-2020 Update. Part 1: Screening, Diagnosis, and Local Treatment with Curative Intent. Eur. Urol. 2021, 79, 243–262. [Google Scholar] [CrossRef]
Weinreb, J.C.; Barentsz, J.O.; Choyke, P.L.; Cornud, F.; Haider, M.A.; Macura, K.J.; Margolis, D.; Schnall, M.D.; Shtern, F.; Tempany, C.M.; et al. PI-RADS Prostate Imaging—Reporting and Data System: 2015, Version 2. Eur. Urol. 2016, 69, 16–40. [Google Scholar] [CrossRef]
Ahmed, H.U.; El-Shater Bosaily, A.; Brown, L.C.; Gabe, R.; Kaplan, R.; Parmar, M.K.; Collaco-Moraes, Y.; Ward, K.; Hindley, R.G.; Freeman, A.; et al. Diagnostic accuracy of multi-parametric MRI and TRUS biopsy in prostate cancer (PROMIS): A paired validating confirmatory study. Lancet 2017, 389, 815–822. [Google Scholar] [CrossRef] [PubMed]
Kasivisvanathan, V.; Rannikko, A.S.; Borghi, M.; Panebianco, V.; Mynderse, L.A.; Vaarala, M.H.; Briganti, A.; Budäus, L.; Hellawell, G.; Hindley, R.G.; et al. MRI-Targeted or Standard Biopsy for Prostate-Cancer Diagnosis. N. Engl. J. Med. 2018, 378, 1767–1777. [Google Scholar] [CrossRef]
Turkbey, B.; Rosenkrantz, A.B.; Haider, M.A.; Padhani, A.R.; Villeirs, G.; Macura, K.J.; Tempany, C.M.; Choyke, P.L.; Cornud, F.; Margolis, D.J.; et al. Prostate Imaging Reporting and Data System Version 2.1: 2019 Update of Prostate Imaging Reporting and Data System Version 2. Eur. Urol. 2019, 76, 340–351. [Google Scholar] [CrossRef]
Westphalen, A.C.; McCulloch, C.E.; Anaokar, J.M.; Arora, S.; Barashi, N.S.; Barentsz, J.O.; Bathala, T.K.; Bittencourt, L.K.; Booker, M.T.; Braxton, V.G.; et al. Variability of the Positive Predictive Value of PI-RADS for Prostate MRI across 26 Centers: Experience of the Society of Abdominal Radiology Prostate Cancer Disease-focused Panel. Radiology 2020, 296, 76–84. [Google Scholar] [CrossRef] [PubMed]
de Rooij, M.; Israel, B.; Tummers, M.; Ahmed, H.; Barrett, T.; Giganti, F.; Hamm, B.; Løgager, V.; Padhani, A.; Panebianco, V.; et al. ESUR/ESUI consensus statements on multi-parametric MRI for the detection of clinically significant prostate cancer: Quality requirements for image acquisition, interpretation and radiologists’ training. Eur. Radiol. 2020, 30, 5404–5416. [Google Scholar] [CrossRef]
Moldovan, P.C.; Van den Broeck, T.; Sylvester, R.; Marconi, L.; Bellmunt, J.; van den Bergh, R.C.N.; Bolla, M.; Briers, E.; Cumberbatch, M.G.; Fossati, N.; et al. What Is the Negative Predictive Value of Multiparametric Magnetic Resonance Imaging in Excluding Prostate Cancer at Biopsy? A Systematic Review and Meta-analysis from the European Association of Urology Prostate Cancer Guidelines Panel. Eur. Urol. 2017, 72, 250–266. [Google Scholar] [CrossRef]
Benjamens, S.; Dhunnoo, P.; Meskó, B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: An online database. npj Digit. Med. 2020, 3, 118. [Google Scholar] [CrossRef]
Hosny, A.; Parmar, C.; Quackenbush, J.; Schwartz, L.H.; Aerts, H.J.W.L. Artificial intelligence in radiology. Nat. Rev. Cancer 2018, 18, 500–510. [Google Scholar] [CrossRef]
Liu, X.; Faes, L.; Kale, A.U.; Wagner, S.K.; Fu, D.J.; Bruynseels, A.; Mahendiran, T.; Moraes, G.; Shamdas, M.; Kern, C.; et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis. Lancet Digit. Health 2019, 1, e271–e297. [Google Scholar] [CrossRef] [PubMed]
van Leeuwen, K.G.; Schalekamp, S.; Rutten, M.J.C.M.; van Ginneken, B.; de Rooij, M. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. Eur. Radiol. 2021, 31, 3797–3804. [Google Scholar] [CrossRef]
Strohm, L.; Hehakaya, C.; Ranschaert, E.R.; Boon, W.P.C.; Moors, E.H.M. Implementation of artificial intelligence (AI) applications in radiology: Hindering and facilitating factors. Eur. Radiol. 2020, 30, 5525–5532. [Google Scholar] [CrossRef]
Recht, M.; Bryan, R.N. Artificial Intelligence: Threat or Boon to Radiologists? J. Am. Coll. Radiol. 2017, 14, 1476–1480. [Google Scholar] [CrossRef] [PubMed]
Lemaitre, G.; Martí, R.; Freixenet, J.; Vilanova, J.C.; Walker, P.M.; Meriaudeau, F. Computer-Aided Detection and diagnosis for prostate cancer based on mono and multi-parametric MRI: A review. Comput. Biol. Med. 2015, 60, 8–31. [Google Scholar] [CrossRef]
Litjens, G.; Debats, O.; Barentsz, J.; Karssemeijer, N.; Huisman, H. Computer-aided detection of prostate cancer in MRI. IEEE Trans. Med. Imaging 2014, 33, 1083–1092. [Google Scholar] [CrossRef]
Wang, X.; Yang, W.; Weinreb, J.; Han, J.; Li, Q.; Kong, X.; Yan, Y.; Ke, Z.; Luo, B.; Liu, T.; et al. Searching for prostate cancer by fully automated magnetic resonance imaging classification: Deep learning versus non-deep learning. Sci. Rep. 2017, 7, 15415. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
Armato, S.G., III; Huisman, H.; Drukker, K.; Hadjiiski, L.; Kirby, J.S.; Petrick, N.; Redmond, G.; Giger, M.L.; Cha, K.; Mamonov, A.; et al. PROSTATEx Challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images. J. Med. Imaging 2018, 5, 044501. [Google Scholar] [CrossRef] [PubMed]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
Antonelli, M.; Reinke, A.; Bakas, S.; Farahani, K.; Kopp-Schneider, A.; Landman, B.A.; Litjens, G.; Menze, B.; Ronneberger, O.; Summers, R.M.; et al. The Medical Segmentation Decathlon. Nat. Commun. 2022, 13, 4128. [Google Scholar] [CrossRef]
Sanford, T.; Harmon, S.A.; Turkbey, B.; Kesani, D.; Tuncer, S.; Madariaga, M.; Yang, C.; Sackett, J.; Mehralivand, S.; Yan, P.; et al. Deep-Learning-Based Artificial Intelligence for PI-RADS Classification to Assist Multiparametric Prostate MRI Interpretation: A Development Study. J. Magn. Reson. Imaging 2020, 52, 1499–1507. [Google Scholar] [CrossRef]
Cao, R.; Bajgiran, A.M.; Mirak, S.A.; Shakeri, S.; Zhong, X.; Enzmann, D.; Raman, S.; Sung, K. Joint Prostate Cancer Detection and Gleason Score Prediction in mp-MRI via FocalNet. IEEE Trans. Med. Imaging 2019, 38, 2496–2506. [Google Scholar] [CrossRef]
Saha, A.; Hosseinzadeh, M.; Huisman, H. End-to-end prostate cancer detection in bpMRI via 3D CNNs: Effects of attention mechanisms, clinical priori and decoupled false positive reduction. Med. Image Anal. 2021, 73, 102155. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Yu, A.C.; Mohajer, B.; Eng, J. External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review. Radiol. Artif. Intell. 2022, 4, e210064. [Google Scholar] [CrossRef] [PubMed]
Twilt, J.J.; van Leeuwen, K.G.; Huisman, H.J.; Fütterer, J.J.; de Rooij, M. Artificial intelligence based algorithms for prostate cancer classification and detection on magnetic resonance imaging: A narrative review. Diagnostics 2021, 11, 959. [Google Scholar] [CrossRef]
Bosma, J.S.; Saha, A.; Hosseinzadeh, M.; Slootweg, I.; de Rooij, M.; Huisman, H. Semisupervised learning with report-guided pseudo labels for deep learning-based prostate cancer detection using biparametric MRI. Radiol. Artif. Intell. 2023, 5, e230031. [Google Scholar] [CrossRef]
Mehta, P.; Antonelli, M.; Ahmed, H.U.; Emberton, M.; Punwani, S.; Ourselin, S. Computer-aided diagnosis of prostate cancer using multiparametric MRI and clinical features: A patient-level classification framework. Med. Image Anal. 2021, 73, 102153. [Google Scholar] [CrossRef]
Hiremath, A.; Shiradkar, R.; Fu, P.; Mahran, A.; Rastinehad, A.R.; Tewari, A.; Tirumani, S.H.; Purysko, A.; Ponsky, L.; Madabhushi, A. An integrated nomogram combining deep learning, Prostate Imaging-Reporting and Data System (PI-RADS) scoring, and clinical variables for identification of clinically significant prostate cancer on biparametric MRI: A retrospective multicentre study. Lancet Digit. Health 2021, 3, e445–e454. [Google Scholar] [CrossRef]
Park, S.H.; Han, K. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology 2018, 286, 800–809. [Google Scholar] [CrossRef]
Saha, A.; Twilt, J.J.; Bosma, J.S.; van Ginneken, B.; Yakar, D.; Elschot, M.; Veltman, J.; Fütterer, J.J.; de Rooij, M.; Huisman, H. The PI-CAI Challenge: Public Training and Development Dataset; Zenodo: Geneva, Switzerland, 2022. [Google Scholar] [CrossRef]
Rouvière, O.; Puech, P.; Renard-Penna, R.; Claudon, M.; Roy, C.; Mège-Lechevallier, F.; Decaussin-Petrucci, M.; Dubreuil-Chambardel, M.; Magaud, L.; Remontet, L.; et al. Use of prostate systematic and targeted biopsy on the basis of multiparametric MRI in biopsy-naive patients (MRI-FIRST): A prospective, multicentre, paired diagnostic study. Lancet Oncol. 2019, 20, 100–109. [Google Scholar] [CrossRef]
Sunoqrot, M.R.S.; Saha, A.; Hosseinzadeh, M.; Elschot, M.; Huisman, H. Artificial intelligence for prostate MRI: Open datasets, available applications, and grand challenges. Eur. Radiol. Exp. 2022, 6, 35. [Google Scholar] [CrossRef]
Schelb, P.; Kohl, S.; Radtke, J.P.; Wiesenfarth, M.; Kickingereder, P.; Bickelhaupt, S.; Kuder, T.A.; Stenzinger, A.; Hohenfellner, M.; Schlemmer, H.P.; et al. Classification of cancer at prostate MRI: Deep learning versus clinical PI-RADS assessment. Radiology 2019, 293, 607–617. [Google Scholar] [CrossRef] [PubMed]
Winkel, D.J.; Tong, A.; Lou, B.; Kamen, A.; Comaniciu, D.; Disselhorst, J.A.; Rodríguez-Ruiz, A.; Huisman, H.; Szolar, D.; Shabana, I.; et al. A novel deep learning based computer-aided diagnosis system improves the accuracy and efficiency of radiologists in reading biparametric magnetic resonance images of the prostate. Investig. Radiol. 2021, 56, 605–613. [Google Scholar] [CrossRef] [PubMed]
Turkbey, B.; Haider, M.A. Deep learning-based artificial intelligence applications in prostate MRI: Brief summary. Br. J. Radiol. 2022, 95, 20210563. [Google Scholar] [CrossRef] [PubMed]
Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.; King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019, 17, 195. [Google Scholar] [CrossRef]
Stabile, A.; Giganti, F.; Rosenkrantz, A.B.; Taneja, S.S.; Villeirs, G.; Gill, I.S.; Allen, C.; Emberton, M.; Moore, C.M.; Kasivisvanathan, V. Multiparametric MRI for prostate cancer diagnosis: Current status and future directions. Nat. Rev. Urol. 2020, 17, 41–61. [Google Scholar] [CrossRef]
Roberts, M.; Driggs, D.; Thorpe, M.; Appleton, J.; Aung, N.; Paiva, J.; Doherty, J.; D’Costa, A.; Ball, J.; Gkrania-Klotsas, E.; et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 2021, 3, 199–217. [Google Scholar] [CrossRef]
Nagendran, M.; Chen, Y.; Lovejoy, C.A.; Gordon, A.C.; Komorowski, M.; Harvey, H.; Topol, E.J.; Ioannidis, J.P.A.; Collins, G.S.; Maruthappu, M. Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies. BMJ 2020, 368, m689. [Google Scholar] [CrossRef]
Kim, D.W.; Jang, H.Y.; Kim, K.W.; Shin, Y.; Park, S.H. Design Characteristics of Studies Reporting the Performance of Artificial Intelligence Algorithms for Diagnostic Analysis of Medical Images: Results from Recently Published Papers. Korean J. Radiol. 2019, 20, 405–410. [Google Scholar] [CrossRef]
Bluemke, D.A.; Moy, L.; Bredella, M.A.; Ertl-Wagner, B.B.; Fowler, K.J.; Goh, V.J.; Halpern, E.F.; Hess, C.P.; Schiebler, M.L.; Weiss, C.R. Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers-From the Radiology Editorial Board. Radiology 2020, 294, 487–489. [Google Scholar] [CrossRef]
Greer, M.D.; Lay, N.; Shih, J.H.; Barrett, T.; Bittencourt, L.K.; Borofsky, S.; Kabakus, I.; Law, Y.M.; Marko, J.; Shebel, H.; et al. Computer-aided diagnosis prior to conventional interpretation of prostate mpMRI: An international multi-reader study. Eur. Radiol. 2018, 28, 4407–4417. [Google Scholar] [CrossRef]
Beam, A.L.; Manrai, A.K.; Ghassemi, M. Challenges to the Reproducibility of Machine Learning Models in Health Care. JAMA 2020, 323, 305–306. [Google Scholar] [CrossRef] [PubMed]
Castillo, T.J.M.; Starmans, M.P.A.; Arif, M.; Niessen, W.J.; Klein, S.; Bangma, C.H.; Schoots, I.G.; Veenland, J.F. A multi-center, multi-vendor study to evaluate the generalizability of a radiomics model for classifying prostate cancer: High grade vs. low grade. Diagnostics 2021, 11, 369. [Google Scholar] [CrossRef] [PubMed]
Pesapane, F.; Volonté, C.; Codari, M.; Sardanelli, F. Artificial intelligence as a medical device in radiology: Ethical and regulatory issues in Europe and the United States. Insights Imaging 2018, 9, 745–753. [Google Scholar] [CrossRef] [PubMed]
Allen, B., Jr.; Seltzer, S.E.; Langlotz, C.P.; Dreyer, K.P.; Summers, R.M.; Petrick, N.; Marinac-Dabic, D.; Cruz, M.; Alkasab, T.K.; Hanisch, R.J.; et al. A Road Map for Translational Research on Artificial Intelligence in Medical Imaging: From the 2018 National Institutes of Health/RSNA/ACR/The Academy Workshop. J. Am. Coll. Radiol. 2019, 16, 1179–1189. [Google Scholar] [CrossRef]
European Society of Radiology (ESR). What the radiologist should know about artificial intelligence—An ESR white paper. Insights Imaging 2019, 10, 44. [Google Scholar]
Thon, A.; Teichgräber, U.; Tennstedt-Schenk, C.; Hadjidemetriou, S.; Winzler, S.; Malich, A.; Papageorgiou, I. Computer aided detection in prostate cancer diagnostics: A promising alternative to biopsy? A retrospective study from 104 lesions with histological ground truth. PLoS ONE 2017, 12, e0185995. [Google Scholar] [CrossRef]
McKinney, S.M.; Sieniek, M.; Godbole, V.; Godwin, J.; Antropova, N.; Ashrafian, H.; Back, T.; Chesus, M.; Corrado, G.S.; Darzi, A.; et al. International evaluation of an AI system for breast cancer screening. Nature 2020, 577, 89–94. [Google Scholar] [CrossRef]
Geis, J.R.; Brady, A.P.; Wu, C.C.; Spencer, J.; Ranschaert, E.; Jaremko, J.L.; Langer, S.G.; Borondy Kitts, A.; Birch, J.; Shields, W.F.; et al. Ethics of Artificial Intelligence in Radiology: Summary of the Joint European and North American Multisociety Statement. Radiology 2019, 293, 436–440. [Google Scholar] [CrossRef]
Vasey, B.; Ursprung, S.; Beddoe, B.; Taylor, E.H.; Marlow, N.; Bilbro, N.; Watkinson, P.; McCulloch, P. Association of Clinician Diagnostic Performance With Machine Learning-Based Decision Support Systems: A Systematic Review. JAMA Netw. Open 2021, 4, e211276. [Google Scholar] [CrossRef]
Giannini, V.; Mazzetti, S.; Armando, E.; Carabalona, S.; Russo, F.; Giacobbe, A.; Muto, G.; Regge, D. Multiparametric magnetic resonance imaging of the prostate with computer-aided detection: Experienced observer performance study. Eur. Radiol. 2017, 27, 4200–4208. [Google Scholar] [CrossRef]
Hambrock, T.; Vos, P.C.; Hulsbergen-van de Kaa, C.A.; Barentsz, J.O.; Huisman, H.J. Prostate cancer: Computer-aided diagnosis with multiparametric 3-T MR imaging--effect on observer performance. Radiology 2013, 266, 521–530. [Google Scholar] [CrossRef] [PubMed]
Padhani, A.R.; Weinreb, J.; Rosenkrantz, A.B.; Villeirs, G.; Turkbey, B.; Barentsz, J. Prostate Imaging-Reporting and Data System Steering Committee: PI-RADS v2 Status Update and Future Directions. Eur. Urol. 2019, 75, 385–396. [Google Scholar] [CrossRef] [PubMed]
Gaur, S.; Lay, N.; Harmon, S.A.; Doddakashi, S.; Mehralivand, S.; Argun, B.; Barrett, T.; Bednarova, S.; Girometti, R.; Karaarslan, E.; et al. Can computer-aided diagnosis assist in the identification of prostate cancer on prostate MRI? a multi-center, multi-reader investigation. Oncotarget 2018, 9, 33804–33817. [Google Scholar] [CrossRef]
Niaf, E.; Lartizien, C.; Bratan, F.; Roche, L.; Rabilloud, M.; Mège-Lechevallier, F.; Rouvière, O. Prostate focal peripheral zone lesions: Characterization at multiparametric MR imaging--influence of a computer-aided diagnosis system. Radiology 2014, 271, 761–769. [Google Scholar] [CrossRef] [PubMed]
Dinh, A.H.; Melodelima, C.; Souchon, R.; Moldovan, P.C.; Bratan, F.; Pagnoux, G.; Mège-Lechevallier, F.; Ruffion, A.; Crouzet, S.; Colombel, M.; et al. Characterization of Prostate Cancer with Gleason Score of at Least 7 by Using Quantitative Multiparametric MR Imaging: Validation of a Computer-Aided Diagnosis System in Patients Referred for Prostate Biopsy. Radiology 2018, 287, 525–533. [Google Scholar] [CrossRef]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
Richenberg, J.; Løgager, V.; Panebianco, V.; Rouviere, O.; Villeirs, G.; Schoots, I.G. The primacy of multiparametric MRI in men with suspected prostate cancer. Eur. Radiol. 2019, 29, 6940–6952. [Google Scholar] [CrossRef]
Drost, F.H.; Osses, D.F.; Nieboer, D.; Steyerberg, E.W.; Bangma, C.H.; Roobol, M.J.; Schoots, I.G. Prostate MRI, with or without MRI-targeted biopsy, and systematic biopsy for detecting prostate cancer. Cochrane Database Syst. Rev. 2019, 4, CD012663. [Google Scholar] [CrossRef]
Cuocolo, R.; Cipullo, M.B.; Stanzione, A.; Ugga, L.; Romeo, V.; Radice, L.; Brunetti, A.; Imbriaco, M. Machine learning applications in prostate cancer magnetic resonance imaging. Eur. Radiol. Exp. 2019, 3, 35. [Google Scholar] [CrossRef]
Bi, W.L.; Hosny, A.; Schabath, M.B.; Giger, M.L.; Birkbak, N.J.; Mehrtash, A.; Allison, T.; Arnaout, O.; Abbosh, C.; Dunn, I.F.; et al. Artificial intelligence in cancer imaging: Clinical challenges and applications. CA Cancer J. Clin. 2019, 69, 127–157. [Google Scholar] [CrossRef]
Mongan, J.; Moy, L.; Kahn, C.E., Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A guide for authors and reviewers. Radiol. Artif. Intell. 2020, 2, e200029. [Google Scholar] [CrossRef] [PubMed]
Collins, G.S.; Reitsma, J.B.; Altman, D.G.; Moons, K.G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. BMJ 2015, 350, g7594. [Google Scholar] [CrossRef] [PubMed]
Song, Y.; Zhang, Y.D.; Yan, X.; Liu, H.; Zhou, M.; Hu, B.; Yang, G. Computer-aided diagnosis of prostate cancer using a deep convolutional neural network from multiparametric MRI. J. Magn. Reson. Imaging 2018, 48, 1570–1577. [Google Scholar] [CrossRef]
Ishioka, J.; Matsuoka, Y.; Uehara, S.; Yasuda, Y.; Kijima, T.; Yoshida, S.; Yokoyama, M.; Saito, K.; Kihara, K.; Numao, N.; et al. Computer-aided diagnosis of prostate cancer on magnetic resonance imaging using a convolutional neural network algorithm. BJU Int. 2018, 122, 411–417. [Google Scholar] [CrossRef]
Aldoj, N.; Lukas, S.; Dewey, M.; Penzkofer, T. Semi-automatic classification of prostate cancer on multi-parametric MR imaging using a multi-channel 3D convolutional neural network. Eur. Radiol. 2020, 30, 1243–1253. [Google Scholar] [CrossRef] [PubMed]
Arif, M.; Schoots, I.G.; Castillo Tovar, J.; Bangma, C.H.; Krestin, G.P.; Roobol, M.J.; Niessen, W.; Veenland, J.F. Clinically significant prostate cancer detection and segmentation in low-risk patients using a convolutional neural network on multi-parametric MRI. Eur. Radiol. 2020, 30, 6582–6592. [Google Scholar] [CrossRef] [PubMed]
Pellicer-Valero, O.J.; Marenco Jiménez, J.L.; Gonzalez-Perez, V.; Ramón-Borja, J.L.C.; García, I.M.; Benito, M.B.; Gómez, M.P.; Rubio-Briones, J.; Rupérez, M.J.; Martín-Guerrero, J.D. Deep learning for fully automatic detection, segmentation, and Gleason grade estimation of prostate cancer in multiparametric magnetic resonance images. Sci. Rep. 2022, 12, 2975. [Google Scholar] [CrossRef] [PubMed]
Khosravi, P.; Lysandrou, M.; Eljalby, M.; Li, Q.; Kazemi, E.; Zisimopoulos, P.; Sigaras, A.; Brendel, M.; Barnes, J.; Ricketts, C.; et al. A Deep Learning Approach to Diagnostic Classification of Prostate Cancer Using Pathology-Radiology Fusion. J. Magn. Reson. Imaging 2021, 54, 462–471. [Google Scholar] [CrossRef]
Litjens, G.; Debats, O.; Barentsz, J.; Karssemeijer, N.; Huisman, H. ProstateX Challenge data. In The Cancer Imaging Archive; University of Arkansas for Medical Sciences (UAMS): Little Rock, AR, USA, 2017. [Google Scholar] [CrossRef]
Liu, X.; Cruz Rivera, S.; Moher, D.; Calvert, M.J.; Denniston, A.K.; SPIRIT-AI and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nat. Med. 2020, 26, 1364–1374. [Google Scholar] [CrossRef]
Sounderajah, V.; Ashrafian, H.; Aggarwal, R.; De Fauw, J.; Denniston, A.K.; Greaves, F.; Karthikesalingam, A.; King, D.; Liu, X.; Markar, S.R.; et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group. Nat. Med. 2020, 26, 807–808. [Google Scholar] [CrossRef]
Haibe-Kains, B.; Adam, G.A.; Hosny, A.; Khodakarami, F.; Waldron, L.; Wang, B.; McIntosh, C.; Goldenberg, A.; Kundaje, A.; Greene, C.S.; et al. Transparency and reproducibility in artificial intelligence. Nature 2020, 586, E14–E16. [Google Scholar] [CrossRef]
Kohl, S.; Bonekamp, D.; Schlemmer, H.P.; Yaqubi, K.; Hohenfellner, M.; Hadaschik, B.; Radtke, J.P.; Maier-Hein, K. Adversarial networks for the detection of aggressive prostate cancer. arXiv 2017, arXiv:1702.08014. [Google Scholar] [CrossRef]
McInnes, M.D.F.; Moher, D.; Thombs, B.D.; McGrath, T.A.; Bossuyt, P.M.; The PRISMA-DTA Group. Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies: The PRISMA-DTA Statement. JAMA 2018, 319, 388–396. [Google Scholar] [CrossRef]
Whiting, P.F.; Rutjes, A.W.; Westwood, M.E.; Mallett, S.; Deeks, J.J.; Reitsma, J.B.; Leeflang, M.M.; Sterne, J.A.; Bossuyt, P.M.; QUADAS-2 Group. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med. 2011, 155, 529–536. [Google Scholar] [CrossRef]
Reitsma, J.B.; Glas, A.S.; Rutjes, A.W.; Scholten, R.J.; Bossuyt, P.M.; Zwinderman, A.H. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J. Clin. Epidemiol. 2005, 58, 982–990. [Google Scholar] [CrossRef]
Guyatt, G.H.; Oxman, A.D.; Vist, G.E.; Kunz, R.; Falck-Ytter, Y.; Alonso-Coello, P.; Schünemann, H.J.; GRADE Working Group. GRADE: An emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008, 336, 924–926. [Google Scholar] [CrossRef]
Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar] [CrossRef] [PubMed]
Bloch, B.N.; Jain, A.; Jaffe, C.C. Data From PROSTATE-DIAGNOSIS. In The Cancer Imaging Archive; University of Arkansas for Medical Sciences (UAMS): Little Rock, AR, USA, 2015. [Google Scholar] [CrossRef]
Choyke, P.; Turkbey, B.; Pinto, P.; Merino, M.; Wood, B. Data From PROSTATE-MRI. In The Cancer Imaging Archive; University of Arkansas for Medical Sciences (UAMS): Little Rock, AR, USA, 2016. [Google Scholar] [CrossRef]
Yan, K.; Wang, X.; Lu, L.; Summers, R.M. DeepLesion: Automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 2018, 5, 036501. [Google Scholar] [CrossRef] [PubMed]
Shur, J.D.; Doran, S.J.; Kumar, S.; ap Dafydd, D.; Downey, K.; O’Connor, J.P.B.; Papanikolaou, N.; Messiou, C.; Koh, D.M.; Orton, M.R. Radiomics in Oncology: A Practical Guide. Radiographics 2021, 41, 1717–1732. [Google Scholar] [CrossRef]
van Timmeren, J.E.; Cester, D.; Tanadini-Lang, S.; Alkadhi, H.; Baessler, B. Radiomics in medical imaging-”how-to” guide and critical reflection. Insights Imaging 2020, 11, 91. [Google Scholar] [CrossRef]
Zwanenburg, A.; Vallières, M.; Abdalah, M.A.; Aerts, H.J.W.L.; Andrearczyk, V.; Apte, A.; Ashrafinia, S.; Bakas, S.; Beukinga, R.J.; Boellaard, R.; et al. The Image Biomarker Standardization Initiative: Standardized Quantitative Radiomics for High-Throughput Image-based Phenotyping. Radiology 2020, 295, 328–338. [Google Scholar] [CrossRef] [PubMed]
Maier-Hein, L.; Reinke, A.; Godau, P.; Tizabi, M.D.; Buettner, F.; Christodoulou, E.; Glocker, B.; Isensee, F.; Kleesiek, J.; Kozubek, M.; et al. Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv 2022, arXiv:2206.01653. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in health and medicine. Nat. Med. 2022, 28, 31–38. [Google Scholar] [CrossRef]
Lehman, C.D.; Wellman, R.D.; Buist, D.S.; Kerlikowske, K.; Tosteson, A.N.; Miglioretti, D.L.; Breast Cancer Surveillance Consortium. Diagnostic Accuracy of Digital Screening Mammography with and Without Computer-Aided Detection. JAMA Intern. Med. 2015, 175, 1828–1837. [Google Scholar] [CrossRef]
Fenton, J.J.; Taplin, S.H.; Carney, P.A.; Abraham, L.; Sickles, E.A.; D’Orsi, C.; Berns, E.A.; Cutter, G.; Hendrick, R.E.; Barlow, W.E.; et al. Influence of computer-aided detection on performance of screening mammography. N. Engl. J. Med. 2007, 356, 1399–1409. [Google Scholar] [CrossRef] [PubMed]
Char, D.S.; Shah, N.H.; Magnus, D. Implementing Machine Learning in Health Care—Addressing Ethical Challenges. N. Engl. J. Med. 2018, 378, 981–983. [Google Scholar] [CrossRef]
Emanuel, E.J.; Wachter, R.M. Artificial Intelligence in Health Care: Will the Value Match the Hype? JAMA 2019, 321, 2281–2282. [Google Scholar] [CrossRef] [PubMed]
Price, W.N., II; Cohen, I.G. Privacy in the age of medical big data. Nat. Med. 2019, 25, 37–43. [Google Scholar] [CrossRef] [PubMed]
Obermeyer, Z.; Powers, B.; Vogeli, C.; Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019, 366, 447–453. [Google Scholar] [CrossRef]
Futoma, J.; Simons, M.; Panch, T.; Doshi-Velez, F.; Celi, L.A. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit. Health 2020, 2, e489–e492. [Google Scholar] [CrossRef]
Kaissis, G.A.; Makowski, M.R.; Rückert, D.; Braren, R.F. Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2020, 2, 305–311. [Google Scholar] [CrossRef]
Li, Q.; Wen, Z.; Wu, Z.; Hu, S.; Wang, N.; Li, Y.; Liu, X.; He, B. A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection. IEEE Trans. Knowl. Data Eng. 2021, 35, 3347–3366. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
Azizi, S.; Mustafa, B.; Ryan, F.; Beaver, Z.; Freyberg, J.; Deaton, J.; Loh, A.; Karthikesalingam, A.; Kornblith, S.; Chen, T.; et al. Big Self-Supervised Models Advance Medical Image Classification. arXiv 2021, arXiv:2101.05224. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv 2016, arXiv:1506.02142. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. arXiv 2017, arXiv:1612.01474. [Google Scholar] [CrossRef]
Wang, D.; Khosla, A.; Gargeya, R.; Irshad, H.; Beck, A.H. Deep Learning for Identifying Metastatic Breast Cancer. arXiv 2016, arXiv:1606.05718. [Google Scholar] [CrossRef]
Patel, B.N.; Rosenberg, L.; Willcox, G.; Baltaxe, D.; Lyons, M.; Irvin, J.; Rajpurkar, P.; Amrhein, T.; Gupta, R.; Halabi, S.; et al. Human-machine partnership with artificial intelligence for chest radiograph diagnosis. npj Digit. Med. 2019, 2, 111. [Google Scholar] [CrossRef]
Tschandl, P.; Rinner, C.; Apalla, Z.; Argenziano, G.; Codella, N.; Halpern, A.; Janda, M.; Lallas, A.; Longo, C.; Malvehy, J.; et al. Human-computer collaboration for skin cancer recognition. Nat. Med. 2020, 26, 1229–1234. [Google Scholar] [CrossRef]
Rajkomar, A.; Dean, J.; Kohane, I. Machine Learning in Medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef]
Yala, A.; Schuster, T.; Miles, R.; Barzilay, R.; Lehman, C. A Deep Learning Model to Triage Screening Mammograms: A Simulation Study. Radiology 2019, 293, 38–46. [Google Scholar] [CrossRef] [PubMed]
Lee, H.; Yune, S.; Mansouri, M.; Kim, M.; Tajmir, S.H.; Guerrier, C.E.; Ebert, S.A.; Pomerantz, S.R.; Romero, J.M.; Kamalian, S.; et al. An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. Nat. Biomed. Eng. 2019, 3, 173–182. [Google Scholar] [CrossRef] [PubMed]
Parikh, R.B.; Teeple, S.; Navathe, A.S. Addressing Bias in Artificial Intelligence in Health Care. JAMA 2019, 322, 2377–2378. [Google Scholar] [CrossRef]
Sendak, M.P.; Gao, M.; Brajer, N.; Balu, S. Presenting machine learning model information to clinical end users with model facts labels. npj Digit. Med. 2020, 3, 41. [Google Scholar] [CrossRef]
Keane, P.A.; Topol, E.J. With an eye to AI and autonomous diagnosis. npj Digit. Med. 2018, 1, 40. [Google Scholar] [CrossRef] [PubMed]
Wiens, J.; Saria, S.; Sendak, M.; Ghassemi, M.; Liu, V.X.; Doshi-Velez, F.; Jung, K.; Heller, K.; Kale, D.; Saeed, M.; et al. Do no harm: A roadmap for responsible machine learning for health care. Nat. Med. 2019, 25, 1337–1340. [Google Scholar] [CrossRef] [PubMed]
Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model Cards for Model Reporting. arXiv 2019, arXiv:1810.03993. [Google Scholar] [CrossRef]
Shaw, J.; Rudzicz, F.; Jamieson, T.; Goldfarb, A. Artificial Intelligence and the Implementation Challenge. J. Med. Internet Res. 2019, 21, e13659. [Google Scholar] [CrossRef]
Greenhalgh, T.; Wherton, J.; Papoutsi, C.; Lynch, J.; Hughes, G.; A’Court, C.; Hinder, S.; Fahy, N.; Procter, R.; Shaw, S. Beyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies. J. Med. Internet Res. 2017, 19, e367. [Google Scholar] [CrossRef] [PubMed]
Verghese, A.; Shah, N.H.; Harrington, R.A. What This Computer Needs Is a Physician: Humanism and Artificial Intelligence. JAMA 2018, 319, 19–20. [Google Scholar] [CrossRef]
He, J.; Baxter, S.L.; Xu, J.; Xu, J.; Zhou, X.; Zhang, K. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 2019, 25, 30–36. [Google Scholar] [CrossRef]
Beede, E.; Baylor, E.; Hersch, F.; Iurchenko, A.; Wilcox, L.; Ruamviboonsuk, P.; Vardoulakis, L.M. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy. arXiv 2019, arXiv:1911.06362. [Google Scholar]
D’Amour, A.; Heller, K.; Moldovan, D.; Adlam, B.; Alipanahi, B.; Beutel, A.; Chen, C.; Deaton, J.; Eisenstein, J.; Hoffman, M.D.; et al. Underspecification Presents Challenges for Credibility in Modern Machine Learning. arXiv 2020, arXiv:2011.03395. [Google Scholar] [CrossRef]

Table 1. Key definitions and concepts. Abbreviations: csPCa = clinically significant prostate cancer; AUC = area under curve; CI = confidence interval; I² = heterogeneity statistic; GRADE = quality of evidence rating.

Term	Category	Definition	References/Notes
csPCa	Clinical Definitions	Clinically significant prostate cancer: Most commonly defined as Gleason score ≥ 7 (Grade Group ≥ 2), though definitions vary to include volume criteria (>0.5 cc), PSA density, or specific staging parameters
PI-RADS v2.1		Prostate Imaging Reporting and Data System: Standardized 5-point scoring system incorporating zone-specific assessment rules, DWI/ADC weighting for peripheral zone, T2W dominance for transition zone, with DCE upgrading capability for equivocal lesions	Turkbey et al. 2019 [6]
Index lesion		Largest tumor focus or lesion with highest Gleason grade that drives clinical management and prognosis
mpMRI	MRI Technical	Multiparametric MRI combining T2-weighted (anatomical detail), DWI/ADC (cellularity assessment), DCE (vascular perfusion)
Prostate zones		Peripheral zone (PZ): 70% of gland volume, origin of 70–80% cancers; Transition zone (TZ): 25% volume, 20–25% cancers; Central zone: 5% volume, <5% cancers
ADC value		Apparent diffusion coefficient: Quantitative measure of water diffusion; typically lower in malignant tissue than benign, though absolute values vary by scanner and protocol	Scanner-dependent, requires local calibration
AUC	AI Metrics	Area Under ROC Curve: >0.90 excellent, 0.80–0.90 good, 0.70–0.80 fair
Dice coefficient		Spatial overlap metric for segmentation (0–1 scale); >0.85 generally considered clinically acceptable for prostate structures
Sensitivity (Recall)		True positive rate: proportion of actual cancers correctly identified; critical for screening applications
Specificity		True negative rate: proportion of non-cancers correctly identified; important for reducing false positives
PPV		Positive Predictive Value: probability that positive prediction is correct; varies with disease prevalence
NPV		Negative Predictive Value: probability that negative prediction is correct
F1 Score		Harmonic mean of precision and recall; useful for imbalanced datasets
IoU		Intersection over Union for segmentation tasks; alternative to Dice
External validation		Testing on data from different institution/scanner/population than training data

Table 2. Comparative analysis of major AI studies for prostate cancer detection on MRI (2019–2024). Abbreviations: Retrosp. = retrospective; Prosp. = prospective; CNN = convolutional neural network; RF = random forest; DL = deep learning; Cross-inst. = cross-institutional. * Performance on hidden test set.

Study (Year)	Country	Design	Patients	AI Method	Performance (AUC)	Validation	Key Limitations
Schelb (2019) [38]	Germany	Retrosp.	312	U-Net CNN	0.84 (0.79–0.88)	Internal	Single center, no PI-RADS 3
Rouvière (2019) [36]	France	Prosp.	251	CAD system	0.82 (0.77–0.87)	Multicenter	No AI comparison
Winkel (2021) [39]	USA	Reader	201	3D CNN	0.89 (0.84–0.93)	External	Selected cohort
Saha (2021) [26]	Netherlands	Retrosp.	1950	nnU-Net	0.91 (0.87–0.94)	Cross-inst.	No prospective validation
Mehta (2021) [32]	UK	Retrosp.	626	RF + Clinical	0.88 (0.83–0.92)	Temporal	High exclusion rate (31%)
Turkbey (2022) [40]	USA/Multi	Review	4827	Various DL	0.87 (0.83–0.90)	Pooled	High heterogeneity (I² = 68%)
Bosma (2023) [31]	Netherlands	Retrosp.	7756	Semi-supervised	0.90 (0.88–0.92)	Multi-center	Report labels only
PI-CAI (2023) [35]	Global	Competition	10,207	200+ teams	0.91 * (top)	Hidden test	Competition ≠ clinical

Table 3. Clinical implementation studies: real-world performance and barriers. Abbreviations: sens. = sensitivity; spec. = specificity; PPV = positive predictive value; acc. = accuracy; mo = months.

Institution	AI System	Implementation Period	Performance Drop *	Primary Barriers	Lessons Learned
Radboud UMC	In-house CNN	6 months	−8% AUC	PACS integration, Training time	Radiologist champions essential
NYU Langone	Commercial CAD	4 months	−12% sens.	Alert fatigue, Workflow disruption	Selective AI use better than routine
Charité Berlin	Hybrid system	9 months	−5% spec.	Regulatory delays, Cost	Need dedicated IT support
UCSF	Cloud-based	3 months	−15% PPV	Data privacy, Latency	On-premise better than cloud
Karolinska	Federated model	14 months	−7% AUC	Multi-site coordination	Governance framework critical
Stanford	Ensemble AI	8 months	−11% acc.	Version control, Updates	Continuous monitoring required
MD Anderson **	Commercial v2	Terminated (5 mo)	−31% spec.	Automation bias, Legal concerns	Human factors underestimated

* Performance drop compared to published validation results (AUC, sensitivity, specificity, PPV, or accuracy as reported by each study) ** Bold formatting indicates study termination due to unacceptable performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bolocan, V.-O.; Mitoi, A.; Nicu-Canareica, O.; Băean, M.-L.; Medar, C.; Popa, G.-A. Artificial Intelligence in Prostate MRI: Current Evidence and Clinical Translation Challenges—A Narrative Review. J. Imaging 2025, 11, 335. https://doi.org/10.3390/jimaging11100335

AMA Style

Bolocan V-O, Mitoi A, Nicu-Canareica O, Băean M-L, Medar C, Popa G-A. Artificial Intelligence in Prostate MRI: Current Evidence and Clinical Translation Challenges—A Narrative Review. Journal of Imaging. 2025; 11(10):335. https://doi.org/10.3390/jimaging11100335

Chicago/Turabian Style

Bolocan, Vlad-Octavian, Alexandru Mitoi, Oana Nicu-Canareica, Maria-Luiza Băean, Cosmin Medar, and Gelu-Adrian Popa. 2025. "Artificial Intelligence in Prostate MRI: Current Evidence and Clinical Translation Challenges—A Narrative Review" Journal of Imaging 11, no. 10: 335. https://doi.org/10.3390/jimaging11100335

APA Style

Bolocan, V.-O., Mitoi, A., Nicu-Canareica, O., Băean, M.-L., Medar, C., & Popa, G.-A. (2025). Artificial Intelligence in Prostate MRI: Current Evidence and Clinical Translation Challenges—A Narrative Review. Journal of Imaging, 11(10), 335. https://doi.org/10.3390/jimaging11100335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Artificial Intelligence in Prostate MRI: Current Evidence and Clinical Translation Challenges—A Narrative Review

Abstract

1. Introduction

2. Materials and Methods

3. Results

3.1. Evolution and Current Landscape

3.2. Implementation

3.3. Economic Considerations and Value Proposition

3.4. Regulatory and Medicolegal Landscape

3.5. Methodological Quality and Reproducibility

4. Discussion

4.1. Reconciling Promise with Reality

4.2. Actionable Recommendations

4.3. Future Directions and the Path Forward

4.4. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI