Appendix A.2. Performance Evaluation Metrics
This metric measures the model’s ability to detect all clinically relevant lesions. A higher recall indicates fewer missed lesions, which is crucial for comprehensive lesion detection. It can be calculated as
            where 
 represents the number of lesions detected by the fully automated AI/AI-assisted models that overlap with a lesion identified by the radiologist, and 
 represents the number of lesions identified by the radiologist that do not overlap with any lesions from the AI/AI-assisted model.
Precision evaluates the accuracy of the detected lesions. A high precision value ensures that most of the identified lesions are truly present, reducing false positives. It can be calculated as
            where 
 represents the number of lesions detected by the fully automated AI or AI-assisted models that do not overlap with any lesions identified by the radiologist.
- Voxel-based Metrics 
- Dice Score: 
To assess the spatial overlap of the segmented regions, the Dice score (DSC) was calculated [
28]. The Dice Coefficient measures the overlap between the predicted and reference segmentations. It ranges from 0 to 1, where 1 indicates perfect agreement between the two segmentations, demonstrating accurate segmentation. The Dice score is calculated as
            where 
 and 
 represent the total number of voxels in the predicted and reference segmentation masks, respectively. 
 represents the count of voxels overlapping in both segmentations.
The Hausdorff distance captures the largest boundary discrepancy between the predicted and reference segmentations. A lower value signifies a better match, highlighting areas with the greatest segmentation error [
29]. The Hausdorff distance represents the greatest of all the distances from the set of boundary points in the predicted segmentation 
A to the closest set of boundary points in the reference segmentation 
B.
Hausdorff distance can be calculated as
            where 
 represents the Euclidean distance between points 
 and 
. sup represents the supremum (least upper bound) and inf represents the infimum (greatest lower bound).
In this formula,  is the shortest distance from a point  to its closest point in B. Similarly,  is the shortest distance from a point  to its closest point in A. The terms  and  represent the largest of these shortest distances for points in A and B, respectively.
A smaller HD value indicates better alignment, reflecting fewer outliers or significant discrepancies, while a larger HD value highlights areas of substantial error. This metric is particularly useful for identifying extreme misalignments.
To assess the average alignment of the boundaries, the Average Surface Distance (ASD) was calculated. This metric evaluates the average boundary alignment, providing insight into how closely the predicted and reference segmentations match, with smaller values signifying better alignment. The ASD is calculated as
            where 
P and 
Q are the sets of boundary points in the predicted segmentation and reference segmentation, respectively. The terms 
 and 
 represent the number of boundary points in the predicted and reference segmentations. The distance 
 denotes the shortest Euclidean distance from a boundary point 
 to the closest point in 
Q, and 
 represents the shortest Euclidean distance from a boundary point 
 to the closest point in 
P.
Unlike the Hausdorff distance, which captures the worst-case error, the ASD focuses on the overall alignment quality, offering a more generalized view of segmentation performance. This metric is useful for evaluating the smoothness and precision of boundary matches between the predicted and reference segmentations.
To assess how effectively the centers of the segmentation masks are aligned, the Euclidean distance between the centroids of the estimated masks and reference masks was calculated as
            where 
 represent the coordinates of center point 
p in 3D space, and 
 represent the coordinates of point 
q in 3D space. A smaller distance indicates better localization accuracy.
To quantify the model’s ability to correctly classify all voxels that belong to the prostate or tumor, the recall was calculated, ensuring that relevant voxels are included in the segmentation. Additionally, voxel-based precision evaluates how accurately the model identifies positive voxels. Higher precision indicates fewer false positives within the segmentation. Voxel-based recall and precision are calculated as
            where true positives (TPs) represent the proportion of voxels where the model correctly identifies a positive voxel, false negatives (FNs) represent the proportion of voxels where the model misses a positive voxel, and false positives (FPs) represent the proportion of voxels where the model incorrectly identifies a negative voxel.
To obtain a comprehensive understanding of the size of the annotated volumes, the volumes of the segmentations were calculated. Mathematically, it can be expressed as
            where 
V represents the total volume of the segmentation, 
N represents the number of voxels in the annotated region, and 
 represents the volume of a single voxel.
The volume size of a single voxel can be expressed as
            where 
, 
, and 
 represent the voxel size in mm in the 
x, 
y, and 
z directions, respectively.
This metric quantifies the absolute difference in volume between the predicted and reference segmentations. A smaller value indicates closer volumetric consistency. The Absolute Volume Difference can be expressed as
            where 
 represents the difference in volume, 
 represents the volume of the manual segmentation, and 
 represents the volume of the AI tool segmentation or the AI-assisted segmentation.
The Relative Volume Difference can be expressed as
            which shows the Volume Difference as a percentage of the reference segmentation, highlighting proportional size accuracy.
  Appendix A.3. Subgroup Analysis of Lesion Segmentation Performance
The PI-RADS 3 group, containing only 5 lesions, was excluded from further analysis due to the small sample size. Among the metrics tested, only recall in the PI-RADS 5 group exhibited a normal distribution, with a p-value of 0.40. Consequently, the Mann–Whitney U test was used to evaluate statistical differences between the subgroups.
In the comparison of PI-RADS scores, statistically significant differences were observed only for precision and Volume Difference metrics. No significant differences were found in any metrics between lesions located in the peripheral zone (PZ) and the transition zone (TZ). Detailed results are provided in 
Table A2 and 
Table A3.
  
    
  
  
    Table A2.
    Segmentation performance metrics for tumor comparison based on PI-RADS 4 and PI-RADS 5 scores, including mean, standard deviation, median, and p-values for statistical differences. * .
  
 
  
      Table A2.
    Segmentation performance metrics for tumor comparison based on PI-RADS 4 and PI-RADS 5 scores, including mean, standard deviation, median, and p-values for statistical differences. * .
      
        | Metric | PI-RADS 4 | PI-RADS 5 | p-Value | 
|---|
| Mean | STD | Median | Mean | STD | Median | 
|---|
| Dice Coefficient | 0.54 | 0.16 | 0.57 | 0.53 | 0.17 | 0.55 | 1.00 | 
| Hausdorff Distance [mm] | 9.46 | 6.51 | 7.24 | 9.88 | 5.67 | 7.88 | 1.00 | 
| Average Surface Distance [mm] | 1.70 | 1.51 | 1.35 | 1.99 | 1.15 | 1.78 | 0.21 | 
| Recallvoxel | 0.55 | 0.21 | 0.53 | 0.44 | 0.20 | 0.46 | 0.20 | 
| Precisionvoxel | 0.64 | 0.23 | 0.69 | 0.79 | 0.17 | 0.84 | * | 
| Euclidean Distance Centres [mm] | 2.77 | 2.82 | 2.19 | 3.15 | 2.34 | 2.52 | 0.80 | 
| Absolute Volume Difference (cc) | 1.28 | 4.71 | 0.32 | 2.17 | 6.00 | 1.01 | * | 
| Relative Volume Difference | 0.80 | 1.61 | 0.41 | 0.47 | 0.32 | 0.44 | 1.00 | 
      
 
  
    
  
  
    Table A3.
    Segmentation performance metrics for tumor comparison by prostatic region (peripheral zone and transition zone), including mean, standard deviation, median, and p-values for statistical differences.
  
 
  
      Table A3.
    Segmentation performance metrics for tumor comparison by prostatic region (peripheral zone and transition zone), including mean, standard deviation, median, and p-values for statistical differences.
      
        | Metric | Peripheral Zone | Transition Zone | p-Value | 
|---|
| Mean | STD | Median | Mean | STD | Median | 
|---|
| Dice Coefficient | 0.52 | 0.16 | 0.56 | 0.50 | 0.22 | 0.61 | 1.00 | 
| Hausdorff Distance [mm] | 9.13 | 5.18 | 7.42 | 14.62 | 9.16 | 14.26 | 0.64 | 
| Average Surface Distance [mm] | 1.73 | 0.92 | 1.66 | 3.19 | 2.63 | 1.68 | 1.00 | 
| Recallvoxel | 0.48 | 0.20 | 0.51 | 0.43 | 0.20 | 0.48 | 1.00 | 
| Precisionvoxel | 0.72 | 0.21 | 0.76 | 0.76 | 0.26 | 0.82 | 1.00 | 
| Euclidean Distance Centres [mm] | 2.88 | 2.12 | 2.52 | 5.43 | 5.25 | 2.89 | 1.00 | 
| Absolute Volume Difference (cc) | 0.90 | 1.01 | 0.56 | 8.73 | 15.26 | 1.49 | 0.66 | 
| Relative Volume Difference | 0.64 | 1.32 | 0.41 | 0.92 | 1.63 | 0.44 | 1.00 |