3D-Vision-Transformer Stacking Ensemble for Assessing Prostate Cancer Aggressiveness from T2w Images

Vision transformers represent the cutting-edge topic in computer vision and are usually employed on two-dimensional data following a transfer learning approach. In this work, we propose a trained-from-scratch stacking ensemble of 3D-vision transformers to assess prostate cancer aggressiveness from T2-weighted images to help radiologists diagnose this disease without performing a biopsy. We trained 18 3D-vision transformers on T2-weighted axial acquisitions and combined them into two- and three-model stacking ensembles. We defined two metrics for measuring model prediction confidence, and we trained all the ensemble combinations according to a five-fold cross-validation, evaluating their accuracy, confidence in predictions, and calibration. In addition, we optimized the 18 base ViTs and compared the best-performing base and ensemble models by re-training them on a 100-sample bootstrapped training set and evaluating each model on the hold-out test set. We compared the two distributions by calculating the median and the 95% confidence interval and performing a Wilcoxon signed-rank test. The best-performing 3D-vision-transformer stacking ensemble provided state-of-the-art results in terms of area under the receiving operating curve (0.89 [0.61–1]) and exceeded the area under the precision–recall curve of the base model of 22% (p < 0.001). However, it resulted to be less confident in classifying the positive class.


Introduction
In 2020, prostate cancer (PCa) was the world's second-most-common tumor among men (accounting for 14.1% of new diagnoses, just behind lung cancer) and the fifth for mortality (6.8%) [1]. For screening PCa, physicians primarily use the prostate-specific antigen (PSA) test, which measures the amount of PSA in the blood, a marker of potential PCa [2]. However, PSA levels may also arise due to other conditions, including an enlarged or inflamed prostate [3]. Therefore, if the PSA test is positive, the patient typically undergoes a multiparametric magnetic resonance imaging (mpMRI) examination [2]. Here, T2-weighted (T2w) and diffusion-weighted (DWI) images (along with apparent diffusion coefficient (ADC) maps) are acquired to investigate the anatomy and detect the presence of the tumor, respectively. These acquisitions allow radiologists to make a preliminary diagnosis following the Prostate Imaging Reporting and Data System (PI-RADS) guidelines [4]. According to the PI-RADS standard, the radiologist assigns a numerical value between 1 and 5 to the suspected lesion: the higher the score, the greater the likelihood that the accounted nodule is malignant. If PI-RADS ≥ 3, the lesion is likely to be clinically significant, and the patient undergoes a biopsy [2]. Based on the two most common patterns in the biopsy specimen, the pathologist assigns a score known as the Gleason Score (GS) to the tumor's aggressiveness. Along with the GS, it is also recommended to provide the group affiliation of the assigned score defined by the International Society of Urological Pathology (ISUP), as this facilitates predicting patient outcomes and patient communication [5]. If GS ≥ 3 + 4 (ISUP ≥ 2), the tumor is confirmed to be clinically significant [6]. However, it is often the case that the suspected lesion results are indolent after a biopsy examination.
In particular, only about 20% of all PI-RADS 3 patients biopsied show intermediate/severe PCa pathology [7]. Although mpMRI investigation reduces overdiagnosis [8], it remains a qualitative diagnostic tool, highly dependent on radiologist experience and acquisition protocols [9]. For this reason, there is a need for an automated tool to support radiologists in the clinical practice to make diagnosis more robust, reliable, and, above all, non-invasive.
In this work, we propose to perform the PCa aggressiveness classification task from T2w images by exploiting an ensemble of vision transformers(ViTs) [24]. ViTs are becoming increasingly popular in the medical imaging domain [25][26][27][28][29][30][31][32][33][34], usually outperforming classical CNNs [35,36], which are one of the most significant networks in the deep learning field [37]. The existing literature typically employs ViTs in transfer learning scenarios by pre-training them on large datasets of natural images and fine-tuning them on specific datasets [27,28,38]. However, due to the limited availability of medical imaging data, we propose training the ViT from scratch by downsizing the architecture. Additionally, since medical data is often acquired in volumetric form, we modify the ViT's architecture to train it on 3D volumes, leveraging most of the information from the acquisitions. To further enhance the performance of vanilla 3D ViTs, which we will call base models henceforth, we propose to combine them in stacking ensembles. The aim is to create a meta-model that learns how to best combine the predictions of base 3D ViTs to harness the capabilities of a stack of models. Finally, to assess the models' confidence in making predictions in addition to the sole accuracy, we propose two confidence metrics based on the models' output probability.
The key contributions of our work can be summarized as follows: • We introduce a downscaled version of the ViT architecture and train it from scratch using small datasets, addressing the challenge of limited data availability. • We propose modifications to the ViT architecture to handle 3D input, enabling the model to effectively leverage volumetric data in the PCa aggressiveness classification task from T2w images. • We develop stacking ensembles by combining multiple base 3D ViTs, thereby leveraging the strengths of both stronger and weaker base models to improve overall performance. • We define two novel confidence metrics that provide insights into the models' confidence in making predictions, offering a more comprehensive evaluation beyond accuracy alone.
• We conduct comparative experiments to assess the performance of ensemble 3D ViTs against the base models in the task of PCa aggressiveness assessment from T2w images.
These contributions collectively aim to enhance the classification accuracy and reliability of PCa aggressiveness assessment, utilizing the power of ensemble models and tailored adaptations to the ViT architecture.

Dataset
The dataset exploited for this study was collected at Radboud University Medical Centre's Prostate MR Reference Center as part of the ProstateX-2 challenge [39] and contained a retrospective set of prostate MR studies. All studies included T2w, proton-density-weighted (PDw), dynamic contrast-enhanced (DCE), and DWI acquisitions, along with ADC maps. All images were acquired on two Siemens 3T MR scanners, the MAGNETOM Trio and Skyra. In this work, we exploited only axial T2w acquisitions since, according to [22], they are the most informative for this task. T2w images were acquired using a turbo spin echo sequence with a resolution of around 0.5 mm in the plane and a slice thickness of 3.6 mm.

Base 3D ViTs
The ViT model was originally designed for handling two-dimensional data, as introduced in [24]. We modified the model to handle three-dimensional input, i.e., each embedding is obtained by flattening a 3D patch rather than a 2D one. Following the formalism presented in [24], we define our input as x ∈ R H×W×Z×C , where (H, W, Z) represents the resolution of the volumetric input, and C denotes the number of channels. The ViT divides the input volume into (P, P, Z) patches and flattens them into a one-dimensional vector. As a result, the encoder receives a sequence of flattened patches x p ∈ R Nx(P 2 ·Z·C) as input for each input volume, where N = HW/P 2 represents the number of patches. In this study, we deal with grayscale images, so C = 1.
Initially, we attempted to apply transfer learning by fine-tuning ViTs based on the positive results reported in the literature [27,28,38]. However, this approach resulted in poor classification performance. We attributed this outcome to the limited size of our training set. Consequently, we explored training the ViTs from scratch to mitigate the issue of overfitting. Thus, we downscaled the architecture, significantly reducing the number of learnable parameters. To determine the optimal configuration, we conducted a grid search, exploring different values for the multilayer perceptron (MLP) size (d), hidden size (D), number of layers (L), and number of attention heads (k). We set P = 16 as the patch size, which seemed reasonable for the 3D ViTs to process sufficient information. We also conducted preliminary experiments with patch sizes of P = 8 and P = 32 but observed notably inferior results. After setting the values for L and k, we derived the appropriate value for D based on Equation (1) Finally, we calculated d value according to Equation (2): In our grid search, we also took into consideration the value of d used in the ViT-base architecture proposed in the original article, which was set to 3072 [24]. The parameters of each configuration explored in the grid search are summarized in Table 1. Furthermore, Figure 1 provides a visualization of a generic base 3D ViT architecture, highlighting the parameters that were varied during the grid search. ...

Ensemble 3D ViTs
Once we defined and trained the 18 base models according to the grid search, we explored combining these models to improve overall performance. Specifically, we implemented several stacking ensembles by concatenating the outputs of the base models and feeding them as input to an additional meta-classifier. This meta-classifier is responsible for producing the final output by performing a linear transformation of the incoming data, as depicted in the following equation: where, x represents the input data, A T denotes the transpose of the weight matrix A, and b represents the bias term. This process yields the final output, denoted by y, which serves as the ensemble model's prediction based on the combined knowledge from all the base models. Figure 2 provides a visual representation of a generic stacking ensemble consisting of m base 3D ViTs. In our study, we evaluated all possible combinations for m = 2 and m = 3. For each ensemble model, we passed the output of the meta-classifier through a sigmoid function, which computed the probability of the input belonging to the positive class. We considered a prediction as positive if the output probability exceeded 0.5.

Data Pre-Processing
To make the model focus on the tumor lesions, we adopted a slice selection strategy based on the lesion coordinates within the volume. For each volumetric acquisition, we chose the slice containing the lesion itself, as well as two slices above and two slices below it, totaling five slices per lesion. The rationale behind this approach was to provide the network with a section of the entire acquisition that is most influenced by the presence of the tumor lesion. By including the surrounding slices, we aimed to capture contextual information and provide the model with a more comprehensive view of the lesion and its immediate surroundings.
To address the issue of varying matrix sizes in the dataset (in terms of the number of pixels in rows and columns), we adopted a resampling approach. In particular, we selected the largest and most common matrix dimension as our reference size, which was 384 × 384 in this case, and we upsampled all the images, utilizing bilinear interpolation to match the reference size. This resampling technique allowed us to standardize the image dimensions across the dataset. The decision to choose the largest matrix dimension as the reference in our study was driven by two main considerations. Firstly, by selecting the largest matrix dimension, we aimed to minimize the need for down-sampling, thereby avoiding potential loss of valuable information that may occur during this process. Secondly, this choice was aligned with the most common size found in the dataset, reducing the number of patients that would require resampling. Another crucial aspect was that all acquisitions in the dataset shared the same slice thickness of 3 mm, eliminating the need for any modifications.
After upsampling the images to a consistent size of 384 × 384 pixels, we performed a center-cropping operation on each slice to facilitate the model's focus on the prostate gland. The center-cropping process involved extracting a smaller region from the center of each slice. Through empirical analysis, we determined that a crop size of 128 × 128 pixels was suitable for encompassing prostate glands of various sizes within the field of view while eliminating a significant portion of non-glandular tissues. Thus, for each lesion, we obtained a volume of 128 × 128 × 5 pixels, representing the cropped slices from the selected region. In Figure   To address the class imbalance issue in the dataset, we applied several data augmentation techniques to balance the training dataset. Specifically, we utilized three augmentation strategies: vertical flip, horizontal flip, and rotation. We chose these strategies to perform rigid transformations of the images while preserving the appearance and shape of the lesion(s) and prostate within each image. During rotation augmentation, we randomly selected the degree of rotation from the interval of [−25°, +25°]. Importantly, we rotated all images within the same volume by the same amount to maintain consistency. Bilinear interpolation was used during the rotation process to ensure smooth transitions. Since the original training set consisted of 54 LG and 27 HG cases, we randomly selected 9 HG volumes using a fixed seed. We generated three augmented versions of each selected HG volume using the techniques described above. Consequently, the final training set comprised 54 volumes of both LG and HG cases, resulting in a balanced dataset. To ensure data harmonization across the training, validation, and test sets, we applied mean normalization by calculating the average pixel value of the training set and subtracting it from all images in the training, validation, and test sets. This normalization step helped to standardize the pixel values across the different sets and align their distributions.

Accuracy Metrics
To assess the models' classification performance, we evaluated the following: specificity, sensitivity, balanced accuracy, the area under the receiving operating curve AUROC, and the area under the precision-recall curve (AUPRC). We chose to report all these metrics because each provides valuable insights into the classifier's performance from different perspectives. Together, they offer a comprehensive evaluation of the model's ability to discriminate between the two classes and help in understanding its strengths and limitations.
Here's why each metric is important: Balanced Accuracy: Balanced accuracy is the arithmetic mean of sensitivity and specificity. A metric like this can be useful in this case since the dataset is imbalanced and traditional accuracy may be unreliable. • AUROC: AUROC summarizes the overall performance of the model with a single value that indicates the model's ability to distinguish between the two classes across all possible thresholds. • AUPRC: AUPRC quantifies the overall ability of the model to balance precision and recall across all possible thresholds. The reason we reported this metric is that our dataset has a moderate skew toward the negative. Therefore, we could better assess our models' performance considering their behavior with respect to the positive class, regardless of the composition of the dataset itself [40].

Confidence and Calibration Metrics
In contrast to the common practice in the literature where model performance is primarily evaluated based on accuracy metrics, we recognized the importance of considering the models' prediction confidence. This is crucial, especially in high-risk domains, such as medicine, where the impact of incorrect predictions can be significant. To assess the models' confidence in making predictions, we introduced two metrics based on the models' output probabilities. Specifically, we defined a reliable prediction for the negative class as one that is correct and computed with a probability of less than 0.3. Conversely, for the positive class, a reliable prediction is accurate and performed with a probability greater than 0.7. In terms of notation, let us denote true positives as TP, true negatives as TN, false positives as FP, and false negatives as FN. The confidence metrics can be defined as follows: Ideally, CSP equals to specificity, and CSE equals to sensitivity. When this happens, all the correct predictions are made confidentially, so for negative ground truth (LG in this case) instances, the model predicted that the input image had a probability of belonging to the positive class less than or equal to 0.3. On the other hand, for positive ground truth (HG in this case) instances, the model predicted that the image had a probability of belonging to the positive class greater than or equal to 0.7. CSP and CSE are hybrid metrics that combine the ability of the model to make a correct prediction with its confidence in making that prediction. They thus provide a truer and more reliable measure of the model's potential, which is critical in high-risk domains such as medical imaging. By incorporating these confidence-based metrics into our evaluation process, we sought to provide a more nuanced understanding of the models' performance, allowing a more cautious and informed approach to their practical application. Indeed, the reliability of the confidence metrics depends on the calibration of the model. Calibration ensures that the predicted probabilities accurately represent the true probabilities of correctness. In a well-calibrated model, if the model assigns a probability of 40% to an image representing a dog, the actual probability of correctness should be close to 40%.
To assess the calibration of our model, we employed the Brier score (BS) [41], a widely used metric for evaluating calibration. The Brier score measures the mean squared difference between the predicted probabilities and the corresponding true outcomes. Mathematically, the Brier score is defined as follows: Here, f i represents the predicted probability, o i denotes the actual outcome for instance i, and N represents the total number of predictions. In a binary classification task, a perfectly calibrated model would yield a Brier score of 0. This implies that the model consistently assigns a probability of 0 to the negative class and a probability of 1 to the positive class. By computing the Brier score separately for each class, namely the negative class (BSNC) and the positive class (BSPC), we can assess the calibration of the model for each class individually. Lower values of BSNC and BSPC indicate better calibration for the negative and positive classes, respectively. The mathematical definition of BSNC and BSPC is the following: Here, f 0 i represents the predicted probability when the ground truth is negative, and f 1 i represents the predicted probability when the ground truth is positive. Lower values of BSNC indicate better calibration for the negative class, while lower values of BSPC indicate better calibration for the positive class.

Dataset Splitting
In our study, we initially divided the dataset into two parts: 90 lesions (80%) for training and validation and 22 lesions (20%) for the test set. To ensure robustness in model evaluation, we employed two different strategies for further splitting the 90-lesion set. The first strategy involved creating two subsets: 90% for training and 10% for validation. The second strategy involved dividing it into five subsets for conducting a five-fold crossvalidation (CV). We used the first splitting strategy to train the 18 base 3D ViTs, which we later used in the ensembles. We employed the second splitting strategy for optimizing both the base and ensemble models. To enhance the reliability of our findings, we bootstrapped the training set obtained from the first splitting strategy, generating 100 samples. We utilized these samples for re-training the best-performing base and ensemble models obtained from the five-fold CV. To ensure balanced representation, we performed all splits while stratifying the data based on the lesion class (approximately two-thirds LG and onethird HG) and lesion location (approximately two-fifths PZ, two-fifths AS, and one-fifth TZ). Additionally, we employed a patient-wise splitting approach to prevent any data leakage. A summary of the composition of the training, validation, and test sets with respect to tumour aggressiveness and location can be found in Table 2.  [45], Pillow (v. 9.0.1) [46], and Pandas [47]. We performed training and test processes employing an Intel Core i7 ASUS Desktop Computer with 32 GB RAM and an NVIDIA GeForce GTX 1650 GPU.

Base Models
We trained each of the 18 base models with the following hyperparameters: learning rate = 1 × 10 −4 , weight decay = 1 × 10 −2 , maximum number of steps = 1000, batch size = 4, warmup steps = 1000, optimization algorithm = Adam, loss function = Binary Cross Entropy. To make each training reproducible, we exploited the reproducibility flags provided by Pytorch [42], Numpy [43], and Random [48] libraries, choosing a seed equal to 42. We trained each base model according to a five-fold CV.

Ensemble Models
We explored all possible combinations of two-and three-base-model stacking ensembles. Specifically, we evaluated 153 two-model combinations and 816 three-model combinations. Each combination underwent training and evaluation using a five-fold CV approach, maintaining consistency with the dataset splitting and reproducibility seed used for the base models. To train the meta-classifier, we utilized the following set of hyperparameters: learing rate = 1 × 10 −4 , weight decay = 1 × 10 −2 , number of epochs = 100, batch size = 4, optimization algorithm = Adam, loss function = BinaryCrossEntropy. These hyperparameters were applied consistently across all training iterations of the meta-classifier to ensure fair and comparable performance evaluation.

Performance Evaluation
To assess the performance and calibration of the models, we evaluated their accuracy, confidence in predictions, and calibration. The selection of the best-performing base and ensemble model was based on the performance on the five-fold CV process. To compare the performance of the best-performing base and ensemble models, we conducted a 100-sample bootstrap of the entire training set. For each bootstrapped sample, we retrained both the best-performing base model and the ensemble model. Subsequently, we evaluated each re-trained model on the same hold-out test set generating a performance distribution. From these distributions, we calculated the median performance and the 95% confidence interval (CI). This approach allowed us to obtain a robust estimation of the models' performance and CI for a reliable performance assessment and comparison.

Statistical Analysis
To evaluate the statistical significance of the difference between the best-performing base and ensemble models, we conducted the Wilcoxon signed-rank test. We considered the difference between the performance distributions of the base and ensemble models statistically significant if the resulting p-value (p) was less than 0.050. We performed the statistical analysis exploiting the Scipy library [49] (v. 1.9.3).

Results
The best-performing ensemble of 3D ViTs, composed of configurations 5, 9, and 11, achieved the following results in terms on median and 95% CI: a specificity of 0. We presented the results for the best-performing base and ensemble models in terms of accuracy and confidence in Table 3 and Table 4, respectively. Additionally, in Figure 4, we included the ROC and PR curves for the two best-performing models when trained on the entire training set (non-bootstrapped). These visualizations provide further insights into the performance and discriminative capabilities of the models.
In Figure 5, we present the box plots illustrating the distributions of results for the bestperforming base and ensemble models based on accuracy metrics. Similarly, in Figure 6, we provide the box plots representing the distributions of results for the base and ensemble models with respect to confidence metrics.    For clarity and comparison purposes, we present the five-fold CV accuracy and confidence results for all the base models in Tables 5 and 6, respectively. Similarly, we provide the five-fold CV results of the top 10 best-performing ensembles in terms of accuracy and prediction confidence in Tables 7 and 8, respectively. All reported values are presented as the mean and SD across the five folds.  Table 5. Prediction accuracy results of each configuration of base 3D ViT on the five-fold CV. Results are provided as mean and SD across the five folds. We use bold formatting to highlight the configuration that achieved the best performance and underlined formatting to indicate the configuration with the worst performance. Details on configurations can be viewed in Table 1.  Table 6. Prediction confidence results of each configuration of base 3D ViT on the five-fold CV. Results are provided as mean and SD across the five folds. We indicate with the ↑ symbol the metrics whose values are better when higher, and with the ↓ symbol, we indicate the metrics whose values are better when smaller. We use bold formatting to highlight the configuration that achieved the best performance and underlined formatting to indicate the configuration with the worst performance. Details on configurations can be viewed in Table 1.  Table 8. Prediction confidence results of the top-10 stacking ensembles on the five-fold CV. Results are provided as mean and SD across the five folds. We indicate with the ↑ symbol the metrics whose values are better when higher, and with the ↓ symbol, we indicate the metrics whose values are better when smaller. We use bold formatting to highlight the configuration that achieved the best performance and underlined formatting to indicate the configuration with the worst performance. Details on configurations can be viewed in Table 1.

Discussion
In this study, we proposed a trained-from-scratch 3D ViT in a stacking ensemble configuration and assessed its effectiveness in assessing PCa aggressiveness from T2w MRI acquisitions. The approach employed in this study involved training vanilla 3D ViT in various configurations and subsequently combining them into 2-and 3-model ensembles. We concatenated the features extracted from each of these base models and provided it to a single fully-connected layer, which yielded the final prediction. The evaluation of the trained models centred around assessing their accuracy performance, specifically focusing on measures such as specificity, sensitivity, balanced accuracy, AUROC, and AUPRC. Additionally, we measured the calibration of these models using BS. We further computed BS with respect to the negative class only BSNC and the positive class only (BSPC), allowing us to gain insights into their calibration for each class separately. To enhance the model evaluation process, we proposed the introduction of two novel metrics, CSP and CSE. The primary purpose of these metrics was to provide a comprehensive measure that combined the model's prediction capabilities with its confidence level. By doing so, we aimed to highlight only those predictions made with a confidence level above a predefined threshold. The implementation of CSP and CSE aimed to offer more detailed and reliable information about the model's performance, with the ultimate goal of bringing its practical application in clinical settings closer. These metrics were designed to provide valuable insights into the accuracy and confidence of the model's predictions, enabling a more informed and cautious approach when utilizing the model's outputs in real-world medical scenarios.
We trained the base 3D ViT models according to a grid search, varying architecture parameters such as d, D, L, and k, for a total of 18 base 3D ViTs. By training all possible combinations of two and three models, we created 966 ensembles from these base models. We trained each ensemble following a five-fold CV and re-trained the one with the highest performance on a 100-sample bootstrapped training set. For comparison, we optimized the 18 base ViTs as well using a five-fold CV and re-trained the best-performing base ViT on the same bootstrapped training set. We evaluated each of the 100 models from both the ensemble and base ViT on a separate hold-out test set. To determine whether there was a significant difference in performance between the base and ensemble models, we conducted a statistical analysis on the distributions of results obtained from the test set evaluations.
We evaluated our approach using the ProstateX-2 challenge dataset [39] appropriately divided into training, validation and test set, ensuring strict separation between patients.
According to the results, the ensemble model demonstrated strong performance in classifying LG and HG lesions, as evidenced by its high median AUROC (0.89). A high AUROC indicates a robust ability to accurately identify positive instances while effectively minimizing false positive predictions. This performance metric holds great importance, particularly in tasks characterized by imbalanced class distributions or situations where the costs associated with false positives and false negatives are substantial, as exemplified in this case. The model also demonstrated good results in classifying the positive class specifically, with a median AUPRC of 0.87. Apart from its general performance, the model displayed notable proficiency in accurately classifying the positive class. The median AUPRC was recorded at a median value of 0.87. The AUPRC is a crucial evaluation metric, especially when dealing with imbalanced datasets. To elaborate further, the AUPRC measures the area under the precision-recall curve, which plots the precision against the recall (sensitivity) at various classification thresholds. In cases where the class distribution is imbalanced, a high AUPRC becomes crucial as it signifies that the model effectively achieves a high precision rate while maintaining a reasonable recall rate. This means that when the model makes a positive prediction, it is highly likely to be correct (high precision), and it successfully captures a significant portion of the actual positive instances (high sensitivity).
Regarding the calibration aspect, the model exhibited strong calibration performance, as evidenced by a BS of 0.16. The BS is a widely used metric that assesses the calibration of probabilistic predictions made by a classification model by measuring the mean squared difference between the predicted probabilities and the actual binary outcomes. A lower Brier score indicates better calibration, implying that the model's predicted probabilities align closely with the actual outcomes.
Concerning the model's confidence in its predictions, CSP revealed a remarkable value equal to the classical specificity metric, reaching 0.83. This outcome signifies that all correct predictions related to the negative class (specificity) were accompanied by high confidence levels, i.e., the model returned one with an output probability less than or equal to 0.3.
Upon comparing the ensemble model to the best-performing base 3D ViT, we conducted a Wilcoxon signed-rank test to assess the statistical significance of their performance differences. In terms of AUROC, the results of this analysis revealed no statistically significant difference between the two models (p = 0.73). This indicates that both the ensemble model and the base ViT performed similarly in terms of overall discriminative ability. However, a notable contrast emerged when evaluating the models' proficiency in classifying HG lesions. Indeed, the ensemble model outperformed the base ViT by a 22% improvement in AUPRC, which resulted in statistically significant (p < 0.001). The substantial improvement in AUPRC for HG lesions highlights the ensemble model's particular strength in accurately identifying and distinguishing severe lesions from the base ViT. This is crucial in medical applications where detecting HG lesions can significantly impact patient outcomes and treatment decisions.
Indeed, while the ensemble model exhibited a significant performance improvement in classifying HG lesions, it is essential to consider its impact on calibration and confidence in predictions. A closer examination of the calibration metrics reveals that the ensemble model's performance comes at the expense of poorer calibration towards the HG class. This is evident from the higher BSPC and the lower CSE values compared to the base ViT. The higher BSPC suggests that the ensemble model's probability predictions for positive instances in the HG class may be less well-calibrated. Consequently, this may lead to less reliable probability estimates for high-grade lesions. Furthermore, the lower CSE value indicates reduced confidence in the ensemble model's predictions for the HG class, i.e., the model might not be as certain when making predictions for positive instances in this category. On the contrary, the ensemble model demonstrated improved confidence in its predictions for the negative class when compared to the base ViT. This was evident from the lower BSNC and CSP values relative to the base model. The lower BSNC suggests that the ensemble model's probability predictions for the negative class are better calibrated and align closely with the actual outcomes. Moreover, the lower CSP metric signifies that the ensemble model confidently assigns lower probabilities (less than or equal to 0.3) to the correct negative predictions. All differences in cited metrics values resulted in statistically significant according to the Wilcoxon signed-rank test (p < 0.001).
Furthermore, we investigated the performance of the ensemble model compared to the base model focusing on the consistency of its predictions. The results, illustrated in Figures 5 and 6, showcased the ensemble model displayed larger CIs with respect to to the base model. This suggests that the ensemble model's performance is characterized by higher variability, making its predictions less stable and more susceptible to fluctuations. While the ensemble model demonstrated superior performance in certain aspects, the broader range of its CIs indicates that its predictions may be less consistent, leading to varying results across different iterations or datasets. This variability in performance could be attributed to the complexity introduced by the ensemble configuration, as it involves combining multiple models, each with its unique characteristics.

•
Performance of best-performing base and ensemble models. The best-performing ensemble model contains the best-performing base model, indicating that this specific configuration of architecture parameters achieves high performance. This configuration stands out as it is present in three out of the ten best ensemble combinations, as shown in Table 7. • Impact of the number attention heads and embedding size. Configurations 5 and 8, both having eight attention heads and an embedding size of 32, are the most frequently recurring setups in the top-performing ensembles (three out of ten combinations each). Additionally, configurations 4, 7, and 16, all having an embedding size of 64 and four attention heads, also occur frequently (three out of ten ensembles). These architecture parameter combinations seem to contribute to improving the ensembles' performance, even though they might not perform well when used in isolation.  Tables 5 and 7, it is evident that combining weaker models in stacking ensembles significantly improves accuracy performance. This finding highlights the benefits of leveraging ensemble methods to improve model performance, even when individual base models might not be as strong.
We compared the performance of our best-performing ensemble 3D ViT model with state-of-the-art (SoTA) studies that addressed the same clinical task of assessing PCa aggressiveness from MRI images. To ensure a fair comparison, we considered all the studies that employ T2w images alone or combined with other images modalities. In addition, we compared our results with studies employing both radiomics and deep learning for a comprehensive evaluation. In radiomic studies, Jensen et al. [20] achieved an AUROC of 0.830 using a K-nearest neighbors (KNN) classifier with only T2w images. Bertelli et al. [22] reported the best machine learning model with an AUROC of 0.750 (90% CI: [0.500-1.000]) for T2w images alone and 0.63 (90% CI: [0.17-1]) when combining T2w and ADC images. As for deep learning models, Yuan et al. [21] used an AlexNet in a transfer-learning approach, achieving an AUROC of 0.809 with T2w images alone and 0.90 when combining T2w and ADC acquisitions. Bertelli et al. [22] utilized CNNs with attention gates, resulting in an AUROC of 0.875 (90% CI: [0.639-1.000]) for T2w images only and 0.67 (90% CI: [0.30-1]) when combining T2w and ADC. In comparison, our bestperforming ensemble 3D ViT model displayed SoTA performance with an AUROC that outperformed all the other models when using only T2w images. Unfortunately, a direct comparison for the positive class only was not possible due to the lack of available data in the literature. The comparison results are summarized in Table 9, providing an overview of our model's performance concerning the SoTA studies.
This study is subject to several limitations that should be acknowledged. Firstly, we did not conduct a hyperparameter optimization, which may have limited the overall performance of our models. Instead, we kept the hyperparameter values fixed throughout all the training phases and focused solely on optimizing the architecture parameters. Moreover, in our investigation of stacking ensembles, we only considered two-and threemodel combinations. Expanding the ensemble to include more base models might lead to further improvements in the results. Indeed, utilizing a larger ensemble could enhance model diversity and potentially increase predictive performance. Furthermore, one of the limitations lies in the choice of using only axial T2w images for training. In contrast, many of the cited works combined multiple modalities, such as T2w and ADC images, to improve model performance. Expanding the dataset to incorporate additional modalities could potentially enhance the model's ability to capture diverse and complementary information, thereby leading to more robust predictions. Overall, recognizing these limitations is essential to interpreting the findings accurately and understanding the scope of the study. Future research could address these limitations and explore new avenues for improving the performance of the 3D ViT model in assessing PCa aggressiveness. Table 9. Comparison between our best-performing 3D ViTs stacking ensemble and several SoTA models that assess the PCa aggressiveness classification (ISUP 1 + 2 vs. rest).

Conclusions
In this study, we explored the effectiveness of 3D ViTs in stacking ensembles to improve the assessment of PCa aggressiveness from T2w images. The best-performing ensemble demonstrated strong capabilities in differentiating between LG and HG lesions, achieving an AUROC comparable to the SoTA methods. Additionally, it displayed a high ability to classify positive instances, as evidenced by a high AUPRC, also outperforming the base model (p < 0.001). However, this improvement came at the cost of reduced calibration and prediction confidence, as indicated by a lower CSE and higher BSPC. At the same time, however, the ensemble model was found to be better calibrated and more confident in its predictions regarding the negative class. In addition, the larger CI for the ensemble model suggests that its performance is less reliable and more variable compared to the base model. Overall, our study showed that 3D ViT ensembles yield promising results for PCa aggressiveness assessment from T2w images and provide improved even though less reliable performance with respect to their base version. This suggests that further refinements or strategies to mitigate variability may be necessary before its deployment in critical real-world scenarios.
Our analysis has certain limitations. First, we did not perform hyperparameter optimization. Secondly, we limited our investigation to ensembles of only 2 and 3 models. Finally, we trained our models only on T2w images from a relatively small dataset. In fact, at the time of our experiments, ProstateX-2 was one of the few publicly available datasets with more than 100 lesions. Our work was a preliminary investigation that will serve as the basis for further research taking advantage of larger datasets that are being collected within the Tuscany Region PAR FAS NAVIGATOR project and the EU H2020 ProCAncer-I project. In these new research endeavors we expect to address these aspects and thereby improve the effectiveness of our approach.
Funding: This study has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 952159 (ProCAncer-I) and from the Regional Project PAR FAS Tuscany-NAVIGATOR, CUP I58D20000500002. The funders had no role in the design of the study; the collection, analysis, and interpretation of data; or writing the manuscript.
Institutional Review Board Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: