The training process was conducted using the Adam optimizer with its default momentum settings, which provided stable and adaptive optimization throughout the training. An initial learning rate of 0.001 was used, regulated by a cosine annealing scheduler to gradually reduce the rate during training, promoting smoother convergence. A batch size of 16 was chosen to balance memory usage and training efficiency. Each model was trained for a maximum of 100 epochs, with early stopping applied based on validation loss to prevent overfitting. Weight decay was set to to regularize the model and discourage over-complex solutions. Training was performed on an NVIDIA RTX 3090 GPU, with each model requiring approximately 6 to 8 h to complete. However, smaller models like YOLOv8n and YOLOv11n trained more quickly due to their lightweight architectures and reduced computational requirements. To improve the robustness and generalization of the models, several data augmentation techniques were applied during training. These included horizontal and vertical flips, random rotations within ±10 degrees, and random adjustments to brightness and contrast. Mosaic augmentation, a technique supported by the YOLO framework, was also used to combine four images into one, enhancing the model’s ability to detect objects in varied and complex contexts. Model checkpointing was based on validation performance, with the best-performing model (in terms of F1-score) saved during training. Validation was performed after each epoch to monitor performance and enable early stopping when no further improvements were observed. Summary of training details:
It is important to note that smoothing does not alter the actual performance metrics or evaluation results. Instead, it improves the visual readability of the curves, making it easier to qualitatively assess the evolution of training dynamics, identify trends, and compare the behavior of different models throughout the training process.
3.2.5. Comparative Analysis of YOLO Models
The
Table 3 presents a detailed architectural comparison between different variants of the YOLOv8 and YOLOv11 object detection models. It lists four specific model versions: YOLOv8n, YOLOv8m, YOLOv11n, and YOLOv11m, and compares them across five key attributes:
Parameters (in millions): This indicates the size and complexity of each model. YOLOv8 variants range from 3.2M to 25.9M parameters, while YOLOv11 variants are slightly larger, ranging from 4.1M to 27.8M.
Speed (milliseconds per image): This reflects the inference time per image. YOLOv11n achieves the fastest speed at 3.9 ms/img, with YOLOv8n close behind at 4.5 ms/img. The medium-sized models (YOLOv8m and YOLOv11m) are slower but still maintain competitive speeds around 7.6–8.2 ms/img.
Backbone Architecture: The backbone is the core feature extractor of the model. YOLOv8 models utilize variations of CSPDarknet (Tiny and Medium), while YOLOv11 models employ more advanced backbones such as RepNConvs combined with FastSAM and an Enhanced RepNBackbone for improved performance.
The advantages of each model are highlighted:
YOLOv8n is lightweight and optimized for edge devices, emphasizing speed and efficiency.
YOLOv8m offers a balanced trade-off between speed and accuracy.
YOLOv11n focuses on the fastest inference time and improved small object detection.
YOLOv11m targets high precision and is designed for reliable, robust inference.
This table (
Table 3) also illustrates the evolution from YOLOv8 to YOLOv11, highlighting improvements in processing speed, backbone sophistication, and detection capabilities. These enhancements support a range of deployment scenarios, from lightweight edge applications to high-precision inference.
The comparison of key performance metrics for different YOLO models applied to aneurysm detection is presented in
Table 4. Each metric (Precision, Recall (Sensitivity), and F-Mean) is reported along with 95% confidence intervals obtained via bootstrapping (statistical method used to estimate the variability or uncertainty of a metric) across patients. These intervals indicate the expected variability in performance and provide a more robust assessment than point estimates alone. The table highlights differences in detection performance across models while reflecting uncertainty in the reported metrics, which is especially relevant for clinical interpretation.
For YOLOv8n, the model achieved a precision of 0.7692 with a 95% confidence interval of 0.72–0.81, indicating that roughly 77% of predicted aneurysms were correct. Its recall (sensitivity) was 0.4061 (0.35–0.46), meaning it detected about 41% of actual aneurysms. The F-Mean of 0.5314 (0.48–0.58) balances precision and recall, reflecting moderate overall detection performance.
YOLOv8m showed improved performance, with a precision of 0.8382 (0.79–0.88) and recall of 0.4923 (0.43–0.55), resulting in an F-Mean of 0.6232 (0.57–0.68). This indicates a higher proportion of correct detections and a better balance between precision and sensitivity compared to YOLOv8n.
YOLOv11n exhibited a different behavior, achieving a precision of 0.6423 (0.59–0.69) but a higher recall of 0.5814 (0.52–0.64), resulting in an F-Mean of 0.6101 (0.55–0.66). This model is slightly less precise but detects more actual aneurysms, showing a trade-off between overprediction and sensitivity.
YOLOv11m achieved the highest precision at 0.9123 (0.87–0.95) but a moderate recall of 0.4146 (0.36–0.47), with an F-Mean of 0.5663 (0.51–0.62). This indicates that YOLOv11m makes very few false positive predictions but misses a larger fraction of actual aneurysms, emphasizing conservative detection.
Combining this analysis with five-fold cross-validation highlights the reproducibility of the YOLO models. Consistent performance across folds and seeds indicates stable generalization despite the inherent randomness in deep learning optimization. This comparative evaluation highlights the inherent trade-offs between model sensitivity (recall) and reliability (precision), with important implications for clinical use, where missing an aneurysm can have serious consequences, yet false positives can lead to unnecessary interventions or procedures.
The achieved results highlight notable differences in performance among the evaluated YOLO models. For YOLOv8, the medium variant (YOLOv8m) outperforms the nano variant (YOLOv8n) across all metrics, showing higher precision, recall, and F-mean. In contrast, the YOLOv11 results reveal a more nuanced pattern: YOLOv11m achieves the highest precision, while YOLOv11n exhibits better recall, leading to a higher F-mean for YOLOv11n compared with YOLOv11m. These differences, combined with observed variability from random seed initializations, illustrate how model size and architecture impact both performance and robustness.
Overall, these results highlight differences in precision–recall trade-offs across the YOLO variants, and the inclusion of 95% confidence intervals provides an estimate of variability, making the comparisons more robust and clinically interpretable.
Comparison of Precision:
YOLOv8m attained the highest precision within the v8 family (0.8382), comfortably outperforming YOLOv8n (0.7692).
YOLOv11m achieved a significantly higher precision (0.9123) than YOLOv11n (0.6423).
This indicates that YOLOv11m was more effective at reducing false positives, making it better suited for clinical scenarios where high diagnostic confidence is required.
In contrast, while YOLOv11n’s precision was lower, it was still respectable and indicates reasonable prediction reliability.
The 7-percentage-point gain indicates that the medium-capacity model suppresses false positives more effectively, likely due to its larger backbone and detection head, which better capture contextual information around small aneurysms. Although YOLOv8n’s precision is slightly lower, a value near 0.77 still represents respectable reliability for a lightweight network designed for resource-constrained environments, such as edge inference on scanner consoles.
Comparison of Recall:
YOLOv8m also leads in recall (0.4923) versus YOLOv8n (0.4061).
YOLOv11n outperformed YOLOv11m in recall (0.5814 vs. 0.4146).
This means YOLOv11n was more successful in identifying actual aneurysms and is therefore more sensitive, reducing the risk of missing positive cases.
YOLOv11m’s lower recall suggests a conservative detection strategy, which may fail to capture small or less pronounced aneurysms.
Recall (sensitivity) measures the proportion of true objects correctly detected by the model. In our experiments, the observed sensitivities reveal notable differences between the evaluated architectures. Among the YOLOv8 models, the medium variant (YOLOv8m) achieved a sensitivity of 0.4923, outperforming the smaller YOLOv8n model (0.4061). This indicates that YOLOv8m detected a larger fraction of true findings, missing fewer targets overall.
For the YOLOv11 models, the YOLOv11n variant demonstrated the highest sensitivity across all tested models, achieving 0.5814. This suggests that the updated architecture enhances detection capability, particularly in identifying a greater portion of true positive cases. In contrast, YOLOv11m achieved a lower sensitivity of 0.4146, indicating that, despite its higher precision, it missed more true targets compared to the lighter YOLOv11n model.
Overall, these results indicate that YOLOv11n provides the best balance in terms of sensitivity, capturing the highest number of true detections among all evaluated configurations.
Comparison of F-Mean:
The harmonic mean confirms the above trends: YOLOv8m posts the highest F-Mean in the entire study (0.6232), whereas YOLOv8n records 0.5314.
YOLOv11n yielded a higher F-Mean (0.6101) than YOLOv11m (0.5663), highlighting its more balanced performance across both recall and precision.
The F1 Score reflects the harmonic mean of precision and recall and suggests that YOLOv11n provides a better trade-off for general-purpose detection tasks.
YOLOv8m therefore offers the best overall balance of precision and recall among all four models analysed, making it an excellent “single-pass” detector when one cannot run multiple specialised models in tandem.
Precision and Recall Curves:
YOLOv8m starts with higher precision and maintains a steady upward climb, intersecting YOLOv8n’s curve by epoch 30 and never relinquishing the lead.
Recall curves reveal a similar crossover: YOLOv8m overtakes YOLOv8n around epoch 50, indicating that its feature hierarchy matures later but ultimately captures more positives. In practical terms, YOLOv8n is “fast out of the gate,” useful for rapid prototyping or low-epoch fine-tuning, whereas YOLOv8m rewards longer training with superior asymptotic performance.
Both models showed increasing precision over time; however, YOLOv11m reached perfect precision faster and more consistently.
YOLOv11n’s recall continued to improve significantly over training, while YOLOv11m’s recall plateaued early, which aligns with its conservative detection strategy.
The four Precision–Recall (PR) curves illustrate the performance evolution of deep learning-based object detection models tasked with identifying aneurysms. These curves summarize performance across all classes and provide insight into the model’s reliability in prioritizing cases for radiologist review, supporting the use of this approach as a decision-support tool for aneurysm detection. The key metric is mean Average Precision at an IoU threshold of 0.5 (mAP@0.5), a standard measure of object detection accuracy that balances precision and recall across thresholds. In our experiments, YOLOv8n and YOLOv8m achieved mAP@0.5 scores of 0.702 and 0.663, respectively, while YOLOv11n and YOLOv11m achieved scores of 0.712 and 0.662, respectively.
The curve analysis of YOLOv8n (
Figure 8) provides a detailed view of its precision–recall behavior and highlights how the model balances detection confidence with sensitivity across varying thresholds:
A noticeable improvement is observed in both the stability and shape of the curve (
Figure 8). Precision remains consistently high (>0.9) up to a recall of approximately 0.4–0.5, and the subsequent decline is more gradual compared to
Figure 9. This indicates better discriminative power and generalization. The model (
Figure 8) demonstrates an improved ability to identify more true positives without a sharp increase in false positives. The smoother curve reflects enhanced robustness, likely resulting from more effective data augmentation, or optimized post-processing (e.g., non-maximum suppression thresholds).
The curve analysis of YOLOv8m (
Figure 9):
The precision remains high (>0.9) for recall values up to approximately 0.4.
After that, the precision steadily declines as recall increases.
Indicates a moderate balance between precision and recall.
Overall performance is decent but may miss more positive cases as recall increases.
The precision curve in
Figure 9 remains near 1.0 for recall values below 0.4, indicating highly confident positive predictions with low false-positive rates at higher thresholds. Beyond this range, however, precision declines sharply as recall increases, reflecting the typical trade-off where improvements in recall come at the cost of additional false positives. This relatively steep drop suggests that the model’s confidence calibration may be suboptimal at higher recall levels. While the model maintains high precision for a limited subset of positives, its generalization decreases as recall rises, potentially due to limited training data or class imbalance.
Precision remains high across a larger portion of the recall range (especially between 0.3–0.7).
Indicates improved robustness of the model in detecting aneurysms with a better balance between false positives and false negatives.
The most favorable precision–recall trade-off is observed in
Figure 10. A high mAP@0.5 indicates that the model maintains a strong balance between precision and recall across thresholds, which is critical in clinical applications where both false negatives and false positives carry significant consequences. Precision remains above 0.85 over a wide range of recall values (0.3 to 0.8), and the curve exhibits a more convex shape, reflecting a robust and well-calibrated model. This suggests that the model in
Figure 10 has achieved improved confidence score calibration, likely due to advanced training strategies such as focal loss, hard negative mining, or ensemble learning. Additional performance gains may also result from enhanced feature extraction, a deeper backbone network, or domain-specific pretraining.
The curve starts at high precision and low recall, indicating the model is highly confident in its top predictions but initially misses many true cases.
As the threshold lowers, recall increases and precision drops, showing that the model retrieves more true aneurysms but also includes more false positives.
The area under the curve (summarized by mAP@0.5 = 0.662) suggests moderate to strong performance, especially valuable for imbalanced medical datasets.
The mAP@0.5 score of 0.662 indicates that the model shown in
Figure 11 achieves a reasonable balance between correctly identifying aneurysms and minimizing false detections. This metric is particularly useful for guiding threshold selection and for comparing models in future studies. The curve exhibits the typical trade-off in which precision decreases as recall increases, reflecting the balance between capturing true positives and avoiding false positives.
The progression shown in
Figure 8,
Figure 9,
Figure 10 and
Figure 11 indicates a consistent improvement in the models’ ability to maintain high precision while expanding recall, signaling a more reliable detection framework. A comparative analysis is presented in
Table 5. By combining the YOLOv8 and YOLOv11 variants in a cascading pipeline e.g., using YOLOv11n for an initial sweep, YOLOv8m for balanced verification, and YOLOv11m for high-confidence confirmation one can leverage the complementary strengths of these models and tailor detection performance to evolving clinical priorities.
To further improve robustness and reliability of the evaluation, we employed 5-fold cross-validation. In this protocol, the dataset was partitioned into five equally sized folds, each serving once as the validation set while the remaining folds were used for training. Performance metrics, Precision, Recall (Sensitivity), and F-Mean, were calculated for each fold, and the mean values were reported as an aggregate measure of model performance.
The K-fold cross-validation table presents a detailed evaluation of all tested YOLO models across five independent folds (see
Table 6). For each fold, the table reports Precision, Sensitivity (Recall), and F-Mean, allowing assessment of how consistently each model performs when trained and validated on different partitions of the dataset. This structure provides a more robust and reliable performance estimate compared to a single train-test split, as it reduces the influence of dataset variability.
Across the five folds, the YOLOv8n and YOLOv8m models demonstrate stable behavior, with YOLOv8m consistently outperforming YOLOv8n in all three metrics. Similarly, the YOLOv11n and YOLOv11m models show clear performance patterns: YOLOv11n achieves the highest average sensitivity across folds, indicating better ability to detect true positives, while YOLOv11m obtains the highest precision, reflecting fewer false detections. The reported mean values at the bottom of each model’s section aggregate the fold-level results and match the previously presented overall performance metrics. To better understand the robustness of the evaluated models, we examined the impact of training variability associated with different random seed initializations. Controlled repeated training on selected variants revealed the sensitivity of the results to seed-dependent variability. The K-fold results confirm that the observed performance differences between models are consistent and not dependent on a particular train–test split. This reinforces the reliability of the conclusions drawn about each model’s strengths and limitations.
The results across the five training runs (multiples) for each model show consistent trends but also highlight some variability. For instance, in the YOLOv8n model, precision varies slightly between 0.7583 and 0.7801, while sensitivity ranges from 0.3921 to 0.4168. This indicates that while the model reliably identifies positive samples with moderate precision, its ability to detect all true positives is somewhat affected by the random seed. Similarly, YOLOv8m exhibits slightly higher precision and sensitivity across runs, suggesting that increased model capacity may reduce sensitivity to random initialization.
YOLOv11 models display a complementary pattern. YOLOv11n shows relatively small variability across runs in both precision and sensitivity, indicating robust performance under different initializations. In contrast, YOLOv11m achieves very high precision (up to 0.9239) but exhibits more variability in sensitivity (0.4025–0.4254), reflecting the influence of random starting values on the model’s ability to detect true positives.
Examining the variability across the five runs provides important insights. Small differences between runs suggest that the model’s training process is stable and likely converges to similar solutions regardless of the random seed. Larger differences, particularly in sensitivity, indicate that certain initializations can steer the model toward slightly different local minima, affecting its detection performance. Reporting these variations alongside mean performance metrics gives a more complete picture of model reliability, rather than relying on a single training instance, which may overestimate or underestimate true performance. The performance of the evaluated YOLO models was assessed using 5-fold cross-validation and the results are summarized in
Table 7 as mean ± standard deviation (SD) for precision, sensitivity (recall), and F-mean.
Among the models, YOLOv11m achieved the highest precision (0.9123 ± 0.0076), indicating that it produced the fewest false positives on average. However, its sensitivity was relatively low (0.4146 ± 0.0085), suggesting a moderate ability to correctly detect all positive instances. In contrast, YOLOv11n showed the highest sensitivity (0.5814 ± 0.0089) among all models, although its precision (0.6423 ± 0.0077) was lower, reflecting more false positives.
The YOLOv8 series (YOLOv8n and YOLOv8m) displayed a more balanced performance. YOLOv8m achieved a higher precision (0.8382 ± 0.0096) and moderate sensitivity (0.4923 ± 0.0087), resulting in the highest F-mean among the YOLOv8 models (0.6232 ± 0.0095), suggesting good overall detection capability. YOLOv8n showed slightly lower precision and sensitivity, with an F-mean of 0.5314 ± 0.0093, indicating moderate performance.
The results indicate a trade-off between precision and sensitivity across the models. YOLOv11m favors precision at the expense of sensitivity, YOLOv11n favors sensitivity over precision, and YOLOv8m offers a balanced performance across both metrics, reflected in its competitive F-mean. The low standard deviations across all models indicate that the performance was stable and consistent across the five cross-validation folds.
In addition to lesion-level sensitivity, we evaluated the models using per-scan (or per-study) false-positive counts, which represent the average number of incorrectly predicted aneurysms. This metric provides complementary information to sensitivity, as it reflects the clinical burden of false alarms: even models with high lesion-level detection may be less practical if they generate many false positives per scan.
Table 8 summarizes the mean ± standard deviation of false positives per-scan for each model across the five cross-validation folds.
The per-scan false-positive analysis (
Table 8) shows that YOLOv11m achieved the lowest average number of false positives per CTA study (0.98 ± 0.20), indicating the highest specificity among the evaluated models. The YOLOv8m model also demonstrated relatively low false positives (1.12 ± 0.18), while YOLOv8n had slightly more false alarms (1.34 ± 0.22). In contrast, YOLOv11n produced the highest number of false positives per scan (1.75 ± 0.25), reflecting a tendency to over-predict aneurysms despite its higher lesion-level sensitivity. Overall, these results highlight the trade-off between sensitivity and per-scan specificity: models with higher sensitivity may generate more false positives, whereas models optimized for fewer false alarms maintain a more practical detection burden for clinical use.
The results also reveal a clear trade-off between precision and recall across the different model variants. YOLOv11m achieves the highest precision but lower sensitivity, YOLOv11n offers the greatest sensitivity, and YOLOv8m provides the most balanced performance between the two metrics. These findings underscore the importance of tailoring neural network models not only for detection accuracy but also for clinical applicability, ensuring that the balance between precision and recall aligns with specific diagnostic priorities.