Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis
Abstract
1. Introduction
- (1)
- Development and experimental application of an approach based on YOLO and RT-DETR models.
- (2)
- Creation of a proprietary, domain-specific dataset of clinical MRI images.
- (3)
- Comprehensive performance evaluation and justification for model selection.
2. Literature Review
3. Materials and Methods
3.1. General Research Design
3.2. Inclusion Criteria
3.3. Dataset Creation
3.4. Data Preprocessing
3.5. Meniscus Tear Recognition Based on YOLO Models and RT-DETR
3.5.1. Network Architecture
3.5.2. RT-DETR Architecture
3.6. Evaluation Metrics
- mAP@0.5: the average precision at a fixed threshold of the intersection of the predicted and true regions (IoU ≥ 0.5);
- mAP@0.5–0.95: the average precision computed over ten IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, in accordance with the official COCO evaluation protocol.
4. Results
4.1. Experimental Environment and Hyperparameters
4.2. Experiment
5. Discussion
5.1. Analysis of Misclassification and False Detection
- Heterogeneity of image quality: differences in scanning characteristics (magnetic field, matrix, slices, signal gain settings, etc.) led to variations in the visual representation of the meniscus, which complicated uniform feature extraction.
- Using different imaging modes (PD, T1, T2). MRI images acquired in different imaging modes such as PD, T1, and T2 were used in the study. Each of these modes has different contrast features and displays tissue structures differently. As a result, signs of rupture could be visualized with different degrees of severity depending on the imaging mode, which placed additional burden on the model and reduced the stability of classification between cases acquired in different imaging settings.
- Subtle or partial manifestations of pathology: In some cases, signs of meniscal tears were only faintly or partially expressed, which posed challenges for automated recognition and increased the likelihood of misclassification.
- Presence of artefacts and noise: Mechanical and software-induced artefacts, shadows, signal inhomogeneities, and occlusions by adjacent bony structures were present in some images. These factors elevated the risk of false positive detections.
- Anatomical variability and multiscale complexity: Substantial inter-patient variation in the size, shape, and positioning of the meniscus introduced additional complexity in generalization. This was particularly challenging given the limited number of training samples representing such diverse anatomical configurations.
5.2. Models Comparison in Terms of Detection Efficiency and Processing Speed
5.3. Performance Comparison of YOLOv8-x and RT-DETR-l Models
- Comparison with ground truth annotations: the model’s predictions were compared with the ground truth bounding boxes. A prediction was considered correct (true positive, TP) if the IoU value exceeded a predefined threshold (IoU = 0.7) and the predicted class matched the ground truth class.
- Filtering by confidence thresholds: Among the TP predictions, those with confidence values exceeding specified thresholds (e.g., 0.80, 0.85, 0.90) were selected. For each threshold, the proportion of reliable predictions was calculated; the corresponding results are presented in Table 7.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
MRI | Magnetic Resonance Imaging |
CNN | Convolutional Neural Network |
YOLO | You Only Look Once |
DETR | Detection Transformer |
RT-DETR | Real-Time Detection Transformer |
DICOM | Digital Imaging and Communications in Medicine |
PNG | Portable Network Graphics |
MSE | Mean Square Error |
pSNR | Peak Signal-to-Noise Ratio |
SSIM | Structural Similarity |
NMS-free | Non-Maximum Suppression-free |
AIFI | Attention-Enhanced Intra-Scale Feature Interaction |
CCFM | Convolution-Driven Cross-Scale Feature Fusion |
mAP | mean Average Precision |
IoU | Intersection over Union |
TP | True Positive |
FP | False Positive |
FN | False Negative |
TN | True Negative |
PACS | Picture Archiving and Communication System |
References
- Hoover, K.B.; Vossen, J.A.; Hayes, C.W.; Riddle, D.L. Reliability of meniscus tear description: A study using MRI from the Osteoarthritis Initiative. Rheumatol. Int. 2020, 40, 635–641. [Google Scholar] [CrossRef] [PubMed]
- Grasso, D.; Gnesutta, A.; Calvi, M.; Duvia, M.; Atria, M.G.; Celentano, A.; Callegari, L.; Genovese, E.A. MRI evaluation of meniscal anatomy: Which parameters reach the best inter-observer concordance? Radiol. Med. 2022, 127, 991–997. [Google Scholar] [CrossRef] [PubMed]
- Bien, N.; Rajpurkar, P.; Ball, R.L.; Irvin, J.; Park, A.; Jones, E.; Bereket, M.; Patel, B.N.; Yeom, K.W.; Shpanskaya, K.; et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med. 2018, 15, e1002699. [Google Scholar] [CrossRef]
- Güngör, E.; Vehbi, H.; Cansın, A.; Ertan, M.B. Achieving High Accuracy in Meniscus Tear Detection Using Advanced Deep Learning Models with a Relatively Small Data Set. Knee Surg. Sports Traumatol. Arthrosc. 2025, 33, 450–456. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020. [Google Scholar] [CrossRef]
- Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2022), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar] [CrossRef]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
- Liu, F.; Zhou, Z.; Jang, H.; Samsonov, A.; Zhao, G.; Kijowski, R.; Li, F. Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magn Reson. Med. 2018, 79, 2379–2391. [Google Scholar] [CrossRef]
- Couteaux, V.; Si-Mohamed, S.; Nempont, O.; Lefevre, T.; Popoff, A.; Pizaine, G.; Villain, N.; Bloch, I.; Cotton, A.; Boussel, L. Automatic knee meniscus tear detection and orientation classification with Mask-RCNN. Diagn. Interv. Imaging 2019, 100, 235–242. [Google Scholar] [CrossRef]
- Kuczyński, N.; Boś, J.; Białoskórska, K.; Aleksandrowicz, Z.; Turoń, B.; Zabrzyńska, M.; Bonowicz, K.; Gagat, M. The Meniscus: Basic Science and Therapeutic Approaches. J. Clin. Med. 2025, 14, 2020. [Google Scholar] [CrossRef]
- Parkar, A.P.; Adriaensen, M.E.A.P.M. ESR Essentials: MRI of the Knee—Practice Recommendations by ESSR. Eur. Radiol. 2024, 34, 6590–6599. [Google Scholar] [CrossRef]
- Smirnov, V.V.; Savvova, M.V.; Smirnov, V.V. Magnetic Resonance Imaging in the Diagnosis of Joint Diseases; Artifex Publishing House: Obninsk, Russia, 2022; p. 170. (In Russian) [Google Scholar]
- Jiang, C.; Ren, H.; Ye, X.; Zhu, J.; Zeng, H.; Nan, Y.; Sun, M.; Ren, X.; Huo, H. Object detection from UAV thermal infrared images and videos using YOLO models. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102912. [Google Scholar] [CrossRef]
- Ultralytics. YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 June 2025).
- Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Hidayatullah, P.; Tubagus, R. YOLOv9 Architecture Explained | Stunning Vision AI. Available online: https://article.stunningvisionai.com/yolov9-architecture (accessed on 1 June 2025).
- Wang, Y.; Li, K.; Zhang, Y.; Han, J.; Wang, C. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
- Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
- Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. Available online: https://github.com/sunsmarterjie/yolov12 (accessed on 1 June 2025).
- Hidayatullah, P.; Syakrani, N.; Sholahuddin, M.R.; Gelar, T.; Tubagus, R. YOLOv8 to YOLO11: A comprehensive architecture in-depth comparative review. arXiv 2024, arXiv:2501.13400. [Google Scholar]
- Glenn, J. Shortcut in Backbone and Neck Issue #1200 Ultralytics/Ultralytics. Available online: https://github.com/ultralytics/ultralytics/issues/1200#issuecomment-1454873251 (accessed on 15 June 2025).
- Glenn, J. Understanding SPP and SPPF Implementation Issue #8785 Ultralytics/yolov5. Available online: https://github.com/ultralytics/yolov5/issues/8785 (accessed on 15 June 2025).
- Hu, J.; Zheng, J.; Wan, W.; Zhou, Y.; Huang, Z. RT-DETR-EVD: An Emergency Vehicle Detection Method Based on Improved RT-DETR. Sensors 2025, 25, 3327. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Chen, J.; Lei, B.; Song, Q.; Ying, H.; Chen, D.Z.; Wu, J. A hierarchical graph network for 3d object detection on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 392–401. [Google Scholar] [CrossRef]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar] [CrossRef]
- Tian, J.; Jin, Q.; Wang, Y.; Yang, J.; Zhang, S. Performance analysis of deep learning-based object detection algorithms on COCO benchmark: A comparative study. J. Eng. Appl. Sci. 2024, 71, 76. [Google Scholar] [CrossRef]
- Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
- Zhao, B.; Chang, L.; Liu, Z. Fast-YOLO Network Model for X-Ray Image Detection of Pneumonia. Electronics 2025, 14, 903. [Google Scholar] [CrossRef]
- Mercaldo, F.; Brunese, L.; Martinelli, F.; Santone, A.; Cesarelli, M. Object Detection for Brain Cancer Detection and Localization. Appl. Sci. 2023, 13, 9158. [Google Scholar] [CrossRef]
- Wang, Q.; Yan, N.; Qin, Y.; Zhang, X.; Li, X. BED-YOLO: An Enhanced YOLOV10N-Based Tomato Leaf Disease Detection Algorithm. Sensors 2025, 25, 2882. [Google Scholar] [CrossRef]
- Roblot, V.; Giret, Y.; Antoun, M.B.; Morillot, C.; Chassin, X.; Cotten, A.; Zerbib, J.; Fournier, L. Artificial Intelligence to Diagnose Meniscus Tears on MRI. Diagn. Interv. Imaging 2019, 100, 243–249. [Google Scholar] [CrossRef]
- Shin, H.; Choi, G.S.; Shon, O.-J.; Kim, G.B.; Chang, M.C. Development of Convolutional Neural Network Model for Diagnosing Meniscus Tear Using Magnetic Resonance Image. BMC Musculoskelet. Disord. 2022, 23, 510. [Google Scholar] [CrossRef]
- Rizk, B.; Brat, H.; Zille, P.; Guillin, R.; Pouchy, C.; Adam, C.; Ardon, R.; D’Assignies, G. Meniscal Lesion Detection and Characterization in Adult Knee MRI: A Deep Learning Model Approach with External Validation. Phys. Medica 2021, 83, 64–71. [Google Scholar] [CrossRef]
- Li, J.; Qian, K.; Liu, J.; Huang, Z.; Zhang, Y.; Zhao, G.; Wang, H.; Li, M.; Liang, X.; Zhou, F.; et al. Identification and Diagnosis of Meniscus Tear by Magnetic Resonance Imaging Using a Deep Learning Model. J. Orthop. Transl. 2022, 34, 91–101. [Google Scholar] [CrossRef]
- Botnari, A.; Kadar, M.; Patrascu, J.M. A Comprehensive Evaluation of Deep Learning Models on Knee MRIs for the Diagnosis and Classification of Meniscal Tears: A Systematic Review and Meta-Analysis. Diagnostics 2024, 14, 1090. [Google Scholar] [CrossRef]
- He, L.-H.; Zhou, Y.-Z.; Liu, L.; Cao, W.; Ma, J.-H. Research on object detection and recognition in remote sensing images based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef]
Classes | Visualization Mode | Images | Objects | ||
---|---|---|---|---|---|
PD | T1 | T2 | |||
Normal | 556 | 202 | 242 | 1000 | 5992 |
Tear | 682 | 134 | 184 | 1000 | 1998 |
Method | MSE | PSNR | SSIM |
---|---|---|---|
Images (normal/tear) | 2800 | 600 | 600 |
Combined method | 32.55 | 41.37 | 0.92 |
Gaussian Blur | 39.47 | 39.36 | 0.90 |
Laplacian Filter | 39.80 | 38.21 | 0.90 |
Bilateral Filter | 35.45 | 39.76 | 0.86 |
Non-Local Means Denoising (NLM) | 35.69 | 39.57 | 0.86 |
Sharpening (Unsharp Mask) | 47.10 | 35.80 | 0.85 |
Median Blur | 42.20 | 32.26 | 0.82 |
CLAHE | 132.33 | 24.47 | 0.74 |
Sobel Filter | 190.83 | 21.22 | 0.68 |
Backbone | Neck | Activation | Loss | Models | |
---|---|---|---|---|---|
YOLOv5 [13,14] | CSPDarknet53 (Focus) | PANet + SPPF | LeakyReLU | BCE + CIoU | YOLOv5-nu, YOLOv5- su, YOLOv5-mu, YOLOv5-lu, YOLOv5-xu |
YOLOv8 [15] | C2f + CBS | PANet + SPPF | SiLU | BCE + DFL (v2) | YOLOv8-n, YOLOv8-s, YOLOv8-m, YOLOv8-l, YOLOv8-x |
YOLOv9 [16] | ELAN-V2 + DFL v3 | BiFPN or PAN++ | SiLU | vFL (v3) + improved DFL | YOLOv9-t, YOLOv9-s, YOLOv9-m, YOLOv9-c, YOLOv9-e |
YOLOv10 [17] | improved C2f//Transformer | RT-DETR-like neck | GELU | DFL + Adaptive Matching | YOLOv10-n, YOLOv10-s, YOLOv10-m, YOLOv10-l, YOLOv10-x |
YOLOv11 [18] | RTMDet-style backbone | CBAM + PAN | SiLU / GELU | Varifocal Loss + DFL | YOLOv11-n, YOLOv11-s, YOLOv11-m, YOLOv11-l, YOLOv11-x |
YOLOv12 [19] | R-ELAN | Multi-scale fusion: Upsample-Concat − A2C2F + C3k2 + Area Attention | SiLU | Hybrid: DFL+ GIoU | YOLOv12-n, YOLOv12-s, YOLOv12-m, YOLOv12-l, YOLOv12-x |
Category | Training Set | Testing Set | Validation Set | Total |
---|---|---|---|---|
Images (normal/tear) | 2800 | 600 | 600 | 4000 |
Architecture | Version | Precision | Recall | mAP50 | mAP50–95 |
---|---|---|---|---|---|
YOLOv5 | n | 0.964 | 0.945 | 0.977 | 0.587 |
s | 0.956 | 0.939 | 0.964 | 0.583 | |
m | 0.956 | 0.948 | 0.972 | 0.6 | |
l | 0.965 | 0.948 | 0.975 | 0.605 | |
x | 0.958 | 0.941 | 0.975 | 0.604 | |
YOLOv8 | n | 0.973 | 0.947 | 0.978 | 0.594 |
s | 0.965 | 0.951 | 0.977 | 0.6 | |
m | 0.968 | 0.95 | 0.974 | 0.601 | |
l | 0.945 | 0.953 | 0.97 | 0.612 | |
x | 0.958 | 0.961 | 0.975 | 0.616 | |
YOLOv9 | t | 0.961 | 0.947 | 0.975 | 0.589 |
s | 0.968 | 0.953 | 0.975 | 0.604 | |
m | 0.959 | 0.955 | 0.974 | 0.601 | |
c | 0.96 | 0.942 | 0.971 | 0.601 | |
e | 0.966 | 0.962 | 0.976 | 0.605 | |
YOLOv10 | n | 0.948 | 0.941 | 0.964 | 0.571 |
s | 0.954 | 0.95 | 0.974 | 0.595 | |
m | 0.953 | 0.95 | 0.978 | 0.582 | |
l | 0.95 | 0.946 | 0.969 | 0.6 | |
x | 0.965 | 0.931 | 0.972 | 0.612 | |
YOLOv11 | n | 0.96 | 0.937 | 0.974 | 0.596 |
s | 0.959 | 0.954 | 0.977 | 0.59 | |
m | 0.949 | 0.963 | 0.977 | 0.587 | |
l | 0.974 | 0.948 | 0.975 | 0.597 | |
x | 0.962 | 0.942 | 0.978 | 0.606 | |
YOLOv12 | n | 0.957 | 0.946 | 0.973 | 0.592 |
s | 0.96 | 0.951 | 0.978 | 0.59 | |
m | 0.97 | 0.945 | 0.979 | 0.584 | |
l | 0.955 | 0.956 | 0.972 | 0.591 | |
x | 0.956 | 0.956 | 0.973 | 0.595 | |
RT-DETR | l | 0.919 | 0.952 | 0.929 | 0.531 |
x | 0.898 | 0.889 | 0.906 | 0.434 |
No | Purpose of the Study | Input Data | Method | Performance Metrics |
---|---|---|---|---|
Our research | Application of YOLO and RT-DETR family models in the recognition of meniscus tears | MRI scans of the knee joint (1000 normal, 1000 tear) | YOLOv5, YOLOv8-YOLOv12 models with all available submodels (n, s, m, l, x) | mAP@0.5–0.95 = 0.616 ACC = 95.8% TPR = 96.1% |
[32] Roblot et al. | Creation and evaluation of an algorithm for detecting and characterizing the presence of a meniscus tear | 1123 MRI images of the knee | Converged neural network, CNN on fast regions (RCNN) | AUC = 92% to determine the position of the two horns of the meniscus AUC = 94% for the presence of a meniscus tear AUC = 83% to determine the orientation of the tear |
[33] Shin H. et al. | Detection of meniscus tears and classification of tear types employing MRI images | MRI images of the knee joint (1048 cases) | AlexNet | AUC = 88.9% for medial meniscus tear AUC = 81.7% for medial and lateral meniscus tears AUC = 92.4% for lateral meniscus tear |
[34] Rizk et al. | Evaluation of a deep learning approach for meniscus tear detection and its characterization | 11,353 MRI examinations of the knee joint | CNN neural convolutional network | AUC = 93% TPR = 82% TNR = 95% |
[35] Li et al. | Diagnosis of a knee meniscus tear | Standard MRI images of the knee of 924 patients | Masked regional convolutional neural network (R-CNN), ResNet50 | AP = 68–80% TPR = 74–95% |
[36] Botnari et al. | Systematic review of DL models for MRI of the knee | More than 20 studies on automatic detection of meniscus tears | Overview of CNN, ResNet, DenseNet, and other models. | ACC = 77–100%, TPR = 56.9–71.1%, TNR = 67–93% |
Confidence ≥ | Test Samples | Reliable TPs (%) |
---|---|---|
0.80 | 79% | 95.67% |
0.85 | 77.1% | 93.05% |
0.90 | 63.8% | 77.04% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tlebaldinova, A.; Omiotek, Z.; Karmenova, M.; Kumargazhanova, S.; Smailova, S.; Tankibayeva, A.; Kumarkanova, A.; Glinskiy, I. Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis. Computers 2025, 14, 333. https://doi.org/10.3390/computers14080333
Tlebaldinova A, Omiotek Z, Karmenova M, Kumargazhanova S, Smailova S, Tankibayeva A, Kumarkanova A, Glinskiy I. Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis. Computers. 2025; 14(8):333. https://doi.org/10.3390/computers14080333
Chicago/Turabian StyleTlebaldinova, Aizhan, Zbigniew Omiotek, Markhaba Karmenova, Saule Kumargazhanova, Saule Smailova, Akerke Tankibayeva, Akbota Kumarkanova, and Ivan Glinskiy. 2025. "Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis" Computers 14, no. 8: 333. https://doi.org/10.3390/computers14080333
APA StyleTlebaldinova, A., Omiotek, Z., Karmenova, M., Kumargazhanova, S., Smailova, S., Tankibayeva, A., Kumarkanova, A., & Glinskiy, I. (2025). Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis. Computers, 14(8), 333. https://doi.org/10.3390/computers14080333