Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis

Tlebaldinova, Aizhan; Omiotek, Zbigniew; Karmenova, Markhaba; Kumargazhanova, Saule; Smailova, Saule; Tankibayeva, Akerke; Kumarkanova, Akbota; Glinskiy, Ivan

doi:10.3390/computers14080333

Open AccessArticle

Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis

by

Aizhan Tlebaldinova

¹

,

Zbigniew Omiotek

²

,

Markhaba Karmenova

^3,*

,

Saule Kumargazhanova

¹,

Saule Smailova

¹,

Akerke Tankibayeva

¹,

Akbota Kumarkanova

¹ and

Ivan Glinskiy

¹

School of Digital Technology and Artificial Intelligence, D.Serikbayev East Kazakhstan Technical University, Ust-Kamenogorsk 070004, Kazakhstan

²

Department of Electronics and Information Technology, Lublin University of Technology, 20-618 Lublin, Poland

³

Department of Computer Modeling and Information Technologies, S.Amanzholov East Kazakhstan University, Ust-Kamenogorsk 070002, Kazakhstan

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(8), 333; https://doi.org/10.3390/computers14080333

Submission received: 21 July 2025 / Revised: 11 August 2025 / Accepted: 14 August 2025 / Published: 17 August 2025

Download

Browse Figures

Versions Notes

Abstract

The aim of this study is a comparative evaluation of the effectiveness of YOLO and RT-DETR family models for the automatic recognition and localization of meniscus tears in knee joint MRI images. The experiments were conducted on a proprietary annotated dataset consisting of 2000 images from 2242 patients from various clinics. Based on key performance metrics, the most effective representatives from each family, YOLOv8-x and RT-DETR-l, were selected. Comparative analysis based on training, validation, and testing results showed that YOLOv8-x delivered more stable and accurate outcomes than RT-DETR-l. The YOLOv8-x model achieved high values across key metrics: accuracy—0.958, recall—0.961; F1-score—0.960; mAP@50—0.975; and mAP@50–95—0.616. These results demonstrate the potential of modern object detection models for clinical application, providing accurate, interpretable, and reproducible diagnosis of meniscal injuries.

Keywords:

magnetic resonance imaging (MRI); meniscus tear; object detection; deep learning; YOLO models; transformer-based models

1. Introduction

The diagnosis of knee meniscus injuries using magnetic resonance imaging (MRI) remains a challenging task, particularly in cases where tear patterns exhibit subtle features resembling normal anatomical structures. Although MRI provides high diagnostic value, the interpretation of these images requires significant expertise and time, with considerable inter-observer variability reported in the literature [1,2]. These limitations have led to increasing interest in the application of deep learning algorithms to automate the analysis of MRI data.

The most promising deep learning architectures are those capable not only of detecting the presence of lesions but also of accurately localizing them within medical images. Convolutional neural network (CNN)-based models, such as YOLO (You Only Look Once), have demonstrated high effectiveness in the diagnosis of orthopedic pathologies, including meniscus tears [3,4]. In parallel, more recent transformer-based architectures, particularly those from the DETR (Detection Transformer) family, have shown considerable potential in processing complex and visually heterogeneous medical images [5,6,7]. These two paradigms differ fundamentally in their feature extraction strategies, which may impact model accuracy and robustness in the presence of clinical noise and subtle pathological changes.

Existing research has predominantly focused either on image classification using CNNs [1,2,3] or on the experimental application of transformer-based models in medical image segmentation tasks [4,5,6]. However, a direct comparison between YOLO and RT-DETR (Real-Time Detection Transformer) architectures for the detection of meniscal tears using real-world clinical MRI data is currently lacking in the literature. Furthermore, many prior studies rely on relatively small datasets constrained by scanner type or institutional origin, thereby limiting the generalizability of their findings [2,3,7].

The aim of this study is to perform a comparative evaluation of the performance of YOLO and RT-DETR model families for the automatic detection and localization of meniscus tears in knee MRI scans.

The main contributions of this study are as follows:

(1): Development and experimental application of an approach based on YOLO and RT-DETR models.

For the first time, a systematic comparison of the YOLO model family and the transformer-based RT-DETR architecture has been conducted for the task of automatic detection and localization of meniscus tears in knee MRI images. Special attention is given to the robustness of the models to clinical visual variability, including faint features and background noise.

(2): Creation of a proprietary, domain-specific dataset of clinical MRI images.

A custom annotated dataset was developed, comprising 2000 images from 2242 patients collected across various medical centers. The dataset includes scans obtained from different MRI machines, using multiple imaging sequences (PD, T1, T2) and anatomical planes (sagittal and coronal). This diversity reflects real-world clinical conditions and enhances the generalizability of the models.

(3): Comprehensive performance evaluation and justification for model selection.

A series of experiments was conducted using key evaluation metrics (precision, recall, F1-score, mAP@50, and mAP@50–95), along with confidence curve analysis and visual assessment of both correct and incorrect predictions. Based on the aggregated results, YOLOv8-x was identified as the most reliable and robust model, underlining its potential for integration into practical computer-aided diagnosis systems.

The structure of the article is organized as follows: Section 2 provides a review of related work in the field. Section 3 details the proposed methodology and experimental approach. Section 4 presents the performance metrics and comparative analysis results. Section 5 discusses the limitations of the proposed approach and outlines possible directions for future improvement. Finally, Section 6 summarizes the key findings and concludes the study.

2. Literature Review

The application of deep learning to the analysis of MRI images of the knee, particularly for the detection of meniscus lesions, has been actively developed. Initially, methods based on CNNs were dominant, aiming to classify the presence of pathology without localization. For example, ResNet and EfficientNet models were used, which demonstrated high accuracy rates in binary and three-class classification tasks [3,8,9]. However, these approaches often failed to accurately determine the location of the lesion and did not provide interpretable results.

With the advent of object detection models in medical imaging, solutions for both classification and localization have emerged. In particular, models from the YOLO family, including YOLOv3 and YOLOv4, have been successfully applied to the automatic detection of meniscus tears, providing more accurate mapping to the anatomical structure [4]. However, many of these studies used limited datasets and rarely compared different architectures under the same conditions.

More recent transformer models, including DETR and its derivatives, have found application in medical image analysis tasks, especially where sparse or weakly expressed structures are required [5,6,7]. In such works, transformers show potential in processing complex visual patterns and heterogeneous data. However, their use in the context of meniscus injury diagnosis remains limited, and comparisons with CNN models on the same dataset are almost unheard of.

Performing a direct comparison of modern model families, such as YOLO and RT-DETR, in the context of real clinical MR imaging remains a pressing research challenge. The present study fills this gap by offering a quantitative and qualitative comparison of the models on a single dataset annotated by experts and reflecting the real clinical variability of MRI images.

3. Materials and Methods

3.1. General Research Design

Two modern families of object detection models were employed in this study: YOLO (versions YOLOv5 and YOLOv8–YOLOv12) and RT-DETR, both capable of simultaneous localization and classification of pathological changes. The model architectures were adapted for the task of medical image analysis, with input data consisting of a series of 2D MRI slices in 640 × 640 × 3 format. The overall methodology for automatic detection of meniscus tears is illustrated in Figure 1.

The proposed methodology comprises several sequential stages. At the initial stage, images were extracted from the source in DICOM (Digital Imaging and Communications in Medicine) format. These images were then automatically sorted based on imaging modalities, anatomical projections, and pulse sequences, followed by export to a two-dimensional format with the PNG (portable network graphics) extension. To enhance visual quality, a preprocessing step was applied using a combined method that included Non-Local Means Denoising and Unsharp Mask techniques. Manual image annotation was performed using the Label Studio platform. Subsequently, data augmentation techniques were employed to expand the size of the training dataset. During the training phase, models from the YOLO and RT-DETR families were utilized, including YOLOv5, YOLOv8–YOLOv12, RT-DETR-l, and RT-DETR-x. Model performance was assessed using standard evaluation metrics, such as precision, recall, F1-score, and mAP, based on which the most effective model was selected. In the final stage, the trained model was deployed for automatic detection of meniscus tears in new MRI images.

3.2. Inclusion Criteria

The original dataset was acquired using different MRI systems with varying technical specifications, including magnetic field strength, slice thickness, and manufacturer. The scans were performed with diverse scanning protocols and acquisition parameters. From the complete series of images, only those slices in which the menisci were clearly visualized were selected. In accordance with clinical diagnostic guidelines, the menisci were classified as either intact (healthy) or exhibiting signs of tear. The classification of torn menisci was based on the presence of an abnormal intrameniscal hyperintense signal extending to at least one articular surface. Additional indicators included meniscal deformation, disrupted contours, or displacement of a fragment [10]. The inclusion criteria for image analysis required clear MRI evidence of a meniscus tear confined to a single compartment of the knee joint (either medial or lateral), with no concomitant tear in the opposite meniscus.

3.3. Dataset Creation

MRI data of the knee joint were collected from patients across all regions of the country between 2022 and 2024 at the Department of Arthroscopy and Sports Trauma of the National Scientific Centre of Traumatology and Orthopaedics named after Academician N.D. Batpenov in Astana. The initial dataset included MRI scans from 2242 patients in DICOM format. The patients’ ages ranged from 11 to 75 years.

The MRI scans were acquired during routine diagnostic examinations conducted at 84 MRI centers using 1.5 T or 3.0 T MRI systems. Imaging was performed on scanners from various manufacturers and with different scanning protocol settings, resulting in variations in image quality, resolution, contrast, and dimensions. Each MRI study consisted of a series of consecutive tomographic slices obtained in different anatomical planes.

According to the scientific literature and medical textbooks [11,12], the most informative MRI parameters for diagnosing knee joint injuries include:

Imaging sequences: T1-weighted, T2-weighted, and proton density (PD) [11];
Pulse sequences: Spin-Echo (SE), Turbo Spin-Echo (TSE), Fast Spin-Echo (FSE), and Fat-Saturated sequences (FatSat, FSat, FS) [12];
Imaging planes: coronal and sagittal [11].

Coronal (892 images, 44.6%) and sagittal (1108 images, 55.4%) T1-, T2-, and PD-weighted MRI images acquired using SE, TSE, FSE, FSat, FatSat, and FS pulse sequences were selected as representative samples for training the object detection model. Using the Weasis software, knee joint MRI scans in DICOM format were exported to PNG format. The total number of images was 129,343, including 66,328 PD-weighted, 38,421 T1-weighted, and 24,594 T2-weighted images. However, not all images were informative for the study. For the purpose of this research, only those slices in which the menisci were clearly visualized were selected from the full MRI series of each patient, as the menisci represent the primary anatomical structure of interest.

The initial dataset included MRI scans with various types of meniscal tears corresponding to clinical classifications: vertical, longitudinal, horizontal, radial, oblique, complex (combined), and bucket-handle tears [12]. These types of meniscus injuries are well described in the literature and are considered key for accurate diagnosis.

Analysis of the tear type distribution in the dataset revealed a predominance of horizontal meniscal tears, which represented the majority of cases. Therefore, this study focused on the horizontal tear type as the target pathology, which enabled optimization of model training and improved the statistical reliability of the results.

During the data preparation phase, all images were manually annotated and categorized into two classes: 0—images without a meniscal tear; 1—images with a confirmed meniscal tear. Representative examples from the dataset are shown in Figure 2.

The data labelling was performed in a stepwise manner. A total of 4262 images were initially distributed. The images selected during the labelling phase were reviewed by an expert. Following verification, the final number of labelled MRI images was 2000, with a total of 7990 annotated objects. Table 1 presents the distribution of images and objects by class.

To train the neural network on the task of automatic detection of meniscus tears, the dataset (2000 original and 2000 augmented) was randomly split into three parts (Table 2): 70% were training data (2800 images), and 15% each were validation and test data (600 images each, respectively).

All MRI data were collected in compliance with institutional ethical standards. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of the National Scientific Centre of Traumatology and Orthopaedics named after Academician N.D. Batpenov (Project identification code: AP23486396; Protocol No. 2/2, dated 27 May 2024).

3.4. Data Preprocessing

After the data labelling stage was completed, image preprocessing was performed. Preprocessing is one of the key steps in MRI image analysis, aimed at improving the quality of visual data, reducing the influence of artefacts, and ensuring the stability of subsequent analysis.

The filter preprocessing stages, which are shown in Figure 3, involved the investigation and testing of contrast enhancement (CLAHE, Equalize Hist), noise reduction (Gaussian Blur, Median Blur, Bilateral Filter, Non-Local Means Denoising), and sharpening (Unsharp Mask, Laplacian Filter, Sobel Filter) techniques. The metrics of MSE (mean square error), pSNR (peak signal-to-noise ratio), and SSIM (structural similarity) were used for performance evaluation.

For a visual assessment of the effectiveness of each filter, Figure 4 presents the image preprocessing results obtained using the tested algorithms. The displayed MRI image fragments allow a visual comparison of the degree of noise suppression and the preservation of anatomical details for each method.

A quantitative assessment of effectiveness is presented in Table 3, which reports the values of the MSE, PSNR, and SSIM metrics.

Based on the results of the comparative analysis, the combined approach of Non-Local Means Denoising and Unsharp Mask was selected, as it demonstrated the best quantitative performance (MSE = 32.55; PSNR = 41.37; SSIM = 0.92), indicating its high effectiveness in MRI image preprocessing. In addition, this method produced the most visually satisfactory results, achieving optimal noise suppression while preserving anatomical boundaries (Figure 5).

It should be noted that the proposed approach was tested only on the dataset employed in this study. Its applicability to external datasets requires further validation.

Spatial image augmentation was applied to enhance the generalization capability of the model and to address the limitations of the training dataset. Information-preserving techniques were used to maintain anatomical structures and contextual features, including Safe Rotate, ShiftScaleRotate, Random Sized BBox Safe Crop, and Perspective transformations.

A total of 500 images from each class were selected and augmented using the four aforementioned methods, resulting in a final dataset of 2000 images. Examples of the applied spatial transformations are presented in Figure 6.

3.5. Meniscus Tear Recognition Based on YOLO Models and RT-DETR

Deep learning methods have recently demonstrated high efficiency in tasks such as segmentation, classification, and object detection. The integration of these models into computer-aided diagnostic systems enables the automatic detection and recognition of pathological meniscal changes on MRI scans. Among object detection algorithms, the YOLO architecture stands out as one of the most efficient and accurate solutions for such tasks.

The rapid evolution of the YOLO model family reflects continuous architectural improvements that have led to significant gains in accuracy and computational performance. The early YOLO versions were based on relatively simple convolutional networks with a limited number of anchor boxes. As the architecture evolved, newer versions incorporated deeper networks with residual layers, multi-scale learning strategies, and anchor-free approaches.

Recent YOLO models, such as YOLOv8 and above, are characterized by the introduction of advanced backbone and head components. These include enhanced feature extraction modules, attention mechanisms, and NMS-free (non-maximum suppression-free) techniques. Additionally, improvements in learning and regularization techniques have contributed to further gains in robustness and generalization.

These architectural enhancements collectively improve detection accuracy and processing speed, making YOLO models particularly well-suited for clinical applications. A comparative overview of key YOLO model versions and their architectural components is provided in Table 4.

3.5.1. Network Architecture

The YOLOv8-x architecture presented in Figure 7 implements an improved convolutional neural network structure. The architecture structure includes functional blocks: backbone, neck, and head. This architecture structure provides efficient feature extraction, feature aggregation, and multi-scale object detection. The YOLOv8 version model is defined by three parameters: depth_multiple, width_multiple, and max_channels, where depth_multiple defines how many blocks of the bottleneck are in the C2f block, and width_multiple and max_channels parameters define the output channels [20]. The stem component of the YOLOv8 model consists of two convolution blocks with a step size of 2 and a kernel size of 3 [20]. These blocks transform the data into raw features and reduce the input resolution [20].

In the YOLOv8 model, the stage component uses the C2f block. There are 8 stages in the structure of the YOLOv8 model. C2f performs deep feature processing while preserving spatial features. This structure implements stepwise reduction in spatial resolution and an increase in the depth of features. Also, the available stages in the backbone use reductions, whereas the neck does not use reductions [20]. This use or non-use of reductions is determined by empirical results of trial and error experiments to find which is best [21]. The implementation of the downsampling procedure in the YOLOv8 model uses a convolution block with a step of 2 and a kernel size of 3. At step 2, the output spatial resolution will be halved.

The backbone output layer is immediately followed by the next SPPF (Spatial Pyramid Pooling Fast) block on the neck. The SPPF block is designed to provide a multi-scale representation of the feature map. The SPPF block allows the model to capture features at different levels of abstraction by pooling at different scales [22]. There are also several blocks such as concat and upsample in the neck. In the YOLOv8 model, the resolution of the feature map is increased based on upsampling, where the nearest neighbor technique is applied in upsampling. In feature map upsampling, nearest pixels are repeated, and upsampling through concat increases the number of channels without changing the size. In addition, YOLOv8 has three heads for small, medium, and large features, which are connected to different levels, and the size of the feature is determined with respect to the image.

Thus, the YOLOv8 model supports multi-level feature aggregation using concat and upsample operations. Also, the YOLOv8 model effectively utilizes the C2f block, which facilitates deep layer learning with compactness and learning efficiency.

3.5.2. RT-DETR Architecture

In this study, the RT-DETR models (RTDETR-l and RTDETR-x) were also trained on the same image dataset to enable a comparative evaluation of their performance and efficiency against the YOLOv5 and YOLOv8–YOLOv12 model families. RT-DETR is a state-of-the-art end-to-end object detector that offers real-time performance while maintaining high accuracy. The network structure of the RT-DETR model architecture is based on Transformer and is shown in Figure 8. The RT-DETR model architecture includes three main modules such as feature extraction backbones, a hybrid feature enhancement encoder, and a Transformer decoder with an auxiliary prediction engine [23]. The working principle of this network structure of the RT-DETR model is based on a hierarchical multi-scale feature processing mechanism, i.e., the model first improves the features at each scale and then performs cross-scale feature integration. In its architecture, RT-DETR uses an efficient CNN architecture (or can be a series of ResNet [24]) or a specially optimized HGNet [25] as a backbone network.

During feature processing, the RT-DETR model incorporates two innovative modules: AIFI (Attention-Enhanced Intra-Scale Feature Interaction) and CCFM (Convolution-Driven Cross-Scale Feature Fusion). The AIFI module performs deep feature vectorization using a lightweight Transformer encoder, followed by feature reconstruction through a skip-connection-based network. This module enhances intra-scale feature representation while maintaining computational efficiency. At the final stage of the RT-DETR pipeline, the CCFM module executes intelligent aggregation of multi-level features, enabling a comprehensive representation across spatial scales. The detection head of RT-DETR applies a noise suppression strategy inspired by the DINO approach [26], which significantly improves the quality of query-to-object matching and accelerates model convergence during training.

3.6. Evaluation Metrics

Standard classification and detection metrics, precision (P), recall (R), F1-measure, mAP (mean Average Precision) at IoU (Intersection over Union) thresholds of 0.5 and 0.5–0.95, and error matrix (confusion matrix), were used to comprehensively evaluate the model performance in the task of automatic detection of meniscus tear on knee MRI images. Intersection over Union (IoU) is used to evaluate the localization accuracy of object detection models. The evaluation is based on the degree of coincidence of the predicted bounding box with the ground-truth bounding box corresponding to a given object. Let

B_{g t}

be the ground-truth bounding box and

B_{p}

the predicted bounding box. In the object detection scope, the IOU is equal to the area of the overlap (intersection) between the predicted bounding box

B_{p}

and the ground-truth bounding box

B_{g t}

divided by the area of their union, which is [27]:

I o U = \frac{a r e a (B_{p} ⋂ B_{g t})}{a r e a (B_{p} ⋃ B_{g t})}

(1)

A schematic illustration of the metric is presented in Figure 9.

The value of the Intersection over Union (IoU) metric ranges from 0 to 1. An IoU of 1 indicates a perfect match between the predicted and ground truth bounding boxes, whereas an IoU of 0 indicates no overlap at all. Thus, the higher the IoU value, the more accurate the object localization by the model; conversely, lower values indicate a significant discrepancy between the predicted and actual object positions.

Let:

TP (True Positive) denote the number of correctly identified meniscus tears;

FP (False Positive) the number of incorrect predictions where a tear was falsely identified;

FN (False Negative) the number of missed (i.e., incorrectly unrecognized) tears;

TN (True Negative) the number of correctly identified cases without a tear.

The evaluation metrics used to assess the model’s performance are presented in Equations (2)–(4).

P r e c i s i o n (P) = \frac{T P}{T P + F P}

(2)

R e c a l l (R) = \frac{T P}{T P + F N},

(3)

F 1 = 2 \frac{P \times R}{P + R} .

(4)

The F1-score metric is employed to assess the balance between precision and recall.

To evaluate the performance of the object detection models, the mean Average Precision (mAP) metric was applied [28]:

mAP@0.5: the average precision at a fixed threshold of the intersection of the predicted and true regions (IoU ≥ 0.5);
mAP@0.5–0.95: the average precision computed over ten IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, in accordance with the official COCO evaluation protocol.

Additionally, a confusion matrix was constructed to visualize the distribution of true positives, false positives, true negatives, and false negatives, enabling a qualitative analysis of model prediction errors. The use of this set of evaluation metrics provides an objective and comprehensive assessment of model performance, capturing both the accuracy of predictions and the robustness to false positives and missed detections.

4. Results

4.1. Experimental Environment and Hyperparameters

The construction of the models was carried out on the NVIDIA DGX A100 computing platform, the main components of which were as follows: CPU—Dual AMD Rome 7742, 256 cores, 1 TB; GPU—8 x NVIDIA A100 SXM4 80 GB Tensor Core; OS: Ubuntu 22.04.5 LTS. The torch 2.3.0 platform and Python 3.10.12 programming language were used to build the models. In this study, the following hyperparameters were used for model training: number of epochs—200; batch size—16; and image size—640 × 640 pixels. The IoU threshold was set to 0.7 for a more rigorous comparison of predictions with annotations. The initial learning rate was 0.01, using momentum of 0.937 and a weight decay coefficient of 0.0005, which ensured stable convergence and prevented overfitting. The optimizer was selected automatically, and the patience = 100 parameter allowed control of the early stopping process, ensuring sufficient time to achieve optimal model performance. The specified parameters, along with the other key training hyperparameters, were fixed at identical values for all YOLO and RT-DETR models employed in this study, ensuring the validity of their comparison.

4.2. Experiment

A comprehensive comparative evaluation was conducted on object detection models from the YOLO family, including YOLOv5 (configurations: nu, su, mu, lu, xu), YOLOv9 (t, s, m, c, e), as well as YOLOv8 and YOLOv10–YOLOv12 (n, s, m, l, x). As an alternative approach, RT-DETR models in large and extra-large configurations were also examined. Both model families were employed to address the task of binary classification of meniscus tear presence on MRI images, followed by localization of the pathological region. The evaluation was performed using the same dataset for all models. The performance of the YOLO and RT-DETR family models on the basic metrics precision, recall, mAP@50, and mAP@50–95 is summarized in Table 5.

In medical diagnostics, particularly in the analysis of MRI images, it is critical not only to correctly identify pathological regions but also to precisely delineate their boundaries. The mAP@0.5–0.95 metric is an internationally recognized standard for evaluating detection quality [29,30]. Unlike mAP@0.5, which is calculated at a fixed IoU (IoU = 0.5) threshold, mAP@0.5–0.95 averages precision over multiple IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05. This provides a more rigorous assessment of a model’s ability to accurately localize and shape predicted objects, which is especially critical in medical contexts where even minor localization errors may result in misdiagnosis or incorrect clinical interpretation.

Equally important is the recall metric, which quantifies the model’s ability to detect all relevant objects in an image. In clinical practice, missing even a single pathological region can have serious implications; therefore, high sensitivity (recall) is essential. Notably, there often exists a trade-off between precision and recall: improving one may come at the cost of the other. Thus, achieving optimal detection performance requires an appropriate balance between these two metrics [31].

Moreover, in scenarios involving imbalanced datasets, where one class is significantly more prevalent than others, precision alone can be misleading. In such cases, recall often serves as a more informative measure for minority classes [14]. Consequently, the joint use of mAP@0.5–0.95 and recall enables a comprehensive evaluation of model performance in terms of both localization accuracy and detection completeness, which is particularly important in medical imaging tasks. Figure 10 presents the results of the models that achieved the best values for both mAP@0.5–0.95 and recall across the YOLO and RT-DETR model families.

As a result of comparing the YOLO and RT-DETR family models, the models that demonstrated the highest values of key metrics were identified. As can be seen from Figure 9, the YOLO family models achieve higher values for the mAP@50–95 metric, indicating their better overall object recognition accuracy. At the same time, the RT-DETR-l model shows decent results in the precision and especially recall metrics (95.2), indicating its high ability to detect most objects. Thus, the best performing model among YOLO is YOLOv8-x and among RT-DETR models is RT-DETR-l.

Figure 11 shows how the loss function changed throughout training, which reflects the performance of these models by epoch.

Figure 12 shows the change in performance metrics (mAP@50–95, precision, recall) to track the behavior of the models as they are trained.

As shown in Figure 12, all four metrics—precision, recall, mAP@50, and mAP@50–95—generally demonstrate similar behavior: initial fluctuations during the early training epochs are followed by gradual improvement and subsequent stabilization. For the YOLOv8-x model, a notable increase in all metrics is observed around the 10th epoch, with both precision and recall rapidly exceeding 0.9. The mAP@50–95 metric exhibits a more gradual but consistent upward trend throughout the training process, indicating robust recognition performance, particularly for complex or ambiguous objects.

The RT-DETR-l model achieved stable values for precision and recall within the first 10 epochs; however, its mAP@50–95 scores remained relatively lower and progressed more slowly. This suggests a limited capacity for fine-grained localization of multi-scale objects within the specified training timeframe.

The behavior of the models on the test dataset was further analyzed through precision–recall–F1 score curves as functions of the confidence threshold (Figure 13, Figure 14, Figure 15 and Figure 16), as well as via confusion matrices (Figure 17). These evaluations provided insight into how predictive performance varies with different confidence levels and helped identify common classification errors made by the models.

As illustrated in Figure 13, Figure 14, Figure 15 and Figure 16, the precision, recall, and F1-score versus confidence threshold curves provide valuable insights into how model performance varies with different levels of prediction confidence.

The YOLOv8-x model achieves 100% precision for all classes at a confidence threshold of 0.985, whereas the RT-DETR-l model reaches this level at a threshold of 0.935. In Figure 13a, the YOLOv8-x curves exhibit a steeper initial rise, indicating that high precision is achieved even at low confidence values. On the recall–confidence plot, YOLOv8-x maintains more stable and higher recall levels as confidence increases. In contrast, RT-DETR-l demonstrates an earlier decline in recall, particularly for the «tear» class. In Figure 15, the YOLOv8-x model reaches its peak F1-score of 0.96 at a confidence level of 0.341, while the RT-DETR-l model achieves a maximum F1-score of 0.938 at a higher confidence threshold of 0.72. The precision–recall curves in Figure 16 for the YOLOv8-x model lie closer to the upper right corner for both classes («normal» and «tear»), with AP values approaching 1.0. For the RT-DETR-l model, the PR curves also exhibit high values, but have a less pronounced convexity, especially for the «tear» class. Finally, the confusion matrices (Figure 17) illustrate the distribution of correct and incorrect classifications across classes, offering additional insight into model reliability and the nature of classification errors.

As evident from the confusion matrices, the YOLOv8-x model exhibited a lower number of misclassifications compared to RT-DETR-l. The «normal» class was erroneously assigned to «background» in 22 cases, whereas RT-DETR-l had 74 such errors, indicating the higher sensitivity of YOLOv8-x to normal structures. Similarly, for the «tear» class, YOLOv8-x made 4 errors confusing it with the «normal» class and 18 errors confusing it with the «background» class, while RT-DETR-l made errors in 11 and 35 cases, respectively. The highest number of correct classifications in both models is also in the «normal» class, which is most likely due to its numerical predominance in the sample.

5. Discussion

5.1. Analysis of Misclassification and False Detection

During the training and testing phases of the YOLOv8-x and RT-DETR-l models, instances of misclassification and false detection were recorded—phenomena commonly observed in medical image analysis, particularly in the context of knee joint MRI interpretation. The identified errors stemmed both from inherent limitations of the models and from the characteristics of the input data.

The MRI images utilized in this study were primarily sourced from the Batpenov National Scientific Centre for Traumatology and Orthopaedics, having been acquired from multiple regions of the country. These examinations were conducted using imaging systems from various manufacturers and under diverse scanning protocols. As a result, the dataset exhibited significant heterogeneity in image quality, including variability in resolution, contrast, scale, and clarity of anatomical structure visualization.

The model errors observed in this study can be grouped into the following main categories:

Heterogeneity of image quality: differences in scanning characteristics (magnetic field, matrix, slices, signal gain settings, etc.) led to variations in the visual representation of the meniscus, which complicated uniform feature extraction.
Using different imaging modes (PD, T1, T2). MRI images acquired in different imaging modes such as PD, T1, and T2 were used in the study. Each of these modes has different contrast features and displays tissue structures differently. As a result, signs of rupture could be visualized with different degrees of severity depending on the imaging mode, which placed additional burden on the model and reduced the stability of classification between cases acquired in different imaging settings.
Subtle or partial manifestations of pathology: In some cases, signs of meniscal tears were only faintly or partially expressed, which posed challenges for automated recognition and increased the likelihood of misclassification.
Presence of artefacts and noise: Mechanical and software-induced artefacts, shadows, signal inhomogeneities, and occlusions by adjacent bony structures were present in some images. These factors elevated the risk of false positive detections.
Anatomical variability and multiscale complexity: Substantial inter-patient variation in the size, shape, and positioning of the meniscus introduced additional complexity in generalization. This was particularly challenging given the limited number of training samples representing such diverse anatomical configurations.

It should be noted that the study’s focus on horizontal meniscus tears, determined by the data selection criteria, may have influenced the nature and frequency of classification errors. The absence of training on other tear morphologies (e.g., radial, bucket-handle) limits the model’s ability to handle such cases correctly. To enhance the universality and clinical applicability of the methodology, it is necessary to expand the original dataset and conduct additional training on images with different tear types.

The obtained results highlight several aspects that can be considered in further improving the models, including increasing sensitivity to subtle features and enhancing robustness to variations in imaging protocols and image quality.

5.2. Models Comparison in Terms of Detection Efficiency and Processing Speed

The key issue when choosing the best model is to determine the requirements it should meet. The model recommended on the basis of this research should, on the one hand, detect as many objects of interest as possible appearing in the image (have high recall) and, on the other hand, perform the object detection process with the highest possible precision (have high mAP). The plot in Figure 18 provides a good overview of the defined criteria. The points visible in this graph represent pairs of values (recall, mAP50–95) for individual models. Models that meet the predefined requirements are marked in Figure 18 with a gray dashed line, and the lower part of the plot shows an enlarged area of the range of values of interest (recall, mAP50–95). On this basis, it can be seen that it is worth analyzing the properties of the following models in more detail: YOLOv11-m, YOLOv8-x, and YOLOv9-e. The YOLOv8-x model has a recall comparable to the YOLOv9-e and YOLOv11-m models, but achieved the highest value of the mAP50–95 parameter (61.6%). It should be noted that the values of the analyzed performance metrics of the RT-DETR-l and RT-DETR-x models are significantly worse than the values of the metrics of the remaining models that belong to the YOLO family.

The second criterion for model selection is that it should perform the detection process in the shortest possible time (have a short inference time). In Figure 19, the gray dashed line marks the area of pairs of values (recall, inference time) that meet the requirement of a possibly high recall value and low or medium inference time. The medium inference time was assumed to be half of the longest inference time belonging to the YOLOv12-x model (16.9 ms). The selected area is shown in the lower part of Figure 19 in an enlarged view. All the previously selected models met the second criterion. The YOLOv8-x and YOLOv9-e models have a medium inference time (7.7 ms). The YOLOv11-m model, in turn, achieved the shortest inference time (2.7 ms), but its mAP50–95 parameter value (58.7%) is significantly lower than the mAP50–95 of the other two models—60.5% (YOLOv9-e) and 61.6% (YOLOv8-x). The models belonging to the RT-DETR architecture achieved a medium inference time (from 6 to 8 ms), but their recall was lower than that of the best models from the YOLO family.

Based on the previous analysis, it can be concluded that the YOLOv8-x model can be recommended for detecting meniscus tears. The inference time of this model is 7.7 ms. It should be noted that the total image frame processing time (tproc) consists of three components: preprocessing (tpre), inference (tinfer), and post-processing (tpost) times. For the YOLOv8-x model, these times were as follows: tproc = tpre + tinfer + tpost = 0.5 ms + 7.7 ms + 2.7 ms = 10.9 ms. This value gives a theoretical processing speed of 92 frames per second (1000 ms/10.9 ms ≈ 92 FPS). The actual processing speed is lower because there are additional delays associated with operations such as image capture, displaying results on the screen, etc. As a result, the FPS depends on the model architecture and the hardware and software platform on which the model is running. The model processing speed was tested using a personal computer with the following configuration: Windows 10 64-bit operating system, Intel Core i7–4770 3.4 GHz CPU, 32 GB RAM, NVIDIA GeForce GTX 1660 Ti GPU, 6 GB GDDR6, PyTorch 2.1.2 platform, and the Python 3.10 programming language. The average processing speed was 13 FPS. This value is sufficient for using the model on typical computer systems with average computational performance. It should also be noted that FPS is more important for video processing, but in the context of the presented research, the recall and medium average precision of meniscal tear detection are more important.

5.3. Performance Comparison of YOLOv8-x and RT-DETR-l Models

Based on the data presented in the Results section, the YOLOv8-x model demonstrates significant advantages over RT-DETR-l, confirming its superior performance in medical detection and classification tasks involving knee joint MRI images. These improvements are attributed to both the architectural strengths of YOLOv8-x and its more effective training strategies, which enable the model to perform reliably even with a limited number of visual features. In particular, YOLOv8-x handles weakly expressed structures more effectively thanks to its ability to integrate features across multiple levels. In contrast, RT-DETR-l was found to be more sensitive to image quality, which reduces its robustness when faced with variations in contrast and sharpness.

YOLOv8-x demonstrates superior performance compared to RT-DETR-l across all key evaluation metrics, indicating its higher effectiveness in medical image object detection tasks. The improvements encompass both detection accuracy and the model’s ability to reliably identify and classify objects under varying conditions. The precision of YOLOv8-x is 0.958, compared to 0.919 for RT-DETR-l, while the recall reaches 0.961 versus 0.952, respectively. These metrics reflect the greater reliability of YOLOv8-x in reducing both false positives and missed detections.

The mAP@50 score for YOLOv8-x is 0.975, exceeding the 0.929 achieved by RT-DETR-l. The advantage becomes even more pronounced with the stricter mAP@50–95 criterion, where YOLOv8-x scores 0.616 compared to 0.531 for RT-DETR-l. These differences confirm the greater robustness of YOLOv8-x in accurately localizing and classifying objects, even in complex scenarios. An analysis of the precision–confidence curves shows that YOLOv8-x achieves maximum precision at a confidence threshold of 0.985, while RT-DETR-l requires only 0.935 to reach a similar level. This indicates a more confident prediction strategy in YOLOv8-x.

The curves for YOLOv8-x exhibit a steeper initial increase and more stable behavior as the confidence threshold grows, particularly for the «tear» class. In contrast, RT-DETR-l loses recall much more rapidly. Such stability is critically important in clinical contexts, where minimizing missed detections is of utmost importance.

Similar conclusions are confirmed by analyzing the F1–confidence curves. YOLOv8-x reaches the maximum F1-estimation value (0.96) already at confidence = 0.341, while RT-DETR-l shows a maximum (0.93) only at a much higher threshold (0.718). This indicates a more reliable balance between accuracy and completeness in YOLOv8-x performance, which is critical when analyzing medical images containing subtle pathological features.

A detailed analysis of the confusion matrix also confirms the advantage of the YOLOv8-x model. RT-DETR-l misclassified images with a normal meniscus as background in 74 cases, while YOLOv8-x had such errors only 22 times. For images with meniscus tears, RT-DETR-l classified them as normal in 11 cases and as background in 35 cases, while YOLOv8-x made only 4 misclassifications as normal and 18 as background. These differences indicate that RT-DETR-l is less robust to visual variability, including subtle signs of damage and background noise. This is especially important in clinical settings, where meniscus tears often manifest themselves as minor signal changes or are partially masked by anatomical structures. The YOLOv8-x model demonstrated higher accuracy and stability in such cases, as evidenced by a significantly lower number of false negatives and misclassifications.

An additional factor affecting the results is the structural imbalance of classes in the training set. Despite the presence of images with both normal and damaged menisci, objects of the «normal» class were quantitatively predominant. This is due to the annotation features: in the case of «normal», both menisci were marked as separate objects, while in the case of a tear, only one pathological structure was annotated. As a result, the proportion of objects of the «normal» class was 5992 against 1998 objects of the «tear» class. Such an imbalance does not reflect the real prevalence of pathologies but affects the behavior of the model, shifting the emphasis towards the «norm». The greatest number of errors in both models was recorded precisely between the «tear» and «background» classes, which may be due to the low contrast of individual images, in which the tear was poorly visualized, making it difficult to localize both for the model and for a person.

Overall, YOLOv8-x demonstrates higher adaptability, resistance to confidence threshold changes, better localization and classification of objects, and lower sensitivity to class imbalance compared to RT-DETR-l. Increased recognition accuracy even in conditions of mild pathologies and visual noise makes it a preferred model for automated analysis of knee joint MRI images in clinical practice.

To assess the effectiveness of the proposed approach, results from several other studies in the same domain [32,33,34,35,36] were collected. The summary metrics are presented in Table 6 and indicate a comparable level of accuracy.

In study [32], an algorithm based on CNN was described, capable of detecting meniscus tears. The main task of the study was divided into three subtasks: determining the position of both horns of the meniscus, detecting the presence of a tear, and determining the orientation of the tear. The performance metric was based on the area under the curve (AUC) analysis for each subtask. The results of the algorithm were as follows: AUC = 92% for determining the position of the two horns of the meniscus, AUC = 94% for the presence of a meniscus tear, and AUC = 83% for determining the orientation of the tear. In study [33], similar to our research, binary classification of meniscus tears in knee MRI images was performed, where the model was trained on clinical data using a multi-level CNN architecture optimized to detect both the presence of tears and their morphological type. The model demonstrated high performance for binary classification (AUC up to 0.924) and satisfactory results for common tear types. However, its accuracy decreased for rare morphologies due to their low representation in the training set. Thus, the trained model employed a relatively simple AlexNet architecture for classification without specifying tear localization or type, and without incorporating modern modules such as residual connections or attention mechanisms. In contrast, our study implements detection employing bounding boxes, leverages modern models such as YOLOv8 and RT-DETR, and applies the stricter mAP@50–95 metric, which provides improved accuracy, greater clinical interpretability, and enhanced practical applicability of our model. Study [34], similar to our study, was conducted to detect meniscus injuries, but using a CNN neural convolutional network. As in our study, an annotated database of coronal and sagittal MRI images of the knee joint was created. The deep learning algorithm demonstrated high efficiency in detecting knee meniscus injuries and showed results close to those obtained by us but still not exceeding them. In [35], a masked regional convolutional neural network (R-CNN) was used to build a deep learning network structure for the detection and diagnosis of knee meniscus tears, and ResNet50 was adopted for the development of the backbone network. Mask R-CNN showed fairly good results: the average accuracy for three types of meniscus (healthy, torn, and degenerated meniscus) ranged from 68% to 80%; sensitivity ranged from 74% to 95%. The authors of [36] presented a systematic review of deep learning methods on MRI for meniscus tears, where ACC ranged from 77 to 100%, sensitivity from 57 to 71%, and specificity from 67 to 93%, indicating significant variation and instability in a number of solutions in the literature.

In comparison with the aforementioned studies, the YOLOv8-x model proposed in this research demonstrates excellent performance: ACC = 96.0, TPR = 96.1, F1 = 96.0, and mAP@0.5:0.95 = 0.616. Of particular note is the sensitivity value achieved, TPR = 96.1, which is the highest among all studies considered. This highlights the model’s high ability to detect even mild meniscus tears, minimizing the risk of missing pathology.

The results achieved confirm the high accuracy, stability, and reliability of the YOLOv8-x model when working with clinical MRI images. This is reflected both in the main quality metrics (mAP, precision, recall, F1) and in the error matrix analysis, which demonstrates a low number of false classifications and high sensitivity (TPR = 96.1). Of particular importance is the model’s ability to correctly distinguish between normal and damaged structures within a single image, which corresponds to real-world diagnostic conditions. In contrast to previous studies (Table 6), our research employs a larger and more diverse proprietary clinical dataset, applies the stricter mAP@50–95 metric instead of mAP@50 alone, includes a comparison of two modern architectures (YOLOv8-x and RT-DETR), and focuses on the practical clinical task of binary classification, thereby enhancing the applied value and objectivity of the results.

In addition to standard metrics, an additional quantitative assessment of the model’s prediction reliability was conducted based on the analysis of the confidence value distribution among correctly classified objects. This approach made it possible to evaluate the model’s self-confidence in its predictions and confirm its stability under different confidence thresholds. To properly interpret the obtained values, it is important to understand what confidence represents and how it is computed within the model.

In object detection tasks, the confidence metric reflects the model’s degree of certainty that an object of a specific class is indeed present within a given region of the image. As defined in the corresponding equation, the confidence is determined by two key factors:

C o n f i d e n c e = P (O b j e c t) \times I o U (p r e d, t r u t h),

(5)

where

P (O b j e c t)

is the predicted probability of the presence of an object within the bounding box, and

I o U

is the intersection-over-union measure between the predicted and ground truth boxes. Thus, the confidence score simultaneously accounts for the probabilistic reliability of the classification and the quality of localization, serving as a comprehensive indicator of accuracy. Higher confidence values (e.g., ≥0.9) indicate strong model certainty in the correctness of the prediction, while lower values (≤0.5) may suggest unreliable or erroneous detections. During the post-processing stage, confidence scores are used to filter predictions and manage non-maximum suppression (NMS), which directly affects the model’s precision and recall [37].

After applying the YOLOv8x model to the test dataset, the reliability of its predictions was evaluated using the following procedure:

Comparison with ground truth annotations: the model’s predictions were compared with the ground truth bounding boxes. A prediction was considered correct (true positive, TP) if the IoU value exceeded a predefined threshold (IoU = 0.7) and the predicted class matched the ground truth class.
Filtering by confidence thresholds: Among the TP predictions, those with confidence values exceeding specified thresholds (e.g., 0.80, 0.85, 0.90) were selected. For each threshold, the proportion of reliable predictions was calculated; the corresponding results are presented in Table 7.

The model demonstrated high reliability: over 93% of correct predictions have a confidence score ≥ 0.85. The value of 77.04% correct predictions at a confidence threshold ≥ 0.90 indicates that approximately 23% of correct predictions have a moderate level of confidence. A decrease in confidence in these cases may be attributed to blurred pathological features, the presence of noise, or variations in scanning parameters across different MRI devices. Such variability in confidence reflects the heterogeneity of clinical data and does not necessarily indicate shortcomings of the model. In real clinical practice, moderate confidence may serve as a signal for additional expert review, thereby enhancing the overall safety of diagnosis.

Figure 20 shows the general scheme of the system for detection and classification of meniscus damage in MRI images of the knee joint using the YOLOv8-x model, which demonstrated the highest quality indicators (precision—0.958; recall—0.961; mAP@50—0.975; mAP@50–95—0.616). After improving the quality of the image data, it is fed to the YOLOv8-x model. The YOLOv8-x model detects objects in the data and determines their belonging to one of the classes (normal, tear). After successful localization, the model analyzes the probabilities of the object belonging to one of the possible classes. Thus, based on these predictions, the object is assigned a label of the class for which the probability was highest—«tear» (if there are signs of damage) or «normal» (if there are no signs of damage).

To qualitatively evaluate the performance of the models, examples of both correct and incorrect predictions were visualized. In most cases of successful classification, shown in Figure 21a, the models accurately localized and recognized meniscus tears with different shapes and locations of pathologies. Incorrect predictions, shown in Figure 21b, generally occurred in conditions of pronounced visual variability, with mild signs of damage or the presence of background structures that made interpretation difficult. In some cases, the meniscus appeared visually significantly reduced, which made it difficult to extract stable signs and became a factor affecting the accuracy of classification.

Overall, the test results show that the model performs well in recognizing meniscus tears in various clinical situations. It accurately identifies important features, works consistently across differences in patient anatomy, and copes well with image noise. The model confidently recognizes even mild injuries and maintains accuracy across varying image quality. The clinical benefit of the developed system lies in its potential to serve as an auxiliary tool, particularly for junior specialists, in the interpretation of knee joint MRI images, thereby improving diagnostic accuracy and reducing the time required for pathology detection. Integrating the system into existing computer-based diagnostic platforms, including PACS (Picture Archiving and Communication System), will enhance the efficiency of specialists’ work.

6. Conclusions

This study employed models from the YOLO and RT-DETR families to detect meniscus tears in knee MRI images. The task involved two target classes, with a total of 7990 labeled samples. After completing the training stages, the best-performing models from each family were selected based on the mAP@50–95 and recall key metrics. According to the analysis, YOLOv8-x and RT-DETR-l showed the highest results. A comparative analysis of the models based on the training dynamics, key quality metrics, and behavior on the test set showed that YOLOv8-x demonstrates more stable and accurate results compared to RT-DETR-l. It copes better with detecting meniscus damage, even with a limited amount of data and the presence of visual distortions in MRI. Such efficiency is explained by the architectural advantages of YOLOv8-x, in particular, the use of advanced feature extraction mechanisms and attention to context, which is especially important when analyzing unstructured medical images. According to the final metrics, YOLOv8-x achieved precision—0.958; recall—0.961; mAP@50—0.975; and mAP@50–95—0.616. These indicators confirm the high ability of the model to detect and classify meniscus tears at different overlap levels (IoU). Despite individual cases of false positive predictions, especially in areas with pronounced structural changes, the model as a whole remained stable and highly selective. In conditions of diagnostic uncertainty, it can effectively complement traditional methods, supporting medical decision-making. Thus, YOLOv8-x has proven itself as an accurate, reliable, and effective tool for automatic recognition of meniscus tears on MRI. Although there are still areas for improvement, such as reducing false positives and increasing sensitivity to subtle lesions, the obtained results confirm the high potential of this model for use in clinical practice. In the future, it is planned to expand the training sample and add other types of damage to solve multi-class classification problems, which will increase the versatility of the model and expand the possibilities of its application in medical diagnostics.

Author Contributions

Conceptualization, A.T. (Aizhan Tlebaldinova) and Z.O.; methodology, A.T. (Aizhan Tlebaldinova), Z.O. and M.K.; software, Z.O. and I.G.; validation, A.T. (Aizhan Tlebaldinova) and Z.O.; formal analysis, A.T. (Aizhan Tlebaldinova), Z.O. and S.K.; investigation, A.T. (Aizhan Tlebaldinova), Z.O., M.K., S.S., S.K., A.T. (Akerke Tankibayeva) and A.K.; resources, Z.O. and S.K.; data curation, A.T. (Aizhan Tlebaldinova), S.S., S.K., M.K., A.T. (Akerke Tankibayeva), A.K. and I.G.; writing—original draft preparation, A.T. (Aizhan Tlebaldinova), M.K., A.T. (Akerke Tankibayeva) and A.K.; writing—review and editing, Z.O.; visualization, Z.O., M.K. and A.K.; supervision, Z.O. and A.T. (Aizhan Tlebaldinova); project administration, A.T. (Aizhan Tlebaldinova) and Z.O.; funding acquisition, A.T. (Aizhan Tlebaldinova), S.K. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan, grant number AP23486396.

Data Availability Statement

The dataset used in this study is publicly available at https://github.com/meniscanData/KneeMRI-Meniscus-Dataset on 21 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MRI	Magnetic Resonance Imaging
CNN	Convolutional Neural Network
YOLO	You Only Look Once
DETR	Detection Transformer
RT-DETR	Real-Time Detection Transformer
DICOM	Digital Imaging and Communications in Medicine
PNG	Portable Network Graphics
MSE	Mean Square Error
pSNR	Peak Signal-to-Noise Ratio
SSIM	Structural Similarity
NMS-free	Non-Maximum Suppression-free
AIFI	Attention-Enhanced Intra-Scale Feature Interaction
CCFM	Convolution-Driven Cross-Scale Feature Fusion
mAP	mean Average Precision
IoU	Intersection over Union
TP	True Positive
FP	False Positive
FN	False Negative
TN	True Negative
PACS	Picture Archiving and Communication System

References

Hoover, K.B.; Vossen, J.A.; Hayes, C.W.; Riddle, D.L. Reliability of meniscus tear description: A study using MRI from the Osteoarthritis Initiative. Rheumatol. Int. 2020, 40, 635–641. [Google Scholar] [CrossRef] [PubMed]
Grasso, D.; Gnesutta, A.; Calvi, M.; Duvia, M.; Atria, M.G.; Celentano, A.; Callegari, L.; Genovese, E.A. MRI evaluation of meniscal anatomy: Which parameters reach the best inter-observer concordance? Radiol. Med. 2022, 127, 991–997. [Google Scholar] [CrossRef] [PubMed]
Bien, N.; Rajpurkar, P.; Ball, R.L.; Irvin, J.; Park, A.; Jones, E.; Bereket, M.; Patel, B.N.; Yeom, K.W.; Shpanskaya, K.; et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med. 2018, 15, e1002699. [Google Scholar] [CrossRef]
Güngör, E.; Vehbi, H.; Cansın, A.; Ertan, M.B. Achieving High Accuracy in Meniscus Tear Detection Using Advanced Deep Learning Models with a Relatively Small Data Set. Knee Surg. Sports Traumatol. Arthrosc. 2025, 33, 450–456. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2022), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Liu, F.; Zhou, Z.; Jang, H.; Samsonov, A.; Zhao, G.; Kijowski, R.; Li, F. Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magn Reson. Med. 2018, 79, 2379–2391. [Google Scholar] [CrossRef]
Couteaux, V.; Si-Mohamed, S.; Nempont, O.; Lefevre, T.; Popoff, A.; Pizaine, G.; Villain, N.; Bloch, I.; Cotton, A.; Boussel, L. Automatic knee meniscus tear detection and orientation classification with Mask-RCNN. Diagn. Interv. Imaging 2019, 100, 235–242. [Google Scholar] [CrossRef]
Kuczyński, N.; Boś, J.; Białoskórska, K.; Aleksandrowicz, Z.; Turoń, B.; Zabrzyńska, M.; Bonowicz, K.; Gagat, M. The Meniscus: Basic Science and Therapeutic Approaches. J. Clin. Med. 2025, 14, 2020. [Google Scholar] [CrossRef]
Parkar, A.P.; Adriaensen, M.E.A.P.M. ESR Essentials: MRI of the Knee—Practice Recommendations by ESSR. Eur. Radiol. 2024, 34, 6590–6599. [Google Scholar] [CrossRef]
Smirnov, V.V.; Savvova, M.V.; Smirnov, V.V. Magnetic Resonance Imaging in the Diagnosis of Joint Diseases; Artifex Publishing House: Obninsk, Russia, 2022; p. 170. (In Russian) [Google Scholar]
Jiang, C.; Ren, H.; Ye, X.; Zhu, J.; Zeng, H.; Nan, Y.; Sun, M.; Ren, X.; Huo, H. Object detection from UAV thermal infrared images and videos using YOLO models. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102912. [Google Scholar] [CrossRef]
Ultralytics. YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 June 2025).
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Hidayatullah, P.; Tubagus, R. YOLOv9 Architecture Explained | Stunning Vision AI. Available online: https://article.stunningvisionai.com/yolov9-architecture (accessed on 1 June 2025).
Wang, Y.; Li, K.; Zhang, Y.; Han, J.; Wang, C. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. Available online: https://github.com/sunsmarterjie/yolov12 (accessed on 1 June 2025).
Hidayatullah, P.; Syakrani, N.; Sholahuddin, M.R.; Gelar, T.; Tubagus, R. YOLOv8 to YOLO11: A comprehensive architecture in-depth comparative review. arXiv 2024, arXiv:2501.13400. [Google Scholar]
Glenn, J. Shortcut in Backbone and Neck Issue #1200 Ultralytics/Ultralytics. Available online: https://github.com/ultralytics/ultralytics/issues/1200#issuecomment-1454873251 (accessed on 15 June 2025).
Glenn, J. Understanding SPP and SPPF Implementation Issue #8785 Ultralytics/yolov5. Available online: https://github.com/ultralytics/yolov5/issues/8785 (accessed on 15 June 2025).
Hu, J.; Zheng, J.; Wan, W.; Zhou, Y.; Huang, Z. RT-DETR-EVD: An Emergency Vehicle Detection Method Based on Improved RT-DETR. Sensors 2025, 25, 3327. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Chen, J.; Lei, B.; Song, Q.; Ying, H.; Chen, D.Z.; Wu, J. A hierarchical graph network for 3d object detection on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 392–401. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar] [CrossRef]
Tian, J.; Jin, Q.; Wang, Y.; Yang, J.; Zhang, S. Performance analysis of deep learning-based object detection algorithms on COCO benchmark: A comparative study. J. Eng. Appl. Sci. 2024, 71, 76. [Google Scholar] [CrossRef]
Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
Zhao, B.; Chang, L.; Liu, Z. Fast-YOLO Network Model for X-Ray Image Detection of Pneumonia. Electronics 2025, 14, 903. [Google Scholar] [CrossRef]
Mercaldo, F.; Brunese, L.; Martinelli, F.; Santone, A.; Cesarelli, M. Object Detection for Brain Cancer Detection and Localization. Appl. Sci. 2023, 13, 9158. [Google Scholar] [CrossRef]
Wang, Q.; Yan, N.; Qin, Y.; Zhang, X.; Li, X. BED-YOLO: An Enhanced YOLOV10N-Based Tomato Leaf Disease Detection Algorithm. Sensors 2025, 25, 2882. [Google Scholar] [CrossRef]
Roblot, V.; Giret, Y.; Antoun, M.B.; Morillot, C.; Chassin, X.; Cotten, A.; Zerbib, J.; Fournier, L. Artificial Intelligence to Diagnose Meniscus Tears on MRI. Diagn. Interv. Imaging 2019, 100, 243–249. [Google Scholar] [CrossRef]
Shin, H.; Choi, G.S.; Shon, O.-J.; Kim, G.B.; Chang, M.C. Development of Convolutional Neural Network Model for Diagnosing Meniscus Tear Using Magnetic Resonance Image. BMC Musculoskelet. Disord. 2022, 23, 510. [Google Scholar] [CrossRef]
Rizk, B.; Brat, H.; Zille, P.; Guillin, R.; Pouchy, C.; Adam, C.; Ardon, R.; D’Assignies, G. Meniscal Lesion Detection and Characterization in Adult Knee MRI: A Deep Learning Model Approach with External Validation. Phys. Medica 2021, 83, 64–71. [Google Scholar] [CrossRef]
Li, J.; Qian, K.; Liu, J.; Huang, Z.; Zhang, Y.; Zhao, G.; Wang, H.; Li, M.; Liang, X.; Zhou, F.; et al. Identification and Diagnosis of Meniscus Tear by Magnetic Resonance Imaging Using a Deep Learning Model. J. Orthop. Transl. 2022, 34, 91–101. [Google Scholar] [CrossRef]
Botnari, A.; Kadar, M.; Patrascu, J.M. A Comprehensive Evaluation of Deep Learning Models on Knee MRIs for the Diagnosis and Classification of Meniscal Tears: A Systematic Review and Meta-Analysis. Diagnostics 2024, 14, 1090. [Google Scholar] [CrossRef]
He, L.-H.; Zhou, Y.-Z.; Liu, L.; Cao, W.; Ma, J.-H. Research on object detection and recognition in remote sensing images based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef]

Figure 1. Complex methodology of automatic detection of meniscus tear.

Figure 2. (a) Images without a meniscus tear (class 0); (b) images with a confirmed meniscus tear (class 1). Arrows indicate the site of the meniscus tear.

Figure 3. Stages of filter preprocessing of images.

Figure 4. Results of image processing using various methods.

Figure 7. Structure diagram of the YOLOv8-x network.

Figure 8. RT-DETR network architecture diagram.

Figure 9. Schematic illustration of the IoU metric.

Figure 5. Result of filtration preprocessing.

Figure 6. Augmentation results: (a) images without meniscus tear (class 0); (b) images with confirmed meniscus tear (class 1). Arrows indicate the site of the meniscus tear.

Figure 10. Values of key quality metrics (mAP@50–95, precision, recall) for the most efficient models by YOLO and RT-DETR families.

Figure 11. Loss curves per epoch for the (a) YOLOv8-x and (b) RT-DETR-l models.

Figure 12. Performance metrics per epoch for the (a) YOLOv8-x and (b) RT-DETR-l models.

Figure 13. Precision–confidence curves for the (a) YOLOv8-x and (b) RT-DETR-l models.

Figure 14. Recall–confidence curves for the (a) YOLOv-8-x and (b) RT-DETR-l models.

Figure 15. F1-score–confidence curves for the (a) YOLOv8-x and (b) RT-DETR-l models.

Figure 16. Precision–recall curves for the (a) YOLOv8-x and (b) RT-DETR-l models.

Figure 17. Confusion matrices for the (a) YOLOv8-x and (b) RT-DETR-l models.

Figure 18. Recall and mAP@50–95 metrics values of the built models.

Figure 19. Recall and inference time values of the built models.

Figure 20. General scheme of the meniscus damage detection and classification system.

Figure 21. Examples of the YOLOv8-x model working on a test sample: (a) correct predictions; (b) incorrect predictions.

Table 1. Distribution of images and objects by classes.

Classes	Visualization Mode			Images	Objects
Classes	PD	T1	T2	Images	Objects
Normal	556	202	242	1000	5992
Tear	682	134	184	1000	1998

Table 3. Comparison of methods based on quality metrics.

Method	MSE	PSNR	SSIM
Images (normal/tear)	2800	600	600
Combined method	32.55	41.37	0.92
Gaussian Blur	39.47	39.36	0.90
Laplacian Filter	39.80	38.21	0.90
Bilateral Filter	35.45	39.76	0.86
Non-Local Means Denoising (NLM)	35.69	39.57	0.86
Sharpening (Unsharp Mask)	47.10	35.80	0.85
Median Blur	42.20	32.26	0.82
CLAHE	132.33	24.47	0.74
Sobel Filter	190.83	21.22	0.68

Table 4. Comparison of YOLO models in network structure.

	Backbone	Neck	Activation	Loss	Models
YOLOv5 [13,14]	CSPDarknet53 (Focus)	PANet + SPPF	LeakyReLU	BCE + CIoU	YOLOv5-nu, YOLOv5- su, YOLOv5-mu, YOLOv5-lu, YOLOv5-xu
YOLOv8 [15]	C2f + CBS	PANet + SPPF	SiLU	BCE + DFL (v2)	YOLOv8-n, YOLOv8-s, YOLOv8-m, YOLOv8-l, YOLOv8-x
YOLOv9 [16]	ELAN-V2 + DFL v3	BiFPN or PAN++	SiLU	vFL (v3) + improved DFL	YOLOv9-t, YOLOv9-s, YOLOv9-m, YOLOv9-c, YOLOv9-e
YOLOv10 [17]	improved C2f//Transformer	RT-DETR-like neck	GELU	DFL + Adaptive Matching	YOLOv10-n, YOLOv10-s, YOLOv10-m, YOLOv10-l, YOLOv10-x
YOLOv11 [18]	RTMDet-style backbone	CBAM + PAN	SiLU / GELU	Varifocal Loss + DFL	YOLOv11-n, YOLOv11-s, YOLOv11-m, YOLOv11-l, YOLOv11-x
YOLOv12 [19]	R-ELAN	Multi-scale fusion: Upsample-Concat − A2C2F + C3k2 + Area Attention	SiLU	Hybrid: DFL+ GIoU	YOLOv12-n, YOLOv12-s, YOLOv12-m, YOLOv12-l, YOLOv12-x

Table 2. Data splitting.

Category	Training Set	Testing Set	Validation Set	Total
Images (normal/tear)	2800	600	600	4000

Table 5. Performance of YOLO and RT-DETR models.

Architecture	Version	Precision	Recall	mAP50	mAP50–95
YOLOv5	n	0.964	0.945	0.977	0.587
	s	0.956	0.939	0.964	0.583
	m	0.956	0.948	0.972	0.6
	l	0.965	0.948	0.975	0.605
	x	0.958	0.941	0.975	0.604
YOLOv8	n	0.973	0.947	0.978	0.594
	s	0.965	0.951	0.977	0.6
	m	0.968	0.95	0.974	0.601
	l	0.945	0.953	0.97	0.612
	x	0.958	0.961	0.975	0.616
YOLOv9	t	0.961	0.947	0.975	0.589
	s	0.968	0.953	0.975	0.604
	m	0.959	0.955	0.974	0.601
	c	0.96	0.942	0.971	0.601
	e	0.966	0.962	0.976	0.605
YOLOv10	n	0.948	0.941	0.964	0.571
	s	0.954	0.95	0.974	0.595
	m	0.953	0.95	0.978	0.582
	l	0.95	0.946	0.969	0.6
	x	0.965	0.931	0.972	0.612
YOLOv11	n	0.96	0.937	0.974	0.596
	s	0.959	0.954	0.977	0.59
	m	0.949	0.963	0.977	0.587
	l	0.974	0.948	0.975	0.597
	x	0.962	0.942	0.978	0.606
YOLOv12	n	0.957	0.946	0.973	0.592
	s	0.96	0.951	0.978	0.59
	m	0.97	0.945	0.979	0.584
	l	0.955	0.956	0.972	0.591
	x	0.956	0.956	0.973	0.595
RT-DETR	l	0.919	0.952	0.929	0.531
RT-DETR	x	0.898	0.889	0.906	0.434

Table 6. Comparison with other studies.

No	Purpose of the Study	Input Data	Method	Performance Metrics
Our research	Application of YOLO and RT-DETR family models in the recognition of meniscus tears	MRI scans of the knee joint (1000 normal, 1000 tear)	YOLOv5, YOLOv8-YOLOv12 models with all available submodels (n, s, m, l, x)	mAP@0.5–0.95 = 0.616 ACC = 95.8% TPR = 96.1%
[32] Roblot et al.	Creation and evaluation of an algorithm for detecting and characterizing the presence of a meniscus tear	1123 MRI images of the knee	Converged neural network, CNN on fast regions (RCNN)	AUC = 92% to determine the position of the two horns of the meniscus AUC = 94% for the presence of a meniscus tear AUC = 83% to determine the orientation of the tear
[33] Shin H. et al.	Detection of meniscus tears and classification of tear types employing MRI images	MRI images of the knee joint (1048 cases)	AlexNet	AUC = 88.9% for medial meniscus tear AUC = 81.7% for medial and lateral meniscus tears AUC = 92.4% for lateral meniscus tear
[34] Rizk et al.	Evaluation of a deep learning approach for meniscus tear detection and its characterization	11,353 MRI examinations of the knee joint	CNN neural convolutional network	AUC = 93% TPR = 82% TNR = 95%
[35] Li et al.	Diagnosis of a knee meniscus tear	Standard MRI images of the knee of 924 patients	Masked regional convolutional neural network (R-CNN), ResNet50	AP = 68–80% TPR = 74–95%
[36] Botnari et al.	Systematic review of DL models for MRI of the knee	More than 20 studies on automatic detection of meniscus tears	Overview of CNN, ResNet, DenseNet, and other models.	ACC = 77–100%, TPR = 56.9–71.1%, TNR = 67–93%

Table 7. Model reliability at different confidence thresholds.

Confidence ≥	Test Samples	Reliable TPs (%)
0.80	79%	95.67%
0.85	77.1%	93.05%
0.90	63.8%	77.04%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tlebaldinova, A.; Omiotek, Z.; Karmenova, M.; Kumargazhanova, S.; Smailova, S.; Tankibayeva, A.; Kumarkanova, A.; Glinskiy, I. Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis. Computers 2025, 14, 333. https://doi.org/10.3390/computers14080333

AMA Style

Tlebaldinova A, Omiotek Z, Karmenova M, Kumargazhanova S, Smailova S, Tankibayeva A, Kumarkanova A, Glinskiy I. Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis. Computers. 2025; 14(8):333. https://doi.org/10.3390/computers14080333

Chicago/Turabian Style

Tlebaldinova, Aizhan, Zbigniew Omiotek, Markhaba Karmenova, Saule Kumargazhanova, Saule Smailova, Akerke Tankibayeva, Akbota Kumarkanova, and Ivan Glinskiy. 2025. "Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis" Computers 14, no. 8: 333. https://doi.org/10.3390/computers14080333

APA Style

Tlebaldinova, A., Omiotek, Z., Karmenova, M., Kumargazhanova, S., Smailova, S., Tankibayeva, A., Kumarkanova, A., & Glinskiy, I. (2025). Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis. Computers, 14(8), 333. https://doi.org/10.3390/computers14080333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Modern Convolution and Transformer Architectures: YOLO and RT-DETR in Meniscus Diagnosis

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. General Research Design

3.2. Inclusion Criteria

3.3. Dataset Creation

3.4. Data Preprocessing

3.5. Meniscus Tear Recognition Based on YOLO Models and RT-DETR

3.5.1. Network Architecture

3.5.2. RT-DETR Architecture

3.6. Evaluation Metrics

4. Results

4.1. Experimental Environment and Hyperparameters

4.2. Experiment

5. Discussion

5.1. Analysis of Misclassification and False Detection

5.2. Models Comparison in Terms of Detection Efficiency and Processing Speed

5.3. Performance Comparison of YOLOv8-x and RT-DETR-l Models

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI